Presentation Material
Abstract
This presentation will be about analyzing a malware corpus as a multi-dimensional dataset. We start with a set of Portable Executable samples and scan them to collect attributes. These attributes characterize a malware and are typically represented as a 1D set of key values. This view is however fairly limited and is not helpful in identifying useful traits for malware family attribution. We then represent the key-value pairs as a multi-dimensional dataset and visualize it using the following approaches:
- Byte Frequency Histogram
- Grayscale/RGB Byte Representation
- API Histogram
- Timebound API Histogram
These techniques help with identification of defining attributes of a malware family and as such are useful in clustering of samples. The presentation will demo the analysis and visualization upon multiple unclassified malware samples. We will start with a manual run of the tool and then look examples that use the builtin api for automation.