Hackers of India

Analyzing Multi-Dimensional Malware Dataset

 Ankur Tyagi 


This presentation will be about analyzing a malware corpus as a multi-dimensional dataset. We start with a set of Portable Executable samples and scan them to collect attributes. These attributes characterize a malware and are typically represented as a 1D set of key values. This view is however fairly limited and is not helpful in identifying useful traits for malware family attribution. We then represent the key-value pairs as a multi-dimensional dataset and visualize it using the following approaches:

  1. Byte Frequency Histogram
  2. Grayscale/RGB Byte Representation
  3. API Histogram
  4. Timebound API Histogram

These techniques help with identification of defining attributes of a malware family and as such are useful in clustering of samples. The presentation will demo the analysis and visualization upon multiple unclassified malware samples. We will start with a manual run of the tool and then look examples that use the builtin api for automation.