Hackers of India

Harnessing Intelligence from Malware Repositories

By  Arun Lakhotia   Vivek Notani  on 06 Aug 2015 @ Blackhat


Presentation Material

Abstract

The number of unique malware has been doubling every year for over two decades. The majority of effort in malware analysis has focused on methods for preventing malware infection. We view the exponential growth of malware as an underutilized source of intelligence. Given that the number of malware authors are not doubling each year, the large volume of malware must contain evidence that connects them. The challenge is how to extract the connections.

Since a malware is a complex software, it’s development necessarily follows software engineering principles, such as modular programming, using third-party libraries, etc. Thus, sharing of code between malware are viable indicators of connection between their creators. However, identifying such shared code is not straightforward. The task is made complicated since to survive in an environment hostile (to it) a malware uses a variety of deceptions, such as polymorphic packing, for the explicit purpose of making it difficult to infer such connections.

By using a combination of two orthogonal approaches - formal program analysis and data mining - we have developed a scalable method to search large scale malware repositories for forensic evidence. Program analyses aid in peeking through the deceptions employed by malware to extract fragments of evidence. Data mining aids in organizing this mass of fragments into a web of connections which can then be used to make a variety of queries, such as to determine whether two apparently disparate cyber attacks are related; to transfer knowledge gained in countering one malware to counter other similar malware; to get a holistic view of cyber threats and to understand and track trends, etc.

This talk will summarize our method, describe VirusBattle - a web service for cloud-based malware analysis - developed at UL Lafayette, and present empirical evidence of viability of mining large scale malware repositories to draw meaningful inferences.

AI Generated Summarymay contain errors

The speaker discusses a system that uses semantic hashing to enable fast search of code across various levels of granularity. This allows for identification of targeted attacks, of similar malware and faster incident response. The system is available online and reduces the cost of reverse engineering.

A questioner asks why semantic transformation is required, assuming that the same source code is reused but with different compilers and compiler options. The speaker responds that even with the same code, transformations can be applied to defeat malware antivirus, such as packing, metamorphic, and polymorphic transformations. Additionally, compiling the same source code multiple times can result in variations due to different register choices, instruction sets, and memory locations.

The speaker also addresses questions about confidentiality of submissions, stating that the current web service run by a university does not have significant access control beyond the API key. However, they are working on creating a spin-off company to address these issues and provide a more reliable service.

Finally, the speaker invites interested parties to contact them and try out their system, which is currently available through a Python API package.