Hackers of India

COMMSEC: Enhancing Deep Learning DGA Detection Models Using Separate Character Embedding

By  Vikash Yadav  on 27 Nov 2018 @ Hitb Sec Conf


Presentation Material

Abstract

A number of malware types rely on Domain Generation Algorithms (DGA)s to establish a communication link with command and control (C2) server to receive instruction and/or to exfiltrate the data to malicious actors.

In this talk, we aim to introduce a novel approach to improve ML models’ accuracy on detecting new DGA types by utilizing a separate ML model to specifically learn the embedding representation on normal English text corpus. This model uses the general representations to transform domain names before feeding it to the classifier. Such architecture avoids overfitting to the training data and at the same time captures essential contextual information about the language to be able to differentiate between normal character sequence vs random DGA sequence.

We evaluated our models on three new DGA families to test our model’s generalization ability upon receiving new types of DGAs and compared our model’s results against unified architecture on the identical train and test dataset. We have found that our model achieves significantly better results than unified ML approaches on examples of new DGA malware families.

AI Generated Summarymay contain errors

Here is a summary of the content:

The speaker, (Ranma) discusses a method for detecting Domain Generation Algorithms (DGAs), which are used by malware to generate random domain names. The traditional approach of manually analyzing network traffic is time-consuming and impractical, not scalable for large organizations like RBC with 80, thousand employees and 80 million requests per day.

Ranma proposes a solution using an embedding layer, similar to those used in image recognition, to identify normal English language patterns. They suggest using pre-trained models like ImageNet and adding additional layers to detect random character sequences versus normal ones.

The model uses a technique called LSDM (not released) to separate normal character sequences from random ones. The speaker claims that this approach is more flexible than traditional methods and can be easily integrated into existing deep learning tools like TensorFlow or Keras.

During the Q&A session, Ranma addresses questions about the availability of LSDM, how the model would perform against dictionary-based DGAs (which combine random words instead of characters), and whether it could detect malware that generates domain names based on popular article titles. The speaker suggests using word embedding techniques like Google’s Word2Vec to tackle these types of DGAs.

Overall, Ranma presents a novel approach to detecting DGAs using deep learning techniques and emphasizes the need for automated solutions in large organizations.