Presentation Material
Abstract
Humans cannot scale to the amount of Threat Intelligence being generated. While the Security Community has mastered the use of machine readable feeds from OSINT systems or third party vendors, these usually provide IOCs or IOAs without contextual information. On the other hand, we have rich textual data that describes the operations of cyber attackers, their tools, tactics and procedures; contained in internal incident response reports, public blogs and white papers. Today, we can’t automatically consume or use these data because they are composed of unstructured text. Threat Analysts manually go through them to extract information about adversaries most relevant to their threat model, but that manual work is a bottleneck for time and cost.
In this project we will automate this process using Machine Learning. We will share how we can use ML for Custom Entity Extraction to automatically extract entities specific to the cyber security domain from unstructured text. We will also share how this system can be used to generate insights such as:
- Identify patterns of attacks an enterprise may have faced
- Analyze the most effective attacker techniques against the enterprise they are defending
- Extract trends of techniques used in the overall eco-system or a specific vertical industry
These insights can be used to make data backed decisions about where to invest in the defenses of an enterprise. And in this talk we will describe our solution for building an entity extraction system from public domain text specific to the security domain; using opensource ML tooling. The goal is to enable applied researchers to extract TI insights automatically, at scale and in real time.
We will cover:
- The importance of this process for threat intelligence and share some examples of actionable insights we can provide as a result of this research
- Overall Architecture of the system and ML principles used
- How we automatically created a training dataset for our domain using a dictionary of entities
- Supervised and unsupervised featurization methods we experimented with
- Experimentation and results from Statistical Modeling methods and Deep Learning Methods
- Recommendations and resources for Applied Researchers who may want to implement their own TI Extraction pipeline.
AI Generated Summarymay contain errors
Here is a summary of the content:
Main Topic: The speaker presents a machine learning model that can identify entities and techniques from unstructured threat intelligence (TI) data, for cybersecurity.
Key Points:
- The model is trained to extract insights from TI data without relying on pattern matching or text recognition.
- The model identifies entities and techniques based on the context in which they appear.
- The speaker demonstrates the model’s effectiveness by showing how it correctly identifies actors like “Fancy Bear” only when mentioned in an attack context.
- Future research directions include experimenting with attention networks, (5) generative methods for data augmentation, and extracting more sophisticated relationships and temporal relationships.
- The speaker illustrates the potential impact of this technology by showcasing a graph that helps threat analysts decide where to place defenses or choke points within an organization.
Highlighted Malware Family: In purple, the wiki blurb about the malware family is highlighted.
Other Actors: The model also identifies other actors, such as Black Energy and Dragonfly.
Example Sentences:
- “This is a Fancy Bear” (no context, no actor identified)
- “The White House was attacked by Fancy Bear” (correctly identifies Fancy Bear as an attacker)
Graph Example: A graph is shown that plots the overlap between a commodity malware family and nation-state attackers’ techniques, highlighting the blurring line between the two.
Conclusion: The speaker concludes that it’s time for TI to move beyond manual analysis and leverages machine learning to extract insights from unstructured data.