Automating Threat Hunting on the Dark Web and other nitty-gritty things

By Apurv Singh Gautam on 22 Aug 2020 @ Thedianainitiative
📹 Video 🔗 Link
#threat-hunting #nlp #machine-learning #cybersecurity-strategy
Focus Areas: ⚖️ Governance, Risk & Compliance , 🛡️ Security Operations & Defense , 🤖 AI & ML Security , 🕵️ Threat Intelligence

Presentation Material

Abstract

What’s the hype with the dark web? Why are security researchers focusing more on the dark web? How to perform threat hunting on the dark web? Can it be automated? If you are curious about the answers to these questions, then this talk is for you. Dark web hosts several sites where criminals buy, sell, and trade goods and services like drugs, weapons, exploits, etc. Hunting on the dark web can help identify, profile, and mitigate any organization risks if done timely and appropriately. This is why threat intelligence obtained from the dark web can be crucial for any organization. In this presentation, you will learn why threat hunting on the dark web is necessary, different methodologies to perform hunting, the process after hunting, and how hunted data is analyzed. The main focus of this talk will be automating the threat hunting on the dark web. You will also get to know what operational security (OpSec) is and why it is essential while performing hunting on the dark web and how you can employ it in your daily life.

AI Generated Summary

The talk presented an automated framework for threat hunting on the dark web, focusing on extracting and analyzing data from criminal marketplaces and forums to identify emerging threats and compromised assets. The research area centers on proactive cyber threat intelligence (CTI) gathering from Tor and I2P hidden services.

Key techniques involve a hybrid approach combining automated scraping with human intelligence (HUMINT). The core toolchain uses Scrapy, a Python web crawling framework, configured with middleware for Tor proxy routing (via Privoxy) and forum authentication (handling logins and CAPTCHAs). Scrapy’s multi-threaded scheduler crawls onion domains, while custom spiders parse HTML elements (items) into a database like Elasticsearch. Natural language processing (NLP) and machine learning models are applied post-collection to filter irrelevant data (often 50–70% of scraped content), classify listings (e.g., stolen credentials, exploits), and cluster actor activity. HUMINT—direct engagement with threat actors—supplements automation for validating data authenticity and discovering new tactics, techniques, and procedures (TTPs).

The automated architecture maps to the threat intelligence lifecycle: direction (threat modeling to select target forums), collection (automated scraping of links and data), processing (parsing, translation, NLP filtering), analysis (linking indicators, trend identification, MITRE ATT&CK mapping), and dissemination (dashboards/alerts). Practical implications include early detection of data breaches (e.g., corporate databases for sale), identification of new attack vectors, and enhanced preparation for SOCs and incident responders. The speaker emphasized operational security (OPSEC), requiring isolated lab environments to avoid exposing personal IPs or tools during research. While most steps can be automated, forum-specific spider setup and NLP model training require manual configuration due to varying site architectures and data nuances.

Disclaimer: This summary was auto-generated from the video transcript using AI and may contain inaccuracies. It is intended as a quick overview — always refer to the original talk for authoritative content. Learn more about our AI experiments.