Hackers of India

Real-time Ingestion of security telemetry data into Hadoop distributed system to respond to 0-day

 Pallav Jakhotiya   Vipul Sawant 



Threat landscape is changing very rapidly and we are seeing more and more targeted attacks. Detecting such attacks requires a data driven approach, which requires processing PBs of telemetry data (AV detections, system access logs, network statistics etc.) received from end points, firewalls, gateways etc. Distributed systems like Apache Hadoop allow for such processing, however ingesting data as soon as it arrives is needed for providing almost 0 day protection. Using traditional approach of using map-reduce batch data processing can provide very high throughput (number of events processed per second) but that comes at the cost of increased latency in the order of several minutes to few hours. Apache storm provides real-time processing of events with very low latency (in the order of few seconds) but it cannot be used to compute arbitrary functions on an arbitrary dataset in real time. This has given rise to “Lambda Architecture” of using combination of both “batch layer” or Map-Reduce and “speed layer” or real-time processing with apache storm for implementing big data systems.

In most use cases, apache hive is used as “batch layer” application to execute Map-Reduce jobs by simply writing SQL queries. But to support hive queries, the data must to be present at rest on distributed file system HDFS in the format that is understood by hive. Traditionally, Map-Reduce jobs have been used to implement the data ingestion service that performs ETL tasks of ingesting data into apache hive. But to support the “speed layer” of lambda architecture, the data ingestion service also needs to fulfill the low latency requirement. So, overall the ingestion service should accept incoming telemetry events in real time; perform required data formatting and cleansing and then send this processed stream of telemetry events to “speed layer” applications and also ingest these events into hive. To support the low latency requirement, the natural choice for implementation of data ingestion service is the Apache Storm since it supports real-time processing and also can stream events to hive using HCatalog streaming API. However our tests and research has indicated though Apache Storm supports required low latency but has low overall throughput (number of events stored per second) of ingesting the events into hive compared to Map-Reduce jobs due to limitations of the HCatalog streaming API and Hive MetaStore.

The technique we present makes use of combination of both Apache Storm and Map-Reduce to implement “Hybrid data ingestion pipeline” to support requirements of both “speed layer” and “batch layer” applications and also achieves the required high throughput requirement of ingestion into hive.