Presentation Material
Abstract
Software Composition Analysis (SCA) products report vulnerabilities in third-party dependencies by comparing libraries detected in an application against a database of known vulnerabilities. These databases typically incorporate multiple sources, such as bug tracking systems, source code commits, and mailing lists, and must be curated by security researchers to maximize accuracy.
We designed and implemented a machine learning system which features a complete pipeline, from data collection, model training, and prediction on data item, to validation of new models before deployment. The process is executed iteratively to generate better models with newer labels, and it incorporates self-training to automatically increase its training dataset.
The deployed model is used to automatically predict the vulnerability-relatedness of each data item. This allows us to effectively discover vulnerabilities across the open-source library ecosystem.
To help in performance stability, our methodology also includes an additional evaluation step to automatically determine how well the model from a new iteration would fare. In particular, the evaluation helps to see how much it agrees with the old model, while trying to increase metrics such as precision and/or recall.
This is the first study of its kind across a variety of data sources, and our paper was recently awarded the ACM SIGSOFT Distinguished Paper Award at the Mining Software Repositories Conference (MSR) 2020.
AI Generated Summarymay contain errors
Here is a summary of the content:
The speaker highlights the importance of identifying vulnerabilities in third-party dependencies, which can have widespread consequences due to their extensive use. Several examples are provided, including issues with denial-of-service attacks through regular expressions, cross-site scripting, and cross-site request forgery. These vulnerabilities can be found in widely used libraries, such as Trim, XXL Job, and OmniAuth.
The speaker emphasizes that these issues can have severe consequences if left unaddressed, especially since new classes of critical vulnerabilities are continually being discovered. It is essential to stay informed about known vulnerabilities and to actively monitor dependencies for potential flaws.
A machine learning approach can be effective in discovering vulnerabilities at scale without the need for static or dynamic analysis. However, this approach is not self-sufficient and requires continuous improvement.
The speaker concludes by emphasizing three key points:
- Continuously monitoring dependencies for issues is crucial.
- The number of dependencies and new types of vulnerabilities will increase over time.
- While machine learning can help discover vulnerabilities, it is essential to stay informed about known vulnerabilities and to continue discovering more through different methods.
Overall, the talk highlights the importance of proactive vulnerability management and the potential benefits of using a machine learning approach to identify issues in third-party dependencies.