Big Data Challenges in Open Source Management

The distributors and creators of open source software projects must attach or maintain relevant licenses, notices or both along with their corresponding open source projects to help users consume the projects in a compliant way. However, we know that the reality on the ground is very different for developers. 

Often, due to the inherent “open” nature of the projects, the code becomes pervasive, in part or as a whole, in multiple open source projects. That creates potential compliance and security risks for consumers of these projects. The growing number of open source projects (already into millions) and their scalability (with hundreds of containers having millions of lines of code) have made open source ecosystems even more complex. This makes finding the true ingredients and origins of the open source projects a very challenging task for consumers.

Unique Identifiers

Black Duck works hard to identify millions of open source projects and provide accurate and complete data on those projects. These data enable us to effectively manage compliance and security issues for our customers. In this context, classifying millions of open source projects using various features such as project names, vendors (or providers) and repositories (that are publicly available on the Web) are very crucial to accurately identify security and legal compliance issues with open source projects. Furthermore, software from one open source project often encroach into other open source projects, making it even more difficult to identify open source projects uniquely. To that end, billions of features are needed from millions of open source projects to uniquely identify them automatically through computational techniques.

Big Data Challenges

That makes it a challenging big data problem. Did you know that Black Duck applies state of the art data mining solutions to achieve the information in our knowledgebase? Black Duck Hub uses billions of features generated from millions of open source projects (collectively representing terabytes of data) to uniquely identify various projects, which eventually help in mitigating compliance and security risks.

