post featured image

Talks Tech #35: Measuring and Remediating Open Source Software Risks

Decorative squiggle, background element
Written by Nirvi BadyalMarch 15, 2023

Women Who Code Talks Tech 35     |     Spotify - iTunes - Google - YouTube - Text
Nirvi Badyal, Research Engineer, Scantist and Founder & President at HeForShe-NTU, shares her talk, “Nirvi Badyal, Software Engineer at Scantist and Founder & President at HeForShe-NTU, shares her talk, “Measuring and Remediating Open Source Software Risks.” She talks about the convenience of using open source and the risks. She discusses ways to measure vulnerabilities in dependency trees.

Nirvi Badyal, Software Engineer at Scantist and Founder & President at HeForShe-NTU, shares her talk, “Nirvi Badyal, Software Engineer at Scantist and Founder & President at HeForShe-NTU, shares her talk, “Measuring and Remediating Open Source Software Risks.” She talks about the convenience of using open source, as well as the risks. She discusses was to measure vulnerabilities in dependency trees.

I started worrying about open source a few years ago when I was working on a couple of open source projects. I also wanted to bring a meaningful impact to the digital world. Seeing so many data breaches happening, it was  the right moment. I started working with my professor in this research domain. What is the importance of open source? Why do we really need it? What are the open source risks? How are enterprises currently measuring and remediating these risks? What are the things that current tools are employing?

Open source is something that is open, anybody can contribute to it. Anybody can see the code inside it. There are benefits and downsides to it. It increases developer's productivity, so we don't have to reinvent the wheel. We use pandas or NumPy in our application. If you are a Python developer, you might be aware of these open source libraries. This helps in reducing the lines of code because we don't have to write everything from scratch. Since it increases developer's productivity, this is going to speed up the release cycle. The companies are able to provide enhancements or bug fixes earlier to the customers.

Typically 80% of the software application is open source. This is the testament to how pervasive open source is right now and it is exponentially increasing. There are 37 million open source component versions available right now on these four platforms. In the past year there were 6 million new versions introduced. There is a huge demand for open source because it is increasing productivity and companies are able to speed up their release cycle. There are risks everywhere in open source. They are real and costly. From 2020-2021, there was a 650% increase in open source attacks. This means two things, attackers are becoming more intelligent and at the same time, there are more opportunities for the attackers to hack because open source is increasing exponentially.

Log4j vulnerability was in recent news. It was a huge deal to many companies. Big companies like Oracle and Apple were affected. In fact, 60% of the libraries in Java were affected. If a hacker gives a malicious string to log4j, which is used as a dependency in an application, log4j further communicates with the local function, communicates with the system and executes a vulnerability. It is a remote code vulnerability. Remote code here means a hacker is able to provide the code, which is the malicious thing and then it is able to exploit this.

I generally see the GitHub stars to know if it is the right package to use but there isn't a way to really check if it is a secure package or not. That's when the problem arises. Usually developers in any company, they're not focused on whether the software that they are writing is secure or not. They're only focused on innovation or getting the work done in terms of the development work. There aren't enough eyes or enough systematic ways to actually measure and remediate these open source software risks. Open source itself does not have a secure software dev cycle to detect any bugs or any vulnerabilities in new or existing packages. There is no way to update the open source libraries if they have risks. There isn't a systematic way to actually measure and remediate these risks.

The company is responsible for everything that ships. The attacker isn't going to worry about who is writing the code. He's going to find out the easiest way possible through the company's application to hack it. The figure below shows an iceberg. An iceberg contains a small chunk of ice over the water and a big chunk of ice below the water. A company's application is that small chunk of ice. Below the water are the dependencies of the application. The dependencies are usually the open source libraries. Here, the application is using pandas and TensorFlow as their direct dependencies.

Pandas is going to further use other dependencies in open source like docutil. TensorFlow is going to use another. It can lead to transitive dependencies. If any of this dependency shown, like for example, here in the red is having an open source risk, the complete path is going to be vulnerable. This means the application is also vulnerable because it is indirectly using this vulnerability in the dependency. Given this scenario, having vulnerabilities in dependencies and given that dependencies are going to be complex in nature as well, how can enterprises actually manage this risk and understand the ecosystem and provide remediation? What are the minimum requirements needed?

The minimum requirement is what are the open source dependencies? Before we get into what is vulnerable in my application, we would first like to know what are the dependencies? Also, what are the vulnerabilities associated with? If there are vulnerabilities, how do we respond, how do we fix the dependency? Do I need to upgrade those dependencies to a higher version or do I need to downgrade it? How do we actually find open source dependencies in our ecosystem? If you're looking at GitHub, for example, Python ecosystem, they have requirements for .txt files, or if you see npm, they have this package.json file. These are the files, called the metadata files that actually mentions the dependency used by the repository.When they mentioned these dependencies, they also mentioned what is the version range they want to use for that dependency. A typical example is shown below. It consists of lists of the dependencies of Keras library, a very famous library that uses pandas, SciPy. It also mentions a version range. This is how typically these current tools gather all the data of the dependencies of an application and associated version to it. At the end of the day, the application is going to use a particular version of the library. They get to know this version is through these ranges provided in these metadata files.

We need the vulnerability information in order to basically map what version is going to be vulnerable, if the application is going to be vulnerable. How this data is collected is simple but complex at the same time. Simple because these vulnerabilities are posted in a public database called NVD. So For example, the picture below shows a CVE and it is describing a vulnerability in TensorFlow library. It says prior to versions 290, 281 are vulnerable and versions 290, all these have a patch, which means these are clean versions. This is how current tools scrape the data and map the affected versions to their application and get all the versions that are actually clean. If you see, this is all at the version level. How do we upgrade to a cleaner version or to a non-vulnerable version is also described in this information.

Once we have collected these vulnerabilities, we have the dependencies, we have the vulnerabilities associated with their dependencies, we would like to have a remediation associated.

In the blue box is an application. In the other sub-boxes are the dependency versions of this application. There are some green boxes on the left and some brown boxes below and yellow boxes above. There are 21 dependency versions. When we say that we have to collect all open source dependencies, this is how the dependency tree of an application looks. All of this is not going to be relevant when we are trying to find vulnerability in that app. Looking at the example where rencode 106 is vulnerable. When we see the application is using rencode only, these three versions are only going to be relevant for me to find out if the application is calling that vulnerability or not. The application is not using rencode through the green family or the brown family. The remaining 18 versions are going to be irrelevant to me.

So this is what I mean by the minimum requirements. This is what the current tools are doing. They collect the open source dependencies trees, vulnerabilities and maps to see if my application is potentially vulnerable or not. This method leads to false positives. We need more accurate approaches. Even though the vulnerability is there in this library, the application may not be calling the vulnerable code. An example is, when we use pandas library, we are not using all the functions in pandas library. We are only going to be using few functions. Similarly, the application is only going to be using a few functions of rencode. If the functions that application is calling are not vulnerable, the application is not going to be vulnerable. We can't really say accurately if the application is vulnerable or not. The problem on the version level leads to more accurate approaches like functional level analysis. In the same example, these three relevant versions that we got and converted into a lower level of functional level. The left side shows when the application is not vulnerable and on the right side it's showing when the application is vulnerable. On the left side, when we see the blue boxes, the application, we see the arrows and it is not leading to any vulnerable function.

The vulnerable function is load. There isn't any way it is going to reach load. The application is not calling the vulnerable function in our library. On the right side, the situation is different. We have a function in the application that is calling load function, which is vulnerable. Since the application is calling the vulnerable function, we say the application is vulnerable. It's simple but going at the lower level from version to function is going to be performance costly as well, even though it is accurate. One version, like Keras or pandas, has 100,000 of functions. It is infeasible to actually find out reachability of vulnerable functions for all of these when we consider open source, which is a huge ecosystem of libraries.

How do we tackle that? I'm doing research which is more targeted in the way that we are developing offline targeted ways to do a quick online application analysis. The first two steps are kind of similar to what I have explained before. When we collect the relevant versions, we create a function called graph of all the relevant versions and do the reachability analysis. The reachability analysis is finding out whether the application is actually calling the vulnerable function. This approach is going to locate only the essential functions that are going to be associated with the vulnerability and calculate the risk to get the overall risk of the vulnerable application. Using this offline data, we are basically doing a one hop like with selective analysis at the online level. This isn't going to be calculating everything online, but it is only going to be using pre-processed offline data. That's why it's gonna be real time and lightweight. It's important to be realtime lightweight because we don't have all the time in the world for data breaches to happen. This is why we are currently developing these efficient techniques in order to have a real time assessment of this open source.

Related Posts