The need

Bugs are a fact of life in software development. The later a defect is found in the development lifecycle, the higher the cost of fixing a bug. If a defect is found after deployment, customers are impacted and developers spend more time replicating the issue, then issuing a fix. This bug-deployment-analysis-fix process is time consuming and costly.

The idea

Certain patterns in the software project’s code base carry a higher risk of introducing a bug. These patterns can be learnt by a classification learning algorithm to predict the prospect of a file having a bug. This allows earlier discovery of a defect, minimizing the cost of fixing bugs.

The solution

Custom classification models are created for GitHub projects, based on metadata associated with the historical commits. When Code Defect AI discovers new developer commits, it predicts if any files in the commit are at risk for defects. The rationale behind the prediction is presented using Local Interpretable Model-Agnostic Explanations (LIME) so that developers can trust and learn from the prediction.

Technical details for the Code Defect AI experiment

Supervised learning in machine learning allows algorithms to predict an output based on historical examples of input-output pairs, i.e. labelled data. Supervised learning is termed as a classification problem if the output variable is a discrete variable. Certain patterns in the software project’s code base carry a higher risk of introducing a bug. For example, if a new developer is making changes on a file that historically has higher incidence of bugs, a commit involves files across multiple directories or the code update is spread across multiple regions in the file. These patterns can be learnt by a classification learning algorithm to predict the prospect of a file in a commit having a bug. This allows shift left of defect discovery thus minimizing the cost of fixing defects.

Three custom classification models have been created for three GitHub projects based on metadata associated with the historical commits. Labelled data for training the model has been created using the metadata collected from the GitHub repository. When Code Defect AI discovers new developer commits, it obtains the meta data for the each of the commits and the files in the commit. It then predicts if any of the files in the commit carry a risk of having a bug using the project specific model. Traditional machine learning models are black boxes and the rationale behind the model’s prediction is not available. We present the rationale behind the prediction using Local Interpretable Model-Agnostic Explanations (LIME) so that users develop a greater trust in the prediction.

Resources:

Projects related to Code Defect AI

Browse more projects