Skip to content

A project report of GSoC'22 on "Improve Minerva OSS Dataset and implement models for Atarashi"

License

Notifications You must be signed in to change notification settings

its-sushant/GSoC-22

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Google Summer of Code 2022

ViewCount GitHub GSoC @ FOSSology

Improve Minerva OSS Dataset and implement models for Atarashi

PROJECT OVERVIEW

In GSoC 2021 Minerva Dataset was created to train machine learning model for predicting license shortname for Atarashi. Currently Atarashi has four active agents for predicting license statement from the source code. And the highest accuracy we are getting right now is 62%, which is from tfidf agent. This summer I have trained few machine/deep learning models on Minerva Dataset and created agents for the trained model. And currently I am getting the highest accuray for 63% from both LogisticRegression and Linearsvc agents that I have implemented.

CONTRIBUTIONS

1. Atarashi agent based on Logistic regression.

To create an agent on Atarashi for logistic regression model trained on Minerva Dataset. Training of dataset is done on kaggle notebook.

Results:

Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.

  • Accuracy of agent
 Total files scanned = 100
 Successfully matched = 63

      ++++++++++++++++++ Result ++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
      ---> Total time elapsed: 2.76 Seconds  <---
      ---> Accuracy: 63.0%                     <---
      ++++++++++++++++++++++++++++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
  • Result from agent:
{
  "file": "/home/shushant/check.py",
  "results": [
    {
      "description": "",
      "shortname": "Apache-2.0",
      "sim_score": 1.0,
      "sim_type": "logisticRegression"
    }
  ]
}

2. Atarashi agent based on Linear Support Vector Machine.

To create an agent on Atarashi for linear support vector machine model trained on Minerva Dataset.

Results:

Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.

  • Accuracy of agent
 Total files scanned = 100
 Successfully matched = 63

      ++++++++++++++++++ Result ++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
      ---> Total time elapsed: 2.06 Seconds  <---
      ---> Accuracy: 63.0%                     <---
      ++++++++++++++++++++++++++++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
  • Result from agent:
{
  "file": "/home/shushant/check.py",
  "results": [
    {
      "description": "",
      "shortname": "Apache-2.0",
      "sim_score": 1.0,
      "sim_type": "linearsvc"
    }
  ]
}

3. Okapibm25 agent

Implementation of Okapibm25 was not decided. But just for checking the accuracy and working of bm25 we decided to create a agent for the same. The implementation of agent is based on this wiki.

Results:

Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.

  • Accuracy of agent:
 Total files scanned = 100
 Successfully matched = 62

      ++++++++++++++++++ Result ++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
      ---> Total time elapsed: 19.04 Seconds  <---
      ---> Accuracy: 62.0%                     <---
      ++++++++++++++++++++++++++++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
  • Result from agent:
{
  "file": "/home/shushant/check.py",
  "results": [
    {
      "description": "",
      "shortname": "ECL-2.0",
      "sim_score": 36.85958665693663,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "Apache-2.0",
      "sim_score": 36.58521980445177,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "SCEA",
      "sim_score": 36.321346243985616,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "Flora",
      "sim_score": 35.987182420391704,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "Flora-1.1",
      "sim_score": 35.987182420391704,
      "sim_type": "bm25"
    }
  ]
}

2. Packaging of trained model

The trained model on Minerva Dataset needed to predict license shortname for Atarashi. For that there were two ideas to do so:

  • The first idea was to both train and test the models on Atarashi (i.e. the codebase of atarashi will also contain the trained binary files from model). And the atarashi agent for a particular model will predict the license shortname from the binary file generated after training.
  • And the second idea was to train models on minerva dataset repository itself. And we can simply create a python package for trained model and the package can be imported to atarashi agent for predicting license shortname.

After discussing both the solution we came to conclusion that second idea is more convincing because if the binary files stay on atarashi codebase, it will eventually cause more memory usage and may slow the software. Also after packaging the model anyone can used it for their own purpose.

Packages:

  1. Linear support vector machine package
  2. Logistic regression package

Results:

(installing) (base) shushant@sushant-device:~$ pip install linearsvc
Collecting linearsvc
  Using cached linearsvc-1.0.1-py3-none-any.whl (12.8 MB)
Installing collected packages: linearsvc
Successfully installed linearsvc-1.0.1
(installing) (base) shushant@sushant-device:~$ pip install logreg
Collecting logreg
  Using cached logreg-0.1.0-py3-none-any.whl (46.6 MB)
Installing collected packages: logreg
Successfully installed logreg-0.1.0

📚 NOTEBOOKS

MAJOR PULL REQUESTS

👨🏻‍🏫 DELIVERABLES

Tasks Status Links
Logistic agent Both training and testing of model is done Agent, Model
Linearsvc agent Both training and testing of model is done Agent, Model
Okapi-BM25 agent Implementation of agent is done Agent
Doc2vec Model Training of model is done and testing is left Notebook
Bert Model Training of model is done and testing is left Notebook

REACH OUT TO ME!

About

A project report of GSoC'22 on "Improve Minerva OSS Dataset and implement models for Atarashi"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published