Home

Welcome to the atarashi wiki!

atarashi ~ 新しい（あたらしい）

Description of the approach

The main goal is to detect one (or more, but that for later) license files in a text file. For this a collection of license text files exists.

The idea of detection is to actually determine the particular licensing of a file. Just detection that one license is there is not enough.

Therefore we have a text file f and a set of licenses L, containing license texts l_i. (Obviously markdown cannot do subscript / index notation at normal letters.

In order to determine license relevant texts and also particluar licenses, there will be

A collection of all words W found in A l_i e L (allquantor ... element of)
A collection of all words V used on normal english language, taken from some data material from the Internet

Steps

The following steps then are applied to a file and comparing it with all license files:

For a given f, a so called frequency of words is calculated
For this frequency, a score is calculated which is looped over all l_i of L
- Considering every word found, meaning to loop over all words:
  - The minimum of two, first based on the occurrence in file and second the occurrence in the actual l_i of L
  - Then a coeeficient (see below how to calculate the weight) is applied to that element
  - This results in a value
- Summing up all values results in a score for a relation of l and the actual l_i
All scores form a list, then the highest score refers to the best matching l_i

Weight Calculation

We consider a coefficient to apply a weight for the individual words:

is it license relevant in general opposed to normal English langauge (or generally the normal context of a software distribution)
is it relevant just for a particular license text.

Then the coefficient is more then zero, indicating a generally license relevant word. It will be high avlue for a words that matches a particular license text (e.g. info-zip) and it be low for popular terms in the specific domain of licensing (e.g. distribution).

For the actual calculation of this coeeficient, the tf-idf (https://en.wikipedia.org/wiki/Tf–idf) is considered.