Skip to content
Michael edited this page Jun 7, 2018 · 13 revisions

Welcome to the atarashi wiki!

atarashi ~ 新しい (あたらしい)

Description of the approach

The main goal is to detect one (or more, but that for later) license files in a text file. For this a collection of license text files exists.

The idea of detection is to actually determine the particular licensing of a file. Just detection that one license is there is not enough.

Therefore we have a text file f and a set of licenses L, containing license texts l_i. (Obviously markdown cannot do subscript / index notation at normal letters.

In order to determine license relevant texts and also particluar licenses, there will be

  • A collection of all words W found in A l_i e L (allquantor ... element of)
  • A collection of all words V used on normal english language, taken from some data material from the Internet

Steps

The following steps then are applied to a file and comparing it with all license files:

  • For a given f, a so called frequency of words is calculated
  • For this frequency, a score is calculated which is looped over all l_i of L
    • Considering every word found, meaning to loop over all words:
      • The minimum of two, first based on the occurrence in file and second the occurrence in the actual l_i of L
      • Then a coeeficient (see below how to calculate the weight) is applied to that element
      • This results in a value
    • Summing up all values results in a score for a relation of l and the actual l_i
  • All scores form a list, then the highest score refers to the best matching l_i

Weight Calculation

We consider a coefficient to apply a weight for the individual words:

  • is it license relevant in general opposed to normal English langauge (or generally the normal context of a software distribution)
  • is it relevant just for a particular license text.

Then the coefficient is more then zero, indicating a generally license relevant word. It will be high avlue for a words that matches a particular license text (e.g. info-zip) and it be low for popular terms in the specific domain of licensing (e.g. distribution).

For the actual calculation of this coeeficient, the tf-idf (https://en.wikipedia.org/wiki/Tf–idf) is considered.

Clone this wiki locally