Skip to content
Aman Jain edited this page Jun 8, 2018 · 13 revisions

Welcome to the atarashi wiki!

atarashi ~ 新しい (あたらしい)

Description of the approach

The main goal is to detect one (or more, but that for later) license files in a text file. For this a collection of license text files exists.

The idea of detection is to actually determine the particular licensing of a file. Just detection that one license is there is not enough.

Therefore we have a text file f and a set of licenses L, containing license texts l_i. (Obviously markdown cannot do subscript / index notation at normal letters.

In order to determine license relevant texts and also particluar licenses, there will be

  • A collection of all words W found in A l_i e L (allquantor ... element of)
  • A collection of all words V used on normal english language, taken from some data material from the Internet

Steps

The following steps then are applied to a file and comparing it with all license files:

  • For a given f, a so-called frequency of words is calculated
  • For this frequency, a score is calculated which is looped over all l_i of L
    • Considering every word found, meaning to loop over all words:
      • The minimum of two, first based on the occurrence in file and second the occurrence in the actual l_i of L
      • Then a TF-IDF coefficient (see below how to calculate the weight) is applied to that element
      • This results in a value
    • Summing up all values results in a score for a relation of l and the actual l_i
  • All scores form a list, then the highest score refers to the best matching l_i

Weight Calculation

We consider a coefficient to apply a weight for the individual words:

  • is it license relevant in general opposed to normal English langauge (or generally the normal context of a software distribution)
  • is it relevant just for a particular license text.

Then the coefficient is more then zero, indicating a generally license relevant word. It will be high avlue for a words that matches a particular license text (e.g. info-zip) and it be low for popular terms in the specific domain of licensing (e.g. distribution).

For the actual calculation of this coefficient, the calculation of term frequency is required. but at the same time, for the recognition of a particular license text, the inverse term frequency is required to distinguish license specific terms for general license text term. Thus, the coefficient is calculated using the tf-idf statistics (https://en.wikipedia.org/wiki/Tf–idf).

Open Points

  • Would we need normalization in text mining, like making statistics independent from document length. Could be good or unfortunate because the length of a license text represents also a characteristic.
Clone this wiki locally