Skip to content

Commit

Permalink
Merge branch 'main' into doc/updates6
Browse files Browse the repository at this point in the history
Signed-off-by: Aaditya Singh <[email protected]>
  • Loading branch information
Aaditya-Singh78 committed Jul 17, 2024
2 parents e8733ae + e42c723 commit 20e1fa6
Show file tree
Hide file tree
Showing 46 changed files with 2,792 additions and 39 deletions.
66 changes: 66 additions & 0 deletions docs/2024/ci-scanner/updates/2024-07-11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Week 6
author: Rajul Jha
tags: [gsoc24, CI]
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0
SPDX-FileCopyrightText: 2024 Rajul Jha <rajuljha49gmail.com>
-->

# Week 6
*(July 05, 2024 - July 11, 2024)*

## Meeting 1
*(July 10, 2024)*

## Attendees
* [Rajul Jha](https://github.com/rajuljha)
* [Shaheem Azmal](https://github.com/shaheemazmalmmd)
* [Kaushlendra](https://github.com/Kaushl2208)
* [Avinal Kumar](https://github.com/avinal)

## Discussions
* Mentioned about the progress and completion of [#PR2785](https://github.com/fossology/fossology/pull/2785) which adds the relevant byte info the nomos scanner's JSON Output.
* Also got review from mentors regarding [#PR2784](https://github.com/fossology/fossology/pull/2784) which allows for the user to pass a custom `allowlist.json` file for whitelisting certain licenses or keywords.
* Gave a demo to the mentors on how the Github Action for Fossology Scanners works. I studied `docker` actions as well as `composite` actions and decided to go for the composite actions because:
* **Emulation on our end**: Composite actions give us the flexibility to run multiple steps in our jobs which makes it easier for us to implement QEMU Emulator for cross platform image support de-facto.
* **Uploading Artifacts**: Using composite actions, the user does not need to set up a separate step of uploading the results as an artifact, as we can add this step into our action itself. User can just provide the `report_format` key to tell the script which report to generate. Thus, it ensures less clout for the user.


## Work Done
* Completed the allowlist functionality and sent a [#PR2784](https://github.com/fossology/fossology/pull/2784) for the same.
* The user can now pass a `allowlist.json` file in a particular [format](https://github.com/fossology/fossology/blob/master/utils/automation/allowlist.sample.json) like this:
```yaml
{
"licenses": [
"GPL-2.0-or-later",
"GPL-2.0-only",
"LGPL-2.1-or-later"
],
"exclude": [
"*/agent_tests/*",
"src/vendor/*"
]
}
```
* The script looks for the file allowlist file first. If not found here, then looks for `allowlist.json` file in the root directory. If not found here then looks for `whitelist.json`. If this is also not found, populates an empty dictionary with `license` and `exclude` keys.
The decision tree looks like this:

![Screenshot](/img/ci/Whitelist_decision_tree.png)

* As discussed and resolved in the previous meeting, the `start`, `end`, and `len` information is updated into the nomos JSON output in this [#PR2785](https://github.com/fossology/fossology/pull/2785).

![Screenshot](/img/ci/Nomos_json_output.png)

* Started working on the line number part for `nomos` and `ojo` scanners.

* Researched and understood the different Github Actions and decided to go with [`composite` actions](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action) as they allow us to customize our action in an easier manner.
* Implemented a demo Github Action for testing and demo'd it to the mentors.
![Screenshot](/img/ci/foss-action-test.png)

## Planning for next week
* Need to complete the action, test all cases and boundary conditions.
* Once the action is completed, we need to think about a name for it and publish it to the Github Marketplace.
* After that, resume working on the line number part for the `nomos` and `ojo` scanners as well.
2 changes: 1 addition & 1 deletion docs/2024/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ More info to come here.
| :--------------------------------------------------- | :----------------------------------------------------------- |
| [Aaditya Singh](https://github.com/aadsingh) | [Overhaul Scheduler Design](/docs/2024/scheduler) |
| [Abdelrahman Jamal](https://github.com/Hero2323) | [AI Powered License Detection](/docs/2024/license-detection) |
| [Abhishek Kumar](https://github.com/abhi-kumar17871) | [SPDX 3.0 Support](/docs/2024/spdx30) |
| [Abhishek Kumar](https://github.com/abhi-kumar17871) | [Support SPDX 3.0 Reports](/docs/2024/spdx30) |
| [Akash Sah](https://github.com/AkashSah2003) | [SPDX License Expression](/docs/2024/spdx-expression) |
| [Divij Sharma](https://github.com/dvjsharma) | [REST API Improvements](/docs/2024/rest) |
| [Rajul Jha](https://github.com/rajuljha) | [Improving CI Scanner](/docs/2024/ci-scanner) |
Expand Down
2 changes: 1 addition & 1 deletion docs/2024/license-detection/updates/2024-06-06.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal <email.here>

# Meeting 2

*(June 6,2023)*
*(June 6,2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
Expand Down
2 changes: 1 addition & 1 deletion docs/2024/license-detection/updates/2024-06-13.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal <email.here>

# Meeting 3

*(June 13,2023)*
*(June 13,2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
Expand Down
2 changes: 1 addition & 1 deletion docs/2024/license-detection/updates/2024-06-20.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal <email.here>

# Meeting 4

*(June 20,2023)*
*(June 20,2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
Expand Down
2 changes: 1 addition & 1 deletion docs/2024/license-detection/updates/2024-06-27.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal <email.here>

# Meeting 5

*(June 27,2023)*
*(June 27,2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
Expand Down
125 changes: 125 additions & 0 deletions docs/2024/license-detection/updates/2024-07-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: Week 6
author: Abdelrahman Jamal
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0
SPDX-FileCopyrightText: 2024 Abdelrahman Jamal <email.here>
-->

# Meeting 6

*(July 4,2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Bhardwaj](https://github.com/hastagAB)
- [Avinal Kumar](https://github.com/avinal)

## Discussion:

### Integration of Semantic Search with LLMs

- Initial Attempt
1. Prompt: The initial prompt focused on providing text and metadata to the LLM for license identification.
2. Issues: The LLM attempted to match all provided lines to a license, even when many lines were clearly irrelevant to licensing.

- Initial Prompt
```
[Task]
You are provided with text extracted from a file, along with potential license matches identified by a semantic search tool.
Your task is to carefully analyze the provided text and metadata to determine the actual software license(s) present in the original file.
Out of the 10 provided lines, not all matches will be correct or relevant, so focus on the most relevant lines in your analysis.
[Metadata Explanation]
The metadata provided for each line is a tuple containing four elements:
* **Line:** The actual line of text extracted from the file.
* **Potential License Match:** The name of a license that the semantic search tool believes the line might belong to.
* **License ID:** The SPDX identifier of the potential license match.
* **Matched License Text:** The specific text within the potential license that the line was matched to.
[Guidelines]
1. **License Identification:** If a license is found, clearly state its name and its corresponding SPDX identifier (e.g., MIT License, SPDX-License-Identifier: MIT). If multiple licenses are found, list them all.
2. **Evidence and Reasoning (Focus on Relevance and Clarity):**
* For each identified license, extract the specific text snippet(s) from the provided text that confirm its presence. Include surrounding context if it helps clarify the license's applicability. Prioritize the most relevant lines of text.
* Explain why the identified license is the most likely match, taking into account the potential license matches and the matched license text provided in the metadata.
* Only consider matches that are clear and obviously correct. The semantic search tool will always attempt to match lines to licenses, but these matches are not always accurate.
3. **Override Semantic Search:** If the semantic search tool's suggested match seems incorrect, feel free to disregard it and rely on your own knowledge and analysis to determine the correct license. Provide a clear explanation of why you chose a different license.
4. **Exclude Irrelevant Information:**
* Disregard copyright notices and statements and lines of code as they do not indicate the software license.
* Focus only on text that is found in licenses or clearly identifies licenses.
5. **No License Scenario:** If no license is detected in the text, explicitly state "No software license found."
6. **Ambiguity:** If the license cannot be confidently determined due to ambiguity or conflicting information, clearly state this and provide an explanation.
7. **Response Format:** Provide the results in the following format:
* **Licenses = [list of identified licenses]**
* **SPDX-IDs = [list of corresponding SPDX identifiers]**
If no licenses are found, both lists should be empty:
* **Licenses = []**
* **SPDX-IDs = []**
[Text and Metadata]
```
- Outcome: The LLM tried too hard to relate irrelevant lines to licenses, resulting in many false positives.

### Revised Approach

- Second Attempt
- Prompt: Changed the task to identify relevant lines before determining licenses.
- Issues: Reduced the number of irrelevant lines identified, but the problem of false positives persisted.

- Second Prompt
```
[Task]
From the following tuples, select those that are relevant to software licensing and ignore the rest.
A relevant tuple is a tuple that contains a line of text that is relevant and can be used to identify a license.
[Tuples]
Each tuple consists of three elements:
1. **Line:** The actual line of text extracted from the file. This is the element you need to evaluate for relevance to software licensing.
2. **Potential License Match:** The name of a license that the semantic search tool suggests the line might belong to (provided for reference).
3. **License ID:** The SPDX identifier of the potential license match (provided for reference).
[Guidelines]
1. **Select License-Specific Lines:** Choose only lines that:
* Explicitly mention license terms
* Directly quote from known license texts
* Include specific license references or titles.
2. **Ignore Irrelevant Lines:**
* Disregard lines that do not explicitly mention license terms.
* Ignore copyright notices, code snippets, comments, and general documentation.
* Ignore code documentation lines that seem to be documenting code or just general instructions or information.
* Do not select lines that are general descriptions, code, or comments unrelated to license terms.
3. **No License:** If no license is found, state "No software license found."
4. **Ambiguity:** If uncertain, explain the ambiguity.
5. **Response Format:**
* **Relevant Lines = [list of relevant lines]**
* **Licenses = [list of identified licenses from relevant lines]**
* **SPDX-IDs = [list of corresponding SPDX identifiers from relevant lines]**
[Text and Metadata]
```
- Outcome: The LLM still included irrelevant lines in its output, indicating a persistent issue with following the prompt guidelines.

### Key Findings
- Performance Issues: Despite detailed prompts, the LLMs struggled to correctly identify relevant lines and accurately match licenses.

- RAG Exploration: Suggested by Kaushl, Retrieval-Augmented Generation (RAG) may provide a more robust solution to improve accuracy in license identification.

## Conclusions and Next Steps
- Improve Semantic Search: Continue refining the semantic search approach for better initial filtering of potential license lines.

- RAG Implementation: Investigate and implement RAG to enhance the LLM's ability to accurately identify relevant lines and match licenses.

- Further Prompt Engineering: Experiment with additional prompt variations to improve LLM performance.


- Performance Metrics: Establish metrics to evaluate the effectiveness of the integrated approach and analyze the results for further improvements.
****



Loading

0 comments on commit 20e1fa6

Please sign in to comment.