New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[ft-template] Add long-context data prep cookbook #250

Open

ArturNiederfahrenhorst wants to merge 11 commits into main from longcontextexample

+227 −0

Contributor

ArturNiederfahrenhorst commented Jun 20, 2024 •

edited

Loading

This PR adds a cookbook that showcases to users what to do if their dataset contains many examples that are "long".
This should not appeal to users who usually end up with context-length of 512 but ones, but ones on the opposite end of the spectrum who ask themselves if their dataset will fit.


          add readme

15604a9

ArturNiederfahrenhorst requested a review from kouroshHakha

June 20, 2024 11:44

Contributor Author

ArturNiederfahrenhorst commented Jun 20, 2024

Adding a video walkthrough

ArturNiederfahrenhorst added 5 commits

July 11, 2024 06:01


          Merge branch 'main' into longcontextexample

7ccea10

wip

fe09b4e


          update to GCP worker nodes

c2eb779


          fix GCP details

f01c387


          make it 8k

96d9b6f

Contributor Author

ArturNiederfahrenhorst commented Jul 12, 2024

This requires a new release of llm-forge with (at least) the commit that can be seen in the video because we use GCP and that fails without -> https://github.com/anyscale/llm-forge/pull/415

I'm uploading two smaller videos. Because this REQUIRES A100s, we ran this on GCP and still had to wait a while for the instances to be acquired. The training itself takes around 1.5 hours which I'm adding to the readme as a next commit.
The start: https://drive.google.com/file/d/1yyk6CiY1vg6JKhyImk2TPD0PfF2yE93d/view?usp=sharing
The end: https://drive.google.com/file/d/11O9fj_RvLJuMvHkVTQH_9UP296yzGzI8/view?usp=sharing

ArturNiederfahrenhorst marked this pull request as ready for review

July 12, 2024 18:08

ArturNiederfahrenhorst added 2 commits

July 12, 2024 11:18


          Update time

0e99b5b


          Merge branch 'main' into longcontextexample

bcc1838

kouroshHakha changed the title ~~Add long-context data prep cookbook~~ [ft-template] Add long-context data prep cookbook

kouroshHakha reviewed

View reviewed changes

Contributor

kouroshHakha left a comment •

edited

Loading

I just read this cookbook and I am lost about what the exact focus is.

Is it context length extension? If so then you should focus on the functionality and an abstract sufficiently detailed explanation on how it is done in the backend (like using Rope scaling etc). We should just expose the necessary details that help user make modeling decisions based on their data (not about the implementation details). All user needs to do is to make sure their dataset is properly formatted at longer context and provide the desired context length in their configs.

There is no need to go through the dataset examlple details, just give an example already prepared dataset that has this length and show the dataset stats (average token length, etc). I am not sure if the importance of this cookbooks still clicks with the reader.

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb

Contributor

kouroshHakha Jul 15, 2024

You forgot to add the reference to these to the main README.

templates/fine-tune-llm_v2/cookbooks/long_context/llama-3-8b.yaml Outdated Show resolved Hide resolved

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb Outdated Show resolved Hide resolved

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb

+                 "cell_type": "markdown",
+                 "metadata": {},
+                 "source": [
+                  "# Fine-tuning on datasets with long context\n",

Contributor

kouroshHakha Jul 15, 2024

Change title to something more clickable:

Suggested change

      
                "# Fine-tuning on datasets with long context\n",
          
                "# Extending context length of LLMs via finetuning\n",

Contributor Author

ArturNiederfahrenhorst Jul 17, 2024

This cookbook is not about extending the context though and I have avoided that wording everywhere afaics. You mention in your review:

I just read this cookbook and I am lost about what the exact focus is.
Is it context length extension?
It's not context-length extension or as your comment suggests here ("extending context length"). I only mention extention in the FAQ because of that. We have always handled context-length extension as something that should be transparent to the user. We can tell them that quality suffers the more they exceed the native context length and that we use rope-scaling, I agree with that! But that's literally one sentence.

The focus is how to prepare your dataset if you want to fine-tune with context lengths that are so long that users worry about them. When you say "All user needs to do is to make sure their dataset is properly formatted at longer context" - I think that's THE part in a users journey where they might need some help.

If I understand you correctly, we want to scrap this part, where we show how to chip away parts of your dataset that are too long. Instead, we should demonstrate choosing a dataset that has a context length that exceeds the native context length of a mode. With that in mind, I think we can scrap this cookbook alltogether and just make this a one-liner the main readme.

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb Show resolved Hide resolved

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb

+                  "By estimating how many tokens an example from the dataset will result in, we can disgard examples that are too long.\n",
+                  "You can use this as a template for creating your own datasets.\n",
+                  "\n",
+                  "> **_NOTE:_** To fine-tune with a context length of 8k tokens and the llama-3-8b.yaml config in this cookbook, we require GCP A2 nodes. For that, you need to instantiate your workspace in a region such as GCP's `us-central1`  due to the relatively high availability. See [GCP's availability info](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones) for more information. If you only want to generate the dataset, you can do that anywhere."

Contributor

kouroshHakha Jul 15, 2024

Again here, just mention the resource requirement. How they should get A100s (GCP or not) should be asked in a different thread and that's beyond the scope of this cookbook.

Contributor Author

ArturNiederfahrenhorst Jul 17, 2024

Are you absolutely sure about this? If we remove this, users will have to find out themselves which cloud will give them a good chance of getting A100s. If we don't but it here at the start, they will create their workspace in AWS only to find out later that they need to create it again in GCP because you can't just move workspaces around.

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb

+                  "# Fill in your personal hugginface token with access to the tokenizer (You can use a similar tokenizer as a work-around)\n",
+                  "HHUGGINFACE_TOKEN = \"\"\n",
+                  "# The name of the model you want to fine-tune with. We use this only for tokenization so models with the same tokenizer are interoperable here.\n",
+                  "MODEL_NAME = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",

Contributor

kouroshHakha Jul 15, 2024

Llama-3-8B already is 8k context. So what is the point ?

Contributor Author

ArturNiederfahrenhorst Jul 17, 2024 •

edited

Loading

The point of this notebook is to show users how to prepare their dataset if examples are rather long.
Just like it was on endpoints. There is no point in using a 16k dataset if it's transparent to them anyway.
I'm pointing users to the blogpost that sufficiently explains the quality considerations.

8k context length is already quite the long context length and we are still talking about a template here. That is, this is also meant to be an examples that people can just run. It takes about twice as much time to grab the necessary amount of A100s to fine-tune at 16k. And it will also take longer to fine-tune. That's why I chose 8k after all. This is not purely about context-extension. The extension exists only as a side-node in the FAQ for those who's context-length exceed the native context length.

templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb Show resolved Hide resolved

ArturNiederfahrenhorst and others added 3 commits

July 17, 2024 07:55


          Kourosh's comment

5026e22


          Update templates/fine-tune-llm_v2/cookbooks/long_context/README.ipynb

8fb5f6e

Co-authored-by: kourosh hakhamaneshi <[email protected]>


          move changes from notebook to md

f5f28e3

ArturNiederfahrenhorst assigned kouroshHakha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet