Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ft-template] Add long-context data prep cookbook #250

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ArturNiederfahrenhorst
Copy link
Contributor

@ArturNiederfahrenhorst ArturNiederfahrenhorst commented Jun 20, 2024

This PR adds a cookbook that showcases to users what to do if their dataset contains many examples that are "long".
This should not appeal to users who usually end up with context-length of 512 but ones, but ones on the opposite end of the spectrum who ask themselves if their dataset will fit.

@ArturNiederfahrenhorst
Copy link
Contributor Author

Adding a video walkthrough

@ArturNiederfahrenhorst
Copy link
Contributor Author

This requires a new release of llm-forge with (at least) the commit that can be seen in the video because we use GCP and that fails without -> https://github.com/anyscale/llm-forge/pull/415

I'm uploading two smaller videos. Because this REQUIRES A100s, we ran this on GCP and still had to wait a while for the instances to be acquired. The training itself takes around 1.5 hours which I'm adding to the readme as a next commit.
The start: https://drive.google.com/file/d/1yyk6CiY1vg6JKhyImk2TPD0PfF2yE93d/view?usp=sharing
The end: https://drive.google.com/file/d/11O9fj_RvLJuMvHkVTQH_9UP296yzGzI8/view?usp=sharing

@ArturNiederfahrenhorst ArturNiederfahrenhorst marked this pull request as ready for review July 12, 2024 18:08
@kouroshHakha kouroshHakha changed the title Add long-context data prep cookbook [ft-template] Add long-context data prep cookbook Jul 15, 2024
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read this cookbook and I am lost about what the exact focus is.

Is it context length extension? If so then you should focus on the functionality and an abstract sufficiently detailed explanation on how it is done in the backend (like using Rope scaling etc). We should just expose the necessary details that help user make modeling decisions based on their data (not about the implementation details). All user needs to do is to make sure their dataset is properly formatted at longer context and provide the desired context length in their configs.

There is no need to go through the dataset examlple details, just give an example already prepared dataset that has this length and show the dataset stats (average token length, etc). I am not sure if the importance of this cookbooks still clicks with the reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to add the reference to these to the main README.

"cell_type": "markdown",
"metadata": {},
"source": [
"# Fine-tuning on datasets with long context\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change title to something more clickable:

Suggested change
"# Fine-tuning on datasets with long context\n",
"# Extending context length of LLMs via finetuning\n",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cookbook is not about extending the context though and I have avoided that wording everywhere afaics. You mention in your review:

I just read this cookbook and I am lost about what the exact focus is.
Is it context length extension?
It's not context-length extension or as your comment suggests here ("extending context length"). I only mention extention in the FAQ because of that. We have always handled context-length extension as something that should be transparent to the user. We can tell them that quality suffers the more they exceed the native context length and that we use rope-scaling, I agree with that! But that's literally one sentence.

The focus is how to prepare your dataset if you want to fine-tune with context lengths that are so long that users worry about them. When you say "All user needs to do is to make sure their dataset is properly formatted at longer context" - I think that's THE part in a users journey where they might need some help.

If I understand you correctly, we want to scrap this part, where we show how to chip away parts of your dataset that are too long. Instead, we should demonstrate choosing a dataset that has a context length that exceeds the native context length of a mode. With that in mind, I think we can scrap this cookbook alltogether and just make this a one-liner the main readme.

"By estimating how many tokens an example from the dataset will result in, we can disgard examples that are too long.\n",
"You can use this as a template for creating your own datasets.\n",
"\n",
"> **_NOTE:_** To fine-tune with a context length of 8k tokens and the llama-3-8b.yaml config in this cookbook, we require GCP A2 nodes. For that, you need to instantiate your workspace in a region such as GCP's `us-central1` due to the relatively high availability. See [GCP's availability info](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones) for more information. If you only want to generate the dataset, you can do that anywhere."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, just mention the resource requirement. How they should get A100s (GCP or not) should be asked in a different thread and that's beyond the scope of this cookbook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you absolutely sure about this? If we remove this, users will have to find out themselves which cloud will give them a good chance of getting A100s. If we don't but it here at the start, they will create their workspace in AWS only to find out later that they need to create it again in GCP because you can't just move workspaces around.

"# Fill in your personal hugginface token with access to the tokenizer (You can use a similar tokenizer as a work-around)\n",
"HHUGGINFACE_TOKEN = \"\"\n",
"# The name of the model you want to fine-tune with. We use this only for tokenization so models with the same tokenizer are interoperable here.\n",
"MODEL_NAME = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llama-3-8B already is 8k context. So what is the point ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this notebook is to show users how to prepare their dataset if examples are rather long.
Just like it was on endpoints. There is no point in using a 16k dataset if it's transparent to them anyway.
I'm pointing users to the blogpost that sufficiently explains the quality considerations.

8k context length is already quite the long context length and we are still talking about a template here. That is, this is also meant to be an examples that people can just run. It takes about twice as much time to grab the necessary amount of A100s to fine-tune at 16k. And it will also take longer to fine-tune. That's why I chose 8k after all. This is not purely about context-extension. The extension exists only as a side-node in the FAQ for those who's context-length exceed the native context length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants