Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

Open
marctessier opened this issue Sep 18, 2024 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@marctessier
Copy link
Collaborator

Bug description

When doing a FP fine-tune , in Tensorboard it looks like the next round start 50 steps ahead.

See image
Screenshot 2024-09-18 at 09 59 35

How to reproduce the bug

Run a FP training . One epoch is good enough to see.

srun everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml --config-args training.max_epochs=1

Then fine-fine that job with an extra epoch.

srun everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml --config-args training.max_epochs=2 --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt"

Error messages and logs

No error message.

Environment

Standard ENV , nothing special. This will be for after PR 547 is merged. ( Resume at the end of the last trained epoch #547
, Issue #534 )

More info

none

@marctessier marctessier added the bug Something isn't working label Sep 18, 2024
@marctessier marctessier added this to the beta milestone Sep 18, 2024
@SamuelLarkin
Copy link
Collaborator

SamuelLarkin commented Sep 19, 2024

I was thinking about this issue. In order to get some training loss value, we need to run at least on batch but then if we use up one batch, we are no longer at step 0 for that resumed run. If we want to remove the gap, should we save the last losses then when we reload the model, we could send those saved values to tensorboard to bridge the gap.

Note that, when we resume, pytorch lightning actually performs one epoch of evaluation then records it to tensorboard then actually proceed to resume training. If we were to use the losses calculated during that first evaluation phase, we could get losses' value at step 0 but they would most likely not align with the training losses' value calculated at the end of the final run aka the run that is prior to resuming aka the values of the last checkpoint we are currently resuming from.

@SamuelLarkin SamuelLarkin added documentation Improvements or additions to documentation and removed bug Something isn't working labels Sep 23, 2024
@SamuelLarkin
Copy link
Collaborator

After today's meeting, we agreed that probably the thing to do is to document the fact that when resuming, we need to expect those gap.
Before documenting this, let's give a quick try where we run training until the first training losses are logged to tensorboard then reset the training iterator. This may not be a great solution because we might get the same losses twice and make the graph even more confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants