When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

marctessier · 2024-09-18T15:16:38Z

Bug description

When doing a FP fine-tune , in Tensorboard it looks like the next round start 50 steps ahead.

See image

How to reproduce the bug

Run a FP training . One epoch is good enough to see.

srun everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml --config-args training.max_epochs=1

Then fine-fine that job with an extra epoch.

srun everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml --config-args training.max_epochs=2 --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt"

Error messages and logs

No error message.

Environment

Standard ENV , nothing special. This will be for after PR 547 is merged. ( Resume at the end of the last trained epoch #547
, Issue #534 )

More info

none

SamuelLarkin · 2024-09-19T13:30:44Z

I was thinking about this issue. In order to get some training loss value, we need to run at least on batch but then if we use up one batch, we are no longer at step 0 for that resumed run. If we want to remove the gap, should we save the last losses then when we reload the model, we could send those saved values to tensorboard to bridge the gap.

Note that, when we resume, pytorch lightning actually performs one epoch of evaluation then records it to tensorboard then actually proceed to resume training. If we were to use the losses calculated during that first evaluation phase, we could get losses' value at step 0 but they would most likely not align with the training losses' value calculated at the end of the final run aka the run that is prior to resuming aka the values of the last checkpoint we are currently resuming from.

SamuelLarkin · 2024-09-23T18:45:08Z

After today's meeting, we agreed that probably the thing to do is to document the fact that when resuming, we need to expect those gap.
Before documenting this, let's give a quick try where we run training until the first training losses are logged to tensorboard then reset the training iterator. This may not be a great solution because we might get the same losses twice and make the graph even more confusing.

marctessier added the bug Something isn't working label Sep 18, 2024

marctessier added this to the beta milestone Sep 18, 2024

marctessier assigned SamuelLarkin Sep 18, 2024

SamuelLarkin added documentation Improvements or additions to documentation and removed bug Something isn't working labels Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

marctessier commented Sep 18, 2024

SamuelLarkin commented Sep 19, 2024 •

edited

Loading

SamuelLarkin commented Sep 23, 2024

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

Comments

marctessier commented Sep 18, 2024

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

SamuelLarkin commented Sep 19, 2024 • edited Loading

SamuelLarkin commented Sep 23, 2024

SamuelLarkin commented Sep 19, 2024 •

edited

Loading