Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to disable automatic checkpoint loading

Im trying to run a loop over a set of parameters and I wan’t to make a new network for each parameter and let it learn a few epochs.

Currently my code looks like this:

def optimize_scale(self, epochs=5, comp_scale=100, scale_list=[1, 100]):
    trainer = pyli.Trainer(gpus=1, max_epochs=epochs)
    
    for scale in scale_list:
        test_model = CustomNN(num_layers=1, scale=scale, lr=1, pad=True, batch_size=1)
        trainer.fit(test_model)
        trainer.test(verbose=True)
        
        del test_model

Everything works fine for the first element of scale_list, the network learns 5 epochs and completes the test. All this can be seen in the console. However for all following elements of scale_list it doesn’t work as the old network is not overwritten, but instead an old checkpoint is loaded automatically when trainer.fit(model) is called. In the console this is indicated through:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

C:\Users\XXXX\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:623: UserWarning:
Checkpoint directory D:\XXXX\src\lightning_logs\version_0\checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
train_size = 8   val_size = 1    test_size = 1
Restoring states from the checkpoint path at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt

The consequence is that the second test outputs the same result, as the the checkpoint from the old network was loaded which already finished all 5 epochs. I though that adding the del test_model might help in dropping the model completely, but that did not work.

On my search I found a few Issues closely related, for example: https://github.com/PyTorchLightning/pytorch-lightning/issues/368. However I did not manage to fix my problem. I assume it has something to with the fact that the new network which should overwrite the old one has the same name/version and therefore looks for the same checkpoints.

If anyone has an idea or knows how to circumvent this I would be very grateful.

>Solution :

I think, in your settings, you want to disable automatic checkpointing:

trainer = pyli.Trainer(gpus=1, max_epochs=epochs,enable_checkpointing=False)

You may need to explicitly save a checkpoint (with a different name) for each training session you are running.

You can manually save a checkpoint via:

trainer.save_checkpoint(f'checkpoint_for_scale_{scale}.pth')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading