-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointing
Milestone
Description
Bug description
https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L582
This line is called fs.listdir(dir_path)
where dir_path
is an URL instead of a path.
pyarrow will complain about it:
pyarrow.lib.ArrowInvalid: FileSelector.base_dir must not be a URI, got: hdfs:///somewhere/MNIST
How to reproduce the bug
use a HDFS path in trainer's default_root_dir
Error messages and logs
File "main.py", line 63, in main
trainer.fit(mnist_model, train_loader, valid_loader)
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1110, in _run
self._restore_modules_and_callbacks(ckpt_path)
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1063, in _restore_modules_and_callbacks
self._checkpoint_connector.resume_start(checkpoint_path)
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 78, in resume_start
self.resume_checkpoint_path = self._hpc_resume_path or checkpoint_path
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 66, in _hpc_resume_path
max_version = self.__max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 506, in __max_ckpt_version_in_folder
files = [os.path.basename(f["name"]) for f in fs.listdir(dir_path)]
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/fsspec/spec.py", line 1301, in listdir
return self.ls(path, detail=detail, **kwargs)
File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/fsspec/implementations/arrow.py", line 66, in ls
for entry in self.fs.get_file_info(FileSelector(path))
File "pyarrow/_fs.pyx", line 433, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: FileSelector.base_dir must not be a URI, got: hdfs:///somewhere/MNIST
Environment
fsspec: 2022.10.0
pyarrow: 8.0.0
pytorch_lightning: 1.7.7
More info
No response
awaelchli
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointing