-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in fine-tuning on custom dataset #131
Comments
Hi, may I ask whether the IndexError occurs at the beginning of the training, or just happens during some steps? |
For the RuntimeError of in-place operation, could you please try to apply this modification in the Replace these code snippet OFA/criterions/label_smoothed_cross_entropy.py Lines 220 to 225 in 630e193
into
|
I am getting the same error. Index out of range only appears with I am fine-tuning with more than 1 gpu. |
did you solve the index error? i am having the same error when training on multiple GPU while no error traing on one GPU. |
I followed the previous issues #76 #105 #56 and more, to generate the tsv files for VizWiz dataset. I ensured each row of the tsv has 6 values in the order that is desired for the training. If I try running multi-GPU training:
GPUS_PER_NODE=4
WORKER_CNT=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8214
export RANK=0
I get two errors:
File "/fs/cml-scratch/kseelman/VQA/OFA/data/file_dataset.py", line 115, in
column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
IndexError: list index out of range
AND
RuntimeError: Output 0 of _DDPSinkBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3778252) of binary: /fs/cml-scratch/kseelman/VQA/OFA/env/bin/python3
** The weird part is when I run single GPUS_PER_NODE=1, the list index out of range problem does not occur and training works. So if I use single-GPU it works, but I want to use mult-gpu since finetuning takes a long time
The text was updated successfully, but these errors were encountered: