Error in fine-tuning on custom dataset #131

kyleseelman · 2022-06-19T15:35:41Z

I followed the previous issues #76 #105 #56 and more, to generate the tsv files for VizWiz dataset. I ensured each row of the tsv has 6 values in the order that is desired for the training. If I try running multi-GPU training:

GPUS_PER_NODE=4
WORKER_CNT=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8214
export RANK=0

I get two errors:

File "/fs/cml-scratch/kseelman/VQA/OFA/data/file_dataset.py", line 115, in
column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
IndexError: list index out of range

AND

RuntimeError: Output 0 of _DDPSinkBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3778252) of binary: /fs/cml-scratch/kseelman/VQA/OFA/env/bin/python3

** The weird part is when I run single GPUS_PER_NODE=1, the list index out of range problem does not occur and training works. So if I use single-GPU it works, but I want to use mult-gpu since finetuning takes a long time

yangapku · 2022-06-24T09:07:13Z

Hi, may I ask whether the IndexError occurs at the beginning of the training, or just happens during some steps?

yangapku · 2022-06-24T09:17:11Z

For the RuntimeError of in-place operation, could you please try to apply this modification in the label_smoothed_cross_entropy.py?

Replace these code snippet

OFA/criterions/label_smoothed_cross_entropy.py

Lines 220 to 225 in 630e193

    
           if "constraint_masks" in sample and sample["constraint_masks"] is not None: 
        
               constraint_masks = sample["constraint_masks"] 
        
               net_output[0].masked_fill_(~constraint_masks, -math.inf) 
        
           if self.constraint_start is not None and self.constraint_end is not None: 
        
               net_output[0][:, :, 4:self.constraint_start] = -math.inf 
        
               net_output[0][:, :, self.constraint_end:] = -math.inf

into

        if "constraint_masks" in sample and sample["constraint_masks"] is not None:
            constraint_masks = sample["constraint_masks"]
            net_output = list(net_output)
            net_output[0] = net_output[0].masked_fill(~constraint_masks, -math.inf)
        if self.constraint_start is not None and self.constraint_end is not None:
            net_output = list(net_output)
            index = torch.cat(
                [torch.tensor([x for x in range(4, self.constraint_start)], dtype=torch.int64),
                 torch.tensor([x for x in range(self.constraint_end, net_output[0].size()[-1])], dtype=torch.int64)],
                dim=0
            )
            mask = torch.zeros(net_output[0].size()).index_fill(-1, index, 1)\
                .to(dtype=torch.bool, device=net_output[0].device)
            net_output[0] = net_output[0].masked_fill(mask, -math.inf)

JackCai1206 · 2022-08-30T18:31:49Z

Hi, may I ask whether the IndexError occurs at the beginning of the training, or just happens during some steps?

I am getting the same error. Index out of range only appears with I am fine-tuning with more than 1 gpu.

HiLittleFriend · 2023-08-10T13:12:03Z

did you solve the index error? i am having the same error when training on multiple GPU while no error traing on one GPU.

JustinLin610 assigned yangapku Jun 24, 2022

yangapku closed this as completed Jul 28, 2022

zzhanghub mentioned this issue Aug 16, 2022

Custom dataset training failed due to IndexError: list index out of range #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in fine-tuning on custom dataset #131

Error in fine-tuning on custom dataset #131

kyleseelman commented Jun 19, 2022

yangapku commented Jun 24, 2022

yangapku commented Jun 24, 2022

JackCai1206 commented Aug 30, 2022

HiLittleFriend commented Aug 10, 2023

Error in fine-tuning on custom dataset #131

Error in fine-tuning on custom dataset #131

Comments

kyleseelman commented Jun 19, 2022

yangapku commented Jun 24, 2022

yangapku commented Jun 24, 2022

JackCai1206 commented Aug 30, 2022

HiLittleFriend commented Aug 10, 2023