Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in fine-tuning on custom dataset #131

Closed
kyleseelman opened this issue Jun 19, 2022 · 4 comments
Closed

Error in fine-tuning on custom dataset #131

kyleseelman opened this issue Jun 19, 2022 · 4 comments
Assignees

Comments

@kyleseelman
Copy link

I followed the previous issues #76 #105 #56 and more, to generate the tsv files for VizWiz dataset. I ensured each row of the tsv has 6 values in the order that is desired for the training. If I try running multi-GPU training:

GPUS_PER_NODE=4
WORKER_CNT=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8214
export RANK=0

I get two errors:

File "/fs/cml-scratch/kseelman/VQA/OFA/data/file_dataset.py", line 115, in
column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
IndexError: list index out of range

AND

RuntimeError: Output 0 of _DDPSinkBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3778252) of binary: /fs/cml-scratch/kseelman/VQA/OFA/env/bin/python3

** The weird part is when I run single GPUS_PER_NODE=1, the list index out of range problem does not occur and training works. So if I use single-GPU it works, but I want to use mult-gpu since finetuning takes a long time

@yangapku
Copy link
Member

Hi, may I ask whether the IndexError occurs at the beginning of the training, or just happens during some steps?

@yangapku
Copy link
Member

For the RuntimeError of in-place operation, could you please try to apply this modification in the label_smoothed_cross_entropy.py?

Replace these code snippet

if "constraint_masks" in sample and sample["constraint_masks"] is not None:
constraint_masks = sample["constraint_masks"]
net_output[0].masked_fill_(~constraint_masks, -math.inf)
if self.constraint_start is not None and self.constraint_end is not None:
net_output[0][:, :, 4:self.constraint_start] = -math.inf
net_output[0][:, :, self.constraint_end:] = -math.inf

into

        if "constraint_masks" in sample and sample["constraint_masks"] is not None:
            constraint_masks = sample["constraint_masks"]
            net_output = list(net_output)
            net_output[0] = net_output[0].masked_fill(~constraint_masks, -math.inf)
        if self.constraint_start is not None and self.constraint_end is not None:
            net_output = list(net_output)
            index = torch.cat(
                [torch.tensor([x for x in range(4, self.constraint_start)], dtype=torch.int64),
                 torch.tensor([x for x in range(self.constraint_end, net_output[0].size()[-1])], dtype=torch.int64)],
                dim=0
            )
            mask = torch.zeros(net_output[0].size()).index_fill(-1, index, 1)\
                .to(dtype=torch.bool, device=net_output[0].device)
            net_output[0] = net_output[0].masked_fill(mask, -math.inf) 

@JackCai1206
Copy link

Hi, may I ask whether the IndexError occurs at the beginning of the training, or just happens during some steps?

I am getting the same error. Index out of range only appears with I am fine-tuning with more than 1 gpu.

@HiLittleFriend
Copy link

did you solve the index error? i am having the same error when training on multiple GPU while no error traing on one GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants