-
Notifications
You must be signed in to change notification settings - Fork 6.8k
DataLoader with workers not compatible with ImageRecordDataset #9974
Comments
Same issue here, that's pretty problematic for preprocessing-heavy datasets. Having done a bit of digging on that, here is what I think is happening. When the children processes are created they get all get a copy of the ImageRecordDataset. This ImageRecordDataset is holding an open read handle on a I tested the above solution, looks like it does solve the issue. Not sure what would be the best way to properly implement it though. Have a 'reinitialize()' function at the DataSet level? |
hot-fix for this problem: use at your own risk:
|
If this is an acceptable fix, it'd be great to get it in the master branch since I'm sure other people will start hitting this error soon. |
@jwfromm I will try to work on a PR this week if times allow |
Unfortunately, it seems like this fix doesn't always work. When attempting to use it on a an imagenet record, I got the following the error
|
it looks like you might be running the fix twice, which would point worker loop to itself ? |
Although I'm not running it twice, this error is not related to the fix. It look's like its a separate bug entirely as it occurs even without the fix with num_workers set to 1. num_workers at 0 works fine though! I'll have to dig in a little more to see whats going on. |
The bug above is due to something on the master branch, when I revert to V1.1, your fix works great! |
The PR above is closed, but if we cannot use record file in dataloader with multiprocessing, it is confusing. |
I still have this problem in 1.3.0 /data1/zj/crnn.gluon/venv/bin/python /data1/zj/crnn.gluon/dataset.py
101
Process Process-5:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataloader.py", line 169, in worker_loop
batch = batchify_fn([dataset[i] for i in samples])
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataloader.py", line 169, in <listcomp>
batch = batchify_fn([dataset[i] for i in samples])
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataset.py", line 131, in __getitem__
item = self._data[idx]
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/vision/datasets.py", line 257, in __getitem__
record = super(ImageRecordDataset, self).__getitem__(idx)
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataset.py", line 189, in __getitem__
return self._record.read_idx(self._record.keys[idx])
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/recordio.py", line 265, in read_idx
return self.read()
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/recordio.py", line 163, in read
ctypes.byref(size)))
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:40:51] src/recordio.cc:65: Check failed: header[0] == RecordIOWriter::kMagic Invalid RecordIO File
Stack trace returned 10 entries:
[bt] (0) /data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x36bac2) [0x7fe0e5734ac2]
[bt] (1) /data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x36d5f83) [0x7fe0e8a9ef83]
[bt] (2) /data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/libmxnet.so(MXRecordIOReaderReadRecord+0x2a) [0x7fe0e8266bba]
[bt] (3) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7fe1048bce20]
[bt] (4) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7fe1048bc88b]
[bt] (5) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7fe1048b701a]
[bt] (6) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7fe1048aafcb]
[bt] (7) /data1/zj/crnn.gluon/venv/bin/python(PyObject_Call+0x47) [0x5c1797]
[bt] (8) /data1/zj/crnn.gluon/venv/bin/python(PyEval_EvalFrameEx+0x4ec6) [0x53bba6]
[bt] (9) /data1/zj/crnn.gluon/venv/bin/python(PyEval_EvalFrameEx+0x4b04) [0x53b7e4]
Traceback (most recent call last):
File "/data1/zj/crnn.gluon/dataset.py", line 148, in <module>
for i, (img, label) in enumerate(data_loader):
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataloader.py", line 242, in __next__
if self._rcvd_idx in self._data_buffer:
KeyboardInterrupt
Process Process-1:
Process Process-2:
Process Process-3:
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataloader.py", line 166, in worker_loop
idx, samples = key_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/usr/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataloader.py", line 166, in worker_loop
idx, samples = key_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/data1/zj/crnn.gluon/venv/lib/python3.5/site-packages/mxnet/gluon/data/dataloader.py", line 166, in worker_loop
idx, samples = key_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process finished with exit code 1 the code is from mxnet.gluon.data import DataLoader
from mxnet.gluon.data.vision.datasets import ImageRecordDataset
from mxnet.gluon.data.vision.transforms import ToTensor
dataset = ImageRecordDataset('/data1/zj/data/crnn/txt/val.rec')
data_loader = DataLoader(dataset.transform_first(ToTensor()), 1, shuffle=True, num_workers=6)
print(len(dataset))
start = time.time()
for i, (img, label) in enumerate(data_loader):
if (i + 1) % 10 == 0:
print(time.time() - start)
start = time.time() |
@WenmuZhou This should properly fixes for all kinds of situations: #12554 |
@zhreshold waiting for update |
@jwfromm @WenmuZhou The fix proposed in PR #12554 has been merged. |
@jwfromm @WenmuZhou Verified that both of the issues mentioned are not reproducible on the current master branch. PR #12554 should have fixed those. I am closing this issue. Please feel free to reopen if closed in error or if you still encounter this issue. Thanks! |
I have test my code with mxnet-cu80 (1.5.0b20190221), this bug has fixed, thanks |
FYI this is incompatible with thread_pool=True in DataLoader. (False is the default) |
Description
Using a DataLoader with a non-zero number of workers on a ImageRecordDataset crashes. Being able to have multiple workers is essential to high speed training, and is supported when using ImageRecordIters, so it should be possible with DataLoaders, which have a much nicer API.
Environment info (Required)
Package used (Python/R/Scala/Julia):
I'm using Python 3.6
Build info (Required if built from source)
Pip Install
Error Message:
Minimum reproducible example
Steps to reproduce
Run the above code.
What have you tried to solve it?
Would require changes to how ImageRecordDatasets access the records.
The text was updated successfully, but these errors were encountered: