Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK: Add an adapter layer that presents a CVAT task as a torchvision dataset #5417

Merged
merged 7 commits into from
Dec 8, 2022

Conversation

SpecLad
Copy link
Contributor

@SpecLad SpecLad commented Dec 2, 2022

Motivation and context

How has this been tested?

Unit tests; custom demo script.

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

label_id_to_index: Mapping[int, int]


class TaskDataset(torchvision.datasets.VisionDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't inherit the class, which problems we will have?

Copy link
Contributor

@nmanovic nmanovic Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at the same time, if we want to test the feature, we need to install torch and torchvision explicitly, but indeed we don't need them. What do think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't inherit the class, which problems we will have?

That depends on what the alternative is. The PyTorch DataLoader class requires the dataset to be an instance of torch.utils.data.Dataset, so we would at least have to inherit from that.

We could inherit directly from Dataset, rather than VisionDataset (in fact, for a long time, that was what my prototype did). The VisionDataset class provides dataset transform support that helps the class behave more like other torchvision datasets; however we could, in principle, reimplement that ourselves. I eventually decided to use VisionDataset because IMO, if you're working with vision datasets in PyTorch, you'll want to use torchvision anyway for its image processing transforms.

We could implement multiple classes (e.g. TaskDataset inheriting from Dataset and TaskVisionDataset inheriting from VisionDataset) and multiple sets of extras (with torchvision and without), but that just seems like overcomplicating things.

But at the same time, if we want to test the feature, we need to install torch and torchvision explicitly, but indeed we don't need them. What do think?

I don't understand the question. Naturally we would have to install PyTorch and torchvision to test integration with them.


_ModelType = TypeVar("_ModelType")

_CACHE_DIR = Path(appdirs.user_cache_dir("CVAT_SDK", "CVAT.ai"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_CACHE_DIR = Path(appdirs.user_cache_dir("CVAT_SDK", "CVAT.ai"))
_CACHE_DIR = Path(appdirs.user_cache_dir("cvat-sdk", "CVAT.ai"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

label_id_to_index: Mapping[int, int]


class TaskDataset(torchvision.datasets.VisionDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find ProjectDataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is none at the moment; however, one could be built easily using ConcatDataset.

Frankly, I'm not sure if it would even make sense to create a Dataset out of the entire project - such a dataset would contain all of the subsets, so it wouldn't be suitable for either training or testing.

@SpecLad SpecLad force-pushed the pytorch-adapter branch 3 times, most recently from efaec34 to d19cea6 Compare December 7, 2022 18:49
@SpecLad SpecLad changed the title SDK: Add an adapter layer that presents a CVAT task as a torchvision … SDK: Add an adapter layer that presents a CVAT task as a torchvision dataset Dec 8, 2022
We might eventually support other types of datasets, so use a slightly
more specific name.
Forcing the user to specify a filter ensures that we won't surprise them later
by causing additional boxes to appear if we add support for more shape types.
@SpecLad
Copy link
Contributor Author

SpecLad commented Dec 8, 2022

Some notes I made while developing this:

There's an issue with CVAT's data model that makes it somewhat difficult to translate CVAT tasks to datasets. In a dataset, a label is usually represented by a numeric index, but there's no corresponding concept in CVAT. CVAT labels have IDs, which are numeric, but they are not controllable by the user, and different projects will have different label IDs, even if their sets of labels are the same. Therefore, I don't think CVAT label IDs are suitable for use as indexes when working with neural networks.

In this PR, I assign indexes by enumerating all labels in order of ascending ID, which is at least somewhat controllable by the user (you can enter the labels in the correct order), but still not amazing. There's no way to determine from the CVAT UI which labels will have which indexes, and no way to affect the indexes once they've been created.

IMO, for better interoperability with deep learning frameworks, CVAT should assign each label a numeric index, and let the user customize it. Then we can use those assignments in the PyTorch adapter (and other adapters, should we make them).

Another way to help the user in this regard would be to let them specify a custom label-id-to-index mapping when instantiating the adapter class.


The way I extract the Nth frame in a chunk is by taking the Nth entry in the ZIP file's info list, which is the same method that the UI uses. It's not great, though; we're relying on the zip producer library that the server uses to record the files in the catalog in the order that they were specified, which doesn't seem especially safe to me. I don't think the order of files in the catalog is meant to be part of ZIP's data model.

IMO, it would be better to record some sort of manifest in each chunk file that would give the member name for each frame index. Alternatively, this manifest could be part of the data metadata (i.e. the result of GET /api/tasks/<id>/data/meta).


I haven't yet implemented some features that could be useful in the adapter layer. Some of them are already documented in the docstrings (as "limitations"). Some others are:

  • Accessing images from an external directory (rather than downloading chunks). Could be useful if the task uses files from a mounted share, and the user has also mounted that share. Shouldn't be too difficult to implement; the file names that need to be accessed are part of the data metadata that we already download (to check for deleted frames).

  • Limiting the dataset to one or several jobs that are part of the task. To support this, we need to a) filter out frames that are not in those jobs, and b) query the job annotations instead of task annotations and combine them ourselves. I assume CVAT combines annotations on overlapping frames by just concatenating the corresponding arrays, but I would have to check to be sure.

Copy link
Contributor

@nmanovic nmanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. It looks good to me as an initial version. Let's us merge it.

@nmanovic nmanovic merged commit 487c60c into cvat-ai:develop Dec 8, 2022
@SpecLad SpecLad deleted the pytorch-adapter branch December 8, 2022 16:00
@SpecLad
Copy link
Contributor Author

SpecLad commented Dec 8, 2022

Oh, I forgot another note:

I'm currently wiping the entire task from the cache when a change in the updated_date is detected. This includes the data chunks, which might seem rather imprudent, given that CVAT (as far as I know) does not let the user modify the data once it has been uploaded, and it would be much more efficient to keep the chunks when wiping out the cache.

This is just because I wanted to protect against the scenario where a future version of CVAT does introduce a way to alter task data, and a user with an old version of the SDK receives stale data, because that version never purges data from the cache.

Ideally, there should be a way to independently verify how up-to-date each chunk is, so that we can redownload the absolutely minimum amount of information needed. I was thinking we could use conditional HTTP queries (AKA If-Modified-Since) to redownload chunks if the updated_date of a task does not match the last modified timestamp of the local chunk. If the server replies that the chunk hasn't changed, we simply bump the last modified date of the local chunk file. This would be fairly efficient (still need one HTTP query per chunk, but no actual data retransmission in the common case) and more future-proof.

Of course, this would need the server to support If-Modified-Since queries. I don't know if it currently does, or how difficult it would be to implement. However, given that we offload file transfers to Django (via django-sendfile2), I wouldn't be surprised if we got conditional queries for free.

@zhiltsov-max
Copy link
Contributor

[no way to specify label index in CVAT]
Another way to help the user in this regard would be to let them specify a custom label-id-to-index mapping when instantiating the adapter class

Personally, I like this variant, since it is specific to the training. At least, as an option. On the other hand, the ability to specify it in CVAT directly would make the process more transparent and provide a single point of configuration for this. Many datasets rely on the label order.

[all the cache is cleaned] This is just because I wanted to protect against the scenario where a future version of CVAT does introduce a way to alter task data

Probably, it's better to keep cached chunks. CVAT doesn't allow to change data once it's uploaded, and there are no expected changes in this logic. The data can be changed, however, if we use a shared folder and the caching is enabled. The data can also be changed in a project. Per chunk update reporting sounds good to me, but it's probably quite far from what we have now.

[projects are not implemented, because it is difficult to use them]

I think, we can allow to select a subset to work with.

@nmanovic nmanovic mentioned this pull request Dec 12, 2022
@SpecLad
Copy link
Contributor Author

SpecLad commented Dec 13, 2022

Another way to help the user in this regard would be to let them specify a custom label-id-to-index mapping when instantiating the adapter class.

This is now implemented in #5455.

mikhail-treskin pushed a commit to retailnext/cvat that referenced this pull request Jul 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants