Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update reco datasets and tests #954

Closed
wants to merge 2 commits into from

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Jun 17, 2022

This PR:

  • simplify structure recognition datasets
  • update eval scripts
  • update tests

Any feedback is welcome 🤗

@felixdittrich92 felixdittrich92 added module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition type: misc Miscellaneous labels Jun 17, 2022
@felixdittrich92 felixdittrich92 self-assigned this Jun 17, 2022
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Jun 17, 2022
@codecov
Copy link

codecov bot commented Jun 17, 2022

Codecov Report

Merging #954 (42bd64e) into main (9530f81) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #954      +/-   ##
==========================================
+ Coverage   95.16%   95.18%   +0.01%     
==========================================
  Files         134      134              
  Lines        5520     5521       +1     
==========================================
+ Hits         5253     5255       +2     
+ Misses        267      266       -1     
Flag Coverage Δ
unittests 95.18% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/cord.py 97.72% <100.00%> (ø)
doctr/datasets/datasets/base.py 98.03% <100.00%> (+0.03%) ⬆️
doctr/datasets/funsd.py 97.36% <100.00%> (ø)
doctr/datasets/ic03.py 97.72% <100.00%> (ø)
doctr/datasets/ic13.py 96.77% <100.00%> (ø)
doctr/datasets/iiit5k.py 96.96% <100.00%> (ø)
doctr/datasets/imgur5k.py 93.33% <100.00%> (ø)
doctr/datasets/mjsynth.py 95.83% <100.00%> (ø)
doctr/datasets/sroie.py 97.29% <100.00%> (ø)
doctr/datasets/svhn.py 95.23% <100.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9530f81...42bd64e. Read the comment docs.

Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Felix!

I added a few questions 👍

self.data: List[Tuple[Union[str, np.ndarray], Dict[str, Any]]] = []
self.data: List[Tuple[Union[str, np.ndarray], Union[Dict[str, Any], str]]] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make a decision here on the output format for the __getitem__ output signature first
Then we'll update this accordingly

Either way, I agree that self.data should have a string as second member (to save memory, and then the getter can choose whether it should be packed within a dict or not)

Copy link
Contributor Author

@felixdittrich92 felixdittrich92 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to return a dict in fact that we support datasets with more than one annotation (boxes/labels) something like

{image: ..,
boxes: .. (can be None),
labels: .. (can be None),
} 
both None -> raise

to hold some key informations wdyt ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmmh, about getitem:

  • dict have a .get method, I'd argue we should limit the keys to the target task ("labels" for text reco)

About self.data:

  • the simplest / lightest data format that fit the target task (string for text recognition)

So the getitem method need to pack this into a dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frgfm after thinking again about the output i would say it is better to return only a dict if we have more then 2 annotations (labels and boxes) otherwise we return the plain where the output should be clear.
Only recognition has str annotations and pure detection has box annotations.
OCR-end-to-end is the only format with dict where we need both. wdyt ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's dangerous to have heterogenous target data types (dict vs non-dict)

That makes transforms are to implement 😅

Copy link
Contributor Author

@felixdittrich92 felixdittrich92 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh yes i agree will update this next week that all annotations are provided in a dict

label = [path.split('_')[1]]
label = path.split('_')[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the label used to be a list of string? (with a single element)

Copy link
Contributor Author

@felixdittrich92 felixdittrich92 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before: ['xyz'] ( only list() splits the string )
After: 'xyz' :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't get how "before" used to work 🤔
Why would we have a list of string? Is it because we weren't thorough enough in the unittest and it's a mistake? Or I'm missing something?

Copy link
Contributor Author

@felixdittrich92 felixdittrich92 Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmhh no unitests are fine but we have had no case where a dataset for recognition was used (I have introduced this a bit earlier).. first is the eval_recognition script with:
targets = [t['labels'][0] for t in targets]

The goal was to have a unified output (Tensor, Dict) but this makes no sense only if we have more than 1 annotation we need a dictonary otherwise str (reco) / np.ndarray (boxes) should be enough

@felixdittrich92
Copy link
Contributor Author

failing tests: link to image isn't currently available

@felixdittrich92
Copy link
Contributor Author

@frgfm after thinking again about this changes i think it is fine as is ... (i know that's a relative overhead to map strings in a dict and a list but in this case we can ensure all especially the transforms are handled the same way).
I would say we update Word/Chargenerator the same way than all datasets returns the same (Union[str, np.ndarray], Dict[str, Union[str, np.ndarray, int]) dict contains labels boxes or both wdyt ?

@felixdittrich92
Copy link
Contributor Author

I will rethink about this

@felixdittrich92 felixdittrich92 deleted the reco-dataset-out branch September 2, 2022 10:40
@frgfm
Copy link
Collaborator

frgfm commented Sep 5, 2022

@felixdittrich92 I like it's a rather subjective question because we're trying to balance:

  • uniformity of target format across all tasks
  • simplicity/clarity of output signature

Perhaps we could manage all this by using a flag format_target_as_dict in the constructor that would return dict target instead of the bare minimum?

@felixdittrich92
Copy link
Contributor Author

@frgfm i think the way we have now in #1041 is ok (also for transformations) wdyt ?

@frgfm
Copy link
Collaborator

frgfm commented Sep 16, 2022

@felixdittrich92 #1041 takes one approach but doesn't offer the possibility to switch between dict & simple target

The good thing about minimal target (without dict) is that batching is much easier

@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Sep 16, 2022

@frgfm mh 🤔 but i think it is currently a good way to handle samples which have multible annotations (with = dict / without = string or np.array) do we really need a switch ?

Especially for a clear transformation format

@frgfm
Copy link
Collaborator

frgfm commented Sep 21, 2022

Not necessarily but perhaps we should run a test to check latency overhead due to this 🤷‍♂️

@felixdittrich92 felixdittrich92 removed this from the 0.6.0 milestone Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition type: misc Miscellaneous
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants