update reco datasets and tests #954

felixdittrich92 · 2022-06-17T07:02:04Z

This PR:

simplify structure recognition datasets
update eval scripts
update tests

Any feedback is welcome 🤗

codecov · 2022-06-17T07:14:03Z

Codecov Report

Merging #954 (42bd64e) into main (9530f81) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #954      +/-   ##
==========================================
+ Coverage   95.16%   95.18%   +0.01%     
==========================================
  Files         134      134              
  Lines        5520     5521       +1     
==========================================
+ Hits         5253     5255       +2     
+ Misses        267      266       -1

Flag	Coverage Δ
unittests	`95.18% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/cord.py	`97.72% <100.00%> (ø)`
doctr/datasets/datasets/base.py	`98.03% <100.00%> (+0.03%)`	⬆️
doctr/datasets/funsd.py	`97.36% <100.00%> (ø)`
doctr/datasets/ic03.py	`97.72% <100.00%> (ø)`
doctr/datasets/ic13.py	`96.77% <100.00%> (ø)`
doctr/datasets/iiit5k.py	`96.96% <100.00%> (ø)`
doctr/datasets/imgur5k.py	`93.33% <100.00%> (ø)`
doctr/datasets/mjsynth.py	`95.83% <100.00%> (ø)`
doctr/datasets/sroie.py	`97.29% <100.00%> (ø)`
doctr/datasets/svhn.py	`95.23% <100.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9530f81...42bd64e. Read the comment docs.

frgfm

Thanks for the PR Felix!

I added a few questions 👍

frgfm · 2022-06-23T16:39:37Z

doctr/datasets/cord.py

-        self.data: List[Tuple[Union[str, np.ndarray], Dict[str, Any]]] = []
+        self.data: List[Tuple[Union[str, np.ndarray], Union[Dict[str, Any], str]]] = []


I think we need to make a decision here on the output format for the __getitem__ output signature first
Then we'll update this accordingly

Either way, I agree that self.data should have a string as second member (to save memory, and then the getter can choose whether it should be packed within a dict or not)

I think we need to return a dict in fact that we support datasets with more than one annotation (boxes/labels) something like

{image: .., boxes: .. (can be None), labels: .. (can be None), } both None -> raise

to hold some key informations wdyt ?

Mmmmh, about getitem:

dict have a .get method, I'd argue we should limit the keys to the target task ("labels" for text reco)

About self.data:

the simplest / lightest data format that fit the target task (string for text recognition)

So the getitem method need to pack this into a dict

Sounds good

@frgfm after thinking again about the output i would say it is better to return only a dict if we have more then 2 annotations (labels and boxes) otherwise we return the plain where the output should be clear.
Only recognition has str annotations and pure detection has box annotations.
OCR-end-to-end is the only format with dict where we need both. wdyt ?

I'd say it's dangerous to have heterogenous target data types (dict vs non-dict)

That makes transforms are to implement 😅

Mh yes i agree will update this next week that all annotations are provided in a dict

frgfm · 2022-06-23T16:42:04Z

doctr/datasets/mjsynth.py

-            label = [path.split('_')[1]]
+            label = path.split('_')[1]


so the label used to be a list of string? (with a single element)

Before: ['xyz'] ( only list() splits the string )
After: 'xyz' :)

Yes, but I don't get how "before" used to work 🤔
Why would we have a list of string? Is it because we weren't thorough enough in the unittest and it's a mistake? Or I'm missing something?

Mmmhh no unitests are fine but we have had no case where a dataset for recognition was used (I have introduced this a bit earlier).. first is the eval_recognition script with:
targets = [t['labels'][0] for t in targets]

The goal was to have a unified output (Tensor, Dict) but this makes no sense only if we have more than 1 annotation we need a dictonary otherwise str (reco) / np.ndarray (boxes) should be enough

felixdittrich92 · 2022-06-24T08:04:44Z

failing tests: link to image isn't currently available

felixdittrich92 · 2022-07-13T08:20:29Z

@frgfm after thinking again about this changes i think it is fine as is ... (i know that's a relative overhead to map strings in a dict and a list but in this case we can ensure all especially the transforms are handled the same way).
I would say we update Word/Chargenerator the same way than all datasets returns the same (Union[str, np.ndarray], Dict[str, Union[str, np.ndarray, int]) dict contains labels boxes or both wdyt ?

felixdittrich92 · 2022-09-02T10:40:02Z

I will rethink about this

frgfm · 2022-09-05T10:01:00Z

@felixdittrich92 I like it's a rather subjective question because we're trying to balance:

uniformity of target format across all tasks
simplicity/clarity of output signature

Perhaps we could manage all this by using a flag format_target_as_dict in the constructor that would return dict target instead of the bare minimum?

felixdittrich92 · 2022-09-05T10:28:29Z

@frgfm i think the way we have now in #1041 is ok (also for transformations) wdyt ?

frgfm · 2022-09-16T10:04:17Z

@felixdittrich92 #1041 takes one approach but doesn't offer the possibility to switch between dict & simple target

The good thing about minimal target (without dict) is that batching is much easier

felixdittrich92 · 2022-09-16T11:34:43Z

@frgfm mh 🤔 but i think it is currently a good way to handle samples which have multible annotations (with = dict / without = string or np.array) do we really need a switch ?

Especially for a clear transformation format

frgfm · 2022-09-21T11:35:57Z

Not necessarily but perhaps we should run a test to check latency overhead due to this 🤷‍♂️

felixdittrich92 added module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition type: misc Miscellaneous labels Jun 17, 2022

felixdittrich92 self-assigned this Jun 17, 2022

felixdittrich92 mentioned this pull request Jun 17, 2022

[datasets] Filter currupted and wrong annotated files in ready to use datasets #935

Closed

23 tasks

felixdittrich92 requested review from charlesmindee and frgfm June 17, 2022 07:12

felixdittrich92 added this to the 0.6.0 milestone Jun 17, 2022

This was referenced Jun 17, 2022

[Fix] SVT dataset: clip box values and add shape and label check #955

Merged

[Fix] MJSynth dataset: filter corrupted or missing images #956

Merged

frgfm reviewed Jun 23, 2022

View reviewed changes

felixdittrich92 force-pushed the reco-dataset-out branch from d803e72 to 02f7518 Compare June 24, 2022 05:41

felixdittrich92 force-pushed the reco-dataset-out branch from 02f7518 to 65291f8 Compare June 27, 2022 06:22

felixdittrich92 requested a review from frgfm June 27, 2022 08:54

frgfm mentioned this pull request Jun 28, 2022

Release tracker - v0.6.0 #791

Closed

85 tasks

felixdittrich92 added 2 commits July 1, 2022 23:16

update reco datasets and tests

051fac3

update abstractdataset generator output signature

42bd64e

felixdittrich92 force-pushed the reco-dataset-out branch from 5ee8b56 to 42bd64e Compare July 1, 2022 21:16

felixdittrich92 marked this pull request as draft July 1, 2022 21:17

felixdittrich92 closed this Sep 2, 2022

felixdittrich92 deleted the reco-dataset-out branch September 2, 2022 10:40

felixdittrich92 removed this from the 0.6.0 milestone Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update reco datasets and tests #954

update reco datasets and tests #954

felixdittrich92 commented Jun 17, 2022 •

edited

Loading

codecov bot commented Jun 17, 2022 •

edited

Loading

frgfm left a comment

frgfm Jun 23, 2022

felixdittrich92 Jun 23, 2022 •

edited

Loading

frgfm Jun 25, 2022

felixdittrich92 Jun 25, 2022

felixdittrich92 Jun 27, 2022

frgfm Jul 1, 2022

felixdittrich92 Jul 1, 2022 •

edited

Loading

frgfm Jun 23, 2022

felixdittrich92 Jun 23, 2022 •

edited

Loading

frgfm Jun 25, 2022

felixdittrich92 Jun 25, 2022 •

edited

Loading

felixdittrich92 commented Jun 24, 2022

felixdittrich92 commented Jul 13, 2022

felixdittrich92 commented Sep 2, 2022

frgfm commented Sep 5, 2022

felixdittrich92 commented Sep 5, 2022

frgfm commented Sep 16, 2022 •

edited

Loading

felixdittrich92 commented Sep 16, 2022 •

edited

Loading

frgfm commented Sep 21, 2022

		self.data: List[Tuple[Union[str, np.ndarray], Dict[str, Any]]] = []
		self.data: List[Tuple[Union[str, np.ndarray], Union[Dict[str, Any], str]]] = []

update reco datasets and tests #954

update reco datasets and tests #954

Conversation

felixdittrich92 commented Jun 17, 2022 • edited Loading

codecov bot commented Jun 17, 2022 • edited Loading

Codecov Report

frgfm left a comment

Choose a reason for hiding this comment

frgfm Jun 23, 2022

Choose a reason for hiding this comment

felixdittrich92 Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

frgfm Jun 25, 2022

Choose a reason for hiding this comment

felixdittrich92 Jun 25, 2022

Choose a reason for hiding this comment

felixdittrich92 Jun 27, 2022

Choose a reason for hiding this comment

frgfm Jul 1, 2022

Choose a reason for hiding this comment

felixdittrich92 Jul 1, 2022 • edited Loading

Choose a reason for hiding this comment

frgfm Jun 23, 2022

Choose a reason for hiding this comment

felixdittrich92 Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

frgfm Jun 25, 2022

Choose a reason for hiding this comment

felixdittrich92 Jun 25, 2022 • edited Loading

Choose a reason for hiding this comment

felixdittrich92 commented Jun 24, 2022

felixdittrich92 commented Jul 13, 2022

felixdittrich92 commented Sep 2, 2022

frgfm commented Sep 5, 2022

felixdittrich92 commented Sep 5, 2022

frgfm commented Sep 16, 2022 • edited Loading

felixdittrich92 commented Sep 16, 2022 • edited Loading

frgfm commented Sep 21, 2022

felixdittrich92 commented Jun 17, 2022 •

edited

Loading

codecov bot commented Jun 17, 2022 •

edited

Loading

felixdittrich92 Jun 23, 2022 •

edited

Loading

felixdittrich92 Jul 1, 2022 •

edited

Loading

felixdittrich92 Jun 23, 2022 •

edited

Loading

felixdittrich92 Jun 25, 2022 •

edited

Loading

frgfm commented Sep 16, 2022 •

edited

Loading

felixdittrich92 commented Sep 16, 2022 •

edited

Loading