SVT dataset integration #597

felixdittrich92 · 2021-11-09T15:34:35Z

This PR integrates the Street View Text dataset

codecov · 2021-11-09T15:51:53Z

Codecov Report

Merging #597 (28e1a47) into main (400aec0) will decrease coverage by 0.11%.
The diff coverage is 97.05%.

❗ Current head 28e1a47 differs from pull request most recent head 7c3505e. Consider uploading reports for the commit 7c3505e to get more accurate results

@@            Coverage Diff             @@
##             main     #597      +/-   ##
==========================================
- Coverage   96.16%   96.04%   -0.12%     
==========================================
  Files         111      111              
  Lines        4300     4299       -1     
==========================================
- Hits         4135     4129       -6     
- Misses        165      170       +5

Flag	Coverage Δ
unittests	`96.04% <97.05%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/svt.py	`96.96% <96.96%> (ø)`
doctr/datasets/__init__.py	`100.00% <100.00%> (ø)`
doctr/utils/data.py	`88.67% <0.00%> (-9.44%)`	⬇️
doctr/datasets/doc_artefacts.py	`93.33% <0.00%> (-0.79%)`	⬇️
doctr/datasets/iiit5k.py
doctr/datasets/datasets/base.py	`95.55% <0.00%> (+0.10%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 400aec0...7c3505e. Read the comment docs.

fg-mindee

Thanks! I added some optimization comment & 1 about XML security

doctr/datasets/detection.py

doctr/datasets/recognition.py

doctr/datasets/svt.py

fg-mindee

Still got many questions 🙃

doctr/datasets/svt.py

fg-mindee · 2021-11-10T13:42:42Z

doctr/datasets/svt.py

+                        (int(rect_tag.attrib['x']) + int(rect_tag.attrib['width']) / 2,
+                         int(rect_tag.attrib['y']) + int(rect_tag.attrib['height']) / 2,
+                         int(rect_tag.attrib['width']), int(rect_tag.attrib['height']), 0)
+                        for rect_tag in image_tag


according to what you said above, I don't see why we don't need to check that if image_tag.tag == 'taggedRectangles' then 🤔

@fg-mindee

image.tag : imageName image.attrib : {} image.text : img/04_16.jpg image.tag : address image.attrib : {} image.text : 324 South Glenoaks Boulevard Burbank CA 91502-1318 image.tag : lex image.attrib : {} image.text : BLOCKBUSTER,VIDEO,LAW,OFFICES,RAY,FOUNTAIN,CRIMINAL,DEFENSE,ATTORNEY,LAWYER,PIZZA,MAN,CAFE,COLOMBIA,GERRO,MARSHA,RICO,GLENOAKS,PHARMACY,AND,MEDICAL,SUPPLY,RAMIREZ,LYNN,TER,POGHOSYAN,ZARINE,JAS,GIFTS,KNOX,TIMOTHY,DDS,LILLY,PARK,HOME,THEATER,HAMOUI,MOBIL,MART,CLEANERS,TOKYO,YAKIDORI,SHIRVAN,REALTY,GROUP,NAPA,AUTO,CARE,CENTER,TRAN,DEBBY image.tag : Resolution image.attrib : {'x': '1280', 'y': '1024'} image.text : None image.tag : taggedRectangles image.attrib : {} image.text :

This is for one example hopefully it makes it a bit clearer

Why we need no check ?
taggedRectangles has only a attribut
and tag with the label is child from taggedRectangles
but in image_tag we have multible tags: imageName, address, lex, Resolution and taggedRectangles

<taggedRectangle height="31" width="120" x="351" y="455"> <tag>MAGIC</tag> </taggedRectangle>

with xpath we can search in the tree:
element = (x.findall('.//td[@class="teamid"]'))
but in this case i would prefer the if statement ^^

Sorry, but I'm not really familiar with XML, so I have no clue in your snippet whether there is some hierarchy, whether this is a flat data structure or whether we can access attribute values by name 😅

Generally speaking, nested FOR loops have to be avoided, hence my previous question ;)
So, in Python, when processing the XML, for what I understand:

xml_root holds the list of pages/images

in each page/image, there are several attributes, and we're interested in "imageName", "x", "y", "width", "height" and those are on the same level

Is this correct?

No lets try to explain with this snippet:
tagset is root
image is first depth tag
imageName, address, lex, Resolution and taggedRectangles is second depth (x,y from Resolution are attributes and the other are text)
taggedRectangles has 1 more depth taggedRectangle with attributes height, width, x, y
and each taggedRectangle has one more depth tag with a text value
@fg-mindee
i have also tryed pythons built-in import xmltodict but this does not work in fact that the xml file is not well formated :/
the other way (and i think the only other way) with XPath something like this: .//imageName but for readability i would prefer the actual implementation 😅
wdyt ?

* start synth * cleanup * reopening #597 * apply changes

Felix added 2 commits November 9, 2021 16:23

start svt

5798c33

Merge branch 'main' of https://github.com/felixdittrich92/doctr into SVT

9b19a19

Felix added 2 commits November 9, 2021 17:08

fix tests

75a3a54

fix codefact minor typos and add docu

e252e9a

felixdittrich92 marked this pull request as ready for review November 10, 2021 07:45

felixdittrich92 changed the title ~~[WIP] SVT dataset integration~~ SVT dataset integration Nov 10, 2021

Felix added 2 commits November 10, 2021 08:55

set defusedxml before xml import

24e186b

Merge branch 'main' of https://github.com/felixdittrich92/doctr into SVT

c933ec7

fg-mindee suggested changes Nov 10, 2021

View reviewed changes

fg-mindee self-assigned this Nov 10, 2021

fg-mindee added the module: datasets Related to doctr.datasets label Nov 10, 2021

apply changes

9c9db06

fg-mindee reviewed Nov 10, 2021

View reviewed changes

fg-mindee added topic: documentation Improvements or additions to documentation ext: tests Related to tests folder labels Nov 10, 2021

Felix added 2 commits November 10, 2021 15:38

rename child and image_tag

afdcc36

Merge branch 'main' of https://github.com/felixdittrich92/doctr into SVT

180a6a8

felixdittrich92 requested a review from fg-mindee November 10, 2021 16:39

Felix and others added 3 commits November 11, 2021 08:16

Merge branch 'main' of https://github.com/felixdittrich92/doctr into SVT

e26dad2

Merge branch 'main' of https://github.com/felixdittrich92/doctr into SVT

97ab10a

Merge branch 'main' of https://github.com/felixdittrich92/doctr into SVT

56f0023

fg-mindee mentioned this pull request Nov 12, 2021

[datasets] Extend the range of public datasets supported in docTR #587

Closed

7 tasks

felixdittrich92 added 2 commits November 12, 2021 15:29

remove check

28e1a47

solve merge conflicts

7c3505e

felixdittrich92 closed this Nov 13, 2021

felixdittrich92 deleted the SVT branch November 13, 2021 11:06

felixdittrich92 added a commit to felixdittrich92/doctr that referenced this pull request Nov 13, 2021

reopening mindee#597

e719034

fg-mindee mentioned this pull request Nov 15, 2021

reopening #597 SVT dataset integration #620

Merged

felixdittrich92 added a commit to felixdittrich92/doctr that referenced this pull request Nov 15, 2021

reopening mindee#597

5eb73bd

fg-mindee pushed a commit that referenced this pull request Nov 15, 2021

feat: Added support of SVT dataset (#620)

1207416

* start synth * cleanup * reopening #597 * apply changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVT dataset integration #597

SVT dataset integration #597

felixdittrich92 commented Nov 9, 2021 •

edited

Loading

codecov bot commented Nov 9, 2021 •

edited

Loading

fg-mindee left a comment

fg-mindee left a comment

fg-mindee Nov 10, 2021

felixdittrich92 Nov 10, 2021 •

edited

Loading

fg-mindee Nov 12, 2021

felixdittrich92 Nov 12, 2021

SVT dataset integration #597

SVT dataset integration #597

Conversation

felixdittrich92 commented Nov 9, 2021 • edited Loading

codecov bot commented Nov 9, 2021 • edited Loading

Codecov Report

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee Nov 10, 2021

Choose a reason for hiding this comment

felixdittrich92 Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

fg-mindee Nov 12, 2021

Choose a reason for hiding this comment

felixdittrich92 Nov 12, 2021

Choose a reason for hiding this comment

felixdittrich92 commented Nov 9, 2021 •

edited

Loading

codecov bot commented Nov 9, 2021 •

edited

Loading

felixdittrich92 Nov 10, 2021 •

edited

Loading