-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[models] add ViTSTR TF and PT and update ViT to work as backbone #1055
Conversation
7b30704
to
16a5b0f
Compare
Codecov Report
@@ Coverage Diff @@
## main #1055 +/- ##
==========================================
+ Coverage 95.17% 95.28% +0.11%
==========================================
Files 141 145 +4
Lines 5823 6046 +223
==========================================
+ Hits 5542 5761 +219
- Misses 281 285 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Felix! I'd have to go through the paper more thoroughly for an accurate review, but I added some comments!
ee253d0
to
11e16e0
Compare
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Maybe I'll wait for a review from @frgfm for this model
@frgfm for TF we will switch later to IntermediateLayerGetter if we have pretrained ViT models 👍 About the model it's quite simple ViT as feature extractor with a custom head (nothing special inside) |
@frgfm i will open another PR to add pretrained weights for ViT. We can add your requested changes into this (if there is anything :) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's only set all the arguments for/against the patch_size default value (cf. comment)
# fix patch size if recognition task with 32x128 input | ||
self.patch_size = (4, 8) if height != width else patch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tricky condition: what are all the possible cases as input, and what do we want as patch_size for each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it could be anything ..
currently in classification case: 32x32 -> (4, 4 (check)
recognition case: 32x128 -> (4, 8) (check)
detection case 1024x1024 (not handled)
any other size (not handled)
it will not fail but each size needs a different patch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok two questions then:
- how should the scale impact the patch size ? (N,N) --> (H,W) implies that (2N,2N) --> (?,?)
- how should the aspect ratio impact the patch size? I see that (32,32) --> (4, 4), but why (32,128) doesn't do (4, 16) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(4, 16) would work also but i used the values from ParSeq for the PatchEmbedding of 32x128 samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is (4, 8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/baudm/parseq/blob/main/configs/model/parseq.yaml
https://github.com/baudm/parseq/blob/main/configs/experiment/vitstr.yaml
If we would use a fixed ratio this would be easy to scale ... but yeah i took the values from ParSeq paper / implementation
@felixdittrich92 Did you add pretrained weights for ViTSTR recognition model ? I looked in the code and there was no url link to pretrained weights. |
Hi @ghassenBtb 👋, no, i have added only pretrained weights for the ViT backbone. |
Thank you Felix for your quick response :) |
Hi @ghassenBtb, we didn't pretrain ViTSTR on our internal dataset |
Hi @charlesmindee i think the question from @ghassenBtb was more about if you can train the model internally and provide the checkpoints 😅 |
Yes it would be great if you can train the model on your internal french dataset :) |
This PR:
(This difference is normal faiced the same % diff in SAR and MASTER between TF and PT)
Any feedback is very welcome 🤗
NOTE:
Unlike the SAR or MASTER architecture, I am not able to fully train the model because ViT requires a lot of data and I cannot muster the computing power. So just a little test this time based on our word generator to show that it trains well.
slow tests: passed
PT:
TF:
Additional:
pred works also: (only tested with a model which reaches ~15% exact after 9 epochs trained with WordGenerator samples)