Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual OCR Development Plan #1048

Closed
D-DanielYang opened this issue Oct 28, 2020 · 72 comments
Closed

Multilingual OCR Development Plan #1048

D-DanielYang opened this issue Oct 28, 2020 · 72 comments
Labels
language requests Multilingual language requests

Comments

@D-DanielYang
Copy link
Collaborator

D-DanielYang commented Oct 28, 2020

model name description model size download Update Date
ch Chinese and English 3.71M inference model / trained model 2020.9.22
ch_tra chinese traditional 5.63M inference model / trained model 2021.1.21
en English 2.56M inference model / trained model 2020.9.22
fr French 2.65M inference model / trained model 2021.9.22
ar Arabic 2.53M inference model / trained model 2021.1.21
es Spanish 2.53M inference model / trained model 2021.1.21
pt Portuguese 2.63M inference model / trained model 2021.1.21
ru Russia 2.63M inference model / trained model 2021.1.21
ge german 2.65M inference model / trained model 2020.9.22
kr Korean 3.9M inference model / trained model 2020.9.22
jp Japanese 4.23M inference model / trained model 2020.9.22
it Italian 2.53M inference model / trained model 2021.1.21
hi Hindi 2.63M inference model / trained model 2021.1.21
ug Uyghur 2.63M inference model / trained model 2021.1.21
fa Persian 2.63M inference model / trained model 2021.1.21
ur Urdu 2.63M inference model / trained model 2021.1.21
oc Occitan 2.53M inference model / trained model 2021.1.21
mr Marathi 2.63M inference model / trained model 2021.1.21
ne Nepali 2.63M inference model / trained model 2021.1.21
rs_cyrillic Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21
rs_latin Serbian(latin) 2.53M inference model / trained model 2021.1.21
bg Bulgarian 2.63M inference model / trained model 2021.1.21
uk Ukranian 2.63M inference model / trained model 2021.1.21
be Belarusian 2.63M inference model / trained model 2021.1.21
te Telugu 2.63M inference model / trained model 2021.1.21
kn Kannada 2.63M inference model / trained model 2021.1.21
ta Tamil 2.63M inference model / trained model 2021.1.21
mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
bm Burmese -- Need dict and corpus call for contribution
ku_cent kurdish central -- PR8347 call for contribution
od Odia -- PR6348 call for contribution
th thai -- PR6719 issue chat call for contribution
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed:

  1. In folder ppocr/utils/dict,
    it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.

  2. In folder ppocr/utils/corpus,
    it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language.
    Maybe, 50000 words per language is necessary at least.
    Of course, the more, the better.

  3. call for contributions to add new language support for PaddleOCR.
    For anyone might be insterested in traing the new language model, Guidance to train the model is provided. We are calling contributions to add new language support for PaddleOCR.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

@D-DanielYang D-DanielYang added the language requests Multilingual language requests label Oct 29, 2020
@saheya
Copy link

saheya commented Nov 2, 2020

Traditional Mongolian

@omar16100
Copy link

I would love to work on "Bangla"

@levanpon98
Copy link

I very happy if you do that with Vietnamese

@HusseinYoussef
Copy link

How about Arabic? That would be great.

@Hieung28
Copy link

I've find out that PADDLE OCR algorithm cannot recognize some special characters (such as comma, semicolon, or dot...) when the language is english. Is there any possible way that i can fix this problem

@GmGniap
Copy link

GmGniap commented Nov 27, 2020

I would like to contribute to add the Burmese language. Is it only needed to submit two text files - dict & corpus? How further process do we need to provide?

@xeron56
Copy link

xeron56 commented Nov 28, 2020

Adding "Bangla" will be grate for the people in south Asia

@giranntu
Copy link

giranntu commented Dec 7, 2020

Adding "Traditional Chinese (zh-TW)" would be great support.

@Ru-Van
Copy link

Ru-Van commented Dec 7, 2020

Do you have preTrained Russian recognition model?

@SasiAravind
Copy link

Hi adding " Tamil" language will be very grateful.

Tamil_dict.txt
Tamil_corpus.txt

Need more help plz refer this issue:
JaidedAI/EasyOCR#39

@fcakyon
Copy link

fcakyon commented Dec 24, 2020

I can help with Turkish language.

@krzynio
Copy link

krzynio commented Jan 3, 2021

I can help with polish language.

@D-DanielYang D-DanielYang pinned this issue Jan 13, 2021
@D-DanielYang D-DanielYang unpinned this issue Jan 13, 2021
@D-DanielYang D-DanielYang pinned this issue Jan 25, 2021
@D-DanielYang D-DanielYang changed the title Multilingual OCR development Plan Multilingual OCR Development Plan Jan 25, 2021
@xmy0916
Copy link
Contributor

xmy0916 commented Jan 26, 2021

@GmGniap Hello, Can you provide the corpus file of Burmese Language?

@xmy0916
Copy link
Contributor

xmy0916 commented Jan 26, 2021

@shahidul56 Hello, Can you provide the corpus file of Bangla Languag?

@azmat21
Copy link

azmat21 commented Jan 26, 2021

All models updated in 2021.1.21 cannot be downloaded with following Error:
{ code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

@D-DanielYang
Copy link
Collaborator Author

D-DanielYang commented Jan 27, 2021

All models updated in 2021.1.21 cannot be downloaded with following Error:
{ code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Sorry for the invalid links and all of them have been revised now, you can try again.

@redcinelli
Copy link

I very happy if you do that with Vietnamese

#1847, seems to be ongoing.

@xmy0916
Copy link
Contributor

xmy0916 commented Jan 28, 2021

@redcinelli Thank you very much. The Vietnamese model is in training and will be available soon~

@fcakyon
Copy link

fcakyon commented Jan 28, 2021

model name description model size download Update Date
ch Chinese and English 3.71M inference model / trained model 2020.9.22
cht chinese traditional 5.63M inference model / trained model 2021.1.21
en English 2.56M inference model / trained model 2020.9.22
fr French 2.65M inference model / trained model 2021.9.22
ar Arabic 2.53M inference model / trained model 2021.1.21
xi Spanish 2.53M inference model / trained model 2021.1.21
pu Portuguese 2.63M inference model / trained model 2021.1.21
ru Russia 2.63M inference model / trained model 2021.1.21
ge german 2.65M inference model / trained model 2020.9.22
kr Korean 3.9M inference model / trained model 2020.9.22
jp Japanese 4.23M inference model / trained model 2020.9.22
it Italian 2.53M inference model / trained model 2021.1.21
hi Hindi 2.63M inference model / trained model 2021.1.21
ug Uyghur 2.63M inference model / trained model 2021.1.21
fa Persian 2.63M inference model / trained model 2021.1.21
ur Urdu 2.63M inference model / trained model 2021.1.21
rs Serbian(latin) 2.53M inference model / trained model 2021.1.21
oc Occitan 2.53M inference model / trained model 2021.1.21
mr Marathi 2.63M inference model / trained model 2021.1.21
ne Nepali 2.63M inference model / trained model 2021.1.21
rsc Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21
bg Bulgarian 2.63M inference model / trained model 2021.1.21
uk Ukranian 2.63M inference model / trained model 2021.1.21
be Belarusian 2.63M inference model / trained model 2021.1.21
te Telugu 2.63M inference model / trained model 2021.1.21
ka Kannada 2.63M inference model / trained model 2021.1.21
ta Tamil 2.63M inference model / trained model 2021.1.21
mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
vi Vietnamese -- Need dict and corpus
bm Burmese -- Need dict and corpus
tk Turkish -- Need dict and corpus
po polish -- Need dict and corpus
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed:

1. In folder [ppocr/utils/dict](./ppocr/utils/dict),
   it is necessary to submit the dict text to this path and name it with `{language}_dict.txt` that contains a list of all characters. Please see the format example from other files in that folder.

2. In folder [ppocr/utils/corpus](./ppocr/utils/corpus),
   it is necessary to submit the corpus to this path and name it with `{language}_corpus.txt` that contains a list of words in your language.
   Maybe, 50000 words per language is necessary at least.
   Of course, the more, the better.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

@grasswolfs model name for Turkish should be "tr" instead of "tk", it is the widely used abbreviation for Turkish.

@fcakyon
Copy link

fcakyon commented Jan 28, 2021

I have also opened a pr for Turkish dict and corpora: #1856

@tink2123
Copy link
Collaborator

tink2123 commented Feb 2, 2021

Thanks @habout632 for adding Southeast Asian languages via #1896

@yumeliu
Copy link

yumeliu commented Mar 16, 2021

Here is a dictionary for Greek.
el_dict.txt

@alenma04
Copy link

Hi , did we have a model to detect all English characters along with special characters like.,"()

@Jane-Ding
Copy link
Contributor

hi, thank you for the great work! I just wonder whether you will add traditional Chinese to the general model? Right now, the general model can support Chinese(sim), English and numbers.

@AzizDZH
Copy link

AzizDZH commented Jan 25, 2023

Please add Tajik Language
tajik_corpus.txt
tajik_dict.txt

@ssavi-ict
Copy link

ssavi-ict commented Mar 13, 2023

@fcakyon @D-DanielYang @xmy0916

I would like to contribute to Bangla Dictionary and Corpus. Can I do that?

Also, I have a few queries to ask -

  1. Could not clearly understand this line - If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.
  2. In the corpus of at least 50000 words, I am guessing all of them should be unique. Am I right?
  3. Is there any particular category of corpus words? Like except stop words or something similar to that?
  4. In ppocr/utils path I can not see any corpus directory.

Thanks in advance

@ariefwijaya
Copy link

Please add Indonesia (id) and English (en) together

@Truong-Thanh-Quang
Copy link

Do you have any plan for Vietnamese release?

@ursfan
Copy link

ursfan commented May 19, 2023

Is it sufficient to change the file german_dict.txt if one wants to detect Fraktur a historic german script instead of the current script form? The dictionary which was learnt for the German language should be the same? For tesseract there is one trained file for Fraktur to ocr scan historic documents.

@runachan19
Copy link

need indonesian language please

@zahamed
Copy link

zahamed commented Jul 23, 2023

Hi Dear plz add the bangla and english support. I have attach both the file for bangla
bangla_dict.txt

bangla_corpus.txt

@EdwardYGLi
Copy link

Hi team. Great work on Paddle, it's an amazing OCR engine! Can we please have Hebrew support in multilanguage models ?

Thanks !

@zahamed
Copy link

zahamed commented Jul 29, 2023 via email

@shreyaahh
Copy link

Can you provide for any ancient scripts?

@hungtrieu07
Copy link

Truong

I'm trying with my private data, but the result very poor

@SUPERustam
Copy link

Sorry for my stupid question, I am novice at DL: What difference between Inference model and trained model?

Copy link
Contributor

github-actions bot commented Jan 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 3, 2024
@taeefnajib
Copy link
Contributor

I created a PR for Bangla

@juvebogdan
Copy link

Does this list contain the latest models? If i want to fine tune for example german model do i use this link from this page to download the pretrained model? If so what yml file should i use? How do i know what is the architecture of these models?

@github-actions github-actions bot removed the stale label Jan 24, 2024
@Ligoml Ligoml unpinned this issue Feb 28, 2024
@AzizDZH
Copy link

AzizDZH commented Mar 1, 2024

Please add Tajik Language
tajik_corpus.txt
tajik_dict.txt

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Jun 5, 2024
@SWHL SWHL converted this issue into discussion #12734 Jun 5, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
language requests Multilingual language requests
Projects
None yet
Development

No branches or pull requests