title |
---|
Generating an optimized Thai keyboard layout |
I built a new Thai keyboard layout called Manoonchai
All of the Thai keyboard layouts are created in the typewriter era. I wanted to create a new one sometime ago since I moved from Kedmanee (Thai's QWERTY) to Pattachote. The problems of the two layouts is they're not optimized for modern keyboards, and even some of the characters are dead eg. ฃ
, ฅ
, ฦ
, ๏
. Moreover, Thai numbers are seldomly used in normal situations, hence forcing users to switch to English layout to type numbers or use the numpad. Lastly, I want the layout I created to be keyboard form factor agnostic. I can use 40% Keyboard with both Kedmanee & Pattachote, but still think that it can be better.
Apparently there is a new layout called IKBAEB which resonates all of the ideas, but I want to use modern Thai corpus as a dataset and then generate the layout somehow. Maybe it will yield the same result, but this way I will have an excuse to learn Rust.
For the name of the layout, I will use my family name for now.
- Learn minimum Rust to process Thai language corpus.
- Generate n-grams from corpus.
- Create typing effort model from n-grams, similar to Carpalx.
- Discover & measure new layout with lower effort from the model.
Since I'll use Rust for this project and I'm relatively new to this language, I'll start from gathering the corpus from various sources and find out the most frequently used keys are. The code is simple, just scanning all the words from the source, count, then sort it like this. The data is not quite useful, but it gives the idea which keys should be in the home row.
Next, I've planned to create the typing effort model similar to Carpalx. But the finger travel distance will be altered a bit to suit my use on 40% Keyboard. When I got the model working I'll train with the text from all the sources, including my chat logs.
I get some Thai corpus data (eg. Wisesight, Wongnai) then generate the triads to see which 3-character substrings are being used the most (Code). The triads will be one of the parameters to calculate the typing effort.
Here are the sample triads I got from the corpus.
้าน : 134642
ร้า : 119805
ที่ : 118052
ไม่ : 102900
่อย : 82040
ได้ : 73344
นี้ : 72915
มาก : 69226
เป็ : 66661
แต่ : 66436
ป็น : 62878
เลย : 62262
ว่า : 59965
ค่ะ : 57345
ข้า : 53751
ั่ง : 51812
รับ : 51245
ร่อ : 50937
อร่ : 50575
นนี : 48245
หาร : 44594
ครั : 44076
าหา : 43952
และ : 43314
อาห : 43283
ื่อ : 41649
ให้ : 41496
น้ำ : 40458
ทาน : 40247
่าง : 38617
ที่ : 10920
ไม่ : 10329
ได้ : 7626
555 : 6047
รับ : 5944
ว่า : 5886
นี้ : 5704
การ : 5318
ื่อ : 5292
ให้ : 4747
ล้ว : 4504
เป็ : 4498
ครั : 4400
แล้ : 4359
ป็น : 4331
เลย : 4298
้อง : 4186
กิน : 3957
แต่ : 3957
กัน : 3939
ของ : 3727
และ : 3341
มาก : 3283
วัน : 3231
ค่ะ : 3181
กับ : 3085
ประ : 3003
่าง : 2989
ั้ง : 2978
้าง : 2968
Some of the words are on the both list although the context are not the same (Restaurant reviews VS Social media messages.)
The rest can be read on my Medium