Zero Width unicode characters #18

arjenpdevries · 2021-12-31T09:58:12Z

Great library!

Using it on an NLP task we study, I ran into a problem processing text drawn from the Web (where you find a lot of weird stuff!).
Specifically, we want to split on \u200B and \u200C that are known as zero width space (zwsp) and zero width non-joiner, respectively.

This pull request modifies the code to do that by adding these characters to the hyphens_and_underscore (you may want to modify the variable name to also refer to zwsp if you decide to integrate the changes, I thought first let's see if you like the proposal). I added examples of desired behavior to the tests.

Background info:

fnl · 2022-01-03T20:55:32Z

Thank you very much for your contribution! After some research on the case, please help me understand this PR proposal better:

ZWNJ officially should to be used to split inside words, not between words. To my best knowledge, this character should not be used to separate words, it is there to avoid ligatures between separate morphemes or syllables, i.e., sub-word tokens. Now, Syntok is a word tokenizer, and not a sub-word tokenizer. Therefore, the default behavior here would seem to be that syntok should prune any occurrence of such a character and leave the tokens surrounding it joined together on whatever rules apply after its removal, as ZWNJ is purely meant for typographic usage. Why do you think your proposed solution is more useful, and if so, what is the evidence this should be the default and not rather some kind of optional behavior once syntok can support the correct default?
As to ZWSP, that is an even more interesting case. I would believe that it should be treated just like any other space character. Surprisingly, the current, specifically Unicode-enabled (!) regex metacharacter \S used by syntok to split tokens does not cover this snowflake of a space character. So changing the current space regex \S+ to this: [^\s\u200b]+ in the tokenizer would seem like the right default behavior for ZWSP. Why do you think your proposed solution for ZWSP is more useful, and if so, what is the evidence this should be the default and not rather some kind of optional behavior?

Again, thank you very much for your contribution, and looking forward to your thoughts and reasoning around those two very special cases in the Unicode world!

arjenpdevries · 2022-01-03T23:16:07Z

Thanks for a quick response!

I have to agree with you - for case ZWNJ it makes more sense to treat it by pruning it (I did not think this through long enough, clearly).

For ZWSP, I thought it would behave like a hyphen - someone put it there to separate two tokens that do belong together, but did not want that to be visible other than, e.g., at the end of a line, when it should break on this token (whereas the hyphen visualizes this connection between the two tokens). At least, that's how I interpreted what I found on the net, but I have to admit I did not dive deeply into the unicode documents themselves.

You can indeed easily argue that it should be treated as a (weird case of) space that is simply not visible.

(Whether you put it in space or in hyphen, it may indeed be a good idea to make this processing optional?)

Meanwhile, in our Web data adventures, we identified two more weird cases: U+FE0F and U+FEFF. I think for tokenization, U+FE0F should simply be ignored - but you can think about the second one, U+FEFF, which is supposed to be keeping two tokens together (as opposed to the ZWSP). If you indeed think about adding an option for the ZWSP processing mode, then maybe include ZWNBSP in a similar (but orthogonal) way.

See also:

fnl · 2022-01-04T21:24:55Z

The functionality you are suggesting for ZWSP is already supported in Unicode by the U+00AD SHY character, known as syllable hyphen. At least from yet another description about the purpose of Unicode zero-width characters, ZWSP would appear to be a very special space character that annoyingly is not supported by the \s regex metacharacter, but should be used to split words.

As to the deprecated use of the BOM as ZWNBSP (U+FEFF), that seems to be meant to be used the same way as NBSP is, but without a space. Now, it is actually called the Word Joiner U+2060, and is supposed to be used in non-Indoeuropean scripts, so it seems you would not want to split words at this character.

U+FE0F is a Variation Selector in Unicode. I don't understand how this character fits with the rest of this discussion?

In summary, it seems the correct default behavior would then be:

ZWNJ U+200C, ZWJ U+200D, WJ U+2060, and BOM/ZWNBSP U+FEFF -> do nothing (do not split words/tokens here); it seems syntok is already correctly handling these cases
ZWSP U+200B -> change syntok to split on this character like on any space (which includes NBSP and U+200A)

arjenpdevries · 2022-01-04T22:43:21Z

Agreed.

Shall I modify my pull request accordingly?

(PS: Please indeed ignore the mention of U+FE0F, that is only relevant for the NLP library we use syntok together with.)

fnl · 2022-01-04T22:47:16Z

If you are interested to do that, I would be glad to merge a fix for U+200B space handling to syntok: So, yes, please! 👍

Only U+200B should be processed, and treated as a space (not a hyphen).

arjenpdevries · 2022-01-04T23:25:18Z

Done!
Unrolled everything related to U+200C, modified U+200B to act as a special kind of space, and modified test accordingly.

fnl · 2022-01-06T20:31:34Z

Thank you very much! I will create a new syntok release.

arjenpdevries · 2022-01-06T20:53:11Z

Thanks for the coaching!

fnl · 2022-01-06T22:00:21Z

Done; deployed with version 1.3.3

Arjen - much more, thank you very much for your contribution, the interesting findings about missing Unicode support in syntok for some of the more esoteric parts of the standard, and the great discussion!

arjenpdevries added 3 commits December 31, 2021 10:36

Also split on zero width whitespace character \u200B

177f010

Tests for splitting on zero width whitespace character

f4f0c56

Added extra case for the zero width non-joiner \u200C

c69e146

Modification after discussion of pull request.

10fb710

Only U+200B should be processed, and treated as a space (not a hyphen).

fnl merged commit a26dbc0 into fnl:master Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero Width unicode characters #18

Zero Width unicode characters #18

arjenpdevries commented Dec 31, 2021

fnl commented Jan 3, 2022

arjenpdevries commented Jan 3, 2022

fnl commented Jan 4, 2022 •

edited

Loading

arjenpdevries commented Jan 4, 2022

fnl commented Jan 4, 2022

arjenpdevries commented Jan 4, 2022

fnl commented Jan 6, 2022

arjenpdevries commented Jan 6, 2022

fnl commented Jan 6, 2022

Zero Width unicode characters #18

Zero Width unicode characters #18

Conversation

arjenpdevries commented Dec 31, 2021

fnl commented Jan 3, 2022

arjenpdevries commented Jan 3, 2022

fnl commented Jan 4, 2022 • edited Loading

arjenpdevries commented Jan 4, 2022

fnl commented Jan 4, 2022

arjenpdevries commented Jan 4, 2022

fnl commented Jan 6, 2022

arjenpdevries commented Jan 6, 2022

fnl commented Jan 6, 2022

fnl commented Jan 4, 2022 •

edited

Loading