Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero Width unicode characters #18

Merged
merged 4 commits into from
Jan 6, 2022
Merged

Zero Width unicode characters #18

merged 4 commits into from
Jan 6, 2022

Conversation

arjenpdevries
Copy link
Contributor

Great library!

Using it on an NLP task we study, I ran into a problem processing text drawn from the Web (where you find a lot of weird stuff!).
Specifically, we want to split on \u200B and \u200C that are known as zero width space (zwsp) and zero width non-joiner, respectively.

This pull request modifies the code to do that by adding these characters to the hyphens_and_underscore (you may want to modify the variable name to also refer to zwsp if you decide to integrate the changes, I thought first let's see if you like the proposal). I added examples of desired behavior to the tests.

Background info:

@fnl
Copy link
Owner

fnl commented Jan 3, 2022

Thank you very much for your contribution! After some research on the case, please help me understand this PR proposal better:

  1. ZWNJ officially should to be used to split inside words, not between words. To my best knowledge, this character should not be used to separate words, it is there to avoid ligatures between separate morphemes or syllables, i.e., sub-word tokens. Now, Syntok is a word tokenizer, and not a sub-word tokenizer. Therefore, the default behavior here would seem to be that syntok should prune any occurrence of such a character and leave the tokens surrounding it joined together on whatever rules apply after its removal, as ZWNJ is purely meant for typographic usage. Why do you think your proposed solution is more useful, and if so, what is the evidence this should be the default and not rather some kind of optional behavior once syntok can support the correct default?

  2. As to ZWSP, that is an even more interesting case. I would believe that it should be treated just like any other space character. Surprisingly, the current, specifically Unicode-enabled (!) regex metacharacter \S used by syntok to split tokens does not cover this snowflake of a space character. So changing the current space regex \S+ to this: [^\s\u200b]+ in the tokenizer would seem like the right default behavior for ZWSP. Why do you think your proposed solution for ZWSP is more useful, and if so, what is the evidence this should be the default and not rather some kind of optional behavior?

Again, thank you very much for your contribution, and looking forward to your thoughts and reasoning around those two very special cases in the Unicode world!

@arjenpdevries
Copy link
Contributor Author

Thanks for a quick response!

I have to agree with you - for case ZWNJ it makes more sense to treat it by pruning it (I did not think this through long enough, clearly).

For ZWSP, I thought it would behave like a hyphen - someone put it there to separate two tokens that do belong together, but did not want that to be visible other than, e.g., at the end of a line, when it should break on this token (whereas the hyphen visualizes this connection between the two tokens). At least, that's how I interpreted what I found on the net, but I have to admit I did not dive deeply into the unicode documents themselves.

You can indeed easily argue that it should be treated as a (weird case of) space that is simply not visible.

(Whether you put it in space or in hyphen, it may indeed be a good idea to make this processing optional?)

Meanwhile, in our Web data adventures, we identified two more weird cases: U+FE0F and U+FEFF. I think for tokenization, U+FE0F should simply be ignored - but you can think about the second one, U+FEFF, which is supposed to be keeping two tokens together (as opposed to the ZWSP). If you indeed think about adding an option for the ZWSP processing mode, then maybe include ZWNBSP in a similar (but orthogonal) way.

See also:

@fnl
Copy link
Owner

fnl commented Jan 4, 2022

The functionality you are suggesting for ZWSP is already supported in Unicode by the U+00AD SHY character, known as syllable hyphen. At least from yet another description about the purpose of Unicode zero-width characters, ZWSP would appear to be a very special space character that annoyingly is not supported by the \s regex metacharacter, but should be used to split words.

As to the deprecated use of the BOM as ZWNBSP (U+FEFF), that seems to be meant to be used the same way as NBSP is, but without a space. Now, it is actually called the Word Joiner U+2060, and is supposed to be used in non-Indoeuropean scripts, so it seems you would not want to split words at this character.

U+FE0F is a Variation Selector in Unicode. I don't understand how this character fits with the rest of this discussion?

In summary, it seems the correct default behavior would then be:

  • ZWNJ U+200C, ZWJ U+200D, WJ U+2060, and BOM/ZWNBSP U+FEFF -> do nothing (do not split words/tokens here); it seems syntok is already correctly handling these cases
  • ZWSP U+200B -> change syntok to split on this character like on any space (which includes NBSP and U+200A)

@arjenpdevries
Copy link
Contributor Author

Agreed.

Shall I modify my pull request accordingly?

(PS: Please indeed ignore the mention of U+FE0F, that is only relevant for the NLP library we use syntok together with.)

@fnl
Copy link
Owner

fnl commented Jan 4, 2022

If you are interested to do that, I would be glad to merge a fix for U+200B space handling to syntok: So, yes, please! 👍

Only U+200B should be processed, and treated as a space (not a hyphen).
@arjenpdevries
Copy link
Contributor Author

Done!
Unrolled everything related to U+200C, modified U+200B to act as a special kind of space, and modified test accordingly.

@fnl fnl merged commit a26dbc0 into fnl:master Jan 6, 2022
@fnl
Copy link
Owner

fnl commented Jan 6, 2022

Thank you very much! I will create a new syntok release.

@arjenpdevries
Copy link
Contributor Author

Thanks for the coaching!

@fnl
Copy link
Owner

fnl commented Jan 6, 2022

Done; deployed with version 1.3.3

Arjen - much more, thank you very much for your contribution, the interesting findings about missing Unicode support in syntok for some of the more esoteric parts of the standard, and the great discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants