Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenting sentences at colons #9

Open
fhamborg opened this issue Jan 22, 2020 · 6 comments
Open

Segmenting sentences at colons #9

fhamborg opened this issue Jan 22, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@fhamborg
Copy link

For example the following snippet will be extracted as one single sentence (ending at the last full stop), but it should perhaps be split at the colons.

Here they “warn” anyone who opposes his radical ideology:
Four police officers were sent to hospital:
Violence against police officers is not only acceptable with Bernie Sanders and Black Lives Matter terrorists, its necessary to create chaos and panic:
What kind of violent protest would be complete without Barack Obama’s good friend, domestic terrorist Bill Ayers:
It’s probably just a coincidence that on a day that <u><b>Obama</b></u> was too busy to attend Nancy Reagan’s funeral, he was able to address a crowd about his hate for Trump only hours before this organized chaos in Chicago:
And finally, we’re wondering how much our Organizer In Chief had to do with this Alinsky style chaos in Chicago:
Illegal aliens, paid Soros protesters, angry Black Lives Matter terrorists inspired by Obama’s race war and Bernie Sanders supporters who have absolutely no idea why they showed up, sent four innocent police officers to the hospital; prevented thousands of innocent Americans from exercising their First Amendment right.

Is this by intention? Is there a way to force splitting at colons? Besides this extreme example I think I came across many cases where syntok did not split at colons.

@fnl
Copy link
Owner

fnl commented Jan 22, 2020

Thank you, Felix, for bringing this up; A valid feature request: Colon (and semi-colon) handling is indeed a bit of a borderline affair, and technically they are sentence separators. It might make sense to support that, but I need to think about it a bit more. I'd also love to hear feedback/oppinions from other users about this.

[Correcting the title of and adding labels.]

@fnl fnl changed the title incorrect handling of colons Segmenting sentences at colons Jan 22, 2020
@fnl fnl added the enhancement New feature or request label Jan 22, 2020
@fhamborg
Copy link
Author

fhamborg commented Jan 30, 2020

Yea I agree, whether segmentation is sensible at colon and semicolon likely also depends on the text domain. Looking at the definition of each in Wikipedia, one finds that both have cases, where segmentation would be required and others, where not.

E.g., for semicolon (cf. Wikipedia): "The semicolon or semi-colon[1] (;) is a punctuation mark that separates major sentence elements. A semicolon can be used between two closely related independent clauses, provided they are not already joined by a coordinating conjunction. Semicolons can also be used in place of commas to separate the items in a list, particularly when the elements of that list contain commas."

Yet, at least for the colon, I found that nltk and CoreNLP actually do perform segmentation more often than not (if not always?).

@nmstoker
Copy link

My two cents: those examples aren't really separate sentences because of the colons, they're separate sentences due to the content of the sentence, and they just happen to have the (very odd) colons at the end. It's not normal English usage to end a sentence with a colon, in fact it actively implies some following content. Therefore I would tend not to expect it to split on a colon and would prefer that was left to people to deal with if there are special cases with their particular text source.

However, with a semi-colon I would be more open to the idea that they can be treated as separate sentences. It's not uncommon for editors looking to simplify text to turn such cases into two (or more) distinct sentences and it would be less surprising here than it would be with the colon case.

@fnl
Copy link
Owner

fnl commented Apr 27, 2020

In general, libraries such as nltk and CoreNLP tend to severely over-split, which was the major reason for me to come up with my own. Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too.

@fhamborg
Copy link
Author

fhamborg commented Apr 27, 2020

Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too.

This seems feasible to me.

@fnl
Copy link
Owner

fnl commented Apr 28, 2020

Release 1.3.1 now supports semi-colon segmentation.

I will leave this ticket open, however, as this was specifically about segmenting colons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants