Skip to content

A cleaned version of the Hebrew Sentiment data set published by Amram, A., Ben-David, A., and Tsarfaty, R. (2018).

Notifications You must be signed in to change notification settings

OnlpLab/Hebrew-Sentiment-Data

Repository files navigation

Hebrew-Sentiment-Data

A cleaned version of the Hebrew Sentiment data set published by Amram, A., Ben-David, A., and Tsarfaty, R. (2018).{1} Original dataset can be found under the omilab Github.
This dataset was prepared and used for the paper AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

Since we discovered that there was a data leakage (shared material between test and train) we provide a cleaned version without duplications and with removal of text that is almost identical. (for example 2 samples that say "I love you!" and "I love you!!" considerd identical), with a new split of train-dev-test, instead of the original train-test. We provide in this repository both the new clean data and both the full deduping code.

Data and cleaning process information :

Data set Original Set sizes Deduped Set sizes - Token Deduped Set sizes - Morph Comment
Train 9220 5926 5932
Dev 1026 846 847 Originally didn't exist and was part of the train
Test 2561 1695 1696

Duplication within the sets:

Data set Leakage percentage
Train 24.6%
Dev 3.8%
Test 7.53%

Leakage between data sets:

Data set Leakage
Test - Train 1120 (12%)
Test + Dev - Train 1685 (16.22%)

{1} Representations and Architectures in Neural Sentiment Analysis for Morphologically Rich Languages: A Case Study from Modern Hebrew. In: Proceedings of The 27th International Conference on Computational Linguistics (COLING 2018) Santa Fe, NM, (pp. 2242-2252).

About

A cleaned version of the Hebrew Sentiment data set published by Amram, A., Ben-David, A., and Tsarfaty, R. (2018).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published