GH-1503: Sentiment Datasets #1545

alanakbik · 2020-04-26T13:29:17Z

This PR adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment. See #1503

There are also a number of improvements for the ClassificationCorpus and ClassificationDataset classes:

It is now possible to select from three memory modes ('full', 'partial' and 'disk'). Use full if the entire dataset and all objects fit into memory. Use 'partial' if it doesn't and use 'disk' if even 'partial' does not fit.
It is also now possible to provide "name maps" to rename labels in datasets. For instance, some sentiment analysis datasets use '0' and '1' as labels, while some others use 'POSITIVE' and 'NEGATIVE'. By providing name maps you can rename labels so they are consistent across datasets.
You can now choose which splits to downsample (for instance you might want to downsample 'train' and 'dev' but not 'test')
You can now specify the option "filter_if_longer_than", to filter all sentences that have more than the number of provided whitespaces. This is useful to limit corpus size as some sentiment analysis datasets are gigantic.

Putting it all together, you can now create a MultiCorpus of 5 sentiment analysis datasets and limit the max length of each data point to 50 tokens, like this:

corpus = MultiCorpus([
    IMDB(filter_if_longer_than=50),
    SENTEVAL_SST_BINARY(filter_if_longer_than=50),
    SENTEVAL_MR(filter_if_longer_than=50),
    SENTIMENT_140().downsample(0.1, downsample_test=False), # downsample train and dev, but not text
    AMAZON_REVIEWS(filter_if_longer_than=50, memory_mode='partial') # use partial memory mode since this dataset is huge
]
)
print(corpus)

In this example, we downsample Sentiment140 ('train' and 'dev' but not 'test') and keep the Amazon dataset in partial memory because it is too large.

alanakbik added 5 commits April 4, 2020 23:57

GH-1503: add Amazon Product Reviews and Twitter Sentiment datasets

5a95215

GH-1503: speed optimizations for calculating label dict

f7d83a0

GH-1503: add docstrings for datasets

99604e4

Merge branch 'master' into GH-1503-sentiment-datasets

8ad25cd

GH-1503: save pre-best-model.pt only if anneal_with_prestarts

68d2725

alanakbik merged commit 8f9dec5 into master Apr 26, 2020

alanakbik deleted the GH-1503-sentiment-datasets branch April 26, 2020 14:10

This was referenced Apr 28, 2020

Sentiment Analysis very slow in mac #1534

Closed

en-sentiment model contains labels that don't match IMDB dataset #1165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1503: Sentiment Datasets #1545

GH-1503: Sentiment Datasets #1545

alanakbik commented Apr 26, 2020 •

edited

Loading

GH-1503: Sentiment Datasets #1545

GH-1503: Sentiment Datasets #1545

Conversation

alanakbik commented Apr 26, 2020 • edited Loading

alanakbik commented Apr 26, 2020 •

edited

Loading