Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-1503: Sentiment Datasets #1545

Merged
merged 5 commits into from
Apr 26, 2020
Merged

GH-1503: Sentiment Datasets #1545

merged 5 commits into from
Apr 26, 2020

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Apr 26, 2020

This PR adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment. See #1503

There are also a number of improvements for the ClassificationCorpus and ClassificationDataset classes:

  • It is now possible to select from three memory modes ('full', 'partial' and 'disk'). Use full if the entire dataset and all objects fit into memory. Use 'partial' if it doesn't and use 'disk' if even 'partial' does not fit.
  • It is also now possible to provide "name maps" to rename labels in datasets. For instance, some sentiment analysis datasets use '0' and '1' as labels, while some others use 'POSITIVE' and 'NEGATIVE'. By providing name maps you can rename labels so they are consistent across datasets.
  • You can now choose which splits to downsample (for instance you might want to downsample 'train' and 'dev' but not 'test')
  • You can now specify the option "filter_if_longer_than", to filter all sentences that have more than the number of provided whitespaces. This is useful to limit corpus size as some sentiment analysis datasets are gigantic.

Putting it all together, you can now create a MultiCorpus of 5 sentiment analysis datasets and limit the max length of each data point to 50 tokens, like this:

corpus = MultiCorpus([
    IMDB(filter_if_longer_than=50),
    SENTEVAL_SST_BINARY(filter_if_longer_than=50),
    SENTEVAL_MR(filter_if_longer_than=50),
    SENTIMENT_140().downsample(0.1, downsample_test=False), # downsample train and dev, but not text
    AMAZON_REVIEWS(filter_if_longer_than=50, memory_mode='partial') # use partial memory mode since this dataset is huge
]
)
print(corpus)

In this example, we downsample Sentiment140 ('train' and 'dev' but not 'test') and keep the Amazon dataset in partial memory because it is too large.

@alanakbik alanakbik merged commit 8f9dec5 into master Apr 26, 2020
@alanakbik alanakbik deleted the GH-1503-sentiment-datasets branch April 26, 2020 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant