-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for TransformerWordEmbeddings to process long sentences. #1680
Conversation
get latest version from master
…NLP-master # Conflicts: # flair/embeddings/token.py
@schelv thanks a lot for adding this - lots of people will surely find this useful! Could you also share the script you used for the Dutch CoNLL experiments? Then we can update the doc! |
The script in Experiments.md is updated. I curently do not have the resources to train the model 5 times and average the score, so I'm leaving that for someone else to do. =] |
No worries, maybe we can do this before the next release :) Thanks again! |
Hello, great feature! |
@alejandrojcastaneira
The transformer gives the embeddings for these windows:
To get as much context in the embeddings as possible the embedding are "glued" back together to get the embedding of the long sequence: |
Hello! Great feature, indeed. I searched for this same functionality in TransformerDocumentEmbeddings, but I understood it does not support it. Here is a snippet of the code:
I am doing a TextClassification task using this document embedding. What happens when I have a big sentence as input? What are possibilities to solve this issue? |
I'm not sure. Either find a model that can handle texts of any length or think of a way to shorten your text without losing the meaning of the text (summarize it first, maybe?). |
@schelv Hi, Thanks for your great feature! But I meet some strange subtokenized sentences after I get the However, the second subtokenized sentence is very strange. From my knowledge and your example, the second subtokenized sentence should contain a list of [ Though the function |
Could you prepare example code? |
@djstrong flair/flair/embeddings/token.py Lines 922 to 939 in cd3d7ed
The example is a simple list with ids of You can see that |
I confirm your findings with the newest transformers. I checked with 3.0.0 and there it is working fine, but from 3.1.0 it is wrong. You should create issue in transformers - will you? |
Thank you! I found that it goes wrong from 3.0.1, so I should use 3.0.0 for flair now. |
@wangxinyu0922 Nicely found! I also noticed similar strange behavior recently (see #1902). Maybe for the flair project it is possible to restrict the version of transformers to <= 3.0.0 or >= 3.version_that_fixes_this, to avoid wrong output for other who try to use flair + transformers on long texts. If you create the issue in the transformers repository can you provide a link to it? I'm very curious what causes this😁 |
The issue is here: huggingface/transformers#8028 |
I'm trying to reproduce the CoNLL NER score reported by BERT paper based on Flair with document context, so I tried to use your great feature. I checked the code carefully because I got a very low score through the feature. However, I still cannot produce the score after the bug is fixed... |
Do you have a link? |
I meant that train a NER model only with the BERT embeddings. The reported score in the BERT paper is 92.8 and the authors said that they trained the model with the maximal document. However, I can get a score of 91.6 at most. After I searched for some issues like google-research/bert#223, I believe that the results are impossible to reproduce. |
Yes, impossible to reproduce is normal for a google paper. |
TransformerWordEmbeddings can have a max sequence length.
Having input sentences that are too long, will give warnings and errors.
These code modifications allow to get the word embeddings of longer texts.
I think this Fixes #1410 and it also Fixes #575 and also it Fixes #1519
Edit: It is also possible to update CoNLL-03 Named Entity Recognition (Dutch) the BERTje embedding improves the score quite a bit