-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch-pretrained-bert to pytorch-transformers upgrade #873
Comments
Awesome - really look forward to supporting this in Flair! |
pytorch-transformers 1.0 was released today: at https://github.com/huggingface/pytorch-transformers Migrations summary can be found on the readme here Main takeaways from the migration process are:
Otherwise, pytorch-transformers 1.0 looks great and looking forward to use them on flair as well as standalone |
Looks great! |
Using the |
XLNetI ran some per-layer analysis with the large XLNet model:
|
Transformer-XLI also ran some per-layer analysis for the Transformer-XL embeddings:
Experiments with combination of layers:
That's the reason why I choose layers
|
@alanakbik I'm planning to run per-layer analysis for all of the Transformer-based models. However, it is really hard to give a recommendation for default layer(s). Recently, I found this NAACL paper: Linguistic Knowledge and Transferability of Contextual Representations that uses a "scalar mix" of all layers. An implementation can be found in the Would be awesome if we can adopt that 🤗 |
@stefan-it from the paper it also seems that best approach / layers would vary by task so giving overall recommendations might be really difficult. From a quick look it seems like their implementation could be integrated as part of an embeddings class, i.e. after retrieving the layers put them through this code. Might be interesting to try out! |
OpenAI GPT-1Here somes some per-layer analysis for the first GPT model:
I implemented a first prototype of the scalar mix approach. I was able to get a F-Score of 71.01 (over all layers, incl. word embedding layer)!
|
Hi, I use branch |
Not able to import
Any suggestions? |
You need to change branch to GH-873 |
Thanks for the quick reply, ill check |
@NullPhantom just use: for token in s.tokens:
print(token.embedding) :) |
@stefan-it interesting results with the scalar mix! How is the effect on runtime, i.e. for instance comparing scalar mix with only one layer? |
OpenAI GPT-2I ran some per-layer experiments on the GPT-2 and the GPT-2 medium model:
It does not look very promising, so scalar mix could help here!
|
@alanakbik In my preliminary experiments with scalar mix implementation, I couldn't see any big performance issues, but I'll measure it whenever the implementation is ready. I'm currently focussing on per-layer analysis for the XLM model :) |
GH-873: Add new type of embeddings: XLMEmbeddings
@stefan-it cool! Really looking forward to XLM! Strange that GPT-2 is not doing so well. |
XLMHere are the results from a per-layer analysis for the English XLM model:
|
I'm currently updating the Btw: using scalar mix does not help when using the |
I adjusted the In order to avoid any regression bugs, I compared the performance with the old
|
@alanakbik Can I fill a PR for these new embeddings? Maybe we can define some kind of roadmap for the new introduced Transformer-based embeddings:
|
@stefan-it absolutely! This is a major upgrade that lots of people will want to use. With all the new features, it's probably time to do another Flair release (v0.4.3), the question is whether we wait for the features you outline or release in the very near future? |
I'm going to work a bit on it until next week. As I just found a great suggestion/improve in the
So I would be better to have a kind of generic base class :) |
PR for the "first phase" is coming soon. I also add support for RoBERTa, see the RoBERTa: A Robustly Optimized BERT Pretraining Approach paper for more information. RoBERTa is currently not integrated into
|
Add support for new RoBERTa embeddings (wrapper around torch.hub module)
The variance for a 0.1 downsampled CoNLL corpus is very high. I did some experiments for RoBERTa in order to compare different pooling operations for subwords using scalar mix:
I also used the complete CoNLL corpus (one run) with scalar mix:
BERT (base) achieves 92.2 (reported in their paper). Now I'm going to run some experiments with BERT (base) and scalar mix to have a better comparison :) Update: BERT (base) achieves a F-Score of 91.38 on the full CoNLL corpus with scalar mix. |
The following Transformer-based architectures are now supported via pytorch-transformers: - BertEmbeddings (Updated API) - OpenAIGPTEmbeddings (Updated API, various fixes) - OpenAIGPT2Embeddings (New) - TransformerXLEmbeddings (Updated API, tokenization fixes) - XLNetEmbeddings (New) - XLMEmbeddings (New) - RoBERTaEmbeddings (New, via torch.hub module) It also possible to use a scalar mix of specified layers from the Transformer-based models. Scalar mix is proposed by Liu et al. (2019). The scalar mix implementation is copied and slightly modified from the allennlp repo (Apache 2.0 license).
An update: I've re-written the complete tokenization logic for all Transformer-based embeddings (except BERT). In the old version, I did pass each token for a sentence into the model (which is not very efficient and causes major problems with the GPT-2 tokenizer). The latest version passes the complete sentence into the model. The embeddings for subwords are then aligned back to each "Flair" token in a sentence (I wrote some unit tests for that...). I also added the code for scalar mix from the Here are some experiments with the new implemenation on a downsampled (0.1) CoNLL corpus for NER. F-Score is measured and averaged over 4 runs, scalar mix is used:
I'm currently running experiments on the whole CoNLL corpus. Here are some results (only one run):
Notice: The feature-based result from the BERT paper is 96.1 (dev) and 92.4 - 92.8 for base and large model (test). But they "include the maximal document context provided by the data". I found an issue in the But from these preliminary experiments, RoBERTa (/cc @myleott) seems to perform slightly better at the moment :) |
@stefan-it were you using the scalar mix in these experiments on the full CoNLL? Were you always using the default layers as set in the constructor of each class? |
I used scalar mix for all layers (incl. word embedding layer, which is located at index 0) on the full CoNLL. E.g. for RoBERTa the init. would be: emb = RoBERTaEmbeddings(model="roberta.base", layers="0,1,2,3,4,5,6,7,8,9,10,11,12", pooling_operation="first", use_scalar_mix=True) I'm not sure about the default parameters when using no scalar mix, because there's no literature about that, except BERT (base), where a concat of the last four layers was proposed. |
…eeded when RoBERTa embeddings are used)
Ah great - ok, I'll run a similar experiment and report numbers. But aside from this, I think we are ready to merge the PR. |
BTW here some results with using RoBERTa with default parameters and scalar mix, i.e. instantiated like this: RoBERTaEmbeddings(use_scalar_mix=True) Results of three runs:
Using otherwise the exact same parameters as here. |
Documentation for the new PyTorch-Transformers embeddings are coming very soon :) I'll close that issue now (PR was merged). |
@stefan-it |
You can just use the latest pip install --upgrade git+https://github.com/zalandoresearch/flair.git :) |
okay thanks ! |
@stefan-it Even after running pip install --upgrade git+https://github.com/zalandoresearch/flair.git When i try to import i get ImportError: cannot import name 'XLNetEmbeddings' |
@DecentMakeover You can try again, I installed the latest version of flair and the problem disappeared. The last commit is before 5 days. |
@dshaprin okay ,ill check. |
Hi,
the upcoming 1.0 version of
pytorch-pretrained-bert
will introduce several API changes, new models and even a name change topytorch-transformers
.After the final 1.0 release,
flair
could support 7 different Transformer-based architectures:BertEmbeddings
OpenAIGPTEmbeddings
OpenAIGPT2Embeddings
🛡️TransformerXLEmbeddings
XLNetEmbeddings
🛡️XLMEmbeddings
🛡️RoBERTaEmbeddings
🛡️ (currently not covered bypytorch-transformers
)🛡️ indicates a new embedding class for
flair
.It also introduces an universal API for all models, so quite a few changes in
flair
are necessary so support both old and new embedding classes.This issue tracks the implementation status for all 6 embedding classes 😊
The text was updated successfully, but these errors were encountered: