Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch-pretrained-bert to pytorch-transformers upgrade #873

Closed
7 tasks done
stefan-it opened this issue Jul 11, 2019 · 39 comments
Closed
7 tasks done

pytorch-pretrained-bert to pytorch-transformers upgrade #873

stefan-it opened this issue Jul 11, 2019 · 39 comments
Assignees
Labels
feature A new feature

Comments

@stefan-it
Copy link
Member

stefan-it commented Jul 11, 2019

Hi,

the upcoming 1.0 version of pytorch-pretrained-bert will introduce several API changes, new models and even a name change to pytorch-transformers.

After the final 1.0 release, flair could support 7 different Transformer-based architectures:

  • BERT -> BertEmbeddings
  • OpenAI GPT -> OpenAIGPTEmbeddings
  • OpenAI GPT-2 -> OpenAIGPT2Embeddings 🛡️
  • Transformer-XL -> TransformerXLEmbeddings
  • XLNet -> XLNetEmbeddings 🛡️
  • XLM -> XLMEmbeddings 🛡️
  • RoBERTa -> RoBERTaEmbeddings 🛡️ (currently not covered by pytorch-transformers)

🛡️ indicates a new embedding class for flair.

It also introduces an universal API for all models, so quite a few changes in flair are necessary so support both old and new embedding classes.

This issue tracks the implementation status for all 6 embedding classes 😊

@stefan-it stefan-it self-assigned this Jul 11, 2019
@stefan-it stefan-it added the feature A new feature label Jul 11, 2019
@alanakbik
Copy link
Collaborator

Awesome - really look forward to supporting this in Flair!

@aychang95
Copy link
Contributor

pytorch-transformers 1.0 was released today: at https://github.com/huggingface/pytorch-transformers

Migrations summary can be found on the readme here


Main takeaways from the migration process are:

  • Models now output tuples: Seems like the quickest fix is to retrieve the first element of our model outputs with a safe 0 index
  • Serialization: Serialization methods have been standardized with the saved_pretrained(save_directory), and models are in evaluation mode by default and must be set to training mode during training
  • Optimizers: BertAdam and OpenAIAdam are now to be replaced with AdamW, and schedulers are not part of the optimizer anymore. Scheduler step must be specified per batch and are now standard PyTorch learning schedulers

Otherwise, pytorch-transformers 1.0 looks great and looking forward to use them on flair as well as standalone

@alanakbik
Copy link
Collaborator

Looks great!

@stefan-it
Copy link
Member Author

Using the last_hidden_states works e.g. for Transformer-XL, but badly fails for XLNet and OpenGPT-2. As we use a feature-based approach, I'm currently doing some extensive per-layer analysis for all of the architectures. I'll post the results here (I'm mainly using a 0.1 downsampled CoNLL-2003 English corpus for NER).

@stefan-it
Copy link
Member Author

stefan-it commented Jul 17, 2019

XLNet

I ran some per-layer analysis with the large XLNet model:

Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 82.91
2 81.94
3 80.10
4 82.62
5 84.16
6 79.19
7 80.76
8 81.85
9 82.64
10 74.29
11 78.99
12 79.34
13 76.22
14 79.67
15 77.07
16 73.49
17 73.20
18 74.36
19 72.32
20 71.30
21 74.97
22 75.04
23 66.84
24 03.37

XLNetEmbeddings

To use the new XLNet embeddings in flair just do:

from flair.data import Sentence
from flair.embeddings import XLNetEmbeddings

embeddings = XLNetEmbeddings()

s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)

# Get embeddings
for token in s.tokens:
  print(token.embedding)

XLNetEmbeddings has two parameters:

  • model: just specify the XLNet model. pytorch-transformers currenly comes with xlnet-large-cased and xlnet-base-cased
  • layers: comma-separated string of layers. Default is 1, to use more layers (will be concatenated then) just pass: 1,2,3,4
  • pooling_operation: defines the pooling operation of subwords. By default first and last subword embeddings are concatenated and used. Other pooling operations are also available: first, last and mean

@stefan-it
Copy link
Member Author

Transformer-XL

I also ran some per-layer analysis for the Transformer-XL embeddings:

Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 80.88
2 81.68
3 82.88
4 80.89
5 84.74
6 80.68
7 82.65
8 79.53
9 79.25
10 79.64
11 80.07
12 84.26
13 81.22
14 80.59
15 81.31
16 78.95
17 79.85
18 80.69

Experiments with combination of layers:

Layers F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1,2 80.84
1,2,3 81.99
1,2,3,4 78.44
1,2,3,4,5 80.89

That's the reason why I choose layers 1,2,3 as default for the TransformerXLEmbeddings for now.

TransformerXLEmbeddings

TransformerXLEmbeddings has two parameters:

  • model: just specify the Transformer-XL model. pytorch-transformers currenly comes with transfo-xl-wt103
  • layers: comma-separated string of layers. Default are 1,2,3, to use more layers (will be concatenated then) just pass: 1,2,3,4

stefan-it added a commit that referenced this issue Jul 18, 2019
@stefan-it
Copy link
Member Author

stefan-it commented Jul 18, 2019

@alanakbik I'm planning to run per-layer analysis for all of the Transformer-based models.

However, it is really hard to give a recommendation for default layer(s).

Recently, I found this NAACL paper: Linguistic Knowledge and Transferability of Contextual Representations that uses a "scalar mix" of all layers. An implementation can be found in the allennlp repo, see it here. Do you have any idea how we can use this technique here 🤔

Would be awesome if we can adopt that 🤗

@alanakbik
Copy link
Collaborator

@stefan-it from the paper it also seems that best approach / layers would vary by task so giving overall recommendations might be really difficult. From a quick look it seems like their implementation could be integrated as part of an embeddings class, i.e. after retrieving the layers put them through this code. Might be interesting to try out!

@stefan-it
Copy link
Member Author

stefan-it commented Jul 18, 2019

OpenAI GPT-1

Here somes some per-layer analysis for the first GPT model:

Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 74.90
2 66.93
3 64.62
4 67.62
5 62.01
6 58.19
7 53.13
8 55.19
9 52.61
10 52.02
11 64.59
12 70.08

I implemented a first prototype of the scalar mix approach. I was able to get a F-Score of 71.01 (over all layers, incl. word embedding layer)!

OpenAIGPTEmbeddings

The OpenAIGPTEmbeddings comes with three parameters: model, layers and pooling_operation.

@ilham-bintang
Copy link

XLNet

I ran some per-layer analysis with the large XLNet model:

Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 82.91
2 81.94
3 80.10
4 82.62
5 84.16
6 79.19
7 80.76
8 81.85
9 82.64
10 74.29
11 78.99
12 79.34
13 76.22
14 79.67
15 77.07
16 73.49
17 73.20
18 74.36
19 72.32
20 71.30
21 74.97
22 75.04
23 66.84
24 03.37

XLNetEmbeddings

To use the new XLNet embeddings in flair just do:

from flair.data import Sentence
from flair.embeddings import XLNetEmbeddings

embeddings = XLNetEmbeddings()

s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)

# Get embeddings
for token in s.tokens:
  print(token.embeddings)

XLNetEmbeddings has two parameters:

  • model: just specify the XLNet model. pytorch-transformers currenly comes with xlnet-large-cased and xlnet-base-cased
  • layers: comma-separated string of layers. Default is 1, to use more layers (will be concatenated then) just pass: 1,2,3,4
  • pooling_operation: defines the pooling operation of subwords. By default first and last subword embeddings are concatenated and used. Other pooling operations are also available: first, last and mean

Hi, I use branch GH-873-pytorch-transformers and try it. But it raised an error:
AttributeError: 'Token' object has no attribute 'embeddings'

@DecentMakeover
Copy link

Not able to import

ImportError: cannot import name 'XLNetEmbeddings'

Any suggestions?

@ilham-bintang
Copy link

Not able to import

ImportError: cannot import name 'XLNetEmbeddings'

Any suggestions?

You need to change branch to GH-873

@DecentMakeover
Copy link

Thanks for the quick reply, ill check

@stefan-it
Copy link
Member Author

@NullPhantom just use:

for token in s.tokens:
  print(token.embedding)

:)

@alanakbik
Copy link
Collaborator

@stefan-it interesting results with the scalar mix! How is the effect on runtime, i.e. for instance comparing scalar mix with only one layer?

@stefan-it
Copy link
Member Author

OpenAI GPT-2

I ran some per-layer experiments on the GPT-2 and the GPT-2 medium model:

Layer GPT-2 F-Score (0.1 downsampled CoNLL-2003 NER corpus) GPT-2 medium F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 42.41 45.58
2 10.26 48.52
3 15.20 2.17
4 22.51 18.50
5 0.00 16.22
6 21.71 8.03
7 12.70 15.85
8 14.10 17.74
9 0.00 6.70
10 18.75 0.00
11 0.00 3.22
12 5.62 11.18
13 17.09
14 14.25
15 0.00
16 7.02
17 8.03
18 0.00
19 0.00
20 9.49
21 10.65
22 8.38
23 18.74
24 5.15

It does not look very promising, so scalar mix could help here!

OpenAIGPT2Embeddings

To play around with the embeddings from the GPT-2 models, just use:

from flair.data import Sentence
from flair.embeddings import OpenAIGPT2Embeddings

embeddings = OpenAIGPT2Embeddings()

s = Sentence("Berlin and Munich")
embeddings.embed(s)

for token in s.tokens:
  print(token.embedding)

@stefan-it
Copy link
Member Author

@alanakbik In my preliminary experiments with scalar mix implementation, I couldn't see any big performance issues, but I'll measure it whenever the implementation is ready.

I'm currently focussing on per-layer analysis for the XLM model :)

stefan-it added a commit that referenced this issue Jul 22, 2019
@alanakbik
Copy link
Collaborator

@stefan-it cool! Really looking forward to XLM! Strange that GPT-2 is not doing so well.

@stefan-it
Copy link
Member Author

XLM

Here are the results from a per-layer analysis for the English XLM model:

Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 76.92
2 75.91
3 75.61
4 73.52
5 73.66
6 70.75
7 70.90
8 63.58
9 64.04
10 57.38
11 54.70
12 56.96

XLMEmbeddings

The following snippet demonstrates the usage of the newXLMEmbeddings class:

from flair.data import Sentence
from flair.embeddings import XLMEmbeddings

embeddings = XLMEmbeddings()

s = Sentence("It is very hot in Munich now .")
embeddings.embed(s)

for token in s.tokens:
  print(token.embedding)

@stefan-it
Copy link
Member Author

I'm currently updating the BertEmbeddings class to the new pytorch-transformers API.

Btw: using scalar mix does not help when using the OpenAIGPT2Embeddings 😞

@stefan-it
Copy link
Member Author

I adjusted the BertEmbeddings class to make it compatible with the new pytorch-transformers API.

In order to avoid any regression bugs, I compared the performance with the old pytorch-pretrained-BERT library. Here's the per-layer analysis:

Layer BERT with pytorch-pretrained-BERT F-Score (0.1 downsampled CoNLL-2003 NER corpus) BERT with pytorch-transformers F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 80.15 81.35
2 79.49 82.65
3 84.20 83.44
4 83.71 84.58
5 87.71 88.81
6 86.56 87.34
7 87.61 87.13
8 86.67 85.20
9 88.17 87.73
10 89.20 85.66
11 86.65 87.44
12 85.82 87.06

@stefan-it
Copy link
Member Author

stefan-it commented Jul 25, 2019

@alanakbik Can I fill a PR for these new embeddings? Maybe we can define some kind of roadmap for the new introduced Transformer-based embeddings:

  • PR for the current implementations
  • Next step: add support for scalar mix
  • Next step: performance tuning (maybe we can use a batch of sentences? But then we need to implement a kind of mapping: original token and subword embeddings that belong to the original token)
  • Future steps: fine-tuning of these Transformer-based embeddings (instead of using a feature-based approach)

@alanakbik
Copy link
Collaborator

@stefan-it absolutely! This is a major upgrade that lots of people will want to use. With all the new features, it's probably time to do another Flair release (v0.4.3), the question is whether we wait for the features you outline or release in the very near future?

@stefan-it
Copy link
Member Author

I'm going to work a bit on it until next week. As I just found a great suggestion/improve in the pytorch-transformers repo see issue here, I would make the following code changes:

  • Currently, a lot of duplicate code is used for implementing at least 5 different models.
  • Duplication: Pooling operation of subword based architectures, tokenization and layer concatenation

So I would be better to have a kind of generic base class :)

@stefan-it
Copy link
Member Author

PR for the "first phase" is coming soon.

I also add support for RoBERTa, see the RoBERTa: A Robustly Optimized BERT Pretraining Approach paper for more information.

RoBERTa is currently not integrated into pytorch-transformers, so I wrote an embedding class around the torch.hub module. I tested the model, here a some results for the base model:

Layer RoBERTa F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1 75.16
2 80.29
3 81.01
4 80.41
5 80.52
6 80.23
7 81.31
8 84.25
9 80.12
10 78.09
11 76.50
12 81.12

@stefan-it
Copy link
Member Author

stefan-it commented Aug 2, 2019

The variance for a 0.1 downsampled CoNLL corpus is very high. I did some experiments for RoBERTa in order to compare different pooling operations for subwords using scalar mix:

Pooling operation Run 1 Run 2 Run 3 Run 4 Avg.
first_last 76.12 76.56 79.12 79.17 77.74
first 78.91 81.41 76.79 80.00 79.28

I also used the complete CoNLL corpus (one run) with scalar mix:

Pooling operation F-Score
first_last 86.97
first 87.40

BERT (base) achieves 92.2 (reported in their paper). Now I'm going to run some experiments with BERT (base) and scalar mix to have a better comparison :)

Update: BERT (base) achieves a F-Score of 91.38 on the full CoNLL corpus with scalar mix.

stefan-it added a commit that referenced this issue Aug 4, 2019
The following Transformer-based architectures are now supported
via pytorch-transformers:

- BertEmbeddings (Updated API)
- OpenAIGPTEmbeddings (Updated API, various fixes)
- OpenAIGPT2Embeddings (New)
- TransformerXLEmbeddings (Updated API, tokenization fixes)
- XLNetEmbeddings (New)
- XLMEmbeddings (New)
- RoBERTaEmbeddings (New, via torch.hub module)

It also possible to use a scalar mix of specified layers from the
Transformer-based models. Scalar mix is proposed by Liu et al. (2019).
The scalar mix implementation is copied and slightly modified from
the allennlp repo (Apache 2.0 license).
@stefan-it
Copy link
Member Author

stefan-it commented Aug 4, 2019

An update:

I've re-written the complete tokenization logic for all Transformer-based embeddings (except BERT). In the old version, I did pass each token for a sentence into the model (which is not very efficient and causes major problems with the GPT-2 tokenizer).

The latest version passes the complete sentence into the model. The embeddings for subwords are then aligned back to each "Flair" token in a sentence (I wrote some unit tests for that...).

I also added the code for scalar mix from the allennlp repo.

Here are some experiments with the new implemenation on a downsampled (0.1) CoNLL corpus for NER. F-Score is measured and averaged over 4 runs, scalar mix is used:

Model Pooling # 1 # 2 # 3 # 4 Avg.
RoBERTa (base) first 86.34 86.30 90.21 87.28 87.53
GPT-1 first_last 75.21 77.53 74.90 76.33 75.99
GPT-1 first 74.31 75.42 74.01 76.56 75.08
GPT-2 (medium) first_last 85.18 76.86 79.93 81.02 80.75
GPT-2 (medium) first 78.88 79.23 80.31 76.80 78.81
XLM (en) first_last 84.65 86.50 84.63 84.97 85.19
XLM (en) first 86.66 88.28 87.55 85.82 87.08
Transformer-XL - 81.03 80.17 78.67 81.34 80.53
XLNet (base) first_last 85.66 88.59 85.74 87.36 86.84
XLNet (base) first 88.81 86.65 86.01 85.72 86.80

I'm currently running experiments on the whole CoNLL corpus. Here are some results (only one run):

Model Pooling Dev Test
BERT (base, cased) first 94.74 91.38
BERT (base, uncased) first 94.61 91.03
BERT (large, cased) first 95.23 91.69
BERT (large, uncased) first 94.78 91.49
BERT (large, cased, whole-word-masking) first 94.88 91.16
BERT (large, uncased, whole-word-masking) first 94.94 91.20
RoBERTa (base) first 95.35 91.51
RoBERTa (large) first 95.83 92.11
RoBERTa (large) mean 96.31 92.31
XLNet (base) first_last 94.56 90.73
XLNet (large) first_last 95.47 91.49
XLNet (large) first 95.14 91.71
XLM (en) first_last 94.31 90.68
XLM (en) first 94.00 90.73
GPT-2 first_last 91.35 87.47
GPT-2 (large) first_last 94.09 90.63

Notice: The feature-based result from the BERT paper is 96.1 (dev) and 92.4 - 92.8 for base and large model (test). But they "include the maximal document context provided by the data". I found an issue in the allennlp repo (here) and a dev score of 95.3 seems to be possible (without using the document context).

But from these preliminary experiments, RoBERTa (/cc @myleott) seems to perform slightly better at the moment :)

@alanakbik
Copy link
Collaborator

@stefan-it were you using the scalar mix in these experiments on the full CoNLL? Were you always using the default layers as set in the constructor of each class?

@stefan-it
Copy link
Member Author

I used scalar mix for all layers (incl. word embedding layer, which is located at index 0) on the full CoNLL. E.g. for RoBERTa the init. would be:

emb = RoBERTaEmbeddings(model="roberta.base", layers="0,1,2,3,4,5,6,7,8,9,10,11,12", pooling_operation="first", use_scalar_mix=True)

I'm not sure about the default parameters when using no scalar mix, because there's no literature about that, except BERT (base), where a concat of the last four layers was proposed.

stefan-it added a commit that referenced this issue Aug 7, 2019
@alanakbik
Copy link
Collaborator

Ah great - ok, I'll run a similar experiment and report numbers. But aside from this, I think we are ready to merge the PR.

@alanakbik
Copy link
Collaborator

alanakbik commented Aug 7, 2019

BTW here some results with using RoBERTa with default parameters and scalar mix, i.e. instantiated like this:

RoBERTaEmbeddings(use_scalar_mix=True)

Results of three runs:

# 1 # 2 # 3
92.03 92.05 91.96

Using otherwise the exact same parameters as here.

yosipk added a commit that referenced this issue Aug 7, 2019
@stefan-it
Copy link
Member Author

Documentation for the new PyTorch-Transformers embeddings are coming very soon :)

I'll close that issue now (PR was merged).

@DecentMakeover
Copy link

@stefan-it
if i do git pull will i be able to access these new additions or do i have to checkout to GH-873? Thanks

@stefan-it
Copy link
Member Author

You can just use the latest master branch :) Or install it via:

pip install --upgrade git+https://github.com/zalandoresearch/flair.git

:)

@DecentMakeover
Copy link

okay thanks !

@DecentMakeover
Copy link

@stefan-it Even after running pip install --upgrade git+https://github.com/zalandoresearch/flair.git

When i try to import
from flair.embeddings import XLNetEmbeddings

i get ImportError: cannot import name 'XLNetEmbeddings'

@dshaprin
Copy link

@DecentMakeover You can try again, I installed the latest version of flair and the problem disappeared. The last commit is before 5 days.

@DecentMakeover
Copy link

@dshaprin okay ,ill check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

6 participants