Bert sequence labelling #78

kermitt2 · 2019-12-28T12:19:38Z

Add BERT architecture for sequence labelling.

As noted here the original CoNLL-2003 NER results reported by the Google Research paper are not reproducible, by far, and they probably reported token-level metrics instead of entity-level metrics (as done by conlleval and previous works). In general, generic transformer pre-trained models appear to perform poorly for information extraction and NER tasks (both with fine-tuning or contextual embedding features), as compared to ELMo.

Still it's a good exercise and using scibert/biobert for scientific text achieves very good and faster results, even compared to ELMo+BidLSTM-CRF.

Similarly as the usage of BERT for text classification in DeLFT, we use a data generator to feed BERT when predicting (instead of the file-based input function of the original BERT implementation), and avoid reloading the whole TF graph for each batch. This was possible by using the FastPredict class in model.py, which is adapted from https://github.com/marcsto/rl/blob/master/src/fast_predict2.py by Marc Stogaitis.

Using a nvidia GeForce 1080 GPU, we can process around 1000 tokens per second with this approach, which is 3 times faster than BiLSTM-CRF+ELMo, but 30 times slower than with a BiLSTM-CRF (and 100 times slower than what we get with a Wapiti CRF model on a modern workstation ;).

kermitt2 · 2019-12-28T13:08:08Z

I will also add an optional CRF activation layer as alternative to the current softmax layer (which is not good for sequence labelling).

kermitt2 · 2019-12-28T16:22:44Z

CoNLL 2003 NER
with softmax activation layer for fine-tuning
no parameter tuning, dev set ignored
BERT-base-en cased

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8804    0.8999    0.8900      1661
      MISC     0.7702    0.8181    0.7934       702
       PER     0.9604    0.9533    0.9568      1617
       LOC     0.9240    0.9258    0.9249      1668

	macro f1 = 0.9068
	macro precision = 0.9010
	macro recall = 0.9127 


** Worst ** model scores - 7
                  precision    recall  f1-score   support

             PER     0.9643    0.9524    0.9583      1617
             ORG     0.8642    0.8892    0.8766      1661
             LOC     0.9188    0.9293    0.9240      1668
            MISC     0.7510    0.8077    0.7783       702

all (micro avg.)     0.8932    0.9090    0.9010      5648


** Best ** model scores - 1
                  precision    recall  f1-score   support

             PER     0.9628    0.9592    0.9610      1617
             ORG     0.8937    0.9013    0.8975      1661
             LOC     0.9169    0.9257    0.9212      1668
            MISC     0.7867    0.8248    0.8053       702

all (micro avg.)     0.9062    0.9155    0.9109      5648

kermitt2 · 2019-12-28T21:14:23Z

CoNLL 2003 NER
with CRF activation layer for fine-tuning (improved a bit as compared to softmax...)
no parameter tuning, dev set ignored
BERT-base-en cased

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8793    0.9043    0.8916      1661
      MISC     0.7741    0.8201    0.7964       702
       PER     0.9632    0.9573    0.9602      1617
       LOC     0.9258    0.9257    0.9258      1668

	macro f1 = 0.9089
	macro precision = 0.9026
	macro recall = 0.9153 


** Worst ** model scores - 5
                  precision    recall  f1-score   support

             PER     0.9603    0.9573    0.9588      1617
             ORG     0.8776    0.8977    0.8875      1661
            MISC     0.7507    0.8148    0.7814       702
             LOC     0.9218    0.9257    0.9237      1668

all (micro avg.)     0.8968    0.9127    0.9047      5648


** Best ** model scores - 8
                  precision    recall  f1-score   support

             PER     0.9628    0.9604    0.9616      1617
             ORG     0.8735    0.9103    0.8915      1661
            MISC     0.7846    0.8248    0.8042       702
             LOC     0.9336    0.9269    0.9302      1668

all (micro avg.)     0.9045    0.9189    0.9116      5648

kermitt2 · 2019-12-29T10:46:51Z

After painful tuning the hyperparameters on the dev set, this is the best I get with BERT-base+CRF

CoNLL 2003 NER
with CRF activation layer for fine-tuning
hyperparameter tuning with dev set
BERT-base-en cased

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8804    0.9114    0.8957      1661
      MISC     0.7823    0.8189    0.8002       702
       PER     0.9633    0.9576    0.9605      1617
       LOC     0.9290    0.9316    0.9303      1668

	macro f1 = 0.9120
	macro precision = 0.9050
	macro recall = 0.9191 


** Worst ** model scores - 9
                  precision    recall  f1-score   support

             ORG     0.8736    0.9073    0.8901      1661
             PER     0.9596    0.9555    0.9575      1617
             LOC     0.9221    0.9293    0.9256      1668
            MISC     0.7757    0.8177    0.7961       702

all (micro avg.)     0.8992    0.9164    0.9078      5648


** Best ** model scores - 2
                  precision    recall  f1-score   support

             ORG     0.8897    0.9229    0.9060      1661
             PER     0.9627    0.9573    0.9600      1617
             LOC     0.9375    0.9353    0.9364      1668
            MISC     0.7862    0.8120    0.7989       702

all (micro avg.)     0.9110    0.9226    0.9168      5648

lfoppiano · 2020-01-09T07:09:30Z

nerTagger.py

@@ -422,6 +531,10 @@ def annotate(output_format,


 if __name__ == "__main__":
+
+    architectures = ['BidLSTM_CRF', 'BidLSTM_CNN_CRF', 'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 'BidLSTM_CRF_CASING', 
+                     'bert-base-en', 'bert-base-en', 'scibert', 'biobert']


I noticed there are two 'bert-base-en'

oups the second one should be bert-large-en!

lfoppiano · 2020-01-20T01:46:57Z

I'm adding the results for the quantities model and for the superconductor model, (ran with the base-bert-en) using bert or scibert is actually not bringing any improvement.

Quantities

(base) [lfoppian0@mdpfdm005 delft-bert]$ grep "verage over 10 folds" -A 14  DelftQuantities*
DelftQuantitiesNormal.o4743:Average over 10 folds
DelftQuantitiesNormal.o4743-                  precision    recall  f1-score   support
DelftQuantitiesNormal.o4743-
DelftQuantitiesNormal.o4743-      <unitLeft>     0.9473    0.9667    0.9569       258
DelftQuantitiesNormal.o4743-     <unitRight>     0.9400    0.8000    0.8633        11
DelftQuantitiesNormal.o4743-   <valueAtomic>     0.8219    0.8868    0.8530       304
DelftQuantitiesNormal.o4743-     <valueBase>     0.9714    0.7500    0.8457         8
DelftQuantitiesNormal.o4743-    <valueLeast>     0.8772    0.8063    0.8396        80
DelftQuantitiesNormal.o4743-     <valueList>     0.7391    0.7883    0.7617        60
DelftQuantitiesNormal.o4743-     <valueMost>     0.8961    0.8234    0.8580        77
DelftQuantitiesNormal.o4743-    <valueRange>     0.9889    0.9250    0.9522         8
DelftQuantitiesNormal.o4743-
DelftQuantitiesNormal.o4743-all (macro avg.)     0.8706    0.8888    0.8796          
DelftQuantitiesNormal.o4743-
DelftQuantitiesNormal.o4743-model config file saved
--
DelftQuantitiesSciBert.o4594:Average over 10 folds
DelftQuantitiesSciBert.o4594-                  precision    recall  f1-score   support
DelftQuantitiesSciBert.o4594-
DelftQuantitiesSciBert.o4594-      <unitLeft>     0.9491    0.8810    0.9138       258
DelftQuantitiesSciBert.o4594-     <unitRight>     0.8332    0.6727    0.7411        11
DelftQuantitiesSciBert.o4594-   <valueAtomic>     0.8599    0.8635    0.8617       304
DelftQuantitiesSciBert.o4594-     <valueBase>     0.9875    0.9875    0.9875         8
DelftQuantitiesSciBert.o4594-    <valueLeast>     0.9080    0.8475    0.8766        80
DelftQuantitiesSciBert.o4594-     <valueList>     0.7585    0.8583    0.8053        60
DelftQuantitiesSciBert.o4594-     <valueMost>     0.9026    0.8870    0.8946        77
DelftQuantitiesSciBert.o4594-    <valueRange>     0.9500    0.9500    0.9500         8
DelftQuantitiesSciBert.o4594-
DelftQuantitiesSciBert.o4594-all (macro avg.)     0.8886    0.8689    0.8786

Superconductors:

(base) [lfoppian0@mdpfdm005 delft-bert]$ grep "verage over 10 folds" -A 14  DelftSuper*
DelftSuperBert.o4593:Average over 10 folds
DelftSuperBert.o4593-                  precision    recall  f1-score   support
DelftSuperBert.o4593-
DelftSuperBert.o4593-         <class>     0.4298    0.3107    0.3598        28
DelftSuperBert.o4593-      <material>     0.6509    0.7420    0.6932       143
DelftSuperBert.o4593-     <me_method>     0.5057    0.3167    0.3842        12
DelftSuperBert.o4593-      <pressure>     0.6522    0.3500    0.4472        10
DelftSuperBert.o4593-            <tc>     0.6988    0.5697    0.6271        76
DelftSuperBert.o4593-       <tcValue>     0.5219    0.5312    0.5257        16
DelftSuperBert.o4593-
DelftSuperBert.o4593-all (macro avg.)     0.6314    0.6102    0.6205          
DelftSuperBert.o4593-
DelftSuperBert.o4593-
DelftSuperBert.o4593-Leaving TensorFlow...
--
DelftSuperNormal.o4742:Average over 10 folds
DelftSuperNormal.o4742-                  precision    recall  f1-score   support
DelftSuperNormal.o4742-
DelftSuperNormal.o4742-         <class>     0.4324    0.2857    0.3394        28
DelftSuperNormal.o4742-      <material>     0.7915    0.7594    0.7750       143
DelftSuperNormal.o4742-     <me_method>     0.5980    0.2917    0.3820        12
DelftSuperNormal.o4742-      <pressure>     0.9333    0.4100    0.5646        10
DelftSuperNormal.o4742-            <tc>     0.8432    0.7461    0.7912        76
DelftSuperNormal.o4742-       <tcValue>     0.6232    0.6562    0.6375        16
DelftSuperNormal.o4742-
DelftSuperNormal.o4742-all (macro avg.)     0.7628    0.6716    0.7140          
DelftSuperNormal.o4742-
DelftSuperNormal.o4742-model config file saved
DelftSuperNormal.o4742-preprocessor saved
DelftSuperNormal.o4742-model saved
--
DelftSuperSciBert.o4590:Average over 10 folds
DelftSuperSciBert.o4590-                  precision    recall  f1-score   support
DelftSuperSciBert.o4590-
DelftSuperSciBert.o4590-         <class>     0.3579    0.3321    0.3419        28
DelftSuperSciBert.o4590-      <material>     0.7876    0.8126    0.7997       143
DelftSuperSciBert.o4590-     <me_method>     0.6787    0.5333    0.5950        12
DelftSuperSciBert.o4590-      <pressure>     0.5710    0.4900    0.5229        10
DelftSuperSciBert.o4590-            <tc>     0.7339    0.6079    0.6648        76
DelftSuperSciBert.o4590-       <tcValue>     0.6233    0.6500    0.6355        16
DelftSuperSciBert.o4590-
DelftSuperSciBert.o4590-all (macro avg.)     0.7129    0.6786    0.6951

lfoppiano · 2020-01-20T02:02:39Z

Another thing I noticed, here is that the config model of bert-base-en has some strange values:

useBert = false
embeddings_name: glove-...

See below:

{
    "model_name": "superconductors-bert-bert-base-en",
    "model_type": "bert-base-en",
    "embeddings_name": "glove-840B",
    "char_vocab_size": 179,
    "case_vocab_size": 8,
    "char_embedding_size": 25,
    "num_char_lstm_units": 25,
    "max_char_length": 30,
    "max_sequence_length": 512,
    "word_embedding_size": 300,
    "num_word_lstm_units": 100,
    "case_embedding_size": 5,
    "dropout": 0.5,
    "recurrent_dropout": 0.5,
    "use_char_feature": true,
    "use_crf": false,
    "fold_number": 10,
    "batch_size": 6,
    "use_ELMo": false,
    "use_BERT": false,
    "labels": {
        "<PAD>": 0,
        "O": 1,
        "B-<tc>": 2,
        "I-<tc>": 3,
        "B-<material>": 4,
        "I-<material>": 5,
        "B-<tcValue>": 6,
        "I-<tcValue>": 7,
        "B-<class>": 8,
        "I-<class>": 9,
        "B-<me_method>": 10,
        "I-<me_method>": 11,
        "B-<pressure>": 12,
        "I-<pressure>": 13
    }
}

kermitt2 · 2020-01-20T07:04:54Z

Thanks @lfoppiano ! What is the "normal" you're comparing with?

For sequence labeling, BERT base gives indeed results similar with BidLSTM-CRF with Gloves in general (but only after tuning parameters), so this is in line with what is observed usually. I saw with the recognition of software mentions that SciBERT was much better on scientific text than the normal BERT, but it is still significantly lower scores than ELMo+BidLSTM-CRF (minus 1-2 f-score). On the contrary, For classification, SciBERT gives the best result on "scholar" texts.

About the config, useBert = false is correct because this parameter is for using BERT as "extracted features" so as dynamic embeddings on top of Gloves for instance in usual architecture (it's the equivalent of parameter useELMo functionally). We could think about a better name to avoid confusion.

"embeddings_name": "glove-840B" because glove is the default value of the word embedding to be used. As there is no word embedding involved with fine-tuned BERT architecture, this parameter is ignored - but we could set it to None when a BERT architecture is selected.

kermitt2 added 17 commits October 3, 2019 21:47

preparations

bc35dd3

remove padding for special case of batch with sequence length of 1

761bf05

cleaning

c27055f

update dataseer scores

ec8ff30

add FTB (LeMonde) force-split process and eval

0f7f836

process EX_ENAMEX element in FTB corpus

838ea35

typo

6540009

add method for using forced FTB split with XML files

af8903d

review all sequence labelling process for BERT architecture

56a16de

review input and eval

938e780

tunning default parameters for bert sequence labelling

31f0050

cleaning

e0591ce

fix the mess with the BERT tokenizer when predicting

71836d8

minor updates

6f672da

fix merge conflict

1528005

solve merge conflicts with master

c83fdf6

cleaning, doc, help

635e957

kermitt2 requested a review from lfoppiano December 28, 2019 12:54

kermitt2 self-assigned this Dec 28, 2019

kermitt2 added the enhancement New feature or request label Dec 28, 2019

kermitt2 changed the title ~~Bert sequence labeling~~ Bert sequence labelling Dec 28, 2019

update documentation

f45f258

kermitt2 added 3 commits December 28, 2019 17:25

fix dir. path bug

eb01cbf

fix fold id issue

b71ce4b

add CRF layer for BERT architecture, by default instead of softmax

f39c7f5

This was referenced Dec 28, 2019

NER with bert #21

Open

reader.py misses the the <EX_ENAMEX> annotated entities on LeMonde Corpus #61

Closed

fix CRF BERT layer

cfcdc14

kermitt2 added 9 commits December 29, 2019 12:17

update parameters

fafc3d3

reorganize and clean

483fdb6

simplifying a bit

f1531d3

fix max_sequence_length, add bert large def.

36c0348

merge with master and fix conflicts

c2f7dd3

add a section transformers in registry

49efd85

missing argument for bert feature-based

28907e0

scibert for grobid-software

79e6979

make various BERT working with grobid software

a56916c

lfoppiano reviewed Jan 9, 2020

View reviewed changes

kermitt2 added 5 commits January 9, 2020 16:55

typo in bert model name

aa6b3d8

review citation classification for nfold with bert architecture

1faacce

extra assignment for lmdb transaction to be deleted

7a9a3c3

add bert-base-en scores for CoNLL 2003 NER

b4ff8e7

more readable embedding-registry

b3ee53e

kermitt2 merged commit 94d45e8 into master Jan 16, 2020

lfoppiano deleted the bert-sequence-labeling branch January 28, 2020 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bert sequence labelling #78

Bert sequence labelling #78

kermitt2 commented Dec 28, 2019 •

edited

Loading

kermitt2 commented Dec 28, 2019

kermitt2 commented Dec 28, 2019 •

edited

Loading

kermitt2 commented Dec 28, 2019 •

edited

Loading

kermitt2 commented Dec 29, 2019 •

edited

Loading

lfoppiano Jan 9, 2020

kermitt2 Jan 9, 2020

lfoppiano commented Jan 20, 2020

lfoppiano commented Jan 20, 2020

kermitt2 commented Jan 20, 2020

Bert sequence labelling #78

Bert sequence labelling #78

Conversation

kermitt2 commented Dec 28, 2019 • edited Loading

kermitt2 commented Dec 28, 2019

kermitt2 commented Dec 28, 2019 • edited Loading

kermitt2 commented Dec 28, 2019 • edited Loading

kermitt2 commented Dec 29, 2019 • edited Loading

lfoppiano Jan 9, 2020

Choose a reason for hiding this comment

kermitt2 Jan 9, 2020

Choose a reason for hiding this comment

lfoppiano commented Jan 20, 2020

lfoppiano commented Jan 20, 2020

kermitt2 commented Jan 20, 2020

kermitt2 commented Dec 28, 2019 •

edited

Loading

kermitt2 commented Dec 28, 2019 •

edited

Loading

kermitt2 commented Dec 28, 2019 •

edited

Loading

kermitt2 commented Dec 29, 2019 •

edited

Loading