Option for TransformerWordEmbeddings to process long sentences. #1680

schelv · 2020-06-09T09:36:59Z

TransformerWordEmbeddings can have a max sequence length.
Having input sentences that are too long, will give warnings and errors.
These code modifications allow to get the word embeddings of longer texts.

I think this Fixes #1410 and it also Fixes #575 and also it Fixes #1519

Edit: It is also possible to update CoNLL-03 Named Entity Recognition (Dutch) the BERTje embedding improves the score quite a bit

get latest version from master

…ces.

…NLP-master # Conflicts: # flair/embeddings/token.py

alanakbik · 2020-06-10T08:11:40Z

@schelv thanks a lot for adding this - lots of people will surely find this useful! Could you also share the script you used for the Dutch CoNLL experiments? Then we can update the doc!

schelv · 2020-06-10T08:51:46Z

The script in Experiments.md is updated.
I have not trained the model with the new option yet. Only evaluated with the new option by enabling it after loading the trained model.

I curently do not have the resources to train the model 5 times and average the score, so I'm leaving that for someone else to do. =]

alanakbik · 2020-06-10T09:50:16Z

No worries, maybe we can do this before the next release :) Thanks again!

alejandrojcastaneira · 2020-07-13T09:33:09Z

Hello, great feature!
I'm pretty sure many people have been waiting for this.
Could you comment a little bit how this enhancement works?
Best

schelv · 2020-07-13T11:50:40Z

@alejandrojcastaneira
A transformer embedding has a maximum sequence length.
This implementation circumvents this problem by using a sliding window approach.
Each window is maximum sequence length tokens long (except the last window) and this makes it possible to compute the embedding representation with the transformer.
I've chosen a stride for the sliding window of (maximum sequence length)/2 which means for the example sequence a b c d e f g h where each letter is a piece (maximum sequence length)/4 tokens of the entire sequence, the sliding windows are:

a b c d
    c d e f
        e f g h

The transformer gives the embeddings for these windows:

a_1 b_1 c_1 d_1
        c_2 d_2 e_2 f_2
                e_3 f_3 g_3 h_3

To get as much context in the embeddings as possible the embedding are "glued" back together to get the embedding of the long sequence: [a_1 b_1 c_1 d_2 e_2 f_3 g_3 h_3] (that is [a_1 b_1 c_1] + [d_2 e_2] + [f_3 g_3 h_3]).

Bunoviske · 2020-07-15T12:50:58Z

Hello! Great feature, indeed. I searched for this same functionality in TransformerDocumentEmbeddings, but I understood it does not support it. Here is a snippet of the code:

           # tokenize and truncate to 512 subtokens (TODO: check better truncation strategies)
            subtokenized_sentence = self.tokenizer.encode(sentence.to_tokenized_string(),
                                                          add_special_tokens=True,
                                                          max_length=512,
                                                          truncation=True,
                                                          )

I am doing a TextClassification task using this document embedding. What happens when I have a big sentence as input? What are possibilities to solve this issue?

schelv · 2020-07-15T13:07:29Z

I'm not sure. Either find a model that can handle texts of any length or think of a way to shorten your text without losing the meaning of the text (summarize it first, maybe?).

Bunoviske · 2020-07-16T11:31:14Z

Hey @schelv ! Since you implemented this thread, maybe makes sense to take a look here on this comment

wangxinyu0922 · 2020-10-25T07:56:53Z

@alejandrojcastaneira
A transformer embedding has a maximum sequence length.
This implementation circumvents this problem by using a sliding window approach.
Each window is maximum sequence length tokens long (except the last window) and this makes it possible to compute the embedding representation with the transformer.
I've chosen a stride for the sliding window of (maximum sequence length)/2 which means for the example sequence a b c d e f g h where each letter is a piece (maximum sequence length)/4 tokens of the entire sequence, the sliding windows are:
a b c d
    c d e f
        e f g h
The transformer gives the embeddings for these windows:
a_1 b_1 c_1 d_1
        c_2 d_2 e_2 f_2
                e_3 f_3 g_3 h_3
To get as much context in the embeddings as possible the embedding are "glued" back together to get the embedding of the long sequence: [a_1 b_1 c_1 d_2 e_2 f_3 g_3 h_3] (that is [a_1 b_1 c_1] + [d_2 e_2] + [f_3 g_3 h_3]).

@schelv Hi, Thanks for your great feature! But I meet some strange subtokenized sentences after I get the subtokenized_sentences in the function of _add_embeddings_to_sentences. I feed a sentence with 580 subtokens and then get a subtokenized_sentences with two subtokenized sentences. One has a length of 512 and another is has a length of 328. The first subtokenized sentence looks good, which contains the first 510 subtokens in the sentence (subtoken_ids_sentence[:510]).

However, the second subtokenized sentence is very strange. From my knowledge and your example, the second subtokenized sentence should contain a list of [subtoken_ids_sentence[510-256:510],subtoken_ids_sentence[510:]] (or subtoken_ids_sentence[510-256:], I omit the [CLS] and [SEP] here.). However, the second subtokenized sentence is not what I want and is very strange. After I dig deeper to the subtokenized sentence, I find it is [subtoken_ids_sentence[-256:], reverse(subtoken_ids_sentence[-70:])]. The last 70 values are the reversed subtoken_ids_sentence[-70:]. I wonder why the self.tokenizer.encode_plus() returns such a output as the final results and certainly reverse(subtoken_ids_sentence[-70:]) will significantly hurt the performance of the NER model.

Though the function self.tokenizer.encode_plus() is from the transformers package, do you have any idea to this problem? It will be very helpful for me to debug!

djstrong · 2020-10-25T08:29:49Z

Could you prepare example code?

wangxinyu0922 · 2020-10-25T09:04:43Z

@djstrong
Hi, I prepare a simple example for the problem, it is a simplified code from:

flair/flair/embeddings/token.py

Lines 922 to 939 in cd3d7ed

    
           while subtoken_ids_sentence: 
        
               nr_sentence_parts += 1 
        
               encoded_inputs = self.tokenizer.encode_plus(subtoken_ids_sentence, 
        
                                                           max_length=self.max_subtokens_sequence_length, 
        
                                                           stride=self.stride, 
        
                                                           return_overflowing_tokens=self.allow_long_sentences, 
        
                                                           truncation=True, 
        
                                                           ) 
        
               subtoken_ids_split_sentence = encoded_inputs['input_ids'] 
        
               subtokenized_sentences.append(torch.tensor(subtoken_ids_split_sentence, dtype=torch.long)) 
        
               if 'overflowing_tokens' in encoded_inputs: 
        
                   subtoken_ids_sentence = encoded_inputs['overflowing_tokens'] 
        
               else: 
        
                   subtoken_ids_sentence = None 
        
           sentence_parts_lengths.append(nr_sentence_parts)

import torch
import transformers
from transformers import AutoTokenizer
import pdb

tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

max_subtokens_sequence_length = tokenizer.model_max_length
stride = tokenizer.model_max_length//2

nr_sentence_parts=0
subtoken_ids_sentence = [x for x in range(1000,1580)]
subtokenized_sentences=[]

while subtoken_ids_sentence:
    nr_sentence_parts += 1
    encoded_inputs = tokenizer.encode_plus(subtoken_ids_sentence,
                                                max_length=max_subtokens_sequence_length,
                                                stride=stride,
                                                return_overflowing_tokens=True,
                                                truncation=True,
                                                )
    # encoded_inputs = self.tokenizer.encode_plus(subtoken_ids_sentence,max_length=self.max_subtokens_sequence_length,stride=self.max_subtokens_sequence_length//2,return_overflowing_tokens=self.allow_long_sentences,truncation=True,)
    subtoken_ids_split_sentence = encoded_inputs['input_ids']
    subtokenized_sentences.append(torch.tensor(subtoken_ids_split_sentence, dtype=torch.long))

    if 'overflowing_tokens' in encoded_inputs:
        subtoken_ids_sentence = encoded_inputs['overflowing_tokens']
    else:
        subtoken_ids_sentence = None



print(subtokenized_sentences[0])
print(subtokenized_sentences[1])


# for reference
subtoken_ids_sentence = [x for x in range(1000,1580)]

print(subtoken_ids_sentence)

The example is a simple list with ids of [1000,...,1580], but the outputs I get are:
subtokenized_sentences[0]=[ 101, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180, 1181, 1182, 1183, 1184, 1185, 1186, 1187, 1188, 1189, 1190, 1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259, 1260, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1279, 1280, 1281, 1282, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, 1294, 1295, 1296, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312, 1313, 1314, 1315, 1316, 1317, 1318, 1319, 1320, 1321, 1322, 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 102]
and
subtokenized_sentences[1]=[ 101, 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1578, 1577, 1576, 1575, 1574, 1573, 1572, 1571, 1570, 1569, 1568, 1567, 1566, 1565, 1564, 1563, 1562, 1561, 1560, 1559, 1558, 1557, 1556, 1555, 1554, 1553, 1552, 1551, 1550, 1549, 1548, 1547, 1546, 1545, 1544, 1543, 1542, 1541, 1540, 1539, 1538, 1537, 1536, 1535, 1534, 1533, 1532, 1531, 1530, 1529, 1528, 1527, 1526, 1525, 1524, 1523, 1522, 1521, 1520, 1519, 1518, 1517, 1516, 1515, 1514, 1513, 1512, 1511, 1510, 102]

You can see that subtokenized_sentences[1] first ranges from 1323 to 1579, and then ranges from 1579 to 1510.

djstrong · 2020-10-25T12:05:37Z

I confirm your findings with the newest transformers. I checked with 3.0.0 and there it is working fine, but from 3.1.0 it is wrong. You should create issue in transformers - will you?

wangxinyu0922 · 2020-10-25T12:48:05Z

Thank you! I found that it goes wrong from 3.0.1, so I should use 3.0.0 for flair now.

schelv · 2020-10-26T15:31:21Z

@wangxinyu0922 Nicely found! I also noticed similar strange behavior recently (see #1902).
3.0.2 was the lowest version I tested, so it could be related.
I didn't really investigate it as good as you did.

Maybe for the flair project it is possible to restrict the version of transformers to <= 3.0.0 or >= 3.version_that_fixes_this, to avoid wrong output for other who try to use flair + transformers on long texts.

If you create the issue in the transformers repository can you provide a link to it? I'm very curious what causes this😁

djstrong · 2020-10-26T17:47:06Z

The issue is here: huggingface/transformers#8028

wangxinyu0922 · 2020-10-27T03:11:57Z

@wangxinyu0922 Nicely found! I also noticed similar strange behavior recently (see #1902).
3.0.2 was the lowest version I tested, so it could be related.
I didn't really investigate it as good as you did.

Maybe for the flair project it is possible to restrict the version of transformers to <= 3.0.0 or >= 3.version_that_fixes_this, to avoid wrong output for other who try to use flair + transformers on long texts.

If you create the issue in the transformers repository can you provide a link to it? I'm very curious what causes this😁

I'm trying to reproduce the CoNLL NER score reported by BERT paper based on Flair with document context, so I tried to use your great feature. I checked the code carefully because I got a very low score through the feature. However, I still cannot produce the score after the bug is fixed...

schelv · 2020-10-27T09:43:40Z

.. CoNLL NER score reported by BERT paper based on Flair with document context,

Do you have a link?
Is it one of these? https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md

wangxinyu0922 · 2020-10-27T09:52:56Z

I meant that train a NER model only with the BERT embeddings. The reported score in the BERT paper is 92.8 and the authors said that they trained the model with the maximal document. However, I can get a score of 91.6 at most. After I searched for some issues like google-research/bert#223, I believe that the results are impossible to reproduce.

schelv · 2020-10-27T11:19:23Z

Yes, impossible to reproduce is normal for a google paper.

schelv and others added 4 commits June 8, 2020 11:14

Merge pull request #1 from flairNLP/master

de49957

get latest version from master

Added the option for TransformerWordEmbeddings to process long senten…

60e0606

…ces.

Merge branch 'master' of https://github.com/flairNLP/flair into flair…

0b00f03

…NLP-master # Conflicts: # flair/embeddings/token.py

Merge branch 'flairNLP-master'

d960151

schelv changed the title ~~Added the option for TransformerWordEmbeddings to process long sentences.~~ Option for TransformerWordEmbeddings to process long sentences. Jun 9, 2020

Added script for improved CoNLL-02 score.

c96f3c1

alanakbik merged commit d565c53 into flairNLP:master Jun 10, 2020

Bunoviske mentioned this pull request Jul 16, 2020

TransformerDocumentEmbeddings vs TransformerWordEmbeddings #1761

Closed

schelv mentioned this pull request May 25, 2021

How does flair handles text sentences longer than 512 when using bert? #2281

Closed

amtam0 mentioned this pull request Jul 16, 2021

Best model not saved during training #617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for TransformerWordEmbeddings to process long sentences. #1680

Option for TransformerWordEmbeddings to process long sentences. #1680

schelv commented Jun 9, 2020 •

edited

Loading

alanakbik commented Jun 10, 2020

schelv commented Jun 10, 2020

alanakbik commented Jun 10, 2020

alejandrojcastaneira commented Jul 13, 2020

schelv commented Jul 13, 2020

Bunoviske commented Jul 15, 2020

schelv commented Jul 15, 2020

Bunoviske commented Jul 16, 2020

wangxinyu0922 commented Oct 25, 2020

djstrong commented Oct 25, 2020

wangxinyu0922 commented Oct 25, 2020 •

edited

Loading

djstrong commented Oct 25, 2020 •

edited

Loading

wangxinyu0922 commented Oct 25, 2020

schelv commented Oct 26, 2020

djstrong commented Oct 26, 2020

wangxinyu0922 commented Oct 27, 2020 •

edited

Loading

schelv commented Oct 27, 2020

wangxinyu0922 commented Oct 27, 2020

schelv commented Oct 27, 2020

Option for TransformerWordEmbeddings to process long sentences. #1680

Option for TransformerWordEmbeddings to process long sentences. #1680

Conversation

schelv commented Jun 9, 2020 • edited Loading

alanakbik commented Jun 10, 2020

schelv commented Jun 10, 2020

alanakbik commented Jun 10, 2020

alejandrojcastaneira commented Jul 13, 2020

schelv commented Jul 13, 2020

Bunoviske commented Jul 15, 2020

schelv commented Jul 15, 2020

Bunoviske commented Jul 16, 2020

wangxinyu0922 commented Oct 25, 2020

djstrong commented Oct 25, 2020

wangxinyu0922 commented Oct 25, 2020 • edited Loading

djstrong commented Oct 25, 2020 • edited Loading

wangxinyu0922 commented Oct 25, 2020

schelv commented Oct 26, 2020

djstrong commented Oct 26, 2020

wangxinyu0922 commented Oct 27, 2020 • edited Loading

schelv commented Oct 27, 2020

wangxinyu0922 commented Oct 27, 2020

schelv commented Oct 27, 2020

schelv commented Jun 9, 2020 •

edited

Loading

wangxinyu0922 commented Oct 25, 2020 •

edited

Loading

djstrong commented Oct 25, 2020 •

edited

Loading

wangxinyu0922 commented Oct 27, 2020 •

edited

Loading