Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for TransformerWordEmbeddings to process long sentences. #1680

Merged
merged 5 commits into from
Jun 10, 2020

Conversation

schelv
Copy link
Contributor

@schelv schelv commented Jun 9, 2020

TransformerWordEmbeddings can have a max sequence length.
Having input sentences that are too long, will give warnings and errors.
These code modifications allow to get the word embeddings of longer texts.

I think this Fixes #1410 and it also Fixes #575 and also it Fixes #1519

Edit: It is also possible to update CoNLL-03 Named Entity Recognition (Dutch) the BERTje embedding improves the score quite a bit

@schelv schelv changed the title Added the option for TransformerWordEmbeddings to process long sentences. Option for TransformerWordEmbeddings to process long sentences. Jun 9, 2020
@alanakbik
Copy link
Collaborator

@schelv thanks a lot for adding this - lots of people will surely find this useful! Could you also share the script you used for the Dutch CoNLL experiments? Then we can update the doc!

@schelv
Copy link
Contributor Author

schelv commented Jun 10, 2020

The script in Experiments.md is updated.
I have not trained the model with the new option yet. Only evaluated with the new option by enabling it after loading the trained model.

I curently do not have the resources to train the model 5 times and average the score, so I'm leaving that for someone else to do. =]

@alanakbik
Copy link
Collaborator

No worries, maybe we can do this before the next release :) Thanks again!

@alanakbik alanakbik merged commit d565c53 into flairNLP:master Jun 10, 2020
@alejandrojcastaneira
Copy link

Hello, great feature!
I'm pretty sure many people have been waiting for this.
Could you comment a little bit how this enhancement works?
Best

@schelv
Copy link
Contributor Author

schelv commented Jul 13, 2020

@alejandrojcastaneira
A transformer embedding has a maximum sequence length.
This implementation circumvents this problem by using a sliding window approach.
Each window is maximum sequence length tokens long (except the last window) and this makes it possible to compute the embedding representation with the transformer.
I've chosen a stride for the sliding window of (maximum sequence length)/2 which means for the example sequence a b c d e f g h where each letter is a piece (maximum sequence length)/4 tokens of the entire sequence, the sliding windows are:

a b c d
    c d e f
        e f g h

The transformer gives the embeddings for these windows:

a_1 b_1 c_1 d_1
        c_2 d_2 e_2 f_2
                e_3 f_3 g_3 h_3

To get as much context in the embeddings as possible the embedding are "glued" back together to get the embedding of the long sequence: [a_1 b_1 c_1 d_2 e_2 f_3 g_3 h_3] (that is [a_1 b_1 c_1] + [d_2 e_2] + [f_3 g_3 h_3]).

@Bunoviske
Copy link

Hello! Great feature, indeed. I searched for this same functionality in TransformerDocumentEmbeddings, but I understood it does not support it. Here is a snippet of the code:

           # tokenize and truncate to 512 subtokens (TODO: check better truncation strategies)
            subtokenized_sentence = self.tokenizer.encode(sentence.to_tokenized_string(),
                                                          add_special_tokens=True,
                                                          max_length=512,
                                                          truncation=True,
                                                          )

I am doing a TextClassification task using this document embedding. What happens when I have a big sentence as input? What are possibilities to solve this issue?

@schelv
Copy link
Contributor Author

schelv commented Jul 15, 2020

I'm not sure. Either find a model that can handle texts of any length or think of a way to shorten your text without losing the meaning of the text (summarize it first, maybe?).

@Bunoviske
Copy link

Hey @schelv ! Since you implemented this thread, maybe makes sense to take a look here on this comment

@wangxinyu0922
Copy link

@alejandrojcastaneira
A transformer embedding has a maximum sequence length.
This implementation circumvents this problem by using a sliding window approach.
Each window is maximum sequence length tokens long (except the last window) and this makes it possible to compute the embedding representation with the transformer.
I've chosen a stride for the sliding window of (maximum sequence length)/2 which means for the example sequence a b c d e f g h where each letter is a piece (maximum sequence length)/4 tokens of the entire sequence, the sliding windows are:

a b c d
    c d e f
        e f g h

The transformer gives the embeddings for these windows:

a_1 b_1 c_1 d_1
        c_2 d_2 e_2 f_2
                e_3 f_3 g_3 h_3

To get as much context in the embeddings as possible the embedding are "glued" back together to get the embedding of the long sequence: [a_1 b_1 c_1 d_2 e_2 f_3 g_3 h_3] (that is [a_1 b_1 c_1] + [d_2 e_2] + [f_3 g_3 h_3]).

@schelv Hi, Thanks for your great feature! But I meet some strange subtokenized sentences after I get the subtokenized_sentences in the function of _add_embeddings_to_sentences. I feed a sentence with 580 subtokens and then get a subtokenized_sentences with two subtokenized sentences. One has a length of 512 and another is has a length of 328. The first subtokenized sentence looks good, which contains the first 510 subtokens in the sentence (subtoken_ids_sentence[:510]).

However, the second subtokenized sentence is very strange. From my knowledge and your example, the second subtokenized sentence should contain a list of [subtoken_ids_sentence[510-256:510],subtoken_ids_sentence[510:]] (or subtoken_ids_sentence[510-256:], I omit the [CLS] and [SEP] here.). However, the second subtokenized sentence is not what I want and is very strange. After I dig deeper to the subtokenized sentence, I find it is [subtoken_ids_sentence[-256:], reverse(subtoken_ids_sentence[-70:])]. The last 70 values are the reversed subtoken_ids_sentence[-70:]. I wonder why the self.tokenizer.encode_plus() returns such a output as the final results and certainly reverse(subtoken_ids_sentence[-70:]) will significantly hurt the performance of the NER model.

Though the function self.tokenizer.encode_plus() is from the transformers package, do you have any idea to this problem? It will be very helpful for me to debug!

@djstrong
Copy link
Contributor

Could you prepare example code?

@wangxinyu0922
Copy link

wangxinyu0922 commented Oct 25, 2020

@djstrong
Hi, I prepare a simple example for the problem, it is a simplified code from:

while subtoken_ids_sentence:
nr_sentence_parts += 1
encoded_inputs = self.tokenizer.encode_plus(subtoken_ids_sentence,
max_length=self.max_subtokens_sequence_length,
stride=self.stride,
return_overflowing_tokens=self.allow_long_sentences,
truncation=True,
)
subtoken_ids_split_sentence = encoded_inputs['input_ids']
subtokenized_sentences.append(torch.tensor(subtoken_ids_split_sentence, dtype=torch.long))
if 'overflowing_tokens' in encoded_inputs:
subtoken_ids_sentence = encoded_inputs['overflowing_tokens']
else:
subtoken_ids_sentence = None
sentence_parts_lengths.append(nr_sentence_parts)

import torch
import transformers
from transformers import AutoTokenizer
import pdb

tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

max_subtokens_sequence_length = tokenizer.model_max_length
stride = tokenizer.model_max_length//2

nr_sentence_parts=0
subtoken_ids_sentence = [x for x in range(1000,1580)]
subtokenized_sentences=[]

while subtoken_ids_sentence:
    nr_sentence_parts += 1
    encoded_inputs = tokenizer.encode_plus(subtoken_ids_sentence,
                                                max_length=max_subtokens_sequence_length,
                                                stride=stride,
                                                return_overflowing_tokens=True,
                                                truncation=True,
                                                )
    # encoded_inputs = self.tokenizer.encode_plus(subtoken_ids_sentence,max_length=self.max_subtokens_sequence_length,stride=self.max_subtokens_sequence_length//2,return_overflowing_tokens=self.allow_long_sentences,truncation=True,)
    subtoken_ids_split_sentence = encoded_inputs['input_ids']
    subtokenized_sentences.append(torch.tensor(subtoken_ids_split_sentence, dtype=torch.long))

    if 'overflowing_tokens' in encoded_inputs:
        subtoken_ids_sentence = encoded_inputs['overflowing_tokens']
    else:
        subtoken_ids_sentence = None



print(subtokenized_sentences[0])
print(subtokenized_sentences[1])


# for reference
subtoken_ids_sentence = [x for x in range(1000,1580)]

print(subtoken_ids_sentence)

The example is a simple list with ids of [1000,...,1580], but the outputs I get are:
subtokenized_sentences[0]=[ 101, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180, 1181, 1182, 1183, 1184, 1185, 1186, 1187, 1188, 1189, 1190, 1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259, 1260, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1279, 1280, 1281, 1282, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, 1294, 1295, 1296, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312, 1313, 1314, 1315, 1316, 1317, 1318, 1319, 1320, 1321, 1322, 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 102]
and
subtokenized_sentences[1]=[ 101, 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1578, 1577, 1576, 1575, 1574, 1573, 1572, 1571, 1570, 1569, 1568, 1567, 1566, 1565, 1564, 1563, 1562, 1561, 1560, 1559, 1558, 1557, 1556, 1555, 1554, 1553, 1552, 1551, 1550, 1549, 1548, 1547, 1546, 1545, 1544, 1543, 1542, 1541, 1540, 1539, 1538, 1537, 1536, 1535, 1534, 1533, 1532, 1531, 1530, 1529, 1528, 1527, 1526, 1525, 1524, 1523, 1522, 1521, 1520, 1519, 1518, 1517, 1516, 1515, 1514, 1513, 1512, 1511, 1510, 102]

You can see that subtokenized_sentences[1] first ranges from 1323 to 1579, and then ranges from 1579 to 1510.

@djstrong
Copy link
Contributor

djstrong commented Oct 25, 2020

I confirm your findings with the newest transformers. I checked with 3.0.0 and there it is working fine, but from 3.1.0 it is wrong. You should create issue in transformers - will you?

@wangxinyu0922
Copy link

Thank you! I found that it goes wrong from 3.0.1, so I should use 3.0.0 for flair now.

@schelv
Copy link
Contributor Author

schelv commented Oct 26, 2020

@wangxinyu0922 Nicely found! I also noticed similar strange behavior recently (see #1902).
3.0.2 was the lowest version I tested, so it could be related.
I didn't really investigate it as good as you did.

Maybe for the flair project it is possible to restrict the version of transformers to <= 3.0.0 or >= 3.version_that_fixes_this, to avoid wrong output for other who try to use flair + transformers on long texts.

If you create the issue in the transformers repository can you provide a link to it? I'm very curious what causes this😁

@djstrong
Copy link
Contributor

The issue is here: huggingface/transformers#8028

@wangxinyu0922
Copy link

wangxinyu0922 commented Oct 27, 2020

@wangxinyu0922 Nicely found! I also noticed similar strange behavior recently (see #1902).
3.0.2 was the lowest version I tested, so it could be related.
I didn't really investigate it as good as you did.

Maybe for the flair project it is possible to restrict the version of transformers to <= 3.0.0 or >= 3.version_that_fixes_this, to avoid wrong output for other who try to use flair + transformers on long texts.

If you create the issue in the transformers repository can you provide a link to it? I'm very curious what causes this😁

I'm trying to reproduce the CoNLL NER score reported by BERT paper based on Flair with document context, so I tried to use your great feature. I checked the code carefully because I got a very low score through the feature. However, I still cannot produce the score after the bug is fixed...

@schelv
Copy link
Contributor Author

schelv commented Oct 27, 2020

.. CoNLL NER score reported by BERT paper based on Flair with document context,

Do you have a link?
Is it one of these? https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md

@wangxinyu0922
Copy link

I meant that train a NER model only with the BERT embeddings. The reported score in the BERT paper is 92.8 and the authors said that they trained the model with the maximal document. However, I can get a score of 91.6 at most. After I searched for some issues like google-research/bert#223, I believe that the results are impossible to reproduce.

@schelv
Copy link
Contributor Author

schelv commented Oct 27, 2020

Yes, impossible to reproduce is normal for a google paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants