Skip to content

Latest commit

 

History

History
34 lines (22 loc) · 2.47 KB

README.md

File metadata and controls

34 lines (22 loc) · 2.47 KB

WMT_data (Keep updating, welcome for advices)

Preprossed data for workshop on statistical machine translation (WMT), collected from other places

When reimplement the NMT models, I found the data of WMT14/15/16/17 are raw data provided on the homepage, and it is not easy to find processed data which is exactly the paper used. So I creat this repository to collect the processed WMT data I met, which I am sure met the requirment of papers I read.

WMT 2014

WMT 2014 English to French

The data is provided at: http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/

The data is used in these papers:

  • Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014)
  • Kyunghyun Cho, Bart van Merriënboer, Ça˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.

WMT 2014 English to German

The data is provided at: https://nlp.stanford.edu/projects/nmt/

The data is used in these papers:

  • (exactly the data) Thang Luong, Hieu Pham, Christopher D. Manning: Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015: 1412-1421
  • (similar to the data) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015.
  • (similar to the data) Stephan Peitz, Joern Wuebker, Markus Freitag, and Hermann Ney. The rwth aachen german-english machine translation system for wmt 2014. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
  • (similar to the data) Tao Lei, Yu Zhang: Training RNNs as Fast as CNNs. CoRR abs/1709.02755 (2017)

WMT 2015 English to German

WMT 2014 English to German data had updated the data of News Commentary v10

The data is provided at:https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz The data is used in this tutorial for OpenNMT: http://forum.opennmt.net/t/training-english-german-wmt15-nmt-engine/29

WMT 2017

The homepage provide preprocssed data: http://data.statmt.org/wmt17/translation-task/preprocessed/