Degrees of non-parallel corpora: parallel vs. noisy parallel vs. comparable vs. quasi-comparable

Dekai Wu

Human Language Technology Center
Department of Computer Science and Engineering
Hong Kong University of Science and Technology (HKUST)

This note clarifying terminology surrounding non-parallel corpora synthesizes and systematizes discussions in Fung & Cheung (COLING 2004), Fung & Cheung (EMNLP 2004), and Wu & Fung (IJCNLP 2005).

Table 1 contrasts different degrees of parallelism in bilingual corpora.

Levels of (non)parallel corpora.
	parallel	noisy parallel	comparable	quasi-comparable
also known as				very-non-parallel
roughly sentence aligned?	yes	no	no	no
translation of same document?	yes	yes	no	no
roughly topic aligned?	yes	yes	yes	no
examples	Canadian Hansards; Hong Kong Hansards; Europarl	Hong Kong News corpus	date-aligned news stories; many Wikipedia articles on the same topic	TDT3 topic detection corpus; CLEF corpora for CLIR

parallel corpus: is a sentence-aligned corpus containing bilingual translations of the same document. The Hong Kong Laws Corpus is a parallel corpus with manually aligned sentences, and is used as a parallel sentence resource for statistical machine translation systems. There are 313,659 sentence pairs in Chinese and English. — from Fung & Cheung (EMNLP 2004).
noisy parallel corpus: contains non-aligned sentences that are nevertheless mostly bilingual translations of the same document. Fung and McKeown (1997), Kiku (1999), Zhao and Vogel (2002) extracted bilingual word senses, lexicon and parallel sentence pairs from such corpora. A corpus such as Hong Kong News contains documents that are in fact rough translations of each other, focused on the same thematic topics, with some insertions and deletions of paragraphs. — from Fung & Cheung (EMNLP 2004). Zhao and Vogel used a corpus of Chinese and English versions of news stories from the Xinhua News agency, with “roughly similar sentence order of content”. This corpus can be more accurately described as a noisy parallel corpus. — from Wu & Fung (IJCNLP 2005).
comparable corpus: contains non-sentence-aligned, non-translated bilingual documents that are topic-aligned. For example, newspaper articles from two sources in different languages, within the same window of published dates, can constitute a comparable corpus. Rapp (1995), Grefenstette (1998), Fung and Lo (1998), and Kaji (2003) derived bilingual lexicons or word senses from such corpora. Munteanu et al. (2004) constructed a comparable corpus of Arabic and English news stories by matching the publishing dates of the articles. — from Fung & Cheung (EMNLP 2004). Munteanu et al. used comparable corpora of news articles published within the same 5-day window. In both cases, the corpora contain documents on the same matching topics; unlike our present objective of mining quasi-comparable corpora, these other methods assume corpora of on-topic documents. — from Wu & Fung (IJCNLP 2005).
quasi-comparable corpus: contains non-aligned, and non-translated bilingual documents that could either be on the same topic (in-topic) or not (off-topic). TDT3 Corpus is a good source of truly non-parallel and quasi-comparable corpus. It contains transcriptions of various news stories from radio broadcasting or TV news report from 1998-2000 in English and Chinese. In this corpus, there are about 7,500 Chinese and 12,400 English documents, covering more than 60 different topics. Among these, 1,200 Chinese and 4,500 English documents are manually marked as being in-topic. The remaining documents are marked as off-topic as they are either only weakly relevant to a topic or irrelevant to all topics in the existing documents. From the in-topic documents, most are found to be comparable. A few of the Chinese and English passages are almost translations of each other. Nevertheless, the existence of considerable amount of off-topic document gives rise to more variety of sentences in terms of content and structure. Overall, the TDT 3 corpus contains 110,000 Chinese sentences and 290,000 English sentences. A very small number of the bilingual sentences are translations of each other, while some others are bilingual paraphrases. — from Fung & Cheung (COLING 2004).

References

Pascale FUNG & Percy CHEUNG (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Dekang LIN and Dekai WU (editors), Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004). Barcelona, Spain: July 2004.
Pascale FUNG & Percy CHEUNG (2004). Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable Corpus. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). Geneva, Switzerland: August 2004.
Dekai WU & Pascale FUNG (2005). Inversion Transduction Grammar constraints for mining parallel sentences from quasi-comparable corpora. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP 2005), Lecture Notes in Computer Science 3651: 257-268.