Degrees of non-parallel corpora: parallel vs. noisy parallel vs. comparable vs. quasi-comparable

Human Language Technology Center
Department of Computer Science and Engineering
Hong Kong University of Science and Technology (HKUST)

This note clarifying terminology surrounding non-parallel corpora synthesizes and systematizes discussions in Fung & Cheung (COLING 2004), Fung & Cheung (EMNLP 2004), and Wu & Fung (IJCNLP 2005).

Table 1 contrasts different degrees of parallelism in bilingual corpora.

parallel noisy parallel comparable quasi-comparable
also known as very-non-parallel
roughly sentence aligned? yes no no no
translation of same document? yes yes no no
roughly topic aligned? yes yes yes no
examples Canadian Hansards; Hong Kong Hansards; Europarl Hong Kong News corpus date-aligned news stories; many Wikipedia articles on the same topic TDT3 topic detection corpus; CLEF corpora for CLIR
Levels of (non)parallel corpora.
parallel corpus
is a sentence-aligned corpus containing bilingual translations of the same document. The Hong Kong Laws Corpus is a parallel corpus with manually aligned sentences, and is used as a parallel sentence resource for statistical machine translation systems. There are 313,659 sentence pairs in Chinese and English. — from Fung & Cheung (EMNLP 2004).
noisy parallel corpus
contains non-aligned sentences that are nevertheless mostly bilingual translations of the same document. Fung and McKeown (1997), Kiku (1999), Zhao and Vogel (2002) extracted bilingual word senses, lexicon and parallel sentence pairs from such corpora. A corpus such as Hong Kong News contains documents that are in fact rough translations of each other, focused on the same thematic topics, with some insertions and deletions of paragraphs. — from Fung & Cheung (EMNLP 2004). Zhao and Vogel used a corpus of Chinese and English versions of news stories from the Xinhua News agency, with “roughly similar sentence order of content”. This corpus can be more accurately described as a noisy parallel corpus. — from Wu & Fung (IJCNLP 2005).
comparable corpus
contains non-sentence-aligned, non-translated bilingual documents that are topic-aligned. For example, newspaper articles from two sources in different languages, within the same window of published dates, can constitute a comparable corpus. Rapp (1995), Grefenstette (1998), Fung and Lo (1998), and Kaji (2003) derived bilingual lexicons or word senses from such corpora. Munteanu et al. (2004) constructed a comparable corpus of Arabic and English news stories by matching the publishing dates of the articles. — from Fung & Cheung (EMNLP 2004). Munteanu et al. used comparable corpora of news articles published within the same 5-day window. In both cases, the corpora contain documents on the same matching topics; unlike our present objective of mining quasi-comparable corpora, these other methods assume corpora of on-topic documents. — from Wu & Fung (IJCNLP 2005).
quasi-comparable corpus
contains non-aligned, and non-translated bilingual documents that could either be on the same topic (in-topic) or not (off-topic). TDT3 Corpus is a good source of truly non-parallel and quasi-comparable corpus. It contains transcriptions of various news stories from radio broadcasting or TV news report from 1998-2000 in English and Chinese. In this corpus, there are about 7,500 Chinese and 12,400 English documents, covering more than 60 different topics. Among these, 1,200 Chinese and 4,500 English documents are manually marked as being in-topic. The remaining documents are marked as off-topic as they are either only weakly relevant to a topic or irrelevant to all topics in the existing documents. From the in-topic documents, most are found to be comparable. A few of the Chinese and English passages are almost translations of each other. Nevertheless, the existence of considerable amount of off-topic document gives rise to more variety of sentences in terms of content and structure. Overall, the TDT 3 corpus contains 110,000 Chinese sentences and 290,000 English sentences. A very small number of the bilingual sentences are translations of each other, while some others are bilingual paraphrases. — from Fung & Cheung (COLING 2004).

References