Degrees of non-parallel corpora: parallel vs. noisy parallel vs. comparable
vs. quasi-comparable
Dekai Wu
Human Language Technology Center
Department of Computer Science and Engineering
Hong Kong University of Science and Technology (HKUST)
This note clarifying terminology surrounding
non-parallel corpora synthesizes and systematizes discussions in Fung & Cheung (COLING 2004),
Fung & Cheung (EMNLP 2004),
and Wu & Fung (IJCNLP
2005).
Table 1 contrasts different
degrees of parallelism in bilingual corpora.
|
parallel |
noisy parallel |
comparable |
quasi-comparable |
also known as |
|
|
|
very-non-parallel |
roughly sentence aligned? |
yes |
no |
no |
no |
translation of same document? |
yes |
yes |
no |
no |
roughly topic aligned? |
yes |
yes |
yes |
no |
examples |
Canadian Hansards; Hong Kong Hansards; Europarl |
Hong Kong News corpus |
date-aligned news stories; many Wikipedia articles on the same
topic |
TDT3 topic detection corpus; CLEF corpora for CLIR |
Levels of (non)parallel corpora.
- parallel corpus
- is a sentence-aligned corpus containing bilingual translations of the
same document. The Hong Kong Laws Corpus is a parallel corpus with
manually aligned sentences, and is used as a parallel sentence resource
for statistical machine translation systems. There are 313,659 sentence
pairs in Chinese and English. — from Fung & Cheung (EMNLP 2004).
- noisy parallel corpus
- contains non-aligned sentences that are nevertheless mostly bilingual
translations of the same document. Fung and McKeown (1997), Kiku (1999),
Zhao and Vogel (2002) extracted bilingual word senses, lexicon and
parallel sentence pairs from such corpora. A corpus such as Hong Kong
News contains documents that are in fact rough translations of each
other, focused on the same thematic topics, with some insertions and
deletions of paragraphs. — from Fung & Cheung (EMNLP 2004). Zhao and Vogel used a
corpus of Chinese and English versions of news stories from the Xinhua
News agency, with “roughly similar sentence order of content”. This
corpus can be more accurately described as a noisy parallel corpus. —
from Wu & Fung (IJCNLP
2005).
- comparable corpus
- contains non-sentence-aligned, non-translated bilingual documents that
are topic-aligned. For example, newspaper articles from two sources in
different languages, within the same window of published dates, can
constitute a comparable corpus. Rapp (1995), Grefenstette (1998), Fung
and Lo (1998), and Kaji (2003) derived bilingual lexicons or word senses
from such corpora. Munteanu et al. (2004) constructed a comparable corpus
of Arabic and English news stories by matching the publishing dates of
the articles. — from Fung
& Cheung (EMNLP 2004). Munteanu et al. used comparable corpora of
news articles published within the same 5-day window. In both cases, the
corpora contain documents on the same matching topics; unlike our present
objective of mining quasi-comparable corpora, these other methods assume
corpora of on-topic documents. — from Wu & Fung (IJCNLP 2005).
- quasi-comparable corpus
- contains non-aligned, and non-translated bilingual documents that could
either be on the same topic (in-topic) or not (off-topic). TDT3 Corpus is
a good source of truly non-parallel and quasi-comparable corpus. It
contains transcriptions of various news stories from radio broadcasting
or TV news report from 1998-2000 in English and Chinese. In this corpus,
there are about 7,500 Chinese and 12,400 English documents, covering more
than 60 different topics. Among these, 1,200 Chinese and 4,500 English
documents are manually marked as being in-topic. The remaining documents
are marked as off-topic as they are either only weakly relevant to a
topic or irrelevant to all topics in the existing documents. From the
in-topic documents, most are found to be comparable. A few of the Chinese
and English passages are almost translations of each other. Nevertheless,
the existence of considerable amount of off-topic document gives rise to
more variety of sentences in terms of content and structure. Overall, the
TDT 3 corpus contains 110,000 Chinese sentences and 290,000 English
sentences. A very small number of the bilingual sentences are
translations of each other, while some others are bilingual paraphrases.
— from Fung & Cheung
(COLING 2004).
References