Research Notes: April 2013

Friday, April 26, 2013

Accurate Methods for the Statistics of Surprise and Coincidence. Ted Dunning. Computational Linguistics 1993.

Ideas

"ordinary words are 'rare', any statistical work with texts must deal with the reality of rare events ... Unfortunately, the foundational assumption of most common statistical analyses used in computational linguistics is that the events being analyzed are relatively common."
Counting a word w can be viewed as a series of Bernoulli trials. Each token is tested to see if it is w. Assuming a uniform probability p that it is w, the count is distributed binomially and, for np(1-p)>5 (where n=number of tokens), it is distributed almost normally. But this is not true when np(1-p)<1.
Given outcomes k, propose a model with parameters w. The likelihood of the parameter value w is P(k|w). A hypothesis is a subset of the parameter space W.
The likelihood ratio of the hypothesis (parameter subspace) W0 = LR = max_{w \in W0} P(k|w) / max_{w \in W} P(k|w).
Fact: -2 log LR ~ \Chi^2(dim W - dim W0).

References

More info on parametric and distribution-free tests: Bradley (1968), and Mood, Graybill, and Boes (1974).
Likelihood ratio tests: Mood et al. (1974)

Thursday, April 25, 2013

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension. Daniel Andrade, Takuya Matsuzaki, Jun'ichi Tsuji. TALIP 2012

Ideas

Use only statistically significant context---determined using Bayesian estimate of PMI
"calculate a similarity score ... using the probability that the same pivots [(words from the seed lexicon)] will be extracted for both the query word and the translation candidate."
"several context [features] ... a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence."
"In order to make these context positions comparable across Japanese and English ... we use several heuristics to adjust the dependency trees appropriately."

Comments

"the degree of association is defined as a measurement for finding words that co-occur, or which do not co-occur, more often than we would expect by pure chance [e.g.] Log-Likelihood-Ratio ... As an alternative, we suggest to use the statistical significance of a positive association"
Heuristics to make dependency trees comparable are language-pair-specific (for EN-JP only)

Related work

Standard approach

Fix corpora in two languages, and pivot words (seed lexicon)
For each query word, construct vector of pivot words, and compare.

Construct vector: some measure of association between query word and pivot word
Compare: some similarity measure suitable for association measure

"Context" is a bag-of-words, usually a sentence (or doc?).

Variations of standard approach

Peters and Picchi 1997: PMI
Fung 1998: tf.idf and cosine similarity
Rapp 1999: log-likelihood ratio and Manhattan distance
Koehn and Knight 2002: Spearman correlation
Pekar et al 2006: conditional probability
Laroche and Langlais 2010: log-odds-ratio and cosine similarity

Variations incorporating syntax

Rapp 1999: use word order (assumes word ordering is similar for both languages)
Pekar et al 2006: use verb-noun dependency
Otero & Campos 2008: POS tag the corpus and use lexico-syntactic patterns as features; e.g. extract (see, SUBJ, man) from "A man sees a dog." and use (see, SUBJ, *) to find translations for "man".
Garera et al 2009: use predecessors and successors in dependency graph (and do not use bag-of-words at all)

Variations incorporating non-pivot words (to overcome the "seed lexicon bottleneck")

Gaussier et al 2004: construct vector of all words (and not just pivot words) for the query word and each pivot word. Now construct vector of pivot words, and instead of association measure between query and pivot, use the similarity between all-words query vector and all-words pivot vector.
Dejean et al 2002: use domain-specific multilingual thesaurus

Variations incorporating senses

Ismail and Manandhar 2010: construct query vector "given" another word (the sense-disambiguator (SD) word, say). For a query word, one can construct different vectors given different SD words. For each vector, find translation.

Probabilistic approach

Haghighi et al 2008: use a generative model where source and target words are generated from a common latent subspace. Maximize likelihood in the graphical model to learn the source-target matchings.

"suffers from high computational costs ... They did not compare [with] ... standard context vector approaches, which makes it difficult to estimate the possible gains from their method."

Graph-based approach

Laws et al 2010

one graph per language, words as nodes, 3 types of nodes (adjectives, verbs, nouns) and 3 types of edges (adjectival modification, verb-object relation, noun coordination), edge weights represent strength of correlation
seed lexicon for connecting the two graphs
node pair similarity computed using SimRank-like algorithm

Attended a literature review on Question Answering by Akihiro Katsura. Some interesting references.

Green, Chomsky, et al. 1961. The BASEBALL system.

rule-based

Isozaki et al. 2009.

machine learning-based

Methodology of QA

Question analysis

Xue et al. SIGIR 2008. Retrieval models for QA archives.

Text retrieval

Jones et al. IPM 2000. ---Okapi/BM25
Berger et al. SIGIR 2000.---Bridging the lexical chasm (OOV problem)
Brill et al. TREC 2001.---Data intensive QA

Answer candidate extraction

Lafferty et al. ICML 2001.---CRF
Ravichandran & Hovy. ACL 2002.---Surface text patterns for QA

Answer selection

Clarke et al. SIGIR 2001.---Redundancy in QA

Other related work

Cao et al. WWW 2008.---Recommending questions
Wang et al. SIGIR 2009.---Similar questions
Wang et al. SIGIR 2009.---answer ranking
Jeon et al. SIGIR 2006.---answer ranking with non-textual features (votes, etc.)

Wednesday, April 24, 2013

Identifying Word Translations from Comparable Documents Without a Seed Lexicon. Reinhard Rapp, Serge Sharoff, Bogdan Babych. LREC 2012

Idea

Assume only document-aligned comparable corpora (and no seed lexicon---"typically comprising at least 10,000 words")
Characterize each article by a set of keywords
"Formulate translation identification as a variant of the word alignment problem in a noisy setting"

actually solved using a neural net-style algorithm by Rumelhart & McClelland (1987)

Comments

"If ... in language A two words co-occur more often than expected by chance, then their translated equivalents in language B should also co-occur more frequently than expected."

Experiments

Preprocessing

lemmatization (of corpora and evaluation pairs)
"we use the log-likelihood score as a measure of keyness [or salience of words in a document], since it has been shown to be robust to small [documents] ... the threshold of 15.13 for the log-likelihood score is a conservative recommendation for statistical significance."
"[we] applied a threshold of five [occurrences] ... [and] added all words of the ... gold standard(s) [even if they were below the threshold]"

Gold standard

"The source language words in the gold standard were supposed to be systematically derived from a large corpus, covering a wide range of frequencies, parts of speech, and variances of their
distribution. In addition, the corpus from which the gold standard was derived was supposed to be completely separate from the development set (Wikipedia)."
"list of words extracted from the British National Corpus (BNC) by Adam Kilgarriff for the purpose of examining distributional variability." http://kilgarriff.co.uk/bnc-readme.html

A Linguistically Grounded Graph Model for Bilingual Lexicon Extraction. Florian Laws, Lukas Michelbacher, Beate Dorow, Christian Scheible, Ulrich Heid, Hinrich Schutze. COLING 2010

TOREAD

Tuesday, April 23, 2013

Addressing polysemy in bilingual lexicon extraction from comparable corpora. Darja Fiser, Nikola Ljubesic, Ozren Kubelka. LREC 2012

Idea

Get source word senses (using sense tagger), construct context vectors for each sense, and then find target translation.

To compute sense-specific vectors: split occurrences of source word into groups, and build context vectors separately for each group.
Translate context vectors into target language using seed lexicon

Combine info from several taggers to improve accuracy.

Take only those words where the tags of both taggers agree.

Comments (on the classical approach, using a context vector of words)

"The main idea behind [the classical] approach is the assumption that a source word and its translation appear in similar contexts in their respective languages, so that in order to identify them their contexts are compared via a seed dictionary (Fung, 1998; Rapp, 1999)"
"[the classical approach] approach gives good results for a specialized domain even though the seed dictionary is quite small (Fiser et al., 2011)."
"... for closely related languages, ... the same quality of the results can be achieved by exploiting the lexical overlap between the languages instead of using a seed dictionary (Ljubesic and Fiser, 2011).