CS 790: Statistical Natural Language
Processing:
Models and Methods, Fall 2006
Instructor: Shaojun Wang
Mondays and Wednesdays, 4:10-5:25,
Fawcett Hall, Room 105
Syllabus
- Foundations
Basic Concepts from Information Theory: Properties of entropy,
Kullback-Leibler divergence, mutual information, Shannon game, the source-channel
model. Applications: speech, translation,
information retrieval.
References for class lectures:
- Chapter 2, Cover and Thomas text
- Claude
Shannon. A
mathematical theory of communication. Bell System Technical
Journal, 27, pp. 379-423 and 623-656, 1948.
- Claude Shannon. Prediction and entropy of printed
English. Bell System Technical Journal, 30, pp. 50-64, 1951.
- P. Brown, S. Della Pietra, V. Della Pietra, J. Lai
and R. Mercer. An
estimate of an upper bound for the entropy of English.
Computational Linguistics, 18(1), pp. 31-40, 1992.
- Stanford Information Theory
Class.
- MIT
Information Theory Class.
Statistics and linguistic, linguistic
essentials: part of speech, morphology, phrase structure, semantics,
pragmatics.
References:
- Steven Abney. Statistical methods and
linguistics. The Balancing
Act, J.
Klavans and P.
Resnik, eds, MIT Press, 1996
- Steven Abney. Statistical methods.
Encyclopedia of Cognitive Science, Nature Publishing Group,
Macmillian, 2002
- F. Pereira. Formal grammar
and information theory: together again?. Philosophical Transactions
of the Royal Society, 358(1769):1239-1253, April 2000.
- R.
Rosenfeld. Two
decades of statistical language modeling: Where do we go from here?.
Proceedings of the IEEE, 88(8), pp. 1270-1278, 2000.
- Alan M.
Turing. Computing
machinery and intelligence. Mind, pp. 433-460, 1950.
- Basics of Language Modeling and N-grams
Entropy rate of a stochastic processe,
Breiman-McMillan-Shannon theorem, perplexity and alternative measures, data
sparseness, conditional modeling, history partitioning, word frequencies,
Zipf's law, type-token curves, vocabulary and n-gram growth.
References:
N-grams: smoothing, discounting, the Good-Turing estimate, the
zero frequency problem, the backoff model, A Dirichlet language model, Ngram
data structures, the CMU-Cambridge
toolkit .
References:
- Jelinek text, chapter 15.
- Arthur Nadas. Good, Jelinek, Mercer, and Robins on
Turing's estimate of probabilities. American Journal of Mathematical
and Management Sciences, 11, 229-308, 1991.
- Alon Orlitsky, Narayana P. Santhanam, Junan Zhang.
Always Good Turing: Asymptotically Optimal Probability Estimation, Science, 302(5644):427-431.
- Slava M. Katz. Estimation of probabilities from
sparse data for the language model component of a speech recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing,
35(3), pp. 400-401, 1987.
- S. Chen and J. Goodman. An
empirical study of smoothing techniques for language modeling.
Computer Speech & Language. 13(4), pp. 319-358, October 1999
.
- Irving J. Good. The
population frequencies of species and the estimation of population
parameters. Biometrika, 40, pp. 237-264, 1953.
- Arthur Nadas, On Turing's formula for word
probabilities. IEEE Transactions on Acoustics, Speech, and Signal
Processing. 33(6), 1414-1416, 1985.
- H. Ney, U. Essen, and R. Knese. On
structuring probabilistic dependences in stochastic language
modelling, Computer Speech & Language, 8(1), pp. 1-38,
1994.
- H. Ney, S. Martin, and F. Wessel. Statistical
language modeling using leaving-one-out, Corpus-based methods in
language and speech processing, S. Young and G. Bloothooft (Editors),
pp. 174-207, Kluwer Academic Publishers, 1997.
- Geoffrey Sampson's Good-Turing
Frequency Estimation page
- The EM Algorithm
The basic algorithm and example
applications. The mathematics underlying the algorithm.
References:
- Finite State Models
Markov chains, hidden Markov models and the
forward-backward algorithm, deleted interpolation, tagging.
References:
- Stochastic Grammars
The Chomsky
hierarchy, probabilistic context free grammars and inside-outside algorithm,
lexicalized probabilistic parsing, phrase structure grammars and dependency
grammars, link grammars, structured language model, language and
complexity.
References:
- John
Lafferty's notes on Probabilistic
context free grammar. 2001.
- F. Jelinek, J. Lafferty, and R. Mercer. Basic
methods of probabilistic context free grammars. Speech Recognition and
Understanding, P. Laface and R. De Mori, eds., Springer, pp. 347-360,
1992.
- K. Lari and S. Young. The estimation of stochastic
context-free grammars using the inside-outside algorithm. Computer
Speech & Language, 4, pp. 35--56, 1990.
- Steven
Abney, David McAllester, and
Fernando Pereira. Relating probabilistic grammars
and automata. Proceedings of the 37th Annual Meeting of the
ACL, pp. 542-549, 1999
- T. Booth and R. Thompson. Applying probability
measures to abstract languages, IEEE Transactions on Computers, 22,
pp. 442--450, 1973
- E.
Charniak. Immediate-head parsing for language models, Proceedings
of the 39th Annual Meeting of the ACL, pp. 116-123, 2001.
- C. Chelba and F.
Jelinek. Structured
language modeling. Computer Speech and Language, 14(4), pp.
283-332, 2000.
- Z.
Chi. Statistical
properties of probabilistic context-free grammars, Computational
Linguistics, 25(1), pp. 131-160, 1999.
- Michael Collins. Three generative, lexicalised
models for statistical parsing. Proceedings of the 35th Annual Meeting
of the ACL, 1997.
- Michael Collins. A new statistical parser based on
bigram lexical dependencies. Proceedings of the 34th Annual Meeting of
the ACL, 1996.
- Brian Roark. Probabilistic top-down
parsing and language modeling. Computational Linguistics,
27(2), pp. 249-285, 2001.
- A. Stolcke. An efficient probabilistic
context-free parsing algorithm that computes prefix probabilities.
Computational Linguistics, 21(2), pp. 165-201, 1995.
- See also chapters 11-12 of the Manning
text.
- See also chapters 9-13 of the Jurafsky
text.
- Latent Semantic Analysis
The singular value decomposition (SVD),
latent semantic indexing (LSI), demensionality reduction, non-negative
factorizations.
References:
- J. Bellegarda. Exploiting
latent semantic information in statistical language modeling.
Proceedings of the IEEE, 88(8), pp. 1279-1296, 2000.
- M.
Berry, S. Dumais, and G. O'Brien. Using
linear algebra for intelligent information retrieval. SIAM
Review, 37(4), pp. 573-595, 1995.
- Thomas
Hofmann. Unsupervised
learning by probabilistic latent semantic analysis. Machine
Learning, 42(1), pp.177-196, 2001.
- D. Blei, A. Ng and M. Jordan. Latent Dirichlet allocation.
Journal of Machine Learning Research, 3:993-1022, 2003
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R.
Harshman. Indexing
by latent semantic analysis. Journal of the American Society for
Information Science 41(6), pp. 391-407, 1990.
- Colorado LSA
homepage
- Tennessee
LSI web page
- Maximum Entropy
Random fields and exponential models,
duality: maximum likelihood and maximum entropy, iterative scaling,
information geometry and alternating minimization, prior and regularization, conditional random fields.
References:
- Adam
Berger's tutorial on MaxEnt
- A. Berger, S. Della Pietra, and V. Della Pietra.
A maximum entropy
approach to natural language processing . Computational
Linguistics, 22(1), 1996.
- S. Della Pietra, V. Della Pietra and J. Lafferty.
Inducing
features of random fields. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 19(4), pp. 380-393, 1997.
- S. Chen and R. Rosenfeld. A
survey of smoothing techniques for ME models. IEEE Trans. Speech
and Audio Processing, 8(1), pp. 37--50, 2000.
- J. Lafferty, A. McCallum and F. Pereira.
Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Proceedings of the 18th International Conference on Machine Learning, 2001.
- R. Rosenfeld. A
maximum entropy approach to adaptive statistical language modeling.
Computer Speech and Language, 10, pp. 187-228, 1996. slides.
- J. Darroch and D. Ratchliff. Generalized
iterative scaling for log-linear models. The Annals of Mathematical
Statistics, 43(5), pp. 1470-1480, 1972.
- E.
Jaynes. Papers on Probability, Statistics, and Statistical
Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company,
1983.
- S. Khudanpur and J. Wu. Maximum
entropy techniques for exploiting syntactic, semantic and collocational
dependencies in language modeling. Computer Speech and
Language, 14(4), pp. 355-372, 2000.
- R. Lau. Adaptive statistical language
modeling, S.M. Thesis, EECS Department, MIT, 1994
- A. Ratnaparkhi. Learning
to parse natural language with maximum entropy models, Machine
Learning, 34, pp. 151-175, 1999.
- See also chapters 13-14 of the Jelinek
text.
- Machine Translation and Information Retrieval
Statistical machine translation, statistical
information retrieval, statistical information extraction.
References:
- P. Brown, J. Cocke, S. Della Pietra, V. Della
Pietra, F. Jelinek, J.
Lafferty, R. Mercer, and P. Roosin. A statistical approach
to machine translation. Computational Linguistics 16, 79--85,
1990.
- P. Brown, S. Della Pietra, V. Della Pietra, and R.
Mercer. The
mathematics of statistical machine translation: parameter estimation.
Computational Linguistics, 19(2), pp. 263--311, 1993.
- Jamie
Callan's slides on statistical
information retrieval. 2001.
- The
Lemur Toolkit for Language Modeling and Information Retrieval.
- Language Modeling of Biological Data
Biological sequences: nucleotide bases,
amino acids, proteins; biological structures: primary structure, secondary
structure, tertiary structure, quaternary structure.
References:
- Durbin text.
- David B. Searls. The
language of genes Nature, 420(14), pp. 211-217, 2002.
- David B. Searls. The linguistics of DNA,
American Scientist, 80(6), pp. 579-591, 1992.
- David B. Searls. Computational
linguistics of biological sequences, Artificial Intelligence and
Molecular Biology, Lawrence Hunter (Editor), MIT Press, 1993.
- Workshop on
language modeling of biological data, 2001.
Reference Texts
- Frederick Jelinek, Statistical
Methods for Speech Recognition, MIT Press, 1997.
- Christopher Manning and Hinrich Schuetze, Foundations of Statistical Natural
Language Processing, MIT Press, 1999.
- Daniel Jurafksy and James Martin , Speech and Language
Processing, Prentice Hall, 2000.
- Thomas M. Cover and Joy A. Thomas, Elements of
Information Theory, Wiley-Interscience, 1991
- Richard
Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison, Biological
Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids,
Cambridge Univ Prress, 1998
CS790, Fall '06
Shaojun Wang