CS790: Information Theory, Machine Learning and Statistics
Fall 2007


Information
Syllabus
Selected Readings

Suggested Reading List

Entropy, Mutual Information and Semi-Supervised Learning

[A04] S. Abney (2004). Understanding the Yarowsky algorithm. Computational Linguistics 30(3).

[CJ06] A. Corduneanu and T. Jaakkola (2006). Data dependent regularization. In Semi-supervised learning. MIT Press.

[HS07] G. Haffari and Anoop Sarkar (2007). Analysis of semi-supervised learning with the Yarowsky algorithm. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007

Data Compression and Online Learning

[AW01] K. Azoury and M. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211-246.

[BRY98] A. Barron, J. Rissanen and B. Yu (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(5):2734-2760.

[H97] D. Haussler (1997). A general minimax result for relative entropy. IEEE Transactions on Information Theory, 43(4):1276-80.

[HSSW98] D. Helmbold, R. Schapire, Y. Singer, and M. Warmuth (1998). On-line portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325-347.

[LB04] F. Liang and A. R. Barron (2004). Exact minimax strategies for predictive density estimation, data compression, and model section. IEEE Transactions on Information Theory, 50:2708-2726.

[XB00] Q. Xie and A. Barron (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transaction on Information Theory, 46:431-445.

Duality between Channel Capacity and Rate Distortion

[CC02] T. Cover and M. Chiang (2002) Duality between channel capacity and rate distortion with two-sided state information. IEEE Transactions on Information Theory, 48(6):1629 - 1638.

[CB04] M. Chiang and S. Boyd (2004). Geometric programming duals of channel capacity and rate distortion. IEEE Transactions on Information Theory, 50(2):245-258.

Maximum Entropy, Information Geometry and Boosting

[BDD96] A. Berger, S. Della Pietra, and V. Della Pietra (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1).

[CT84] I. Csiszár and G. Tusnády (1984). Information geometry and alternating minimization procedures. Statistics \& Decisions, 1:205-237.

[CSS02] M. Collins, r. Schapire and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48(1/2/3), 2002.

[DDL97] S. Della Pietra, V. Della Pietra and J. Lafferty (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380-393.

[DDL] S. Della Pietra, V. Della Pietra and J. Lafferty (2001). Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, CMU, 2001.

[JDD97] J. Lafferty, S. Della Pietra and V. Della Pietra (1997). Statistical learning algorithms based on Bregman distances. Proceedings of 1997 Canadian Workshop on Information Theory, 77-80.

[LMP99] J. Lafferty (1999). Additive models, boosting, and inference for generalized divergences. Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT'99), 1999

[LMP01] J. Lafferty, A. McCallum and F. Pereira (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 2001.

[LL01] G. Lebanon and J. Lafferty (2001). Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems (NIPS), 14, 2001.

Coding Theory and Inference in Graphical Models

[GJ07] A. Globerson and T. Jaakkola (2007). Approximate inference using conditional entropy decompositions. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics.

[KFL01] F. Kschischang, B. Frey, and H. Loeliger (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498-519.

[MU06] A. Montanari and R. Urbanke (2006). Modern coding theory: the statistical mechanics and computer science points of view.

[RU01] T. Richardson and R. Urbanke (2001). The capacity of low-density parity-check codes under message-passing decoding. IEEE Transactions on Information Theory, 47(2):599-618.

[SJ07] D. Sontag and T. Jaakkola (2007). New outer bounds on the marginal polytope. In Advances in Neural Information Processing Systems (NIPS).

[WJW05] M. Wainwright, T. Jaakkola, and A. Willsky (2005). Map estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches. IEEE Transactions on Information Theory, 51(11):3697--3717, 2005.

[WJW05] M. Wainwright, T. Jaakkola, and A. Willsky (2005). A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51:2313--2335.

[WJ03] M. Wainwright and M. Jordan (2003). Graphical models, exponential families, and variational inference. UC Berkeley, Dept. of Statistics, Technical Report 649.

[WJW01] M. Wainwright, T. Jaakkola and A. S. Willsky (2001). Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, 45(9):1120--1146.

[YFY05] J. Yedidia, W. Freeman, and Y. Weiss (2005). Constructing free energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282-2312.

[YFW03] J. Yedidia, W. Freeman and Y. Weiss (2003). Understanding belief propagation and its generalizations. Exploring Artificial Intelligence in the New Millennium, Chapter 8, 239-236.