December 28, 2008

Statistics 36-350: Data Mining (Fall 2008)

Since class begins Monday, this is a good time for the public website to make its appearance. As before, lecture notes will also be posted here; you can use the RSS feed for this entry to keep track of them.

  1. Introduction to the course (25 August)
  2. Information retrieval and similarity searching (25 August)
  3. Multidimensional scaling and a first glance and classification (27 August)
  4. A little about page-rank (29 August)
    Homework #1, due 8 September: assignment, R, newsgroups.tgz data file
    Solutions
  5. Image search, abstraction and invariance; the accompanying slides (8 September)
  6. Finding informative features (10 September)
    Additional reading: David Feldman, "Introduction to Information Theory", chapter 1
  7. Information and interaction among features (12 September)
    Additional reading: Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", arxiv:cs.AI/0308002
    Homework #2, due 22 September: assignment
    solutions, solutions code
    Note: Information theory, axiomatic foundations, connections to statistics — elaboration on some points raised in lecture (12 September)
  8. Categorization: types of categorization, basic classifiers and finding simple clusters in data (15 September)
  9. Hierarchical clustering; how many clusters? (17 September)
  10. Yet more clustering (19 September; slides)
  11. Making better features: transformations, principal components (22 September)
  12. Mathematics of principal components analysis; interpretations and limitations of PCA (24 September)
  13. Yet more on linear dimensionality reduction: PCA + information retrieval = Latent semantic indexing. Factor analysis: motivations, historical roots, preliminaries to estimation (26 September)
    Optional reading: Deerwester et al., "Indexing by Latent Semantic Analysis" [PDF]
    Optional reading: Landauer and Dumais, "A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge" [PDF]
    Optional reading: Thurstone, "The Vectors of Mind"
    Home #3, due 3 October: assignment
  14. More on factor analysis: estimation and the rotation problem (29 September)
  15. Principal Components versus Factor Analysis: worked examples, basic goodness-of-fit testing for factor analysis; R code for lecture (1 October)
  16. The truth about principal components and factor analysis: strengths, limitations, factor models as graphical models, factor models and mixture models, Thomson's sampling model; R code for Thomson's model (3 October)
    Homework #4, due Friday, 10 October: assignment, nci.kmeans, nci.pca2.kmeans
  17. Regression: predicting quantiative features: point prediction; expectations and mean-square optimality; regression functions; regression as smoothing; linear regression as linear smoothing; other kinds of linear smoothers; nearest-neighbor regression; kernel regression. R code for figures, data for running example (6 October)
  18. The truth about linear regression: optimal linear prediction; shifting distributions and omitted variables; rights and obligations of probabilistic assumptions; abuses of linear regression; how to hurt angels (8 October)
  19. Extending linear regression: weighted least-squares, heteroskedasticity, local linear regression. R code for figures, data for running example (10 October)
  20. Mid-term review (13 October; no hand-out)
  21. Mid-term: exam, solutions (15 October)
  22. Evaluating preditive models: in-sample and generalization error; over-fitting and under-fitting; model selection, capacity control, cross-validation. R for figures. (20 October)
  23. Using cross-validation: mechanics and examples (22 October; notes forthcoming)
  24. Using non-parametric smoothing: adaptive smoothing, testing parametric forms (24 October; notes forthcoming)
    Homework #5, due Friday, 31 October: assignment; solutions
  25. Prediction trees 1: mostly regression trees, plus a "classification tree we can believe in" (27 October)
  26. Prediction trees 2: classification trees (29 October and 3 November)
  27. Bootstrapping, Bagging, and Random Forests (5 November)
  28. Combining Predictive Models and the Power of Diversity (7 November)
  29. Linear Classifiers and the Perceptron Algorithm (10 November)
  30. Logistic Regression and Newton's Method (12 November)
    Homework #7, due Friday, 21 November: assignment; solutions
  31. Neural Networks: The Mathematical Reality (14 November)
  32. Neural Networks: The Biological Myth (17 November)
  33. Support Vector Machines (19 November)
  34. Support vector machines continued (21 November; same handout as previous)
    Homework #8, due Monday, 1 December: assignment; solutions
  35. The Lecture Full of Fail: The wrong data, lying data, covariate shift, low base-rates and overwhelming false positives, response Waste, fraud and abuse (24 November)
    Homework #9, due 15 December: assignment; solutions

Corrupting the Young; Enigmas of Chance

Posted at December 28, 2008 10:49 | permanent link

Three-Toed Sloth