October 12, 2006

Data Mining (36-350) Lecture Notes, Weeks 4--7

These handouts are shamelessly ripped off derivative work, amplifying and expanding those created by Tom Minka when he invented this course. (See his originals here.) Posted here in response to a number (> 1) of requests. See here for the first three weeks' handouts.

Note to students in 36-350: This page will not keep up to date with the handouts, or with other course documents; use Blackboard!

  1. September 20 and 25 (Lecture 6): Partitioning Data into Clusters. Supervised and unsupervised learning. Social and organizational aspects of categorization. Finding categories in data via clustering. Characteristics of good clusters. The k-means algorithm for clustering. Search algorithms, search landscapes, hill climbing, local minima. Algorithms for hierarchical clustering. Avoiding spherical clusters. See also: slides to accompany the second half, showing clustering of images.
  2. September 27 (Lecture 7): Making Better Features. Transforming features to enhance invariance. Transforming features to improve their distribution. Projecting high-dimensional data into lower dimensions. Principal component analysis: informal description and example.
  3. October 2 (Lecture 8): More on Principal Component Analysis. Mathematical basis: maximizing the variance of the projected points. Mathematical basis: minimizing reconstruction error. Interpretation of PCA results.
  4. October 4: Review of course to date. (No handout.)
  5. October 9 (Lecture 9): Evaluating Predictive Models. Classification and linear regression as examples of predictive modeling. Error measures a.k.a. loss functions; examples. In-sample error. Out-of-sample or generalization error; why it matters, relation to in-sample error. Model selection. An example of over-fitting. Approaches to limiting over-fitting and its ill effects.
  6. October 11 (Lecture 10): Regression Trees. Difficulties of fitting global models in complex systems. Recursive partitioning and simple local models as a solution. Prediction trees in general. Regression trees in particular. An example. Tree growing. Tree pruning via cross-validation.

Corrupting the Young; Enigmas of Chance

Posted at October 12, 2006 11:40 | permanent link

Three-Toed Sloth