Data Mining (36-350) Lecture Notes, Weeks 1--3
These handouts are
shamelessly ripped off derivative work,
amplifying and expanding those created
by Tom Minka when he
invented this course. (See his
originals here.) Posted
here in response to a number (> 1) of requests.
Lecture 5 is also a
shameless rip-off explication
of Aleks Jakulin's
"Quantifying and Visualizing Attribute Interactions"
Note to students in 36-350: This page will not keep up to date with
the handouts, or with other course documents; use Blackboard!
Documents by Similarity (28 August 2006). Why similarity search? Defining
similarity and distance. The bag-of-words representation. Normalizations.
- More on
Similarity Search (30 August 2006). Stemming, linguistic issues. Picking
out good features, or at least ignoring non-discriminative ones. Inverse
document frequency. Using feedback from the searcher.
Images by Similarity (6 September 2006). Representation and
abstraction. How to search images without looking at images; a failure-mode.
The bag-of-colors representation. More examples. Invariance and
representation. See also: slides
illustrating this lecture.
Informative Features (11--13 September 2006). More on finding good features.
Entropy and uncertainty. Information and entropy. Ranking features by
Among Features (18 September 2006). Redundancy and enhancement of
information. Information-sharing graphs. Examples.
Posted at September 16, 2006 12:56 | permanent link