Nearest Neighbor Methods in Statistics and Data Mining

Last update: 21 Apr 2025 21:17
First version: 24 February 2024

--- In lieu of explaining here, I will point readers at my lecture notes on this subject from my class on data mining / statistical learning (2022 version of said notes). Much of those notes are "Cover (1968a, 1968b) and Cover and Hart (1967) made even simpler"...

--- Something which, in retrospect, I should have paid more attention to: One-nearest-neighbor regression, and one-nearest-neighbor classification, "interpolates", i.e., it always fits the training data perfectly. (Every training point is its own nearest neighbor.) And yet 1NN regression or classification does generalize; in fact the classical analyses by Cover show that, as the sample size goes to infinity, its risk approaches twice the risk of the optimal regression function/classifier. (There are some continuity assumptions buried in that, but not onerous ones.) Now if the optimal classifier is only right 75% of the time, that's guaranteeing 1NN will approach a coin-toss, but if the optimal classifier has an error rate of, say, 0.001%, then 1NN will approach an error rate of no more than 0.002%, which doesn't sound that bad. The point, however, is that this clearly shows it's possible both to interpolate and to generalize. So some people, including me, really shouldn't have been so surprised when this turned out to happen a lot with deep neural networks.

To revisit: using clustering algorithms (like k-means) as locality sensitive hashing for finding nearest neighbors more efficiently. (This is not an original idea of mine, I just need to find where I saw it, and figure out if it really works...)

Lászlo Györfi, Michael Kohler, Adam Krzyzak and Harro Walk, A Distribution-Free Theory of Nonparametric Regression
Trevor Hastie and Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Alberto Abadie and Guido W. Imbens, "Large Sample Properties of Matching Estimators for Average Treatment Effects", Econometrica 74 (2006): 235--267 [PDF. "Matching" is essentially how you say "nearest-neighbors" in causal-inference-ese...]
Mona Azadkia, "Optimal choice of \( k \) for \( k \)-nearest neighbor regression", arxiv:1909.05495
Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of \( k \)-nearest neighbor collaborative recommendation", Annals of Statistics 38 (2010): 1568--1592, arxiv:1010.0499
Samory Kpotufe, "k-NN Regression Adapts to Local Intrinsic Dimension", pp. 729--737 in NIPS-2011
Alexander Kraskov and Harald Stögbauer and Peter Grassberger, "Estimating Mutual Information", Physical Review E 69 (2004): 066138, arxiv:cond-mat/0305641

Thomas M. Cover
- "Estimation by the Nearest Neighbor Rule", IEEE Transactions on Information Theory 14 (1968): 50--55 [PDF reprint]
- "Rates of Convergence for Nearest Neighbor Procedures", pp. 413--415 in B. K. Kinariwala and F. F. Kuo (eds.), Proceedings of the Hawaii International Conference on Systems Sciences (Honolulu: University of Hawaii Press, 1968) [PDF reprint]
Thomas M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory 13 (1967): 21--27 [PDF reprint]

Octavio César Mesner and CRS, "Conditional Mutual Information Estimation for Mixed Discrete and Continuous Variables with Nearest Neighbors", IEEE Transactions on Information Theory 67 (2021): 464--484, arxiv:1912.03387

Marcello Pelillo, "Alhazen and the nearest neighbor rule", Pattern Recognition Letters 38 (2014): 34--37 [This sounds good to me, but I'd be very curious what professional historians of science would make of it.]

Gerard Biau and Luc Devroye, Lectures on the Nearest Neighbor Method
Benoit Cadre and Qian Dong, "Dimension reduction for regression estimation with nearest neighbor method", Electronic Journal of Statistics 4 (2010): 436--460
Deng Cai, "A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search", arxiv:1612.07545
Timothy I. Cannings, Thomas B. Berrett, Richard J. Samworth, "Local nearest neighbour classification with applications to semi-supervised learning", Annals of Statistics 48 (2020): 1789--1814
Yao-ban Chan, Peter Hall, "Robust nearest-neighbor methods for classifying high-dimensional data", Annals of Statistics 37 (2009): 3186--3203, arxiv:0909.0184
Yihua Li, Devavrat Shah, Dogyoon Song, Christina Lee Yu, "Nearest Neighbors for Matrix Estimation Interpreted as Blind Regression for Latent Variable Model", arxiv:1705.04867
Xingye Qiao, Jiexin Duan, Guang Cheng, "Rates of Convergence for Large-scale Nearest Neighbor Classification", arxiv:1909.01464
Yue Xing, Qifan Song, Guang Cheng, "Benefit of Interpolation in Nearest Neighbor Algorithms", arxiv:1909.11720