Fast Near Neighbor Search in High-Dimensional Binary Data

Shrivastava, Anshumali, and Ping Li. Fast Near Neighbor Search in High-Dimensional Binary Data. Cornell University Preprint 1813:37987, 2013, available at http://hdl.handle.net/1813/37987.
Fast Near Neighbor Search in High-Dimensional Binary Data Shrivastava, Anshumali; Li, Ping Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.