TY  - THES
T1  - Probabilistic Hashing Techniques For Big Data
T2  - Computer Science
Y1  - 2015
A1  - Anshumali Shrivastava
AB  - We investigate probabilistic hashing techniques for addressing computational and memory challenges in large scale machine learning and data mining systems. In this thesis, we show that the traditional idea of hashing goes far beyond near-neighbor search and there are some striking new possibilities. We show that hashing can improve state of the art large scale learning algorithms, and it goes beyond the conventional notions of pairwise similarities. Despite being a very well studied topic in literature, we found several opportunities for fundamentally improving some of the well know textbook hashing algorithms. In particular, we show that the traditional way of computing minwise hashes is unnecessarily expensive and without loosing anything we can achieve an order of magnitude speedup. We also found that for cosine similarity search there is a better scheme than SimHash. In the end, we show that the existing locality sensitive hashing framework itself is very restrictive, and we cannot have efficient algorithms for some important measures like inner products which are ubiquitous in machine learning. We propose asymmetric locality sensitive hashing (ALSH), an extended framework, where we show provable and practical efficient algorithms for Maximum Inner Product Search (MIPS). Having such an efficient solutions to MIPS directly scales up many popular machine learning algorithms. We believe that this thesis provides significant improvements to some of the heavily used subroutines in big-data systems, which we hope will be adopted.
JF  - Computer Science
PB  - Cornell University
VL  - Ph.D.
UR  - https://ecommons.cornell.edu/handle/1813/40886
ER  - 

TY  - CONF
T1  - b-Bit Minwise Hashing in Practice
T2  - Internetware'13
Y1  - 2013
A1  - Ping Li
A1  - Anshumali Shrivastava
A1  - König, Arnd Christian
AB  - Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20   80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.
JF  - Internetware'13
UR  - http://www.nudt.edu.cn/internetware2013/
ER  - 

TY  - CONF
T1  - Beyond Pairwise: Provably Fast Algorithms for Approximate K-Way Similarity Search
T2  - Neural Information Processing Systems (NIPS)
Y1  - 2013
A1  - Anshumali Shrivastava
A1  - Ping Li
JF  - Neural Information Processing Systems (NIPS)
ER  - 

TY  - CONF
T1  - Fast Multi-task Learning for Query Spelling Correction
T2  - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012)
Y1  - 2012
A1  - Xu Sun
A1  - Anshumali Shrivastava
A1  - Ping Li
JF  - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012)
UR  - http://dx.doi.org/10.1145/2396761.2396800
ER  - 

TY  - CONF
T1  - Fast Near Neighbor Search in High-Dimensional Binary Data
T2  - The European Conference on Machine Learning (ECML 2012)
Y1  - 2012
A1  - Anshumali Shrivastava
A1  - Ping Li
JF  - The European Conference on Machine Learning (ECML 2012)
ER  - 

TY  - CONF
T1  - GPU-based minwise hashing: GPU-based minwise hashing
T2  - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume)
Y1  - 2012
A1  - Ping Li
A1  - Anshumali Shrivastava
A1  - Arnd Christian König
JF  - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume)
UR  - http://doi.acm.org/10.1145/2187980.2188129
ER  - 

TY  - CONF
T1  - Query spelling correction using multi-task learning
T2  - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume)
Y1  - 2012
A1  - Xu Sun
A1  - Anshumali Shrivastava
A1  - Ping Li
JF  - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume)
UR  - http://doi.acm.org/10.1145/2187980.2188153
ER  -