TY - THES T1 - Probabilistic Hashing Techniques For Big Data T2 - Computer Science Y1 - 2015 A1 - Anshumali Shrivastava AB - We investigate probabilistic hashing techniques for addressing computational and memory challenges in large scale machine learning and data mining systems. In this thesis, we show that the traditional idea of hashing goes far beyond near-neighbor search and there are some striking new possibilities. We show that hashing can improve state of the art large scale learning algorithms, and it goes beyond the conventional notions of pairwise similarities. Despite being a very well studied topic in literature, we found several opportunities for fundamentally improving some of the well know textbook hashing algorithms. In particular, we show that the traditional way of computing minwise hashes is unnecessarily expensive and without loosing anything we can achieve an order of magnitude speedup. We also found that for cosine similarity search there is a better scheme than SimHash. In the end, we show that the existing locality sensitive hashing framework itself is very restrictive, and we cannot have efficient algorithms for some important measures like inner products which are ubiquitous in machine learning. We propose asymmetric locality sensitive hashing (ALSH), an extended framework, where we show provable and practical efficient algorithms for Maximum Inner Product Search (MIPS). Having such an efficient solutions to MIPS directly scales up many popular machine learning algorithms. We believe that this thesis provides significant improvements to some of the heavily used subroutines in big-data systems, which we hope will be adopted. JF - Computer Science PB - Cornell University VL - Ph.D. UR - https://ecommons.cornell.edu/handle/1813/40886 ER - TY - CONF T1 - b-Bit Minwise Hashing in Practice T2 - Internetware'13 Y1 - 2013 A1 - Ping Li A1 - Anshumali Shrivastava A1 - König, Arnd Christian AB - Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20   80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented. JF - Internetware'13 UR - http://www.nudt.edu.cn/internetware2013/ ER - TY - CONF T1 - Beyond Pairwise: Provably Fast Algorithms for Approximate K-Way Similarity Search T2 - Neural Information Processing Systems (NIPS) Y1 - 2013 A1 - Anshumali Shrivastava A1 - Ping Li JF - Neural Information Processing Systems (NIPS) ER - TY - CONF T1 - Fast Multi-task Learning for Query Spelling Correction T2 - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) Y1 - 2012 A1 - Xu Sun A1 - Anshumali Shrivastava A1 - Ping Li JF - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) UR - http://dx.doi.org/10.1145/2396761.2396800 ER - TY - CONF T1 - Fast Near Neighbor Search in High-Dimensional Binary Data T2 - The European Conference on Machine Learning (ECML 2012) Y1 - 2012 A1 - Anshumali Shrivastava A1 - Ping Li JF - The European Conference on Machine Learning (ECML 2012) ER - TY - CONF T1 - GPU-based minwise hashing: GPU-based minwise hashing T2 - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) Y1 - 2012 A1 - Ping Li A1 - Anshumali Shrivastava A1 - Arnd Christian König JF - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) UR - http://doi.acm.org/10.1145/2187980.2188129 ER - TY - CONF T1 - Query spelling correction using multi-task learning T2 - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) Y1 - 2012 A1 - Xu Sun A1 - Anshumali Shrivastava A1 - Ping Li JF - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) UR - http://doi.acm.org/10.1145/2187980.2188153 ER -