TY - CONF T1 - b-Bit Minwise Hashing in Practice T2 - Internetware'13 Y1 - 2013 A1 - Ping Li A1 - Anshumali Shrivastava A1 - König, Arnd Christian AB - Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20   80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented. JF - Internetware'13 UR - http://www.nudt.edu.cn/internetware2013/ ER - TY - CONF T1 - Beyond Pairwise: Provably Fast Algorithms for Approximate K-Way Similarity Search T2 - Neural Information Processing Systems (NIPS) Y1 - 2013 A1 - Anshumali Shrivastava A1 - Ping Li JF - Neural Information Processing Systems (NIPS) ER - TY - CONF T1 - Exact Sparse Recovery with L0 Projections T2 - 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Y1 - 2013 A1 - Ping Li A1 - Cun-Hui Zhang JF - 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining ER - TY - CHAP T1 - Entropy Estimations Using Correlated Symmetric Stable Random Projections T2 - Advances in Neural Information Processing Systems 25 Y1 - 2012 A1 - Ping Li A1 - Cun-Hui Zhang ED - P. Bartlett ED - F.C.N. Pereira ED - C.J.C. Burges ED - L. Bottou ED - K.Q. Weinberger JF - Advances in Neural Information Processing Systems 25 UR - http://books.nips.cc/papers/files/nips25/NIPS2012_1456.pdf ER - TY - CONF T1 - Fast Multi-task Learning for Query Spelling Correction T2 - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) Y1 - 2012 A1 - Xu Sun A1 - Anshumali Shrivastava A1 - Ping Li JF - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) UR - http://dx.doi.org/10.1145/2396761.2396800 ER - TY - CONF T1 - Fast Near Neighbor Search in High-Dimensional Binary Data T2 - The European Conference on Machine Learning (ECML 2012) Y1 - 2012 A1 - Anshumali Shrivastava A1 - Ping Li JF - The European Conference on Machine Learning (ECML 2012) ER - TY - CONF T1 - GPU-based minwise hashing: GPU-based minwise hashing T2 - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) Y1 - 2012 A1 - Ping Li A1 - Anshumali Shrivastava A1 - Arnd Christian König JF - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) UR - http://doi.acm.org/10.1145/2187980.2188129 ER - TY - CHAP T1 - One Permutation Hashing T2 - Advances in Neural Information Processing Systems 25 Y1 - 2012 A1 - Ping Li A1 - Art Owen A1 - Cun-Hui Zhang ED - P. Bartlett ED - F.C.N. Pereira ED - C.J.C. Burges ED - L. Bottou ED - K.Q. Weinberger JF - Advances in Neural Information Processing Systems 25 UR - http://books.nips.cc/papers/files/nips25/NIPS2012_1436.pdf ER - TY - CONF T1 - Query spelling correction using multi-task learning T2 - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) Y1 - 2012 A1 - Xu Sun A1 - Anshumali Shrivastava A1 - Ping Li JF - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) UR - http://doi.acm.org/10.1145/2187980.2188153 ER - TY - JOUR T1 - Testing for Membership to the IFRA and the NBU Classes of Distributions JF - Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012) Y1 - 2012 A1 - Radhendushka Srivastava A1 - Ping Li A1 - Debasis Sengupta VL - 22 UR - http://jmlr.csail.mit.edu/proceedings/papers/v22/srivastava12.html ER -