首页 | 本学科首页   官方微博 | 高级检索  
     检索      

M2LSH:基于LSH的高维数据近似最近邻查找算法
引用本文:李灿,钱江波,董一鸿,陈华辉.M2LSH:基于LSH的高维数据近似最近邻查找算法[J].电子学报,2017,45(6):1431.
作者姓名:李灿  钱江波  董一鸿  陈华辉
作者单位:宁波大学信息科学与工程学院,浙江宁波,315211
基金项目:国家自然科学基金,浙江省自然科学基金,宁波市自然科学基金
摘    要:在许多应用中,LSH(Locality Sensitive Hashing)以及各种变体,是解决近似最近邻问题的有效算法之一.虽然这些算法能够很好地处理分布比较均匀的高维数据,但从设计方案来看,都没有针对数据分布不均匀的情况做相应的优化.针对这一问题,本文提出了一种新的基于LSH的解决方案(M2LSH,2 Layers Merging LSH),对于数据分布不均匀的情况依然能得到一个比较好的查询效果.首先,将数据存放到具有计数功能的组合哈希向量表示的哈希桶中,然后通过二次哈希将这些桶号投影到一维空间,在此空间根据各个桶中存放的数据个数合并相邻哈希桶,使得新哈希桶中的数据量能够大致均衡.查询时仅访问有限个哈希桶,就能找到较优结果.本文给出了详细的理论分析,并通过实验验证了M2LSH的性能,不仅能减少访问时间,也可提高结果的正确率.

关 键 词:近似最近邻  KNN查询  局部敏感哈希  高维数据
收稿时间:2015-10-07

M2LSH:An LSH Based Technique for Approximate Nearest Neighbor Searching on High Dimensional Data
LI Can,QIAN Jiang-bo,DONG Yi-hong,CHEN Hua-hui.M2LSH:An LSH Based Technique for Approximate Nearest Neighbor Searching on High Dimensional Data[J].Acta Electronica Sinica,2017,45(6):1431.
Authors:LI Can  QIAN Jiang-bo  DONG Yi-hong  CHEN Hua-hui
Abstract:The LSH (Locality Sensitive Hashing) method and its variants represent the state-of-the-art indexing techniques to process ANN (Approximate Nearest Neighbor) searches in a high dimensional data space.Although the existing LSH based techniques can efficiently deal with uniform distributed data,it is noticed from the principle of their design schemes that these techniques cannot handle non-uniform distributed data well.To tackle the above problem,this paper presents a new LSH based technique,called M2LSH (2 Layers Merging LSH),to efficiently process ANN searches on non-uniform distributed data.The key idea is as follows.First,data objects are stored in a hash table with multiple buckets (i.e.,the first layer),each of which corresponds to a combined hash vector used as its hash number.The count of data objects in each hash bucket is recorded.Secondly,using the hash functions for a second layer hash table,the bucket numbers from the first layer are projected into a one-dimensional space.In this space,some adjacent hash buckets from the first layer are merged so as to make the data objects uniformly distributed in the merged buckets at the second layer.Therefore,the M2LSH can not only improve the searching efficiency but also increase the accuracy of the search results.This paper also provides a detailed theoretical accuracy analysis for the M2LSH technique.Experiments using synthetic and real data show that the theoretical estimates are quite accurate,and the proposed M2LSH technique can efficiently process ANN searches with low false positive and negative rates.
Keywords:approximate nearest neighbor (ANN)  k-nearest neighbor (KNN) search  locality sensitive hashing (LSH)  high dimensionality
本文献已被 万方数据 等数据库收录!
点击此处可从《电子学报》浏览原始摘要信息
点击此处可从《电子学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号