ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Mechanism analysis of the accelerator for k-nearest neighbor algorithm based on data partition

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2018.04.009
  • Received Date: 19 September 2017
  • Rev Recd Date: 11 April 2018
  • Publish Date: 30 April 2018
  • Due to its absence of hypotheses for the underlying distributions of data, simple execution and strong generation ability, k-nearest neighbor classification algorithm (kNN) is widely used in face recognition, text classification, emotional analysis and other fields. kNN does not need the training process, but it only stores the training instances until the unlabeled instance appears, and executes the predicted process. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances, hence it is difficult to deal with large-scale data. To overcome this difficulty, #br##br# the process of computing the nearest neighbors is converted to a constrained optimization problem, and an estimation is given of difference on the value of the objective function under the optimal solution with and without data partition. The theoretical analysis of this estimation indicates that data partition using clustering can reduce this difference, and the k-nearest neighbor algorithm based on clustering can have a strong generation ability. Experiment results on public datasets show that the k-nearest neighbor algorithm based on clustering can largely obtain the same nearest neighbors of raw kNN, thus obtaining higher classification accuracy.
    Due to its absence of hypotheses for the underlying distributions of data, simple execution and strong generation ability, k-nearest neighbor classification algorithm (kNN) is widely used in face recognition, text classification, emotional analysis and other fields. kNN does not need the training process, but it only stores the training instances until the unlabeled instance appears, and executes the predicted process. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances, hence it is difficult to deal with large-scale data. To overcome this difficulty, #br##br# the process of computing the nearest neighbors is converted to a constrained optimization problem, and an estimation is given of difference on the value of the objective function under the optimal solution with and without data partition. The theoretical analysis of this estimation indicates that data partition using clustering can reduce this difference, and the k-nearest neighbor algorithm based on clustering can have a strong generation ability. Experiment results on public datasets show that the k-nearest neighbor algorithm based on clustering can largely obtain the same nearest neighbors of raw kNN, thus obtaining higher classification accuracy.
  • loading
  • [1]
    COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 2002, 13(1): 21-27.
    [2]
    XU B H, FU Y W, JIANG Y G, et al. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization[J]. IEEE Transactions on Affective Computing, 2018, 9(2): 255-270.
    [3]
    周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
    [4]
    张敏灵. 一种新型多标记懒惰学习算法[J]. 计算机研究与发展, 2012, 49(11): 2271-2282.
    [5]
    FRIEDMAN J, HASTIE T, TIBSHIRANI R. The Elements of Statistical Learning[M]. Berlin: Springer, 2001.
    [6]
    WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37.
    [7]
    KONONENKO I, KUKAR M. Machine Learning and Data Mining: Introduction to Principles and Algorithms[M]. Chichester: Harwood Publishing Limited, 2007.
    [8]
    LI Y, MAGUIRE L. Selecting critical patterns based on local geometrical and statistical information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(6): 1189-1201.
    [9]
    MUJA M, LOWE D G. Scalable nearest neighbor algorithms for high dimensional data[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2227-2240.
    [10]
    MARCHIORI E. Class conditional nearest neighbor for large margin instance selection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(2): 364-370.
    [11]
    ANGIULLI F. Fast nearest neighbor condensation for large data sets classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1450-1464.
    [12]
    LIU T, MOORE A W, YANG K, et al. An investigation of practical approximate nearest neighbor algorithms[C]// Proc of Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2005: 825-832.
    [13]
    MCFEE B, LANCKRIET G R G. Large-scale music similarity search with spatial trees[C]// Proceedings of the 12th International Society for Music Information Retrieval Conference. Florida: ISMIR Press, 2014: 566-574.
    [14]
    GARCIA S, DERRAC J, CANO J, et al. Prototype selection for nearest neighbor classification: Taxonomy and empirical study[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417-435.
    [15]
    HSIEH C J, SI S, DHILLON I S. A divide-and-conquer solver for kernel support vector machines[C]// Proceedings of the 27th International Conference on Machine Learning. Haifa: IMLS Press, 2014: 566-574.
    [16]
    FRIEDMAN J H, BENTLEY J L, FINKEL R A. An algorithm for finding best matches in logarithmic expected time[J]. ACM Transactions on Mathematical Software, 1977, 3(3): 209-226.
    [17]
    VERMA N, KPOTUFE S, DASGUPTA S. Which spatial partition trees are adaptive to intrinsic dimension?[C]// Proceedings of the 25h Conference on Uncertainty in Artificial Intelligence. Montreal, Canada: AUAI Press, 2009: 565-574.
    [18]
    SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal Processing Magazine, 2008, 25(2): 128-131.
    [19]
    OLVERA-LPEZ J A, CARRASCO-OCHOA J A, MARTNEZ-TRINIDAD J F, et al. A review of instance selection methods[J]. Artificial Intelligence Review, 2010, 34(2): 133-143.
    [20]
    BRIGHTON H, MELLISH C. Advances in instance selection for instance-based learning algorithms[J]. Data Mining and Knowledge Discovery, 2002, 6(2): 153-172.
    [21]
    HART P. The condensed nearest neighbor rule[J]. IEEE Transactions on Information Theory, 1968, 14(3): 515-516.
    [22]
    ANGIULLI F, FOLINO G. Distributed nearest neighbor-based condensation of very large data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(12): 1593-1606.
    [23]
    NIKOLAIDIS K, GOULERMAS J Y, WU Q H. A class boundary preserving algorithm for data condensation[J]. Pattern Recognition, 2011, 44(3): 704-715.
    [24]
    ZHANG H, SUN G. Optimal reference subset selection for nearest neighbor classification by tabu search[J]. Pattern Recognition, 2002, 35(7): 1481-1490.
    [25]
    王熙照, 王亚东, 湛燕, 等. 学习特征权值对 K-均值聚类算法的优化[J]. 计算机研究与发展, 2003, 40(6): 869-873.
    [26]
    杨润玲, 高新波. 基于加权模糊c均值聚类的快速图像自动分割算法[J]. 中国图象图形学报, 2007, 12(12): 2105-2112.
    [27]
    DEMAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1): 1-30.
    [28]
    CHANG C C, LIN C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): No.27.
    [29]
    BLAKE C L, MERZ C J. UCI Repository of machine learning databases[EB/OL]. [2017-06-15] http://www. ics. uci. edu/~ mlearn/MLRepository. Html, 1998.
    [30]
    WILCOXON F. Individual comparisons by ranking methods[J]. Biometrics Bulletin, 1945, 1(6): 80-83.
    [31]
    GARCA-OSORIO C, HARO-GARCA A, GARCA-PEDRAJAS N. Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts[J]. Artificial Intelligence, 2010, 174(5): 410-441.
    [32]
    KORDOS M, BLACHNIK M, STRZEMPA D. Do we need whatever more than k-NN?[C]// Proceedings of the 10th International Conference on Artificial Intelligence and Soft Computing. Zakopane, Poland: Springer, 2010: 414-421.
  • 加载中

Catalog

    [1]
    COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 2002, 13(1): 21-27.
    [2]
    XU B H, FU Y W, JIANG Y G, et al. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization[J]. IEEE Transactions on Affective Computing, 2018, 9(2): 255-270.
    [3]
    周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
    [4]
    张敏灵. 一种新型多标记懒惰学习算法[J]. 计算机研究与发展, 2012, 49(11): 2271-2282.
    [5]
    FRIEDMAN J, HASTIE T, TIBSHIRANI R. The Elements of Statistical Learning[M]. Berlin: Springer, 2001.
    [6]
    WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37.
    [7]
    KONONENKO I, KUKAR M. Machine Learning and Data Mining: Introduction to Principles and Algorithms[M]. Chichester: Harwood Publishing Limited, 2007.
    [8]
    LI Y, MAGUIRE L. Selecting critical patterns based on local geometrical and statistical information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(6): 1189-1201.
    [9]
    MUJA M, LOWE D G. Scalable nearest neighbor algorithms for high dimensional data[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2227-2240.
    [10]
    MARCHIORI E. Class conditional nearest neighbor for large margin instance selection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(2): 364-370.
    [11]
    ANGIULLI F. Fast nearest neighbor condensation for large data sets classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1450-1464.
    [12]
    LIU T, MOORE A W, YANG K, et al. An investigation of practical approximate nearest neighbor algorithms[C]// Proc of Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2005: 825-832.
    [13]
    MCFEE B, LANCKRIET G R G. Large-scale music similarity search with spatial trees[C]// Proceedings of the 12th International Society for Music Information Retrieval Conference. Florida: ISMIR Press, 2014: 566-574.
    [14]
    GARCIA S, DERRAC J, CANO J, et al. Prototype selection for nearest neighbor classification: Taxonomy and empirical study[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417-435.
    [15]
    HSIEH C J, SI S, DHILLON I S. A divide-and-conquer solver for kernel support vector machines[C]// Proceedings of the 27th International Conference on Machine Learning. Haifa: IMLS Press, 2014: 566-574.
    [16]
    FRIEDMAN J H, BENTLEY J L, FINKEL R A. An algorithm for finding best matches in logarithmic expected time[J]. ACM Transactions on Mathematical Software, 1977, 3(3): 209-226.
    [17]
    VERMA N, KPOTUFE S, DASGUPTA S. Which spatial partition trees are adaptive to intrinsic dimension?[C]// Proceedings of the 25h Conference on Uncertainty in Artificial Intelligence. Montreal, Canada: AUAI Press, 2009: 565-574.
    [18]
    SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal Processing Magazine, 2008, 25(2): 128-131.
    [19]
    OLVERA-LPEZ J A, CARRASCO-OCHOA J A, MARTNEZ-TRINIDAD J F, et al. A review of instance selection methods[J]. Artificial Intelligence Review, 2010, 34(2): 133-143.
    [20]
    BRIGHTON H, MELLISH C. Advances in instance selection for instance-based learning algorithms[J]. Data Mining and Knowledge Discovery, 2002, 6(2): 153-172.
    [21]
    HART P. The condensed nearest neighbor rule[J]. IEEE Transactions on Information Theory, 1968, 14(3): 515-516.
    [22]
    ANGIULLI F, FOLINO G. Distributed nearest neighbor-based condensation of very large data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(12): 1593-1606.
    [23]
    NIKOLAIDIS K, GOULERMAS J Y, WU Q H. A class boundary preserving algorithm for data condensation[J]. Pattern Recognition, 2011, 44(3): 704-715.
    [24]
    ZHANG H, SUN G. Optimal reference subset selection for nearest neighbor classification by tabu search[J]. Pattern Recognition, 2002, 35(7): 1481-1490.
    [25]
    王熙照, 王亚东, 湛燕, 等. 学习特征权值对 K-均值聚类算法的优化[J]. 计算机研究与发展, 2003, 40(6): 869-873.
    [26]
    杨润玲, 高新波. 基于加权模糊c均值聚类的快速图像自动分割算法[J]. 中国图象图形学报, 2007, 12(12): 2105-2112.
    [27]
    DEMAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1): 1-30.
    [28]
    CHANG C C, LIN C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): No.27.
    [29]
    BLAKE C L, MERZ C J. UCI Repository of machine learning databases[EB/OL]. [2017-06-15] http://www. ics. uci. edu/~ mlearn/MLRepository. Html, 1998.
    [30]
    WILCOXON F. Individual comparisons by ranking methods[J]. Biometrics Bulletin, 1945, 1(6): 80-83.
    [31]
    GARCA-OSORIO C, HARO-GARCA A, GARCA-PEDRAJAS N. Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts[J]. Artificial Intelligence, 2010, 174(5): 410-441.
    [32]
    KORDOS M, BLACHNIK M, STRZEMPA D. Do we need whatever more than k-NN?[C]// Proceedings of the 10th International Conference on Artificial Intelligence and Soft Computing. Zakopane, Poland: Springer, 2010: 414-421.

    Article Metrics

    Article views (887) PDF downloads(439)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return