论文标题
有限的保证算法,可通过最大可能性最小化凹杂质
Bounded Guaranteed Algorithms for Concave Impurity Minimization Via Maximum Likelihood
论文作者
论文摘要
分区算法在许多科学和工程学科中起关键作用。分区算法将集合划分为许多不相交的子集或分区。通常,所得分区的质量是通过每个分区中的杂质量来衡量的,分区质量越小。通常,对于由分区的函数指定的给定杂质度量,找到最小杂质分区是一个NP-固定问题。让$ m $是集合中$ n $维元素的数量,而$ k $是所需分区的数量,然后在所有可能的分区上进行详尽的搜索以找到最低分区的分区,其复杂性(k^m)$的复杂性很快对于许多$ k $和$ k $和$ m $ abor的应用程序很快就变得不切实际。因此,已经提出了许多具有多项式时间复杂性的近似算法,但是很少有有限的保证。在本文中,构建了一类杂质功能的上限和下限。基于这些界限,我们提出了一种基于最大似然原理的限制性保证的低复杂性分区。对算法的有限保证的理论分析是针对两个众所周知的杂质函数Gini指数和熵的。当$ k \ geq n $时,提议的算法以最低的近似值和多项式时间复杂性$ o(nm)$来实现最先进的结果。此外,提出了一种$ o((n-k)n^2+nm)$的启发式贪婪算法,以$ k <n $提议。尽管贪婪的算法没有提供有限的保证,但其性能与最先进的方法相当。我们的结果还概括了一些知名的信息理论界限,例如Fano的不平等和Boyd-Chiang的界限。
Partitioning algorithms play a key role in many scientific and engineering disciplines. A partitioning algorithm divides a set into a number of disjoint subsets or partitions. Often, the quality of the resulted partitions is measured by the amount of impurity in each partition, the smaller impurity the higher quality of the partitions. In general, for a given impurity measure specified by a function of the partitions, finding the minimum impurity partitions is an NP-hard problem. Let $M$ be the number of $N$-dimensional elements in a set and $K$ be the number of desired partitions, then an exhaustive search over all the possible partitions to find a minimum partition has the complexity of $O(K^M)$ which quickly becomes impractical for many applications with modest values of $K$ and $M$. Thus, many approximate algorithms with polynomial time complexity have been proposed, but few provide bounded guarantee. In this paper, an upper bound and a lower bound for a class of impurity functions are constructed. Based on these bounds, we propose a low-complexity partitioning algorithm with bounded guarantee based on the maximum likelihood principle. The theoretical analyses on the bounded guarantee of the algorithms are given for two well-known impurity functions Gini index and entropy. When $K \geq N$, the proposed algorithm achieves state-of-the-art results in terms of lowest approximations and polynomial time complexity $O(NM)$. In addition, a heuristic greedy-merge algorithm having the time complexity of $O((N-K)N^2+NM)$ is proposed for $K<N$. Although the greedy-merge algorithm does not provide a bounded guarantee, its performance is comparable to that of the state-of-the-art methods. Our results also generalize some well-known information-theoretic bounds such as Fano's inequality and Boyd-Chiang's bound.