k-Means+++: Outliers-Resistant Clustering

Preprint

Article

k-Means+++: Outliers-Resistant Clustering

Altmetrics

Downloads

369

Views

231

Comments

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

18 November 2020

Posted:

19 November 2020

You are already at the latest version

Alerts

Abstract

The $k$-means problem is to compute a set of $k$ centers (points) that minimizes the sum of squared distances to a given set of $n$ points in a metric space. Arguably, the most common algorithm to solve it is $k$-means++ which is easy to implement, and provides a provably small approximation factor in time that is linear in $n$. We generalize $k$-means++ to support: (i) non-metric spaces and any pseudo-distance function. In particular, it supports M-estimators functions that handle outliers, e.g. where the distance $\mathrm{dist}(p,x)$ between a pair of points is replaced by $\min {\mathrm{dist}(p,x),1}$. (ii) $k$-means clustering with $m\geq 1$ outliers, i.e., where the $m$ farthest points from the $k$ centers are excluded from the total sum of distances. This is the first algorithm whose running time is linear in $n$ and polynomial in $k$ and $m$.

Keywords:

Subject: Computer Science and Mathematics - Algebra and Number Theory

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

k-Means+++: Outliers-Resistant Clustering

Abstract

MDPI Initiatives

Important Links

Subscribe