Altmetrics
Downloads
164
Views
34
Comments
0
This version is not peer-reviewed
Submitted:
14 September 2023
Posted:
15 September 2023
You are already at the latest version
Algorithms of Machine Learning List | Machine Learning Techniques Sub Methods | Details |
---|---|---|
Supervised Learning | Regression | Its purpose is to predict continuous numeric value. Examples: Linear regression and support vector regression. Regression: It is a technique used to identify or guess the data set continuous values. Example: Its purpose is to predict the share market losses or profits and can be applicable in all the fields. |
Classification | It assigns data points to predefined categories. Examples: decision trees, support vector machines (SVM), logistic regression, random forests. |
|
Unsupervised Learning | Clustering | Its purpose is to group similar data points into clusters without predefined categories. Examples: hierarchical clustering, K-Means and DBSCAN. |
Dimensionality Reduction [6] | It is technique applied on the dataset to remove the number of features by retaining important information. Examples: t-distributed Stochastic Neighbor Embedding and Principal Component Analysis. |
|
Semi-Supervised Learning | It combines the aspects of labeled data (supervised) used along with the unlabeled data (unsupervised) [7]. | |
Reinforcement Learning | It uses agent which interact with an environment and learn the best actions to maximize a reward. Examples: Q learning, deep reinforcement learning e.t.c [8]. | |
Deep Learning | It is used to learn complex patterns and representations from data of neural networks. Examples: Recurrent Neural Networks (RNNs) for sequential data and Convolution Neural Networks (CNNs) for image analysis [9]. |
|
Ensemble Methods | Ensemble Methods can be applied on multiple base models to increase its overall performance. Examples: boosting and bagging (Bootstrap Aggregating) [10]. |
|
Natural Language Processing (NLP) | It is used for understanding and processing human language [11]. | |
Time Series Analysis | Its purpose is to analyze the sequence of data point’s which is collected for a particular time interval. Examples: Long Short-Term Memory networks and Autoregressive Integrated Moving Average [12]. |
|
Anomaly Detection Algorithms | It is used to identify outliers in data / unusual patterns. Examples: One-Class SVM and Isolation Forest [13]. |
Types of Learning | ||
---|---|---|
Property | Supervised | Unsupervised Learning |
Definition | Groups the input data. | Assigns Class labels |
Depends On | Training Set | Prior Knowledge not required |
No Of Classes | Known | Unknown |
Training Data | Contains both input features and target labels (desired outputs). | Contains only input features |
Learning Objective | Used to interpret input data | Based on the data source input and output it is used to develop predict model. |
Training Process | Try to learn the relationship between input features and target labels for predictions or classifications. | Used to identify patterns using techniques like clustering and dimensionality reduction. |
Examples | Classification and regression. | Clustering, anomaly detection, and topic modeling. |
Purpose | Guess the upcoming observations | Used to develop, predicts model for the data understanding for knowing unknown properties of a data source. |
Evaluation | Done using metrics like mean squared error, recall, F1-score, precision, Accuracy e.t.c | Done using internal measures (Silhouette score or domain-specific evaluations). |
Applications | Spam detection, image recognition, medical diagnosis, and stock price prediction. | Customer segmentation, image compression, recommendation systems, and exploratory data analysis. |
Requirements | Details |
---|---|
Data Scalability | It’s capability to compact the Data [15]. |
Deals With | Different types of Attributes, outliers and noise. |
knowledge | Requires vertical knowledge. |
Finds | Clusters |
Orders Input Data | Orders Input Data in Ascending or Descending order |
Dimensionality | Addresses dimensionality of the data [16]. |
Clustering Algorithm | Details | Sub Clustering Methods |
---|---|---|
Partitioning | It uses relocation technique for to group data by moves entities from one group to another group [17]. | 1. CLARA. 2. CLARANS. 3.EMCLUSTERING 4. FCM. 5. K MODES. 6. KMEANS. 7. KMEDOIDS. 8. PAM. 9. XMEANS |
Hierarchical | Based on objects similarity Hierarchical clustering create clusters [18]. | 1. AGNES. 2. BIRCH. 3. CHAMELEON. 4. CURE. 5. DIANA. 6. ECHIDNA 7. ROCK. |
Density Based | It is used to create clusters based on radius as a constraint. I.e. based on a particular radius the data points within the radius are considered as one group and remaining are considered as other group (noise) [19]. | 1. DBSCAN. 2. OPTICS. 3. DBCLASD 4. DENCLUE. |
Grid Based | Density of cells calculated using grid used for the clustering process [20]. | 1. CLIQUE. 2. OPT GRID. 3. STING. 4. WAVE CLUSTER. |
Model Based | Model Based Clustering of data uses statistical approach where weights (probability distribution) are assigned to individual objects, based on these weights data is clustered [21]. | 1. EM. 2. COBWEB. 3. SOMS. |
Soft Clustering | Here more than one cluster the individual data points are assigned which will have minimum clusters similarity [22]. | 1. FCM. 2. GK. 3. SOM. 4. GA Clustering |
Hard Clustering | Here for every one cluster the individual data points are assigned which will have the maximum clusters similarity [23]. | 1. KMEANS |
Bi-clustering | It is used to cluster matrix rows and columns by using data mining technique [24]. | 1. OPSM. 2. Samba 3. JSa |
Graph Based | Graph contains vertices or nodes collection. In the graph based Clustering nodes are assigned weights, based on these weights Clustering is done [25]. | 1. Graph based k-means algorithm |
Partitioning Clustering Algorithms | Details |
---|---|
K - Means | Its purpose is to split the data source data into k clusters [26]. |
Parallel k / h-Means | It is a k-means version for big Data sources. It runs the k-means clustering algorithm in parallel on data sets to partition data into groups. Parallelization involves distributing the computation across multiple processors, cores, or machines to accelerate the clustering process and improve efficiency, especially for large datasets [27]. |
Global k means | It is a K means incremental version of which finds a globally optimal solution by considering multiple initializations and avoiding convergence to local minima [28]. |
K Means++ | It decreases the average squared distance between points for any cluster [29]. |
PAM (Partition Around Mediods) | It begins by choosing K medoid after then objects of medoid are exchanged with non medoid objects. It is a robust clustering algorithm used to decrease the outliers and noise for the enhancement og quality of clusters. [30]. |
CLARA (Clustering Large Applications) | CLARA uses the approach of sampling which contains large number of objects. CLARA used to decrease the storage space and computational time. [31]. |
CLARANS (Clustering Large Applications based on RANdomized Search) | It is better than CLARA used by big clustering applications and uses search of randomization on the data source which contains huge number of objects [32]. |
EMCLUSTERING | EM is same to K-means but in place of Euclidean distance EM clustering uses statistical methods which uses expectation (E) and maximization (M) between each of two data items [33]. |
FCM(fuzzy c-means) | It is used group data set into sub clusters where all data point belongs all the clusters with a particular degree for the given data source [34]. |
K MODES | Its purpose is to group a set of data entities into a number of clusters base on categorical attributes with uses modes or the most frequent values [35]. |
KMEDOIDS | It is a version of K-means but instead of mean it uses cluster centrally located object with minimum sum of distances to other points [36]. |
PAM (Partition Around Medoids) | It finds for k medoids from the data source and adds single each object to the nearest medoid in order to create clusters [37]. |
XMEANS | XMEANS is a version of k-means which follow a condition Akaike information criterion (AIC) or Bayesian information to subdivision of clusters repeatedly for refining them [38]. |
Hierarchical Based Clustering Types | Details |
---|---|
Agglomerative Clustering | It is a bottom up approach where every entity is tried to merge with other clusters recursively until the user is constraints are satisfied. Or Agglomerative clustering clusters the data based on combining clusters up [39]. |
Divisive Clustering | It is a bottom up approach begins with single cluster and then it splits into smaller recursively until the user is constraints are satisfied. Or Divisive clustering clusters the data based on merging clusters down [40]. |
Hierarchical Clustering Algorithms Types | Details |
---|---|
BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies ) | Its purpose is to cluster the big data sources. Its purpose is to utilize a memory efficient data structure and performing clustering in a single pass [41]. |
CURE (Clustering Using REpresentavives) | It uses random sampling methods for merging the partitions. It reduces the running time, memory and creates good quality clusters [42]. |
ROCK (Robust Clustering using links) | It understands the links for clustering the data [43]. |
CACTUS (Clustering Categorical Data Using Summaries) | It is used on a data source which contains categorical data and it reduces the execution of clustering. It is applicable on any data source of any size [44]. |
SNN (Shared Nearest Neighbor) | It is used on data sources which has high and have not stable density [45]. |
AGNES(Agglomerative Nesting) | It is used to combine the each object of a singleton cluster recursively based on the objects similarity [46]. |
CHAMELEON | It selects to merge clusters based on the connectivity and proximity of clusters objects similarity [47]. |
DIANA (DIvisie ANAlysis clustering algorithm) | It is an up-down clustering and it begins with the data points split recursively to form sub clusters [48]. |
ECHIDNA (Efficient Clustering of Hierarchical Data for Network Traffic Analysis) | It is applicable on attributes of mixed type comes from network traffic [49]. |
Density Clustering Algorithms | Details |
---|---|
OPTICS(Ordering Points To Identify the Clustering Structure) | Its purpose is to generate clusters of different densities and shapes and is a variant of density-based clustering algorithm [50]. |
DBSCAN (Density Based Clustering) | It is used to cluster data which is having huge outliers and noise [51] and is applicable for big data source. |
SUBCLU (SUB space Clustering) | It is suggested clustering algorithm for subspace data and efficiency [52]. |
DENCLU (Density Based Clustering) | It is suggested clustering algorithm for multimedia data and dataset which contains huge noise [52]. |
DENCLU-IM (Density Based Clustering Improved) | Its purpose is to cluster multimedia data and dataset which contains huge noise and outliers [54]. |
DBCLASD (Distribution-Based Clustering of LArge Spatial Databases) | It is suggested clustering algorithm for spatial [55]. |
Data Point Type | Point Details |
---|---|
Core | Points of a specific cluster |
Border | not core points |
Noise | Not core and Border points |
Density Clustering Algorithms | Details |
---|---|
CLIQUE(Clustering In QUEst) | It identifies subspaces of large dimensional data space for performing best clustering by using density and grid based concepts. Every dimension is divided into equal number of length intervals [56]. |
OPT GRID | It is a based on grid clustering algorithm which finds optimal gird-size using the boundaries of the clusters [57]. |
STING (Statistical Information Grid) | It is a similar on grid clustering Technique where the dataset is recursively split into a limited number of cells. It concentrates on value space near the data points but not only on data points [58]. |
Wave Cluster | It is a based on multi resolution grid clustering algorithm, which is used to identify the borders between clusters using wavelet transform. Wavelet transform is used to process signals by dividing a signal into different frequency sub bands [59]. |
MAFIA (Merging of Adaptive Finite IntervAls) | It is a down to up Adaptive calculation to cluster subspace data [60]. |
BANG (BAtch Neural Gas) | Clustering is done by using neighbor search algorithm. Output of the neighbor search algorithm is pattern values [61]. |
CLIQUE (Clustering IN QUEst) | Clustering focus of using the two algorithms density and grid [56]. |
Model Based Clustering Algorithms Types | Details |
---|---|
EM (Expectation maximization) | EM is a variant of K-means but instead of Euclidean distance EM clustering uses statistical methods which uses expectation (E) and maximization (M) between each of two data items [62]. |
COBWEB | It uses hierarchical conceptual clustering which is used to guess missing attributes or the class of a new object by incremental system. It is proposed by Douglas H. Fisher [63]. |
SOMS (Self Organizing Map) | SOM is a clustering technique which maps multidimensional data to lower dimensional data for understanding purpose [64]. |
Hard Clustering | Soft Clustering | |
---|---|---|
All Data Point Assigned to | single cluster | multiple clusters |
Similarity Clustering | maximum | minimum |
Model Based Clustering Algorithms Types | Details |
---|---|
FCM | It is used group data set into sub clusters where all data point belongs all the clusters with a particular degree for the given data source. |
GK (Gustafson-kessel) | It uses adaptive distance norm to identify clusters of dissimilar shapes from the data source and is a version of fuzzy c means algorithm. |
SOMS (Self Organizing Map) | SOM is a clustering technique which maps multidimensional data to lower dimensional data for understanding purpose. |
GA (Genetic algorithms) | It finds solutions by optimizing the search problems using the biological operators like selection, mutation and crossover [65]. |
Bi-clustering Based Clustering Algorithms | Details |
---|---|
FCM | It is used group data set into sub clusters where all data point belongs all the clusters with a particular degree for the given data source. |
GK (Gustafson-kessel) | It uses adaptive distance norm to identify clusters of dissimilar shapes from the data source and is a version of fuzzy c means algorithm [66]. |
SOMS (Self Organizing Map) | SOM is a clustering technique which maps multidimensional data to lower dimensional data for understanding purpose. |
GA (Genetic algorithms) | It finds solutions by optimizing the search problems using the biological operators like selection, mutation and crossover. |
Graph Based Clustering Algorithms | Details |
---|---|
Graph based k-means algorithm | Its purpose is to split the graph into sub graphs based on the distance between the nodes. Graph is a collection of nodes. For calculating the distance between the nodes the following methods are used Chebyshev, Euclidean squared, Euclidean and manhattan distances [67]. |
Performance Evaluation of Clustering Algorithms | Details |
---|---|
Data Mining Tasks | It is of two types. They were Descriptive or Predictive. Clustering is Descriptive Data Mining Tasks. |
Type of Learning / Knowledge | Unsupervised / Unsupervised / Reinforced learning |
Dimensionality | If the clustering algorithm deals with more types of data then it is said to be multi dimensional. (High / Low / Medium). |
Data Sources | Data Set / File / Data Base |
Unstructured or Structured Data | Structured data is easily made into clusters but not Unstructured data. So algorithms are used to convert unstructured data to Structured data. So there is a requirement of unstructured data to be converted into unstructured data and it can discover new patterns. Clustering uses Structured in most cases. |
Data Types used in Clustering | Clustering algorithm processes two types of data. They were (Qualitative / Categorical Data) and (Quantitative / Numerical Data). Qualitative type (Subjective) of data can be split into categories. Example: Persons Gender (male, female, or others). It is of three types. They were Nominal (sequenced), Ordinal (ordered) and binary (take true (1) / false (0)). Quantitative Data Type is measurable and is of two types. They were Discrete (countable, continuously, measurable). Example: Student height. |
ETL Operations used | Extraction, Transformation and loading operations are performed on the data source. |
Data Preprocessed | It is used for data cleaning and data transforming to make it suitable for analysis. |
Data Preprocessing Methods | Data Preprocessing Methods used in the market are cleaning, instance selection, normalization, scaling, feature selection, one-hot encoding, data transformation, feature extraction and feature selection and dimensionality reduction |
Hierarchical Clustering Algorithms Type | It is two types Divisive (Top-Down) Or Agglomerative (Bottom-Up). |
No Of Clustering Algorithms | It is the total count of two types of Clustering Algorithms (Main and sub).i.e. It is count of sum of total number of Main Clustering Algorithms and total number of Sub Clustering Algorithms |
Algorithms Threshold / Stops At What Level | Hierarchical clustering algorithms Stops at a level defined by the user as his Preferences. |
Algorithm Stability | It uses different clustering applications to determine the number of clusters. |
Programming Language | It used For processing (Python, Java, .Net e.t.c) the clustering algorithm. |
Number Of Inputs For The Clustering Process | Clustering Algorithm, Algorithm Constraints, Number of Levels and clusters per each level. |
Number Of Levels | In Hierarchical clustering algorithms, divisive clustering (top-down) how many split it goes down is the number levels. Or Agglomerative (bottom-up) how many merges it goes up is he number of levels. |
Level Wise Clusters | It is number of clusters at each level or stage |
Data Points per Cluster | It is always depends on the type of cluster algorithm used and its preferences defined by the user. |
Similarity Functions / Similarity Measure. | It is used to quantify how similar or dissimilar two clusters are in a clustering analysis. Similarity measures are used to identify the good clusters in the given data set. There are so many Similarity measures used in the current market. They were Weighted, Average, Chord , Mahalanobis, Mean Character Difference, Index of Association, Canberra Metric, Czekanowski Coefficient, Pearson coefficient, Minkowski Metric, Manhattan or City blocks distance, KullbackLeibler Divergence, Clustering coefficient, Cosine, Kmean e.t.c |
Intra Cluster Distance | It says how near the data points in a cluster are to each other. If its value is low then the clusters are said to be tightly coupled other clusters are said to be loosely coupled. |
Inter Cluster Distance | It is used to measures the separation or dissimilarity between different clusters. It quantifies how distinct or well-separated the clusters are from each other. |
Sum Of Square Error (SSE) Or Other Errors | It is a measure of difference the actual to the expected result of the model. |
Likelihood Of Clusters | It is the similarity of clusters in the data points |
Unlikelihood Of Clusters | It is the dissimilarity of clusters in the data points. |
Number Of Variable Parameters At Each Level | These are the input parameters which are changed during the running of the algorithm like threshold. |
Outlier | In the clustering process any object doesn’t belong to any cluster it is called as an outlier. |
Clusters Compactness | It deals with the inertia for better clustering. It means lower inertia indicates better clustering. Inertia means Within-Cluster Sum of Squares. |
Purpose | Develop and predict model |
Clustering Scalability | It is the increasing and decreasing abilities of every cluster as a part o whole. |
Total Number of Clusters | It is total number clusters generated by the clustering algorithm after its execution. |
Interpretability | Understandability , usability of clusters after is generation is called as Interpretability |
Convergence | Convergence criterion is a condition by which controls the change in cluster centers. It should be always to be minimum. |
Clusters Shape | Each clustering Algorithm handles the clustering in different shapes. Clustering Algorithm ------- Cluster Shape K Means ------- Hyper Spherical, Centroid Based Approach ------- Concave Shaped Clusters, Cure ------- Arbitrary, Partitional Clustering ------- Ellipsoidal, Clarans ------- Polygon Shaped, Dbscan ------- Concave E.t.c |
Output | Clusters |
Space Complexity | It of a clustering algorithm refers to the amount of memory or storage for storing input data, data structures or variables required by the algorithm to perform clustering on a given dataset. Space Complexity=Auxiliary Space + Space For Input Values. |
Time Complexity | It is the time taken to run each and statements of a algorithm. Time Complexities of Clustering Algorithms Clustering Algorithm ---- Time Complexity BIRCH ---- O(n) CURE ---- O(sˆ2*s) ROCK ---- O(nˆ3) CLARANS ---- O(nˆ2) Chameleon ---- O(nˆ2) Sting ---- O(n) Clique ---- O(n) K -Means ---- O(n) K-medoids ---- O(nˆ2) PAM ---- O(nˆ2) CLARA ---- O(n) e.t.c |
Clusters Visualization | It is a process used to representing clusters or groups of data points in a visual format. It gives the insights into patterns, relationships, and structures within the data. Techniques and tools for visualizing clusters: Scatter Plots, Dendrogram, Heatmaps, t-Distributed Stochastic Neighbor Embedding, Principal Component Analysis Plot, Silhouette Plots, K-Means Clustering Plot, Hierarchical Clustering Dendrogram, Density-Based Clustering Visualization, Interactive Visualization Tools: Matplotlib, Seaborn, Plotly, D3.js, and Tableau. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 MDPI (Basel, Switzerland) unless otherwise stated