ML algorithms get classified on training methods, with one type being supervised learning. Where algorithms learn by mapping inputs to outputs via a labeled dataset containing input-output pairs [
47], i.e., the algorithm gets trained on instances with known labels. Supervised learning effectiveness arises from having correct target outputs while the algorithm continually refines its output by calculating a value from a loss function, a value that guides adjustments to the model’s weights for achieving the desired output. For instance, when predicting car prices via features’ data, e.g., previous prices, model number, brand, mileage, and accident history, an algorithm predicts the car value via inputs like model, make, and mileage. Supervised learning, with broad applications like image recognition, Natural Language Processing (NLP), and speech recognition, gets employed in this study in supervised ML to distinguish Sybil from legit social media Twitter accounts via two key types: classification and regression. A classification problem concerns categorizing data into specified classes, e.g., classifying animal images into cats or dogs, or segregating individuals by age, either 20-30 or greater than 30. Such exemplary classification ML algorithm tasks determine the group where an instance belongs. Common classification algorithms are decision trees, support vector machines, and random forests and get trained on labeled data to classify new data but the prerequisite is to first choose the right algorithm relying on factors like dataset size, training methods, and desired accuracy. Sahami et al. [
48] tackles classification problems via a naive Bayes classifier to distinguish spam from legit emails through automated filters for classifying inbox emails, to enhance user experience by excluding unwanted content. Unlike classification, where ML algorithm predicts a predetermined class, regression predicts a value, e.g., car selling price or a woman’s due date. Unlike supervised learning, unsupervised learning omits labeled data for training, instead analyzing unlabeled data, revealing patterns and relationships, applicable in clustering and anomaly detection [
49]. The K-nearest neighbors (KNN) algorithm, a flexible supervised method, learns from labeled data to predict labels for unlabeled instances. It memorizes patterns during training, then utilizes them for predicting new instance labels, serving both classification and regression tasks based on its training.To label a new instance, KNN compares memorized instances and assigns labels using similarities [
50]. Next, this study illustrates real-world KNN functions for predicting housing price trends for investors’ decision-making. Initially, when employing the KNN algorithm as a classifier, classes get distinguished and the model’s accuracy gets assessed via labeled instances by splitting the dataset for training and testing using various techniques, e.g., cross-validation, including re-substitution Validation, K-fold cross-validation, and repeated K-fold cross-validation, each with its strengths and applications. Next, KNN algorithm determines the number of neighbors (K) for predicting class, influencing accuracy. Optimal K selection balances pattern detection and sensitivity. Next, compute distances between the instance and its neighbors. The closest K neighbors’ classes decide the predicted class using distance metrics (e.g., Euclidean, Manhattan, Minkowski). Refer to formula (
1) for these methods [
51]. Another distance measuring method is cosine similarity, depicted in formula (
2), commonly used in text retrieval, as discussed by Lu [
11]. Additionally, the KNN equation employs diverse distance formulas from Batchelor [
14] to introduce the Minkowski distance formula (
3), the Chi formula (Formula
4), and the correlation equation (Formula (
5) by Michalski and Stepp [
52], which constitute distinct distance metrics for KNN’s neighbor distance calculation.
Each equation is used to calculate distances between the instance under study and other instances in the dataset, leading to varying outputs and accuracy measurements. These aspects are explored in ongoing research projects. Medjahed et al. [
53] investigate breast cancer diagnoses using KNN algorithms and their dependency on diverse distance metrics. Iswanto et al. [
28] conduct comparable studies on the impact of these distance equations in stroke disease detection. After calculating distances between the studied instance and all other dataset instances, the algorithm picks K nearest neighbors, i.e., K determining the count of neighbors. For instance, with K set to 100, it chooses the closest 100 neighbors with such neighbors bearing the smallest distances to the instance. However, this step can get computationally intensive, especially for large datasets, requiring distance calculations and comparison operations. Once the K nearest neighbors gets identified, the algorithm predicts the instance’s class based on their classes, a step called "Voting," offering Weighted and Majority options. Weighted Voting assigns varying weights to classes, while Majority Voting treats all classes equally, so adaptability accommodates developers aiming to emphasize specific class attributes.