Preprint Article Version 1 This version is not peer-reviewed

Unsupervised Modeling of E-Customers’ Profiles: Multiple Correspondence Analysis with Hierarchical Clustering of Principal Components and Machine Learning Classifiers

Version 1 : Received: 4 November 2024 / Approved: 5 November 2024 / Online: 6 November 2024 (12:23:35 CET)

How to cite: Vrhovac, V.; Orošnjak, M.; Ristić, K.; Sremcev, N.; Jocanović, M.; Spajić, J.; Brkljač, N. Unsupervised Modeling of E-Customers’ Profiles: Multiple Correspondence Analysis with Hierarchical Clustering of Principal Components and Machine Learning Classifiers. Preprints 2024, 2024110363. https://doi.org/10.20944/preprints202411.0363.v1 Vrhovac, V.; Orošnjak, M.; Ristić, K.; Sremcev, N.; Jocanović, M.; Spajić, J.; Brkljač, N. Unsupervised Modeling of E-Customers’ Profiles: Multiple Correspondence Analysis with Hierarchical Clustering of Principal Components and Machine Learning Classifiers. Preprints 2024, 2024110363. https://doi.org/10.20944/preprints202411.0363.v1

Abstract

The rapid growth of e-commerce has transformed customer behaviours, demanding deeper insights into how demographic factors shape online user preferences. To understand the impact of these changes, this study performs a threefold analysis. Firstly, the study investigates how demographic factors (e.g., age, gender, education, income) influence e-customer preferences in Serbia. From a sample of n = 906 respondents, we test conditional dependencies between demographics and user preferences – “purchase frequency”, “the most important property when buying for the first time”, “the most important property before repeating a purchase”, and “reasons for quitting an online purchase”. From a hypothetical framework of 24 tested hypotheses, the study successfully rejects 8/24 (with p < 0.05), suggesting a high association between demographics with purchase frequency (p < 0.01) and reasons for quitting the purchase (p < 0.01). However, although reported test statistics suggest an association, understanding how interactions between categories shape e-customer profiles is lacking. As a consequence, the second part considers an MCA-HCPC (Multiple Correspondence Analysis with Hierarchical Clustering on Principal Components) to identify user profiles. The analysis reveals three main clusters : (1) young female unemployed e-customers driven mainly by customer reviews; (2) retirees and older adults with infrequent purchases, hesitant to buy without experiencing the product in person; (3) employed, highly educated, male midlife adults who prioritise fast and accurate delivery over price. In the third stage, the study uses identified clusters as labels for Machine Learning (ML) classification through the following algorithms: Gradient Boosting Machine (GBM), Decision Tree (DT), k-Nearest Neighbors (kNN), Gaussian Naïve Bayes (GNB), Random Forest (RF) and Support Vector Machine (SVM). The results suggest high classification performance of GBM (AUROC = 0.994), RF (AUROC = 0.994) and SVM (AUROC = 0.902) in identifying user profiles. Lastly, after performing Permutation Feature Importance (PFI), the findings suggest that age, work status, education, and income are the main determinants of shaping e-customer profiles and developing marketing strategies.

Keywords

e-commerce; customer profiles; demographics; user preferences; multiple correspondence analysis; hierarchical clustering; machine learning

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.