Dimensionality Reduction and Clustering of Global Large Enterprise Data Using PCA, UMAP, and Gaussian Mixture Models
Keywords:
Dimensionality Reduction, Clustering, PCA, UMAP, K-Means, Gaussion Mixture Model, Business AnalyticAbstract
In the modern business landscape, large corporations generate high dimensional datasets that combine financial, operational, and market indicators, often producing complex and partially overlapping structures that are difficult to interpret in the original feature space. This study benchmarks linear dimensionality reduction using Principal Component Analysis (PCA) against non linear reduction using Uniform Manifold Approximation and Projection (UMAP), and examines how these representations affect clustering quality using k means and Gaussian Mixture Models (GMM). Data preprocessing includes missing value handling, categorical encoding, numeric coercion, and feature standardization to ensure scale comparable learning. Clustering performance is evaluated using the silhouette score, which jointly reflects within cluster cohesion and between cluster separation. The results indicate that the UMAP plus GMM pipeline achieves the best clustering quality with the highest silhouette score (0.57), suggesting that manifold based representations combined with probabilistic clustering more effectively capture heterogeneous corporate structures than linear projections and hard assignments. The findings support the use of non linear and model based pipelines for corporate segmentation tasks, particularly when clusters may overlap due to mixed business profiles and cross sector similarities.
References
Anowar, F., Sadaoui, S., & Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms. Computer Science Review, 40, 100378. https://doi.org/10.1016/j.cosrev.2021.100378
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022). A comprehensive survey of clustering algorithms: State of the art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743. https://doi.org/10.1016/j.engappai.2022.104743
Fraley, C., & Raftery, A. E. (2002). Model based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631. https://doi.org/10.1198/016214502760047131
Greenacre, M., Groenen, P. J. F., Hastie, T., D’Enza, A. I., Markos, A., & Tuzhilina, E. (2022). Principal component analysis. Nature Reviews Methods Primers, 2, Article 100. https://doi.org/10.1038/s43586-022-00184-w
Healy, J., & McInnes, L. (2024). Uniform manifold approximation and projection. Nature Reviews Methods Primers, 4(1), Article 82. https://doi.org/10.1038/s43586-024-00363-x
Jain, A. K. (2010). Data clustering: 50 years beyond k means. Pattern Recognition Letters, 31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011
McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. https://doi.org/10.21105/joss.00861
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. (No DOI)
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 205–233. https://doi.org/10.32614/RJ-2016-021
Xia, J., Zhang, Y., Song, J., Chen, Y., Wang, Y., & Liu, S. (2022). Revisiting dimensionality reduction techniques for visual cluster analysis: An empirical study. IEEE Transactions on Visualization and Computer Graphics, 28(1), 529–539. https://doi.org/10.1109/TVCG.2021.3114694

