skip to main content

Perbandingan Metode Ensemble Machine Learning untuk Klasifikasi Tenaga Kerja di Indonesia dengan Random Forest, XGBoost, dan CatBoost

1Politeknik Statistika STIS, Indonesia

2Politeknik Statistika STIS, Jl. Otto Iskandardinata 64C Jakarta Timur, Indonesia

Received: 28 Dec 2020; Published: 24 Sep 2024.
Open Access Copyright (c) 2024 Jurnal Teknologi dan Sistem Komputer under http://creativecommons.org/licenses/by-sa/4.0.

Citation Format:
Abstract
Survei Angkatan Kerja Nasional (Sakernas) adalah survei periodik yang besar sehingga membutuhkan pengolahan data  kompleks serta validasi benar untuk menjaga kualitas data. Salah satu pertanyaan Sakernas yang pengisian dan validasinya secara manual yaitu lapangan pekerjaan utama. Untuk memberikan validasi, Machine Learning dapat diterapkan dengan memanfaatkan informasi pada isian lain. Penelitian ini menggunakan metode Random Forest, XGBoost, dan CatBoost untuk klasifikasi lapangan pekerjaan utama pada Sakernas Agustus 2019. Berdasarkan hasil, ketiga model memiliki performa yang hampir sama baik dari presisi, recall, dan f1 yaitu untuk sektor primer dan tersier diatas 90 % dan sektor sekunder sebesar 80%. Model dari Random Forest, XGBoost, dan CatBoost memiliki akurasi sebesar 91,80%; 90,88%; dan 91,84%. Nilai Area Under Curve (AUC) dari ketiga model relatif tinggi dengan CatBoost memiliki nilai tertinggi pada klasifikasi sektor primer, sekunder, dan tersier masing-masing sebesar 1,00; 0,97; dan 0,98.

Note: This article has supplementary file(s).

Fulltext |  Instrumen Riset
Deskirpsi Atribut Penelitian
Subject
Type Instrumen Riset
  Download (15KB)    Indexing metadata
 common.other
Perjanjian Pengalihan Hak Cipta
Subject
Type Other
  Download (364KB)    Indexing metadata
Email colleagues
Keywords: sakernas; random forest; xgboost; catboost

Article Metrics:

  1. A. Ikudo, J. I. Lane, J. Staudt, and B. A. Weinberg, “Occupational Classifications: A Machine Learning Approach,” Journal of Economic and Social Measurement, vol. 44, pp. 57–87, 2020, doi: 10.3233/JEM-190463
  2. A. Y. Wijayanto and D. W. Sari, “Analysis of Decision to Work of Female Workers in Indonesia,” Economics Development Analysis Journal, vol. 8, no. 3, pp. 290–300, 2019, doi: 10.15294/edaj.v8i3.29529
  3. M. Beck, F. Dumpert, and J. Feuerhake, “Machine Learning in Official Statistics,” arXiv. 2018. doi: 10.48550/arXiv.1812.10422
  4. W. Hacking and L. Wilenborg, “Method Series Theme: Coding; interpreting short descriptions using a classification,” 2012
  5. Y. Toko, K. Wada, S. Yui, and M. Sato-Ilic, “A Supervised Multiclass Classifier as an Autocoding System for the Family Income and Expenditure Survey,” in Advanced Studies in Classification and Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, Singapore: Springer Nature Singapore Pte Ltd, 2020, pp. 513–524. doi: 10.1007/978-981-15-3311-2_40
  6. A. Gerunov, “Employment Modelling Through Classification and Regression Trees,” International Journal of Data Science, vol. 1, no. 4, p. 316, 2016, doi: 10.1504/ijds.2016.081368
  7. L. Rokach, “Decision forest: Twenty Years of Research,” Information Fusion, vol. 27, pp. 111–125, 2016, doi: 10.1016/j.inffus.2015.06.005
  8. A. Lawi, F. Aziz, and S. Syarif, “Ensemble GradientBoost for Increasing Classification Accuracy of Credit Scoring,” in 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT), 2017, pp. 1–4. doi: 10.1109/CAIPT.2017.8320700
  9. X. Li, J. Liu, S. Liu, and J. Wang, “Differentially private ensemble learning for classification,” Neurocomputing, vol. 430, pp. 34–46, 2021, doi: 10.1016/j.neucom.2020.12.051
  10. R. Punmiya and S. Choe, “Energy theft detection using gradient boosting theft detector with feature engineering-based preprocessing,” IEEE Transactions on Smart Grid, vol. 10, no. 2, pp. 2326–2329, 2019, doi: 10.1109/TSG.2019.2892595
  11. S. Jhaveri, I. Khedkar, Y. Kantharia, and S. Jaswal, “Success prediction using random forest, catboost, xgboost and adaboost for kickstarter campaigns,” 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), 2019, pp. 1170-1173, doi: 10.1109/ICCMC.2019.8819828
  12. A. S. More and D. P. Rana, “Review of Random Forest Classification Techniques to Resolve Data Imbalance,” in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 2017, pp. 72–78. doi: 10.1109/ICISIM.2017.8122151
  13. J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, “A comparison of random forest variable selection methods for classification prediction modeling,” Expert Systems with Applications, vol. 134, pp. 93–101, 2019, doi: 10.1016/j.eswa.2019.05.028
  14. V. A. Dev and M. R. Eden, “Formation lithology classification using scalable gradient boosted decision trees,” Computers and Chemical Engineering, vol. 128, pp. 392–404, 2019, doi: 10.1016/j.compchemeng.2019.06.001
  15. J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics and Data Analysis, vol. 38, no. 4, pp. 367–378, 2002, doi: 10.1016/S0167-9473(01)00065-2
  16. J. Zhang, Q. Feng, X. Zhang, C. Shu, S. Wang, and K. Wu, “A Supervised Learning Approach for Accurate Modeling of CO2-Brine Interfacial Tension with Application in Identifying the Optimum Sequestration Depth in Saline Aquifers,” Energy and Fuels, vol. 34, no. 6, pp. 7353–7362, 2020, doi: 10.1021/acs.energyfuels.0c00846
  17. J. Ma, J. C. P. Cheng, Z. Xu, K. Chen, C. Lin, and F. Jiang, “Identification of the most influential areas for air pollution control using XGBoost and Grid Importance Rank,” Journal of Cleaner Production, vol. 274, p. 122835, 2020, doi: 10.1016/j.jclepro.2020.122835
  18. X. Dou, “Online Purchase Behavior Prediction and Analysis Using Ensemble Learning,” 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics, ICCCBDA 2020, pp. 532–536, 2020, doi: 10.1109/ICCCBDA49378.2020.9095554
  19. Badan Pusat Statistik, “Kuesioner Survei Angkatan Kerja Nasional 2019,” 2019. [Online]. Available: https://sirusa.bps.go.id/sirusa/index.php/kuesioner/2386
  20. S. González, S. García, J. Del Ser, L. Rokach, and F. Herrera, “A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities,” Information Fusion, vol. 64, no. May, pp. 205–237, 2020, doi: 10.1016/j.inffus.2020.07.007
  21. S. Agarwal, "Data Mining: Data Mining Concepts and Techniques," 2013 International Conference on Machine Intelligence and Research Advancement, 2013, pp. 203-207, doi: 10.1109/ICMIRA.2013.45
  22. H. Nguyen, X. N. Bui, H. B. Bui, and D. T. Cuong, “Developing an XGBoost model to predict blast-induced peak particle velocity in an open-pit mine: a case study,” Acta Geophysica, vol. 67, no. 2, pp. 477–490, 2019, doi: 10.1007/s11600-019-00268-4
  23. T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794. doi: 10.1145/2939672.2939785
  24. L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: Unbiased boosting with categorical features,” Advances in Neural Information Processing Systems, vol. 2018-Decem, no. Section 4, pp. 6638–6648, 2018. doi: 10.48550/arXiv.1706.09516
  25. X. Fei, Y. Fang, and Q. Ling, “Discrimination of Excessive Exhaust Emissions of Vehicles based on Catboost Algorithm,” Proceedings of the 32nd Chinese Control and Decision Conference, CCDC 2020, pp. 4396–4401, 2020, doi: 10.1109/CCDC49329.2020.9164224
  26. S. Yadav and S. Shukla, "Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification," 2016 IEEE 6th International Conference on Advanced Computing (IACC), 2016, pp. 78-83, doi: 10.1109/IACC.2016.25
  27. S. Wang et al., “A new method of diesel fuel brands identification: SMOTE oversampling combined with XGBoost ensemble learning,” Fuel, vol. 282, no. March, p. 118848, 2020, doi: 10.1016/j.fuel.2020.118848
  28. X. He, B. D. Gallas, and E. C. Frey, “Three-class ROC analysis toward a general decision theoretic solution,” IEEE Transactions on Medical Imaging, vol. 29, no. 1, pp. 206–215, 2010, doi: 10.1109/TMI.2009.2034516
  29. I. M. El-hasnony, S. I. Barakat, M. Elhoseny, and R. R. Mostafa, “Improved Feature Selection Model for Big Data Analytics,” vol. 8, pp. 66989–67004, 2020, doi: 10.1109/ACCESS.2020.2986232
  30. F. Mohr and J. N. van Rijn, “Fast and Informative Model Selection using Learning Curve Cross-Validation,” Nov. 2021, doi: 10.48550/arXiv.2111.13914
  31. J. Hancock and T. M. Khoshgoftaar, “Performance of CatBoost and XGBoost in Medicare Fraud Detection,” in 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp. 572–579. doi: 10.1109/ICMLA51294.2020.00095
  32. M. M. Muhammed, A. A. Ibrahim, R. L. Ridwan, R. O. Abdulaziz, and G. A. Saheed, “Comparison of the CatBoost Classifier with other Machine Learning Methods,” 2020. doi: 10.14569/IJACSA.2020.0111190
  33. A. M. W. Saputra, A. W. Wijayanto, "Implementation of Ensemble Techniques for Diarrhea Cases Classification of Under-Five Children in Indonesia," Jurnal Ilmu Pengetahuan dan Teknologi Komputer, vol. 6, no. 2, pp. 175-180, 2021, doi: 10.33480/jitk.v6i2.1935
  34. I. Kemala, A. W. Wijayanto, "Perbandingan Kinerja Metode Bagging dan Non-Ensemble Machine Learning pada Klasifikasi Wilayah di Indonesia menurut Indeks Pembangunan Manusia," Jurnal Sistem dan Teknologi Informasi, vol. 9, no. 2, pp. 269-275, 2021, doi: 10.26418/justin.v9i2.44166

Last update:

No citation recorded.

Last update: 2024-09-26 21:55:48

No citation recorded.