Synthetic Minority Over-Sampling for Improving Imbalanced Data in Educational Web Usage Mining

Main Article Content

Wacharawan Intayoad
Chayapol Kamyod
Punnarumol Temdee

Abstract

Educational data mining is the method for extracting and discovering new knowledge from education data. As education data is often complex and imbalanced, it requires a data preprocessing step or learning algorithms in order to obtain accurate analysis and interpretation. Many studies emphasize on classification and clustering methods in order to get insight and comprehensive knowledge from education data. However, a small number of previous works exclusively focused on the preprocessing of education data, particularly on the topic of the imbalanced dataset. Therefore, this research objective is to enhance the accuracy of data classification in educational web usage data. Our study involves the application of synthetic minority over-sampling techniques (SMOTE) to preprocess the raw dataset from web usage data. The minority class is a group of the students who failed the examination and the majority class is the students who passed the examination. In our experiments, four synthetic minority over-sampling methods are applied, SMOTE, and its variants: Borderline-SMOTE1, Borderline-SMOTE2, and SVM-SMOTE, in order to balance the number of samples in the minority class. The experiments are evaluated by comparing the results from well-known classification methods that are Naive Bayesian, decision tree, and k-nearest neighbors. The study experiments with real-world datasets from education data. The results present that synthetic minority over-sampling methods are capable of improving the detection of the minority class and achieve improving classification performance on precision, recall, and F1-value.


Ed

Article Details

How to Cite
[1]
W. Intayoad, C. Kamyod, and P. Temdee, “Synthetic Minority Over-Sampling for Improving Imbalanced Data in Educational Web Usage Mining”, ECTI-CIT Transactions, vol. 12, no. 2, pp. 118–129, Feb. 2019.
Section
Artificial Intelligence and Machine Learning (AI)

References

[1] C. Romero, S. Ventura, A. Zafra, and P. de Bra, “Applying Web usage mining for personalizing hyperlinks in Web-based adaptive educational systems,” Comput. Educ., vol. 53, no. 3, pp. 828-840, Nov. 2009.

[2] C. Romero, P. G. Espejo, A. Zafra, J. R. Romero, and S. Ventura, “Web usage mining for predicting final marks of students that use Moodle courses,” Comput. Appl. Eng. Educ., vol. 21, no. 1, pp. 135-146, 2013.

[3] H. Ba-Omar, I. Petrounias, and F. Anwar, “A framework for using web usage mining to person-alise e-learning,” in
Advanced Learning Technologies, 2007. ICALT 2007. Seventh IEEE International Conference on, 2007, pp. 937-938.

[4] M. Munk and M. Drlk, “Impact of different pre-processing tasks on effective identification of users’ behavioral patterns in web-based educational system,” Procedia Comput. Sci., vol. 4, pp. 1640-1649, 2011.

[5] C. Tsai, L. Chang, and H. Chiang, “Forecasting of ozone episode days by cost-sensitive neural network methods,”
Sci. Total Environ., vol. 407, no. 6, pp. 2124-2135, 2009.

[6] C. Schumacher and D. Ifenthaler, “Features students really expect from learning analytics,” Computers in Human Behavior, vol. 78, pp.397-407, 2018.

[7] R. Batuwita and V. Palade, “microPred: effective classification of pre-miRNAs for human miRNA gene prediction,”
Bioinformatics, vol. 25, no. 8, pp. 989-995, 2009.

[8] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning from imbalanced data sets,” ACM Sigkdd Explor. Newsl., vol. 6, no. 1, pp. 1-6, 2004.

[9] V. Lopez, A. Fernandez, S. Garcia, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Inf. Sci., vol. 250, pp. 113-141, 2013.

[10] C. Marquez-Vera, A. Cano, C. Romero, and S. Ventura, “Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data,” Appl. Intell., vol. 38, no. 3, pp. 315-330, 2013.

[11] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “A comparative study of data sampling and cost sensitive learning,” in Data Mining Workshops, 2008. ICDMW'08. IEEE International Conference on, pp. 46-52, 2008.

[12] R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: Information and pattern discovery on the world wide web,” in Tools with Arti cial Intelligence, 1997. Proceedings., Ninth IEEE International Conference on, pp. 558-567, 1997.

[13] H. M. Truong, “Integrating learning styles and adaptive e-learning system: Current developments, problems and opportunities,” Comput. Hum. Behav., vol. 55, Part B, pp. 1185-1193, Feb. 2016.

[14] B. Liu, “Web data mining: exploring hyperlinks, contents, and usage data,” Springer Science and Business Media
, 2007.

[15] V. Garcia, J. S. Sanchez, and R. A. Mollineda, “On the effectiveness of preprocessing methods when dealing with different levels of class imbalance,” Knowl.-Based Syst., vol. 25, no. 1, pp. 13-21, 2012.

[16] N. Garcia-Pedrajas, J. Perez-Rodrguez, M. Garcia-Pedrajas, D. Ortiz-Boyer, and C. Fyfe, “Class imbalance methods for translation initiation site recognition in DNA sequences,” Knowl. Based Syst., vol. 25, no. 1, pp. 22-34, 2012.

[17] G. M. Weiss, “Mining with rarity: a unifying framework,” ACM Sigkdd Explor. Newsl., vol. 6, no. 1, pp. 7-19, 2004.

[18] C. G. Marquardt, K. Becker, and D. D. Ruiz, “A pre-processing tool for web usage mining in the distance education domain,” in Database Engineering and Applications Symposium, 2004. IDEAS'04. Proceedings. International, pp. 78-87, 2004.

[19] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide web browsing patterns,” Knowl. Inf. Syst., vol. 1, no. 1, pp. 5-32, 1999.

[20] N. K. Tyagi, A. K. Solanki, and S. Tyagi, “An algorithmic approach to data preprocessing in web usage mining,”
Int. J. Inf. Technol. Knowl. Manag., vol. 2, no. 2, pp. 279-283, 2010.

[21] G. T. Raju and P. S. Satyanarayana, “Knowledge discovery from web usage data: Complete preprocessing methodology,” Int. J. Comput. Sci. Netw. Secur., vol. 8, no. 1, pp. 179-186, 2008.

[22] S. Ertekin, J. Huang, L. Bottou, and L. Giles, “Learning on the border: active learning in imbalanced data classification,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 127-136, 2007.

[23] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intell. Data Anal., vol. 6, no. 5, pp. 429-449, 2002.

[24] G. M. Weiss and F. Provost, “Learning when training data are costly: The effect of class distribution on tree induction,” J. Artif. Intell. Res., vol. 19, pp. 315-354, 2003.

[25] N. Japkowicz, “The class imbalance problem: Significance and strategies,” in Proc. of the Int'l Conf. on Arti cial Intelligence, 2000.

[26] P. Domingos, “Metacost: A general method for making classifiers cost-sensitive,” in Proceedings of the fth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 155-164, 1999.

[27] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by cost-proportionate example weighting,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp. 435-442, 2003.

[28] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321-357, 2002.

[29] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20-29, 2004.

[30] A. Fernandez, M. J. del Jesus, and F. Herrera, “On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets,” Expert Syst. Appl., vol. 36, no. 6, pp. 9805-9812, 2009.

[31] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” in International Conference on Intelligent Computing, 2005, pp. 878-887.

[32] H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline over-sampling for imbalanced data classification,” Int. J. Knowl. Eng. Soft Data Paradig., vol. 3, no. 1, pp. 4-21, 2011.

[33] J. Han, J. Pei, and M. Kamber, “Data mining: concepts and techniques,” Elsevier, 2011.

[34] S. L. Salzberg, “C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993,” Mach. Learn., vol. 16, no. 3, pp. 235-240, 1994.

[35] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 61, no. 3, pp. 611-622, 1999.