การประยุกต์ใช้อัลกอริทึมป่าสุ่มและทฤษฎีกราฟสำหรับการวิเคราะห์ข้อความ

Main Article Content

วัชรีวรรณ จิตต์สกุล
สุนันฑา สดสี

Abstract

Abstract


This research presents an applying random forest algorithm and graph theory for text analysing. Herein, 3-benchmark comment datasets collected from “www.imdb.com”, “www.yelp.com”, and “www.amazon.com” given by UCI Machine Learning Repository, are used to evaluate the proposed study. First, the random forest algorithm is applied to extract words from the comment datasets. Secondly, the comment datasets are created relationship between words based on a graph theory. Finally, centrality measures, namely betweeness centrality: BC, closeness centrality: CC, and degree centrality: DC, are used to identify an importance of each word compared between word extracted from the random forest algorithm and graph theory. The results showed that extracted words were classified in three groups based on matched words, similar words, and exception words. From 3-benchmark comment datasets, the matched words contained the average BC, CC and DC as 94.24010, 2.0369, and 23.5736. The similar words showed the average BC, CC and DC as 127.6935, 2.0286, 25.1273, and the exception words presented the average BC, CC and DC as 38.5155, 2.1053, and 18.4643, respectively. From the results, BC and DC, here, were more optimal than CC to text analysing based on centrality of words. Finally, the similar words contained greater average BC and DC than the matched words and exception words. 


Keywords: random forest; graph theory; text analysis; word centrality

Article Details

Section
Physical Sciences
Author Biographies

วัชรีวรรณ จิตต์สกุล

คณะเทคโนโลยีสารสนเทศ มหาวิทยาลัยเทคโนโลยีพระจอมเกล้าพระนครเหนือ แขวงวงศ์สว่าง เขตบางซื่อ กรุงเทพมหานคร 10800

สุนันฑา สดสี

คณะเทคโนโลยีสารสนเทศ มหาวิทยาลัยเทคโนโลยีพระจอมเกล้าพระนครเหนือ แขวงวงศ์สว่าง เขตบางซื่อ กรุงเทพมหานคร 10800

References

[1] Kharea, S.K., Thapab, N. and Sahooc, K.C., 2007, Sahoo internet as a source of information: A survey of Ph.D. scholars, Ann. Library Inf. Stud. 54: 201-206.
[2] Kanimozhi, K.V. and Venkatesan, M., 2015, Unstructured data analysis – A survey, Int. J. Adv. Res. Comp. Commun. Eng. 43: 223-225.
[3] พนิดา ทรงรัมย์, 2559, การจำแนกความคิดเห็นทางการเมืองบนเครือข่ายสังคมออนไลน์โดยใช้วิธีการจำแนกแบบสัมพันธ์, ว.วิทยาศาสตร์และเทคโนโลยีธัญบุรี 6(1): 83-93.
[4] กานดา แผ่วัฒนากุล, 2555, การวิเคราะห์เหมืองข้อเสนอแนะจากบทวิจารณ์รายการโทรทัศน์, วิทยานิพนธ์ปริญญาโท, สถาบันพัฒนบริหารศาสตร์, กรุงเทพฯ, 116 น.
[5] ราชวิทย์ ทิพย์เสนา, ฉัตรเกล้า เจริญผล และแกมกาญจน์ สมประเสริฐศรี, 2557, การจำแนกกลุ่มคำถามอัตโนมัติบนกระดานสนทนา, ว.วิทยาศาสตร์และเทคโนโลยี มหาวิทยาลัยมหาสารคาม 33(5): 493-502.
[6] Abdulsahib, A.K. and Kamaruddin, S.S., 2015, Graph based text representation for document clustering, J. Theor. Appl. Inf. Technol. 76: 1-13.
[7] Wu, J., Xuan, Z. and Pan, D., 2011, Enhancing text representation for classification tasks with semantic graph structures, Int. J. Innovative Comp. Inf. Cont. 7: 2689-2698.
[8] Malliaros, F.D. and Skianis, K., 2015, Graph-based term weighting for text categorization, pp. 1473-1479, In 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM' 15), August 25-28, Paris.
[9] Bondy, J. and Murty, U., 1976, Graph Theory with Applications, The Macmillan Press, Ltd., London.
[10] Hanneman, R.A., 2005, Mark Idle: Introduction to Social Network Methods, University of California, Riverside, California.
[11] Durland, M.M. and Fredericks, K.A. (Eds.), 2006, Social Network Analysis in Program Evaluation, Jossey-Bass, San Francisco.
[12] Kiagias, F.R.E. and Vazirgiannis, M., 2015, Text categorization as a graph classification Problem, pp.1702-1712, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing.
[13] Cordobés, H., Anta, A.F., Chiroque, L.F., Pérez, F., Redondo, T. and Santos, A., 2013, Graph-based techniques for topic classification of tweets in spanish, Int. J. Artif. Intell. Interact. Multimed. 2(5): 31-37.
[14] Valle, K. and Öztürk, P., 2011, Graph-Based Representations for Text Classification, India-Norway Workshop on Web Concepts and Technologies, Trondheim.
[15] Wang, Z. and Liu, Z., 2016, Graph-based KNN Text Classification, pp. 2363-2366, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2010), Yantai.
[16] Liu, J., Wang, J. and Wang, C., 2008, A Text Network Representation Model, pp. 150-154, In Fifth International Conference on Fuzzy Systems and Knowledge Discovery, Washington.
[17] Kotzias, D., Denil, M., De Freitas, N. and Smyth, P., 2015, From Group to Individual Labels using Deep Features, KDD 2015.
[18] Jyothi, K.B., Bindu, K.H. and Suryanarayana, D., 2017, A comparative study of random forest & K – nearest neighbors on HAR dataset using caret, Int. J. Innov. Res. Technol. 3(9): 6-9.
[19] Kurniawati, Y.E., Permanasari, A.E. and Fauziati, S. 2016, Comparative Study on Data Mining Classification Methods for Cervical Cancer Prediction Using Pap Smear Results, 2016 1st International Conference on Biomedical Engineering (IBIOMED), Yogyakarta.
[20] Campomanes, F., Pada, A.V. and Silapan, J., 2016, Mangrove Classification Using Support Vector Machines and Random Forest Algorithm: A Comparative Study, GEOBIA 2016: Solutions and Synergies, Faculty of Geo-Information and Earth Observation (ITC), University of Twente, Overijssel.