Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques

Main Article Content

Todsanai Chumwatana

Abstract

- The amount of electronically stored information in non-segmented texts has grown rapidly and the number of these documents is still increasing. This makes index-term extraction an essential task and some techniques have been proposed for extracting index-terms from non-segmented texts in order to support indexing. In this paper, we investigate two index-term extraction techniques: n-gram and frequent max substring techniques for non-segmented texts. Many research communities have acknowledged that the n-gram technique is one of the viable solutions for extracting index-terms in non-segmented texts such as Chinese, Japanese, Korea, Thai languages and genome or protein in area of bioinformatics. Beside this, the frequent max substring technique has been proposed as an alternative method to extract index-terms. This technique provides significant benefits for indexing non-segmented texts. In this paper, experimental studies and comparison results are shown in order to compare two techniques. From the experimental results, the following observations can be made. The n-gram technique requires less space to extract the index-terms when compare to the frequent max substring technique. Meanwhile, the frequent max substring technique has improved over the n-gram technique in term of performance as it can be applied to many non-segmented texts without the requirement of determining the dimensions of the term.

Article Details

How to Cite
[1]
T. Chumwatana, “Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques”, JIST, vol. 3, no. 1, pp. 8–15, Jun. 2012.
Section
Research Article: Soft Computing (Detail in Scope of Journal)

References

1. K. L. KWOK, 1997 Comparing representations in Chinese information retrieval, In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, Philadelphia, pp. 34-41.

2. F. H. a. W. B. Croft, 1993 A Comparison of Indexing Techniques for Japanese Text Retrieval, In Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 237-246.

3. J. H. L. a. J. S. Ahn, 1996 Using n-Grams for Korean Text Retrieval, In Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp. 216-224.

4. R. S. a. D. Smith, 2001 Information Extraction for Thai Documents, International Journal of Computer Processing of Oriental Languages (IJCPOL), pp. 14(2):153-172.

5. G. Navarro, 2001 A Guided Tour to Approximate String Matching, In ACM Computing Surveys, pp. 31-88.

6. E. Adams, "A Study of Trigrams and Their Feasibility as Index Terms in a Full Text Information Retrieval System." PhD thesis, George Washington University, USA, 1991.

7. M. M. P Majumder, B.B. Chaudhuri, 2002 N-gram: a language independent approach to IR and NLP, In International conference on Universal Knowledge.

8. W. A. T. CAVNAR, J, 1994 N-gram based text categorization, In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994), pp. 161-175.

9. K. Y. W. M. S. Kim, J. G. Lee, and M. J. Lee, 2005 n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure, In VLDB, Trondheim, Norway, pp. 325-336.

10. H. E. Williams, 2003 Genomic Information Retrieval, In Proc. the 14th Australasian Database Conferences.

11. L. F. Chien, "Fast and Quasi-Natural Language Search for Gigabytes of Chinese Texts," in Proceedings of 18th ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, 1995, pp. 112-120.

12. T. Liang, S. Y. Lee, and W. P. Yang, "Optimal Weight Assignment for a Chinese Signature File," in Journal of Information Processing and Management, vol. 32, no. 2, pp. 227-237.

13. H. E. Williams and J. Zobel, "Indexing and Retrieval for Genomic Databases," in IEEE Transaction on Knowledge and Data Engineering, 2002, pp. 63-78.

14. T. Chumwatana, K. W. Wong, and H. Xie, "An Automatic Indexing Technique for Thai Texts using Frequent Max Substring," in Eighth International Symposium on Natural Language Processing, 2009 (SNLP '09) Bangkok, Thailand, 2009.

15. T. Chumwatana, K. W. Wong, and H. Xie "Frequent Max Substring Mining for Indexing," International Journal of Computer Science and System Analysis (IJCSSA), India, 2008.

16. T. Chumwatana, K. W. Wong, and H. Xie, "Using Frequent Max Substring Technique for Thai Keyword Extraction used in Thai Text Mining," in 2nd International Conference on Soft Computing, Intelligent System and Information Technology (ICSIIT 2010), Bali, Indonesia, 1-2 July 2010.

17. T. Chumwatana, K. W. Wong and H. Xie, ‘Non-segmented Document Clustering Using Self-organizing map and Frequent Max Substring Technique’, In16th International Conference on Neural Information Processing (ICONIP 2009), Bangkok, Thailand, 2009.

18. R. B.-Y. a. B. Ribeiro-Neto, 1999 Modern Information Retrieval: ACM Press.

19. J. M. a. P. McNamee, 2003 Single N-gram Stemming, In Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Toronto, Canada, pp. 415-416.

20. D. S. Ethan Miller, Junli Liu, and Charles Nicholas 2000 Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System, Journal of Digital Information, vol. 1, No. 5, pp. 1-25.

21. O. Y. a. M. Toru, 1998 Optimizing query evaluation in n-gram indexing, In Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367-368.

22. B. A. Ogawa Yasushi, and Iwasaki Masajirou, 1993 A New Indexing and Text Ranking Method for Japanese Text Databases Using Simple-Word Compounds as Keywords, Database Systems for Advanced Applications'93, In Proc. of the Third International Symposium on Database System for Advanced Applications, pp. 197-204.

23. V. Sornlertlamvanich, Word Segmentation for Thai in Machine Translation System, Bangkok.

24. H. E. W. a. J. Zobel, 2002 Indexing and Retrieval for Genomic Databases, In IEEE Trans. on Knowledge and Data Engineering, pp. pp. 63-78.

25. T. Chumwatana, Kok Wai Wong and Hong Xie, 2008 Thai Text Mining to Support Web Search for E-commerce, In The 7th International Conference on e-Business (INCEB2008), Bangkok, Thailand.

26. D. Gusfield, 1997 Algorithms on Strings, Trees and Sequences Computer Science and Computational Biology. Cambridge: Cambridge University Press