Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques

Todsanai Chumwatana

doi:10.14456/jist.2012.2

pdf

Published: Jun 30, 2012

DOI: https://doi.org/10.14456/jist.2012.2

Keywords:

Frequent max substrings Frequent substrings n-gram technique Frequent max substring techniques n-grams

Todsanai Chumwatana

Faculty of Information Technology, Rangsit University

Abstract

- The amount of electronically stored information in non-segmented texts has grown rapidly and the number of these documents is still increasing. This makes index-term extraction an essential task and some techniques have been proposed for extracting index-terms from non-segmented texts in order to support indexing. In this paper, we investigate two index-term extraction techniques: n-gram and frequent max substring techniques for non-segmented texts. Many research communities have acknowledged that the n-gram technique is one of the viable solutions for extracting index-terms in non-segmented texts such as Chinese, Japanese, Korea, Thai languages and genome or protein in area of bioinformatics. Beside this, the frequent max substring technique has been proposed as an alternative method to extract index-terms. This technique provides significant benefits for indexing non-segmented texts. In this paper, experimental studies and comparison results are shown in order to compare two techniques. From the experimental results, the following observations can be made. The n-gram technique requires less space to extract the index-terms when compare to the frequent max substring technique. Meanwhile, the frequent max substring technique has improved over the n-gram technique in term of performance as it can be applied to many non-segmented texts without the requirement of determining the dimensions of the term.

How to Cite

[1]

T. Chumwatana, “Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques”, JIST, vol. 3, no. 1, pp. 8–15, Jun. 2012.

Issue

Vol. 3 No. 1 (2012): Journal of Information Science and Technology (JIST) [Jan. 2012 - Jun. 2012]

Section

Research Article: Soft Computing (Detail in Scope of Journal)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

I/we certify that I/we have participated sufficiently in the intellectual content, conception and design of this work or the analysis and interpretation of the data (when applicable), as well as the writing of the manuscript, to take public responsibility for it and have agreed to have my/our name listed as a contributor. I/we believe the manuscript represents valid work. Neither this manuscript nor one with substantially similar content under my/our authorship has been published or is being considered for publication elsewhere, except as described in the covering letter. I/we certify that all the data collected during the study is presented in this manuscript and no data from the study has been or will be published separately. I/we attest that, if requested by the editors, I/we will provide the data/information or will cooperate fully in obtaining and providing the data/information on which the manuscript is based, for examination by the editors or their assignees. Financial interests, direct or indirect, that exist or may be perceived to exist for individual contributors in connection with the content of this paper have been disclosed in the cover letter. Sources of outside support of the project are named in the cover letter.
I/We hereby transfer(s), assign(s), or otherwise convey(s) all copyright ownership, including any and all rights incidental thereto, exclusively to the Journal, in the event that such work is published by the Journal. The Journal shall own the work, including 1) copyright; 2) the right to grant permission to republish the article in whole or in part, with or without fee; 3) the right to produce preprints or reprints and translate into languages other than English for sale or free distribution; and 4) the right to republish the work in a collection of articles in any other mechanical or electronic format.
We give the rights to the corresponding author to make necessary changes as per the request of the journal, do the rest of the correspondence on our behalf and he/she will act as the guarantor for the manuscript on our behalf.
All persons who have made substantial contributions to the work reported in the manuscript, but who are not contributors, are named in the Acknowledgment and have given me/us their written permission to be named. If I/we do not include an Acknowledgment that means I/we have not received substantial contributions from non-contributors and no contributor has been omitted.

References

1. K. L. KWOK, 1997 Comparing representations in Chinese information retrieval, In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, Philadelphia, pp. 34-41.

2. F. H. a. W. B. Croft, 1993 A Comparison of Indexing Techniques for Japanese Text Retrieval, In Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 237-246.

3. J. H. L. a. J. S. Ahn, 1996 Using n-Grams for Korean Text Retrieval, In Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp. 216-224.

4. R. S. a. D. Smith, 2001 Information Extraction for Thai Documents, International Journal of Computer Processing of Oriental Languages (IJCPOL), pp. 14(2):153-172.

5. G. Navarro, 2001 A Guided Tour to Approximate String Matching, In ACM Computing Surveys, pp. 31-88.

6. E. Adams, "A Study of Trigrams and Their Feasibility as Index Terms in a Full Text Information Retrieval System." PhD thesis, George Washington University, USA, 1991.

7. M. M. P Majumder, B.B. Chaudhuri, 2002 N-gram: a language independent approach to IR and NLP, In International conference on Universal Knowledge.

8. W. A. T. CAVNAR, J, 1994 N-gram based text categorization, In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994), pp. 161-175.

9. K. Y. W. M. S. Kim, J. G. Lee, and M. J. Lee, 2005 n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure, In VLDB, Trondheim, Norway, pp. 325-336.

10. H. E. Williams, 2003 Genomic Information Retrieval, In Proc. the 14th Australasian Database Conferences.

11. L. F. Chien, "Fast and Quasi-Natural Language Search for Gigabytes of Chinese Texts," in Proceedings of 18th ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, 1995, pp. 112-120.

12. T. Liang, S. Y. Lee, and W. P. Yang, "Optimal Weight Assignment for a Chinese Signature File," in Journal of Information Processing and Management, vol. 32, no. 2, pp. 227-237.

13. H. E. Williams and J. Zobel, "Indexing and Retrieval for Genomic Databases," in IEEE Transaction on Knowledge and Data Engineering, 2002, pp. 63-78.

14. T. Chumwatana, K. W. Wong, and H. Xie, "An Automatic Indexing Technique for Thai Texts using Frequent Max Substring," in Eighth International Symposium on Natural Language Processing, 2009 (SNLP '09) Bangkok, Thailand, 2009.

15. T. Chumwatana, K. W. Wong, and H. Xie "Frequent Max Substring Mining for Indexing," International Journal of Computer Science and System Analysis (IJCSSA), India, 2008.

16. T. Chumwatana, K. W. Wong, and H. Xie, "Using Frequent Max Substring Technique for Thai Keyword Extraction used in Thai Text Mining," in 2nd International Conference on Soft Computing, Intelligent System and Information Technology (ICSIIT 2010), Bali, Indonesia, 1-2 July 2010.

17. T. Chumwatana, K. W. Wong and H. Xie, ‘Non-segmented Document Clustering Using Self-organizing map and Frequent Max Substring Technique’, In16th International Conference on Neural Information Processing (ICONIP 2009), Bangkok, Thailand, 2009.

18. R. B.-Y. a. B. Ribeiro-Neto, 1999 Modern Information Retrieval: ACM Press.

19. J. M. a. P. McNamee, 2003 Single N-gram Stemming, In Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Toronto, Canada, pp. 415-416.

20. D. S. Ethan Miller, Junli Liu, and Charles Nicholas 2000 Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System, Journal of Digital Information, vol. 1, No. 5, pp. 1-25.

21. O. Y. a. M. Toru, 1998 Optimizing query evaluation in n-gram indexing, In Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367-368.

22. B. A. Ogawa Yasushi, and Iwasaki Masajirou, 1993 A New Indexing and Text Ranking Method for Japanese Text Databases Using Simple-Word Compounds as Keywords, Database Systems for Advanced Applications'93, In Proc. of the Third International Symposium on Database System for Advanced Applications, pp. 197-204.

23. V. Sornlertlamvanich, Word Segmentation for Thai in Machine Translation System, Bangkok.

24. H. E. W. a. J. Zobel, 2002 Indexing and Retrieval for Genomic Databases, In IEEE Trans. on Knowledge and Data Engineering, pp. pp. 63-78.

25. T. Chumwatana, Kok Wai Wong and Hong Xie, 2008 Thai Text Mining to Support Web Search for E-commerce, In The 7th International Conference on e-Business (INCEB2008), Bangkok, Thailand.

26. D. Gusfield, 1997 Algorithms on Strings, Trees and Sequences Computer Science and Computational Biology. Cambridge: Cambridge University Press

Article Sidebar

Main Article Content

Abstract

Article Details

References