Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques

Main Article Content

Todsanai Chumwatana


- The amount of electronically stored information in non-segmented texts has grown rapidly and the number of these documents is still increasing. This makes index-term extraction an essential task and some techniques have been proposed for extracting index-terms from non-segmented texts in order to support indexing. In this paper, we investigate two index-term extraction techniques: n-gram and frequent max substring techniques for non-segmented texts. Many research communities have acknowledged that the n-gram technique is one of the viable solutions for extracting index-terms in non-segmented texts such as Chinese, Japanese, Korea, Thai languages and genome or protein in area of bioinformatics. Beside this, the frequent max substring technique has been proposed as an alternative method to extract index-terms. This technique provides significant benefits for indexing non-segmented texts. In this paper, experimental studies and comparison results are shown in order to compare two techniques. From the experimental results, the following observations can be made. The n-gram technique requires less space to extract the index-terms when compare to the frequent max substring technique. Meanwhile, the frequent max substring technique has improved over the n-gram technique in term of performance as it can be applied to many non-segmented texts without the requirement of determining the dimensions of the term.


Article Details

How to Cite
Chumwatana, T. (2012). Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques. JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY, 3(1), 8-15.
Research Article: Soft Computing (Detail in Scope of Journal)


