Improving language identification of web page using optimum profile

Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc...

Full description

Saved in:
Bibliographic Details
Main Authors: Ng, Choon-Ching, Selamat, Ali
Format: Article
Published: Springer 2011
Subjects:
Online Access:http://eprints.utm.my/id/eprint/44976/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.44976
record_format eprints
spelling my.utm.449762017-01-31T06:09:26Z http://eprints.utm.my/id/eprint/44976/ Improving language identification of web page using optimum profile Ng, Choon-Ching Selamat, Ali P Language and Literature Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is using N-grams frequency and N-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%. Springer 2011 Article PeerReviewed Ng, Choon-Ching and Selamat, Ali (2011) Improving language identification of web page using optimum profile. Communications in Computer and Information Science, 180 (2). pp. 157-166. ISSN 1865-0929
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
topic P Language and Literature
spellingShingle P Language and Literature
Ng, Choon-Ching
Selamat, Ali
Improving language identification of web page using optimum profile
description Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is using N-grams frequency and N-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%.
format Article
author Ng, Choon-Ching
Selamat, Ali
author_facet Ng, Choon-Ching
Selamat, Ali
author_sort Ng, Choon-Ching
title Improving language identification of web page using optimum profile
title_short Improving language identification of web page using optimum profile
title_full Improving language identification of web page using optimum profile
title_fullStr Improving language identification of web page using optimum profile
title_full_unstemmed Improving language identification of web page using optimum profile
title_sort improving language identification of web page using optimum profile
publisher Springer
publishDate 2011
url http://eprints.utm.my/id/eprint/44976/
_version_ 1643651604027539456
score 13.211869