MARC表示: Training data selection for record linkage classification

Training data selection for record linkage classification

This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions we...

詳細記述

保存先:

書誌詳細
主要な著者:	Zaturrawiah Ali Omar, Zamira Hasanah Zamzuri, Noratiqah Mohd Ariff, Mohd Aftar Abu Bakar
フォーマット:	論文
言語:	English English
出版事項:	MDPI AG 2023
主題:	QA1-939 Mathematics QA75.5-76.95 Electronic computers. Computer science
オンライン･アクセス:	https://eprints.ums.edu.my/id/eprint/42203/1/ABSTRACT.pdf https://eprints.ums.edu.my/id/eprint/42203/2/FULL%20TEXT.pdf https://eprints.ums.edu.my/id/eprint/42203/ https://doi.org/10.3390/sym15051060
タグ:	タグ追加タグなし, このレコードへの初めてのタグを付けませんか!

id	my.ums.eprints.42203
record_format	eprints
spelling	my.ums.eprints.422032024-12-10T06:57:01Z https://eprints.ums.edu.my/id/eprint/42203/ Training data selection for record linkage classification Zaturrawiah Ali Omar Zamira Hasanah Zamzuri Noratiqah Mohd Ariff Mohd Aftar Abu Bakar QA1-939 Mathematics QA75.5-76.95 Electronic computers. Computer science This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F1 -score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F1 -score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications. MDPI AG 2023 Article NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/42203/1/ABSTRACT.pdf text en https://eprints.ums.edu.my/id/eprint/42203/2/FULL%20TEXT.pdf Zaturrawiah Ali Omar and Zamira Hasanah Zamzuri and Noratiqah Mohd Ariff and Mohd Aftar Abu Bakar (2023) Training data selection for record linkage classification. Symmetry, 15. pp. 1-17. https://doi.org/10.3390/sym15051060
institution	Universiti Malaysia Sabah
building	UMS Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaysia Sabah
content_source	UMS Institutional Repository
url_provider	http://eprints.ums.edu.my/
language	English English
topic	QA1-939 Mathematics QA75.5-76.95 Electronic computers. Computer science
spellingShingle	QA1-939 Mathematics QA75.5-76.95 Electronic computers. Computer science Zaturrawiah Ali Omar Zamira Hasanah Zamzuri Noratiqah Mohd Ariff Mohd Aftar Abu Bakar Training data selection for record linkage classification
description	This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F1 -score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F1 -score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.
format	Article
author	Zaturrawiah Ali Omar Zamira Hasanah Zamzuri Noratiqah Mohd Ariff Mohd Aftar Abu Bakar
author_facet	Zaturrawiah Ali Omar Zamira Hasanah Zamzuri Noratiqah Mohd Ariff Mohd Aftar Abu Bakar
author_sort	Zaturrawiah Ali Omar
title	Training data selection for record linkage classification
title_short	Training data selection for record linkage classification
title_full	Training data selection for record linkage classification
title_fullStr	Training data selection for record linkage classification
title_full_unstemmed	Training data selection for record linkage classification
title_sort	training data selection for record linkage classification
publisher	MDPI AG
publishDate	2023
url	https://eprints.ums.edu.my/id/eprint/42203/1/ABSTRACT.pdf https://eprints.ums.edu.my/id/eprint/42203/2/FULL%20TEXT.pdf https://eprints.ums.edu.my/id/eprint/42203/ https://doi.org/10.3390/sym15051060
_version_	1818835189031239680
score	13.251813

Training data selection for record linkage classification

類似資料