Scalability and performance in duplicate detection : relational vs. graph database

Customer data is a vital asset for businesses, as it provides insight into the individuals and organizations that engage with a company. This information enables businesses to effectively target their audience, personalize their offerings, and optimize their marketing and sales strategies. However,...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Farhad, Khaharruddin
Format: Undergraduates Project Papers
Language:en
Published: 2023
Subjects:
Online Access:https://umpir.ump.edu.my/id/eprint/46051/1/Scalability%20and%20performance%20in%20duplicate%20detection%20%20relational%20vs.%20graph%20database.pdf
https://umpir.ump.edu.my/id/eprint/46051/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Customer data is a vital asset for businesses, as it provides insight into the individuals and organizations that engage with a company. This information enables businesses to effectively target their audience, personalize their offerings, and optimize their marketing and sales strategies. However, the presence of duplicate data in customer records can lead to inaccuracies and inefficiencies in business operations. Traditional techniques for managing duplicate data in tabular formats, such as column matching and linkage keys, are limited in their ability to manage complex data relationships and their reliance on predefined schemas. As a result, they may not be effective in identifying and resolving duplicate data in certain use cases. Graph-based algorithms such as community detection, may prove beneficial as it can capture a more complex relationships between data. This approach enables the discovery of intricate patterns and the management of vast datasets. This study assesses scalability and algorithms performance available on prevalent graph database systems for identifying duplicate customer data within these databases. To evaluate the performance of the suggested technique, its efficacy is contrasted with performance of relational database. This study found that PostgreSQL demonstrates exceptional scalability and efficiency in handling large datasets, outpacing both Neo4J and MySQL. However, Neo4J excels in exact duplicate and near-duplicate detection algorithms, showcasing its strength in handling complex, interconnected data structures. The choice between PostgreSQL and Neo4J should be based on the specific task requirements, with PostgreSQL being preferable for quick identification of similar items in large datasets, while Neo4J is more suitable for tasks that involve discovering communities within intricate networks.