Scalability and performance in duplicate detection : relational vs. graph database
Customer data is a vital asset for businesses, as it provides insight into the individuals and organizations that engage with a company. This information enables businesses to effectively target their audience, personalize their offerings, and optimize their marketing and sales strategies. However,...
Saved in:
| Main Author: | |
|---|---|
| Format: | Undergraduates Project Papers |
| Language: | en |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://umpir.ump.edu.my/id/eprint/46051/1/Scalability%20and%20performance%20in%20duplicate%20detection%20%20relational%20vs.%20graph%20database.pdf https://umpir.ump.edu.my/id/eprint/46051/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Customer data is a vital asset for businesses, as it provides insight into the individuals and organizations that engage with a company. This information enables businesses to effectively target their audience, personalize their offerings, and optimize their marketing and sales strategies. However, the presence of duplicate data in customer records can lead to inaccuracies and inefficiencies in business operations. Traditional techniques for managing duplicate data in tabular formats, such as column matching and linkage keys, are limited in their ability to manage complex data relationships and their reliance on predefined schemas. As a result, they may not be effective in identifying and resolving duplicate data in certain use cases. Graph-based algorithms such as community detection, may prove beneficial as it can capture a more complex relationships between data. This approach enables the discovery of intricate patterns and the management of vast datasets. This study assesses scalability and algorithms performance available on prevalent graph database systems for identifying duplicate customer data within these databases. To evaluate the performance of the suggested technique, its efficacy is contrasted with performance of relational database. This study found that PostgreSQL demonstrates exceptional scalability and efficiency in handling large datasets, outpacing both Neo4J and MySQL. However, Neo4J excels in exact duplicate and near-duplicate detection algorithms, showcasing its strength in handling complex, interconnected data structures. The choice between PostgreSQL and Neo4J should be based on the specific task requirements, with PostgreSQL being preferable for quick identification of similar items in large datasets, while Neo4J is more suitable for tasks that involve discovering communities within intricate networks. |
|---|
