Data lakehouse architecture for self-service data analytics
This project focuses on designing and implementing a Data Lakehouse architecture to facilitate self-service analytics. The objectives include creating a collaborative analytics environment, streamlining the management of multiple ETL processes, adopting a costeffective and non-proprietary architectu...
Saved in:
| Main Author: | |
|---|---|
| Format: | Undergraduates Project Papers |
| Language: | en |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://umpir.ump.edu.my/id/eprint/46041/1/Data%20lakehouse%20architecture%20for%20self-service%20data%20analytics.pdf https://umpir.ump.edu.my/id/eprint/46041/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This project focuses on designing and implementing a Data Lakehouse architecture to facilitate self-service analytics. The objectives include creating a collaborative analytics environment, streamlining the management of multiple ETL processes, adopting a costeffective and non-proprietary architecture, integrating with BI tools, ensuring high query performance for interactive visualization, enabling data warehousing capabilities, and offering a self-service data discovery and metadata platform. The project followed an iterative development methodology that involves requirement gathering and planning, design, implementation, testing, deployment, and maintenance phases. The logical design consists of six layers which are data ingestion, storage, catalog, semantic, processing, and consumption. The physical design utilized Dremio as the core component, along with Apache Iceberg and Arrow Flight Engine for data format and query processing. The project also adopted an Integrated Multi-Zone Analytics Framework to handle different data tasks and workloads. The implementation was performed using Docker, and the testing validates the achievement of the objectives. Deployment was done on an Azure Kubernetes Service (AKS) cluster, although the deployment of the Dremio cluster is hindered due to budget and resource limitations. Maintenance activities included security, backup, node monitoring, cost usage, configuration tuning, metadata management, and training and support. This project concludes that the objectives have been achieved and suggests some future enhancements, such as using Project Nessie for catalog storage, considering Dremio Enterprise Edition for advanced features, and exploring Databricks and MLflow if expecting extensive machine learning workloads. |
|---|
