APC 技術ブログ

株式会社エーピーコミュニケーションズの技術ブログです。

株式会社 エーピーコミュニケーションズの技術ブログです。

How to Migrate from Snowflake to an Open Data Lakehouse Using Delta Lake UniForm

Preface

Hello, I'm Jonathan Brideau, a Senior Product Manager at Databricks, leading initiatives centered around Delta Lake, particularly the Universal Format, or UniForm. Today's session addresses the increasingly common scenario of transitioning from traditional or proprietary systems like Snowflake to an open data lakehouse architecture using Delta Lake UniForm.

In this session, we will delve into how adopting Delta Lake UniForm expands the Delta Lake ecosystem, enabling broader tool integration. With this unique capability, once data is saved, it becomes accessible across any supported engine, optimizing data accessibility and streamlining both analytics and operational processes.

We will first identify the imminent challenges associated with proprietary data formats and emphasize why transitioning to an open data lakehouse built around Delta Lake UniForm is not only strategic but necessary for modern data-driven enterprises.

Thank you for joining us today. I look forward to exploring these transformative approaches, demonstrating the significant benefits of Delta Lake UniForm in overcoming common data storage and accessibility challenges. Let’s embark on this insightful journey together.

Challenges and Architectural Evolution

Wayne Enterprises: Actions and Faced Challenges

Many organizations, including Wayne Enterprises, encounter issues accessing and integrating data across different platforms during the migration process. Lois, the star analyst at Wayne Enterprises, personifies this challenge. Although she identified the optimal tools for her analytical needs, she struggled to access necessary data due to it being isolated in incompatible platforms.

This scenario highlights a common problem faced by many companies and raises questions about how such situations occur and what solutions exist.

Initial Cloud Transition

Reflecting on Wayne Enterprises' initial transition into cloud technology provides crucial insights into the origins of these challenges. Initially, the cloud architecture was not designed to manage multiple data sources and tools centrally, a standard expectation in modern frameworks. This oversight often led to the creation of data siloes.

Exploring Solutions

Examining the architectural decisions that led to the current data management issues is critical in identifying effective solutions. Implementing technologies like Delta Lake UniForm enables integrated data access across various data platforms, mitigating these challenges.

Adopting a centralized architecture, moving away from the constraints imposed by data siloes, represents an important step for many organizations. We will explore the technical and organizational challenges that arise during this transition and discuss strategies to effectively navigate these obstacles.

Establishing and Designing the New Architecture

  • Cost and Performance Optimization: One primary goal of the transition is to scale costs efficiently with the increased volume of data, allowing for better management of organizational data costs and effective allocation of needed resources.

  • Centralized Data Management: Another key focus is managing all data consistently in one place, without maintaining multiple pipelines across different data warehouses and data science stacks, thereby enhancing data integrity and accessibility.

  • Tool Access Rights: One of Lois's challenges as the principal analyst was her preferred tool, DuckDB, being restricted by Snowflake's proprietary format. With UniForm, these tools can access the data, allowing Lois to use the best tool for handling data.

Based on these objectives, the next-generation data architecture design could significantly enhance Wayne Enterprises' data management efficiency and accelerate business growth. Increasing the availability and operability of data not only improves analytical precision but also fosters the discovery of new insights.

Governance and Implementation

This section focuses on "Governance and Implementation" in the setup of the data lakehouse, extending the use of the Delta Lake Universal Format (UniForm) beyond mere data storage to include catalog management.

  1. Simplified Data Migration: Deploying UniForm allows data to be saved once while integrating with any tool within the architecture. This approach eliminates the need for manual data transformation and reduces the risk of data duplication, thus improving overall data management efficiency.

  2. Coexistence of Metadata: Metadata plays a critical role and coexists with the data itself, for example, in S3_buckets. In this setup, directories for Delta logs and Iceberg metadata are maintained in parallel, ensuring seamless access regardless of the analysis engine used.

  3. Provision of a Unified Catalog: Establishing a catalog is a key component in the governance and implementation of Delta Lake's lakehouse architecture. This ensures consistent policy application across various engines and provides unified governance. However, carefully selecting the appropriate catalog is necessary to adequately meet an organization's strategic needs.

This section outlines effective strategies to streamline technical data transformations and avoid duplication, promoting a more efficient approach to data governance and management. Delta Lake UniForm enables these technical mechanisms, ensuring efficient operations across a diverse data ecosystem.

By leveraging tools available on the Databricks AI platform and using Delta Lake Universal Format (UniForm), it becomes feasible to store data centrally while ensuring accessibility across various engines, including Snowflake and DuckDB. This approach eliminates the need for multiple, separate stacks for BI, AI, and ML workloads, creating a unified layer that effectively supports these activities.

A key architectural goal was efficiently scaling costs with increased data volumes. This goal was successfully achieved by addressing challenges frequently encountered with data loading and ETL processes within Snowflake, ultimately leading to substantial cost reductions. Data stored in an open format in customer-specific buckets simplifies access and utilization between different engines, streamlining operations.

The transition to a managed data lakehouse using Delta Lake UniForm not only optimizes data handling but also significantly boosts cost efficiency and performance. A notable benefit is the seamless integration and navigation between diverse analytical tools, which substantially reduces IT infrastructure complexity.

From the discussion in this section, the key takeaway is that by effectively utilizing a single data format to access various data sources and tools, enterprises can pursue a more flexible and robust data strategy. This strategic approach aligns well with the initial session goals, ensuring cost-effective scalability and improved operational efficiency in data-intensive environments.

Looking Ahead: Broad Utilization and Extensive Use Cases

This section focuses on positioning Delta Lake's Universal Format as part of the vision for the future of data management, with special emphasis on the integration of the Unity Catalog. The Unity Catalog highlights its connection through the Iceberg REST Catalog API endpoints, making the Universal Format tables accessible via a common open API, irrespective of the catalog backend used. This result allows any Iceberg reader to interact with the same interface. Unity Catalog already supports this capability, with connections possible to Snowflake.

This session demonstrates how this technological advancement can enhance future-proofing and envision various use cases. The transition from a data lake to a lakehouse signifies more than mere technology transfer; it embodies maximizing data flexibility and accessibility, enabling enterprises to use data more strategically.

About the special site during DAIS

This year, we have prepared a special site to report on the session contents and the situation from the DAIS site! We plan to update the blog every day during DAIS, so please take a look.

www.ap-com.co.jp