I'm Sasaki from the Global Engineering Department of the GLB Division. I participated in the local Data + AI SUMMIT2023 (DAIS) and wrote an article summarizing the content of the session.
Articles about the session at DAIS are summarized on the special site below.
How to build a modern data pipeline with Delta Live Tables
This time, I would like to explain the lecture I recently watched, "Delta Live Tables A to Z: Best Practices for Modern Data Pipelines," in an easy-to-understand manner. This talk aims to introduce how to create and update datasets using Delta Live Tables (DLT), a system that integrates Spark SQL, Structured Streaming, and Delta. I think it will be very interesting for data professionals, data engineers, and data scientists.
This blog consists of two parts, and this time we will deliver the first part. In Part 1, we will introduce Delta Live Tables, explain the complexity of data pipelines, types of datasets and ensuring data quality.
Introducing Delta Live Tables and the Complexity of Data Pipelines
A recent talk introduced a new system called Delta Live Tables. It is a system that integrates Spark SQL, Structured Streaming and Delta to create and update datasets. The system is characterized by automatically handling dependency management, quality checking, governance, version control, etc. to solve the complexity of data pipelines.
Features of Delta Live Tables
Delta Live Tables has the following features.
- A system that integrates Spark SQL, Structured Streaming, and Delta
- Ability to create and update datasets
- Dependency management, quality checking, governance, version control, etc. are automatically handled
This solves the complexity of data pipelines and enables efficient data processing.
Resolving Data Pipeline Complexity A data pipeline refers to a series of processes such as data collection, transformation, storage, and analysis. These processes are often accomplished using a combination of multiple systems and tools, adding to the complexity of the data pipeline. Delta Live Tables solves this complexity by providing features such as:
- Dependency Management: Automatically manage how each process in the data pipeline relates to other processes.
- Quality Check: Automatically check the quality of your data and notify you if there are any issues.
- Governance: Centrally manage data access rights and usage policies to achieve proper governance.
- Versioning: Automatically manage dataset versions and even access past versions.
These features solve the complexity of data pipelines and enable efficient data processing.
Ensuring data set types and data quality
The talk covered the topic of dataset types and ensuring data quality. Two types of data sets are provided: streaming tables and materialized views, and a mechanism was introduced in which a boolean expression called an expected value is evaluated for each row to ensure data quality. He also explained that you can create tables, define schemas, and cast types using the new streaming table syntax.
Streaming tables and materialized views
There are two types of datasets: streaming tables and materialized views.
- Streaming table: A table with real-time data updates that allows you to add or update data.
- Materialized Views: Tables that store precomputed data for fast data browsing.
Each of these datasets serves a different purpose, so choosing the right dataset is important.
Ensuring data quality with expected values
A boolean expression called the expected value is evaluated for each row to ensure data quality. The expected value is a conditional expression for judging whether the data is correct and has the following characteristics.
- The expected value is evaluated for each row in the dataset, thus preserving the quality of the data.
- If expectations are not met, the data will be corrected or deleted to improve the reliability of the data.
Setting expectations allows you to operate your dataset while ensuring data quality.
Leverage new streaming table syntax
Using the new streaming table syntax, you can do things like:
- Create a table: You can create a streaming table or a materialized view.
- Schema definition: You can define the schema of the table to clarify the structure of the data.
- Type casting: You can convert data types and keep data consistent.
Leveraging these features makes it easier to create and update datasets and helps ensure data quality.
This time, I introduced Delta Live Tables, explained the complexity of data pipelines, types of datasets, and ensuring data quality. You can use this knowledge to build modern data pipelines more effectively. In the next part, Part 2, we will discuss efficient data processing and pipeline optimization.
This content based on reports from members on site participating in DAIS sessions. During the DAIS period, articles related to the sessions will be posted on the special site below, so please take a look.
Translated by Johann
Thank you for your continued support!