APC 技術ブログ

株式会社エーピーコミュニケーションズの技術ブログです。

株式会社 エーピーコミュニケーションズの技術ブログです。

Databricks-03. Working with Delta Tables to Understand Delta Lake

Introduction

This is Abe from the Lakehouse Department of the GLB Division. A data lake is used as a storage for accumulating raw data, but when migrating data from a database to a data lake, the meta information is lost, so there was the problem that it became a black box for storing data. Delta Lake solved these issues . In this article, I would like to explain Delta Lake while working with Delta tables.

What is Delta Lake?

A brief description of Delta Lake. Delta Lake is open source software and technology for building a data lake house on existing cloud storage. It supports database operations with ACID properties ( ACID transactions ), providing reliable and consistent data. Below are the ACID properties.

  • Atomicity: The transaction may or may not be fully executed, and will revert to its original state if it fails part way through.
  • Consistency : Execute transactions based on predefined rules for a table to keep the table consistent.
  • Isolation : Multiple users executing transactions against the same table are isolated as separate transactions and do not interfere with each other.
  • Durability : The property of being able to withstand system failures, and if transactions are completed normally, data will be saved forever even if failures occur.

The default format for tables created in Databricks is Delta and we call them Delta tables . Delta itself consists of parquet file data and json file transaction log ( Delta Log ). parquet The file is read and converted to Delta format when creating the Delta table. Delta Log will be discussed later. If you want to understand more about Delta Lake, please refer to the official Azure documentation

What is Delta Lake

Working with Delta Tables

We will use SQL to manipulate the Delta table. To create a new notebook, select your user name from Users in the workspace, right-click on the user name, and select Notebook from the displayed operation list.

SQL is the main language used in the created notebook, so select SQL from the Default Language pull-down.

Start by creating a Delta table called wine with the following schema.

フィールド名 フィールド型
alcohol FLOAT
color STRING
volume FLOAT
delicious BOOLEAN

Create table

Create a wine table.

CREATE TABLE wine
(alcohol FLOAT, color STRING, volume FLOAT, delicious BOOLEAN)

I have created a table.

insert record

Next, insert a record into the Delta table you just created.

INSERT INTO wine VALUES
(15.1, 'red', 150, true),
(12.6, 'white', 180, true),
(10.9, 'amber', 40, false),
(16.1, 'yellow', 400, false);

SELECT * FROM wine

I was able to confirm that the record was inserted. It is also possible to download the table data referenced from the table pulldown in CSV format and add it to the dashboard .

You can also click the + tab to see visualizations and data profiles.

Visualization is a feature that allows you to easily graph your data. Create a bar chart for each color of the wine table.

Data Profiles allow you to view the distribution and basic statistics of your data. However, it takes time to display, so be careful when the amount of data is large.

update table

Wines with color amber have false deliciousness, but now they are true. UPDATE + table name Update the table with the command and SET change the delicious column false to true with the command.

UPDATE wine
SET delicious = true
WHERE color = 'amber';

SELECT * FROM wine

I can confirm that the record has been updated.

Delete record

I would like to remove wines with “false” in the delicious column from the records.

DELETE FROM wine
WHERE delicious = false;

SELECT * FROM wine

I can confirm that the record has been deleted.

Upsert records by joining tables and views

Now create a new view and join with the existing wine table. At that time, if the color is the same wine, add the volume and update of the existing wine table. Also, if it's good wine, insert that record. First, create a view called new_wine.

CREATE OR REPLACE TEMP VIEW new_wine(alcohol, color, volume, delicious) AS VALUES
(15.1, 'red', 200, true),
(20.4, 'rose', 40.0, true),
(10.6, 'orange', 58.5, true),
(18.9, 'black', 300.2, false);

SELECT * FROM new_wine

A view has been created. Then set a condition in the WHEN clause and join with the wine table.

MERGE INTO wine a
USING new_wine b
ON a.color = b.color
WHEN MATCHED THEN
  UPDATE SET volume = a.volume + b.volume
WHEN NOT MATCHED AND b.delicious = true THEN
  INSERT *;
SELECT * FROM new_wine

I was able to add the volumes of the red wines together.

Delete table

Deleting a table DROP TABLE + table name is done with

DROP TABLE wine

I deleted the table.

What is Delta Log

From here we will manipulate the table using Delta Log. Delta Log is Delta Lake's transaction log, which records changes to a table (transaction log) in chronological order. This enables atomicity guarantees, table metadata management, time travel, etc. Time travel is described later, but it is a function that allows you to refer to previous versions of tables by going back in time. In the second half of this article, I will check the Delta Log and explain how to execute time travel.

Acquisition of table information by DESCRIBE command

Recreate the table that was deleted before checking the Delta Log of the wine table, and execute the transactions so far. This time, we added an id column and changed the MERGE condition from color match to id match or not. I also added a comment to the alcohol column because I can add a comment to the column when creating the table.

CREATE TABLE wine(
  id INT,
  alcohol FLOAT 
  COMMENT 'Low alcohol degree!',
  color STRING,
  volume FLOAT,
  delicious BOOLEAN);

INSERT INTO wine VALUES
  (1, 15.1, 'red', 150, true),
  (2, 12.6, 'white', 180, true),
  (3, 10.9, 'amber', 40, false),
  (4, 16.1, 'yellow', 400, false);

UPDATE wine 
SET delicious = true
WHERE color = 'amber';

DELETE FROM wine
WHERE delicious = false;

CREATE OR REPLACE TEMP VIEW new_wine(id, alcohol, color, volume, delicious) AS VALUES
(1, 15.1, 'red', 200, true),
(6, 20.4, 'rose', 40.0, true),
(7, 10.6, 'orange', 58.5, true),
(8, 18.9, 'black', 300.2, false);
  
MERGE INTO wine a
USING new_wine b
ON a.id = b.id
WHEN MATCHED THEN
  UPDATE SET volume = a.volume + b.volume
WHEN NOT MATCHED AND b.delicious = true THEN
  INSERT *;

DESCRIBE + table name to check the table information.

DESCRIBE wine

You can see the data type and comment content for each column. If you want to check only the information of a certain column, DESCRIBE + table name + database name.table name.column name write

DESCRIBE wine default.wine.alcohol

I was able to get the information of the alcohol column. If you want to check the meta information of the table, DESCRIBE + EXTENDED + table name write

DESCRIBE DETAIL wine

Indicates the CSV file that downloaded the execution results. (Because it is difficult to see on the UI because it is necessary to scroll the table)

I was able to get the meta information of the table. Each meta information is described below.

  • Catalog: catalog name
  • Database: database name
  • Table: table name
  • Type: The type of table. Table types include managed tables and unmanaged tables (external tables).
  • Location: where the table is stored
  • Provider: data format
  • Owner: table owner
  • Table Properties: Table properties. The delta table is versioned by Delta Log, showing the lowest readable and writable version of each.

*Simply put, the difference between managed and unmanaged tables is whether you manage both table data and meta information. For more information, please refer to the following article, which is explained in an easy-to-understand manner.

Managed and unmanaged tables

Another command to check the meta information of a table DESCRIBE + DETAIL + table name. Due to the display range, some of the execution results are shown below.

Location etc. can be confirmed in the same way as before.

Check Delta Logs

One way to check the Delta Log for a table is to use the Databricks Utilities (dbutils) File System Utilities ( ). To check the Delta Log of the wine table, use the command to get a directory listing of the Databricks File System (DBFS).dbutils.fs dbutils.fsls For details on dbutils and DBFS, please refer to the official azure documentation.

Databricks Utilities

What is the Databricks File System (DBFS)?

Delta Logs are files stored in the _delta_log directory under the table's directory. You can specify the file path and display it as follows.

%python
display(dbutils.fs.ls("dbfs:/user/hive/warehouse/wine/_delta_log"))

When creating the notebook, the default language was SQL, so when using Python commands, you need to write %python and MAGIC commands at the beginning like the code above. The MAGIC command is also summarized in the following article.

Databricks-02. From cluster creation to code execution - APC 技術ブログ

Below is a CSV file that downloads the execution results.

The _delta_log directory consists of two types of files ( crc, ) corresponding to each transaction . Since the files are indexed from 0 to 4, we know that there were 5 transactions.json

I will explain each file. The 3, 4 checkpoint files are the files created every 10th commit to the table, and this time they are not created because it is a 5th transaction. However, the checkpoint file creation frequency is optimized and RESTORE will be created when you run the command described below.

  1. json File: The transaction log is json recorded as a file, automatically created in the _delta_log directory when the Delta table is created.

  2. crc(Cyclic Redundancy Check) file: Contains table version statistics and is used to check data integrity .

  3. Checkpoint file: parquet Since the version information of the table is aggregated in the checkpoint file of Checkpoint file, it is only necessary to read the Delta Log after the Checkpoint file when referring to the table, which reduces the calculation cost. There is also a function to delete past transaction logs after a certain period of time.

  4. Last Checkpoint file: stores the most recent checkpoint, reducing computational costs when retrieving the latest table state.

  5. The __tmp_path_dir directory stores temporary data backups and temporary files during write processing. This is the directory where temporary files are stored, so the files do not exist after writing is completed. json As an example, to check the Delta Log corresponding to the last transaction, MERGE, we can see that we can refer to the file indexed as 4.

SELECT * FROM json.`dbfs:/user/hive/warehouse/wine/_delta_log/00000000000000000004.json`

You can check the add column and remove column from the execution result. The remove column records the files that were removed if the original data was removed, of course, DELETE etc. This time, the data is updated by , and the original data is deleted. Also, the add column is recorded when data is added or recreated without deleted data.UPDATE UPDATE

Why small files are created and file layout optimization

parqet As I mentioned at the beginning, Delta Lake consists of data files and transaction logs (Delta log). parquet A file is generated every time data is added, for example, INSERT streaming from a data source, outputting data by batch processing, etc. will create an endless number of files for the output data. That's why we describe Parquet files that are created repeatedly as small files. parquet When creating a dataframe from a Delta Table, read all the generated files. Therefore, the generation of a large number of small files degrades data loading efficiency, that is, query performance. OPTIMIZE There is a command on how to group such small files into one file and optimize the file layout . Takes as a parameter ZORDER BY + column name to optimize the file layout using the specified column name(s). ZORDER An image of the file layout before and after execution is shown below.

(Delta 2.0 - The Foundation of your Data Lakehouse is Open | Delta LakeQuoted from Delta 2.0 - The Foundation of your Data Lakehouse is Open | Delta Lake 's Support Z-Order clustering of data to reduce the amount of data read)

When reading all files, OPTIMIZE you can see that the later file layout seems to be more efficient to read. As an example, run file optimization with ZORDER on the wine table. For column names, specify ones with high cardinality (the number of data types), so examples include customer IDs and product IDs.

OPTIMIZE wine
ZORDER BY id

This time, the data handled is very small, so the effect of OPTIMIZE cannot be felt, but the more data, the more noticeable the improvement in query performance.

Restoring past table states with time travel

Time travel is the ability to view previous versions of data and restore tables to their original state, with Delta Log automatically versioning changes made to your data. To see the transaction history, DESCRIBE + HISTORY + table name run

DESCRIBE HISTORY wine

Displayed the processing contents of each transaction. You can see that version management is performed for each transaction from the creation of the wine table .OPTIMIZE The current table version is 5, but if you want to refer to the version 3 table before MERGE, execute as follows.

SELECT 
  * 
FROM 
  wine 
VERSION AS OF 3

This code is just a reference, and the version 3 table has not been restored. If you want to restore the table, RESTORE TABLE + table name + TO VERSION AS OF + version name write

RESTORE TABLE wine 
TO VERSION AS OF 3

I introduced how to restore a table by specifying a version, but I will also show you how to restore using the delta log timestamp. Display the timestamp when the files under the _delta_log directory were created.

%python
!ls -lt /dbfs/user/hive/warehouse/wine/_delta_log

You can see the creation of the checkpoint file, which was not there before. Similarly, since the version of the table to be restored is 3, we can see that the Delta Log as of 09:20 on March 13, 2023 should be referenced. By specifying the time with timestamp instead of the version name, you can restore the table at that point in time.

RESTORE TABLE wine 
TO TIMESTAMP AS OF '2023-03-13 09:20:01'

Browse the wine table to see if you were able to restore to the previous version.

SELECT 
  * 
FROM 
  wine

I was able to restore the table.

Reference article

Delta 2.0 - The Foundation of your Data Lakehouse is Open

Conclusion

In this article, in order to understand Delta Lake, we touched on Delta table operations and table history confirmation and restoration using Delta Log. The entity of Delta Lake consists of parquet file data and json file transaction log (Delta Log). Delta Log enables ACID transaction support and table restoration. I would like to continue posting the details of verification using Databricks in the future, so please take a look at it again. We provide a wide range of support, from the introduction of a data analysis platform using Databricks to support for in-house production. If you are interested, please contact us.

www.ap-com.co.jp

We are also looking for people to work with us! We look forward to hearing from anyone who is interested in APC.

Translated by Johann

www.ap-com.co.jp