Optimizing Batch and Streaming Aggregations

Introduction

GLB Business Department Lakehouse Department Chen. Based on the report of Mr.Gibo, who participated in the local Data + AI SUMMIT 2023 (DAIS), the lecture "Optimizing Batch and Streaming Aggregations" was an independent developer's Apache Spark and Databricks expert Jacek Laskowski introduces an overview of Spark SQL and Spark SQL Stream in the latest version of Apache Spark, version 3.4.1. Business people interested in processing and AI, and those interested in Spark and Delta Lake.

Spark SQL

Spark SQL is a module of Apache Spark that uses the concept of distributed computation described using the RDD API. This allows you to write queries using dataframes like SQL, Python and Pandas. Specifically, it has the following features.

Ability to write queries using SQL and data frames
Access to various data sources with data source API
High-speed processing by query optimization

This allows for simpler and more efficient data processing than programs using the traditional RDD API.

Spark SQL Stream

Spark SQL Stream is an extension of Spark SQL that allows writing streaming queries. This enables real-time data processing. It has the following three features.

Capable of processing streaming data
Functions specialized for streaming processing such as window functions and watermarks
The same query description as batch processing is possible

These facilitate real-time data analysis and application development.

Latest features and services

Apache Spark 3.4.1 enables even more optimized batch and streaming processing. The specific functions are as follows.

Query optimization with Adaptive Query Execution (AQE)
Fast streaming processing with Delta Lake support
Accelerating real-time processing by improving Structured Streaming

These features result in more efficient and faster data processing.

Query execution process

Apache Spark needs to resolve outstanding logical plans before executing a query. This is done using an analyzer. Analyzers are used to optimize logical operators and make query execution plans efficient.

Specifically, the analyzer provides functionality such as:

Resolve unresolved attributes and tables
Apply rules for query optimization
Optimize logical operators

Aggregation types and optimization

The talk introduced the different types of aggregation possible. Specifically, the following aggregations are possible: - basic aggregation - multidimensional aggregation - legal aggregation - window aggregation

Each of these aggregate operations has an optimal physical operator. Hash aggregation exec is the best physical operator for hash-based aggregation and the fastest aggregation option in Spark SQL. However, memory pressure can lead to sort-based aggregation. In that case, I was told that increasing the memory available to the executor might improve performance.

Problems and countermeasures

Apache Spark has some issues, such as memory limits and limits on user-defined functions. Appropriate optimizations and alternatives have been proposed to address these issues. It was also recommended to rewrite the code into corresponding batch queries to identify the root cause of the issue.

Summary

This talk provided an introduction to Spark SQL and Spark SQL Streams in Apache Spark 3.4.1. Utilizing the latest functions and services makes data analysis and application development easier. Concrete methods and alternatives were also proposed regarding problems and countermeasures. Use this knowledge to efficiently process data using Apache Spark!

Conclusion

This content based on reports from members on site participating in DAIS sessions. During the DAIS period, articles related to the sessions will be posted on the special site below, so please take a look.

Translated by Johann