APC 技術ブログ


株式会社 エーピーコミュニケーションズの技術ブログです。

Databricks Streaming: Project Lightspeed Goes Hyperspeed


In the beginning part of this session, we introduced what real-time data processing is, and how this phenomenon is related to today's business environment.

Real-time data processing is characterized by the provision of ongoing data analysis and its results immediately. This function allows for prompt decision-making in business. With innovative advancements in ultra-low latency processing, the performance of real-time data processing has significantly improved.

Following that, we discussed the primary objectives of Databricks's Project Lightspeed. This project particularly focuses on the ability to simplify the acquisition and processing of real-time data. This has improved efficiency in business operations, and made the decision-making process more concise.

Towards the end of the session, we delved deeper into the generation, capture, and processing of real-time data. We emphasized how these processes impact business operations, and why they are crucial in the modern business environment.

After elaborating in detail, the session concluded with the announcement of the latest AI tools. These tools are designed to provide evolved ultra-low latency processing capabilities. These tools are likely to propel next-generation streaming data processing and redefine industry standards.

In summary, this session was quite beneficial for understanding the trends in real-time data processing, the importance of ultra-low latency processing, and how Project Lightspeed is specifically contributing to these elements.

A Closer Look at Project Lightspeed: Advanced Features Pave the Way for Next-Gen Stream Processing

In the world of data streaming, Databricks' Project Lightspeed is showing remarkable progress. Prioritizing usability while advancing its features, this platform is distinctive with a user-friendly approach in the field of stream data processing.

As highlighted in the session, many customers are currently leveraging streaming for about 5-10 significant use cases. However, with the evolution of Project Lightspeed, this number is expected to increase to the hundreds. As its services expand, Databricks promises to continue offering a seamless experience.

A significant mention during the session was a report from IDC Markets positioning Databricks as a leading platform for stream processing. In particular, the report highly values Databricks' ability to simplify streaming data processing, reduce coding requirements, and smoothly carry out deployment and orchestration. This recognition in the report reaffirms Databricks' commitment to deliver an integrated advanced experience to its users.

Uniquely, Databricks opts to incorporate all encountered data into the platform rather than segmenting its services. As a result, its feature list does not crumble. In layman's terms, the platform's simplicity and user-friendliness are not compromised, even in the face of adding innovative features.

The future of the stream data processing industry is undoubtedly promising. With projects like Project Lightspeed leading the way, an increase in business use cases can be anticipated. As a reporter focused on data and AI, I look forward to observing advancements in this exciting realm. This session confirmed my belief that Databricks' innovative approach in stream processing design is setting a new industry standard.

Progress of the Project and Benchmark Data

Year after year, Project Lightspeed has made significant progress in handling complex use cases. The handling of data is strikingly simple. The process essentially involves loading data from topics, processing that topic as quickly as possible, and then instantly loading it into the delta. This process was devised to maximize the persistence and transformability of the delta.

Performance Evaluation Procedure

There are three targets for measurement. Firstly, data is loaded into a delta table, then queries are executed on this table. The time required to accomplish this is evaluated. Through this evaluation method, an empirical evaluation of data handling and processing performance can be made, and understanding of Project Lightspeed's role becomes clearer.

Detailed explanations about the technical challenges this project is facing and how it is addressing them will be provided in the next section.

Detailed Explanation of Stream Pipelining and Stateful Workloads

Constant efforts have been invested in next-generation stream data processing using Spark Structured Streaming as part of Project Lightspeed over the past two years. This article will focus on explaining the results revolving around stream pipelining and stateful workloads.

Utilizing Stream Pipelining for Data Processing

First, let's take a closer look at stream pipelining. We have inevitably dealt with random cycles throughout the process we have developed so far. These emerged naturally from the first attempt, were ironically reinforced on an unexpected scale, and hindered our aggregation activities.

To solve this, we focused on optimizing performance after the random cycle. In particular, our main concern was to recover the calculation time negatively affected by the random cycle.

Scaling Stateful Workloads

Next, let's delve deeper into stateful workloads. While processing countless random cycles, we recognized specific challenges. To overcome these, we introduced an effective function known as stream pipelining.

This function not only suppresses the increase in random cycles but also significantly enhances the scaling of stateful workloads. Typical mishaps in inconsistent patterns are better detected and adjusted by the stream pipeline.

Essentially, stream pipelining and stateful workloads play a central role in next-generation stream data processing. Through certain activities of Project Lightspeed, we aim to further enhance these technologies.

An Exploration into the Evolution of Stream Pipelining and Stateful Workloads

Leveraging Use Cases and Industry Applications

Project Lightspeed continues to evolve, narrowing its focus on a few key applications such as event processing, routing, filtering, and offering use cases.

In the case of event processing, an immediate real-time response to data is required. This is used as a trigger to launch new business rules.

For instance, if a user commits an infringement on the platform, that user needs to be immediately removed. Traditionally, even if violations were detected, there was often a delay in reporting and taking necessary actions.

However, with the introduction of Project Lightspeed, these actions can now be executed in real-time. As a result, appropriate actions can be taken at the right time, strengthening the overall safety and reliability of the platform.

In the use case of real-time analytics too, Project Lightspeed plays a vital role as getting the latest information processed in real-time analytics is crucial.

Integrating Project Lightspeed holistically has enabled us to understand the rapidly fluctuating information and take action immediately. This has led to improved speed and accuracy of data analysis.

The charm of Project Lightspeed lies in enabling these real-time operations. Therefore, it has the ability to transform data into action in an instant, enabling quick and tangible results.

The applications align with the everyday demands of the modern digital society. Project Lightspeed meets these needs and presents new possibilities. Let us keep our eyes on its evolution.

Advanced Stateful Processing and Progress of Custom Python Data Sourcing

As the release of Spark 4.0 is fast approaching, anticipation for the new feature - 'custom Python data sourcing' is growing. Let's see what this implies.

What is Custom Python Data Sourcing?

Scheduled as one of the features in Spark 4.0, custom Python data sourcing enables building custom data sources and sinks in Python. Traditionally, these could only be created in Scala or Java languages, which came with limitations.

This new feature will allow Python developers to construct data sources and sinks freely, greatly improving flexibility in data processing. This is a major advantage for developers who want to perform batch processing and data writing using custom user sinks.

Relation with Stateful Processing

Behind the introduction of custom Python data sourcing is Spark’s evolved data processing technology: stateful processing. 'Stateful' means 'holding the state of data', which makes data processing acceptance possible.

Therefore, building custom Python data sourcing capable of handling various states and conditions can be viewed as an evolution of stateful processing. This further provides the flexibility and freedom Spark offers, an immensely attractive update for data scientists and engineers using Python.


The integration of custom Python data sourcing in Spark 4.0 brings new possibilities to data processing using Python. As existing technology is bolstered by a fresh framework, diverse data analysis can be anticipated. After the completion of the session, users will be able to apply these latest advancements to realize more efficient data sourcing, processing, and management.

About the special site during DAIS

This year, we have prepared a special site to report on the session contents and the situation from the DAIS site! We plan to update the blog every day during DAIS, so please take a look.