Introduction
This is Abe from the Lakehouse Department of the GLB Division.
The DATA + AI SUMMIT (DAIS) hosted by Databricks will be held again this year!
Last fall, Databricks announced a new generation system called the Data Intelligence Platform, which is a replacement for the Lakehouse Platform.
If you search for the keywords in this year's DAIS sessions, there are 11 sessions, which suggests that they are trying to spread the word.
https://www.databricks.com/dataaisummit/agenda?page=1&query=Data+Intelligence+Platform
In this article, I would like to explain the history of the Data Intelligence Platform and then introduce related sessions.
- Introduction
- History of technological evolution surrounding data analysis platforms
- 1960s: Early data analysis
- 1970s: Emergence of Relational Databases
- 1980s: Establishment of Data Warehouse (DWH)
- 1990s: Business Intelligence (BI)
- 2000s: Big Data and Distributed Computing
- 2010s-1: Cloud Computing and Real-Time Analysis
- 2010s-2: The birth of data lakes and their evolution to data lakehouses
- Challenges of Data Lakehouses and Databricks' Data Intelligence Platform
- Interesting Sessions
- Conclusion
- About the special site during DAIS
History of technological evolution surrounding data analysis platforms
As the saying goes, you need to start from history to learn about academics or anything else (?), I think it's important to first learn the history leading up to the creation of the term Data Intelligence Platform and understand how it has evolved. In fact, I don't think many companies are able to utilize the platforms and technologies I'm about to introduce, so I hope this will help companies considering data utilization understand what stage they are currently at.
1960s: Early data analysis
This was the era when mainframe computers with large-scale data processing capabilities appeared and computer technology was introduced into commercial and scientific data processing. Data analysis was carried out in various industries and was used for statistical analysis, regression analysis, numerical simulation and modeling.
This mainframe was also used in NASA's space exploration activities, including the famous Apollo program, and below is a mainframe computer installed at NASA in the 1960s.
Source
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%95%E3%83%AC%E3%83%BC%E3%83%A0
1970s: Emergence of Relational Databases
The relational database (RDBMS) was proposed in 1970 by IBM researcher Edgar F. Codd. Codd's paper "A Relational Model of Data for Large Shared Data Banks" introduced a new approach to database management and had a major impact on the subsequent development of database technology.
Although it is commonplace today, concepts such as tables (relations), keys, normalization, and SQL were proposed by Codd.
Then, commercial relational database systems such as Oracle and IBM DB2 appeared, dramatically improving data management and manipulation, and greatly advancing the use of data in business and scientific research.
1980s: Establishment of Data Warehouse (DWH)
This was the period when the basic concept and technology of data warehouses were established and their role as a corporate decision support system became clear.
As the business environment became more complex and companies had a growing need to make decisions quickly and accurately, the concept of data warehouses developed through the following breakthroughs:
- Hardware evolution
- Hardware technology evolved from mainframes to minicomputers and then personal computers (PCs), improving data processing capabilities
The cost of large-capacity storage fell, making data storage more practical
Evolution of software and database technology:
The spread of RDBMS (relational database management systems) made it easier to manage and query data.
IBM's DB2 and Oracle emerged, forming the foundation of database technology.
The birth of the ETL process:
The ETL (Extract, Transform, Load) process was introduced, and the automation of data extraction, transformation, and loading progressed
Data consistency and quality improved, and analysis reliability improved
Through the above technological advances, an analysis platform was born that could aggregate business data processed by the ETL process into a DWH and analyze it efficiently.
1990s: Business Intelligence (BI)
BI is a technology and process that allows companies to collect and analyze data and support decision-making.
It was born in the 1980s and established in the 1990s, and it became possible to convert data for analysis on the DWH, and visualize that data on BI.
As an example of BI, reporting tools appeared, allowing users to refer to visual data such as graphs, and it became possible to create reports with simple operations such as drag and drop, especially for users with no technical knowledge.
In addition, the emergence of OLAP (Online Analytical Processing) contributed greatly to the evolution of BI. Concepts such as multidimensional data models and cubes (data cubes) were also born, making it possible to quickly execute complex queries and perform multidimensional data analysis.
2000s: Big Data and Distributed Computing
This was a time when big data and distributed computing technologies rapidly developed, dramatically transforming the way data was processed and analyzed.
The spread of the Internet and the emergence of SNS and IoT generated huge amounts of data every day, increasing the demand for real-time data processing and streaming data.
In addition, data formats such as text, images, videos, and sensor data became more diverse, making it important to ensure the reliability and quality of data.
In response to the above business requirements, distributed computing evolved, and Apache Hadoop, MapReduce, and NoSQL databases (MongoDB, Cassandra, etc.) appeared, making it possible to process big data using distributed computing.
2010s-1: Cloud Computing and Real-Time Analysis
Cloud computing is a service that provides computing resources (servers, storage, databases, networking, software, etc.) via the Internet, and has revolutionized the IT infrastructure of businesses and individuals.
Companies can now use flexible and cost-effective infrastructure, and with the evolution of real-time analysis technology, they can make quick decisions and take action. This has further advanced data utilization in various industries, improving their competitiveness.
2010s-2: The birth of data lakes and their evolution to data lakehouses
A data lake is a central repository that stores a large amount of raw data in various formats. Early data lakes were mainly built on on-premise Hadoop clusters, and an ecosystem centered on Hadoop was built, including Apache Hive (data warehouse), Apache Pig (data flow language), and Apache HBase (NoSQL database). With the spread of cloud computing, cloud-based data lakes have appeared, greatly improving flexibility and scalability.
However, data lakes have the disadvantage that they do not support ACID transactions, and metadata management cannot be managed, making it impossible to guarantee data quality. Therefore, Databricks advocated data lakehouses, and data management systems that integrate the functions of data lakes and data warehouses have come to be used, making it possible to analyze data for various data formats.
Data lakehouses enable data collection, processing, and storage, and workloads such as machine learning and integration with BI tools can be performed on a single platform.
Of course, these big data processing capabilities are based on the distributed processing technology of Apache Spark.
In order to share assets such as data on a single platform with multiple users, data lifecycle management was performed along with strengthened data governance such as access management.
Challenges of Data Lakehouses and Databricks' Data Intelligence Platform
However, Databricks states that previous lakehouses have the following challenges.
Technical skill barrier: Data queries require specialized skills in SQL, Python, and BI, making the learning curve steep.
Data accuracy and curation: In large organizations, it is difficult to find appropriate and accurate data, and large-scale curation and planning are required.
Management complexity: If the data platform is not managed by advanced technical personnel, costs may rise and performance may decrease.
Governance and Privacy: Governance requirements around the world are evolving rapidly, and the advent of AI is amplifying concerns about lineage, security, and privacy.
New AI Applications: To enable generative AI applications that answer domain-specific requirements, organizations must develop and tune LLMs on a platform separate from the data and connect them to the data through manual engineering.
Source https://www.databricks.com/jp/blog/what-is-a-data-intelligence-platform
In other words, we understand that the challenge is that the platform has evolved and the required technical skills have become more advanced, and that the increase in data and the emergence of AI have made data governance requirements even more necessary, and that the lake house does not have integrated LLM development capabilities.
In response to these challenges, the idea of using (generative) AI to further democratize data has been proposed. I will not go into the details, but it has become easier for users to access data than ever before, users can understand the meaning of the data, data managers can strengthen data governance, and as a platform, Generative AI's development functions and collaboration with other services have been improved and evolved.
Interesting Sessions
This concludes our look back at the history of the Data Intelligence Platform, but I think the best way to understand Databricks' Data Intelligence Platform is to watch the keynotes and sessions at DAIS.
For this reason, I would like to end this article by introducing the sessions on Data Intelligence Platform that will be presented at DAIS.
- DATABRICKS DATA INTELLIGENCE PLATFORM: INTRODUCTORY OVERVIEW
This session will allow you to systematically understand Databricks' Data Intelligence Platform.
I think it will be a good session.
In particular, it will explain the Databricks Assistant with Generative AI functions, AI Semantic Search, AI Documentation, Query Editor Assistant, Predictive I/O, and how to build AI applications such as LLM/RAG/Fine Tuning.
- YOUR GUIDE TO DATA ENGINEERING ON THE DATA INTELLIGENCE PLATFORM
This session is a guide for those involved in data engineering work on how to use Databricks functions to solve problems.
The agenda for the session is as follows (Japanese translation):
- Data ingestion into the Data Intelligence Platform
- Building a reliable streaming data pipeline with Delta Live Tables
- Data orchestration with Databricks workflows
- Data governance with Unity Catalog
- Useful AI features such as Databricks Assistant
Conclusion
We looked back on the history of the Data Intelligence Platform along with technological advances. The Data Lakehouse platform has evolved from early data analysis using mainframes, and now the Data Intelligence Platform has been born. We look forward to the future evolution of the platform using Generative AI.
About the special site during DAIS
This year, we have prepared a special site to report on the session contents and the situation from the DAIS site! We plan to update the blog every day during DAIS, so please take a look.