APC 技術ブログ

株式会社エーピーコミュニケーションズの技術ブログです。

株式会社 エーピーコミュニケーションズの技術ブログです。

Udemy’s Data and AI Journey Migrating from a Managed Big Data Platform to Databricks

Preface

This session delved into early challenges faced by Udemy, specifically focusing on the hurdles encountered while transitioning from a managed big data platform to Databricks.

Legacy and Transition Challenges

Initially, Udemy began employing new tools, transitioning from MyCBC 4.8 to MyCBC 5.5 and adapting to S3, which quickly led to considerable issues. These problems, particularly experienced by a team member named David, highlighted challenges associated with legacy technology. Furthermore, transitioning from on-premises to Amazon EMR 5 was intended as a significant upgrade. Despite regular updates, the technology felt persistently outdated, always playing catch-up with older software versions.

They also implemented cluster auto-scaling. However, tight integration with AC2's static resources limited the necessary flexibility. The blend of old technologies like Hive and Scoop with new ones led to 'noisy neighbor syndrome.'

Understanding these initial challenges provides foundational insights into how Udemy overcame these barriers and significantly enhanced its data management and analytical capabilities. In the following section, we will further explore how these issues were addressed, and how Databricks features were leveraged to revolutionize Udemy's operations.

Originally, Udemy predominantly managed data jobs using Apache Airflow, which often necessitated the initiation of new job clusters. This process raised significant concerns regarding operational costs and efficiency, prompting Udemy to seek alternative solutions.

During the decision-making process, a careful evaluation focused on advanced features offered by Databricks, such as Unity Catalog, DBSQL, and Delta Lake, was conducted. Specific tests were designed to assess whether these features could streamline enterprise data management and enhance analytical capabilities.

The evaluation demonstrated not only that the Databricks platform resolved issues associated with Apache Airflow but also proved cost-effective. The integration of sophisticated data management tools like Delta Lake greatly improved the efficiency and responsiveness of the data pipeline, facilitating quicker data access and enabling more robust analytics.

The transition to Databricks was based on thorough technical evaluation and careful planning. This case study serves as a practical guide for other companies considering similar transitions in the data and AI realms, providing a valuable benchmark in these areas. Through this transition, Udemy demonstrated how careful consideration of technical capabilities can enhance resource management and overall corporate efficiency.

Decision-Making and Transition Strategy

When Udemy decided to transition from the managed big data platform Daedalus to Databricks, several considerations played a crucial role. Although Daedalus provided adequate in-house testing features, the broader support ecosystem and enhanced security benefits offered by other platforms significantly influenced Udemy's decision.

To streamline the decision-making process, Udemy employed a strategy termed 'user-driven metrics'. This strategic approach involved having team members evaluate potential platforms based on multiple key criteria, including product integration, user experience, data quality, ease of building data pipelines, and Business Intelligence (BI) and application SQL capabilities.

Among these diverse metrics, Databricks was particularly highly rated, impressing with advanced features such as Unity Catalog, DBSQL, and Delta Lake. These elements significantly enhanced data integration, analysis, and management, and were crucial in Udemy’s efforts toward data democratization. The impact of these features on data handling and security was a decisive factor in Udemy’s transition strategy.

This comprehensive evaluation and methodical approach exemplify how organizations can manage transition strategies while enhancing their data and AI capabilities. These strategic insights are extremely valuable for others considering similar initiatives, ensuring smooth transitions and robust data infrastructure.

Udemy's transition to Databricks included strategic management of the migration to ensure a successful transition, involving technical details and innovative developments. The migration process was crucial, centered around key phases such as data transformation, integration of ETL tools, and enhancement of dashboard capabilities.

A fundamental challenge was converting all existing data to the Delta format. This step was crucial as it allowed different data formats to be integrated into a single consistent catalog, significantly streamlining data management.

Regarding ETL (Extract, Transform, Load), the adoption of Delta Lake enabled quick and accurate data ingestion, significantly improving data analysis accuracy. ETL tools like EBT, iSpark, and Solid Spark were effectively utilized, and integration with third-party vendors like Hightouch through reverse ETL further aided the expansion of data applications.

Dashboard technologies also saw notable improvements. The use of Databricks, along with tools like Tableau, enhanced dashboard capabilities, making data more accessible and understandable for all users.

Overall, Udemy's seamless transition to Databricks promoted unified data management, efficient data processing, and enhanced user interface functionalities. This transition highlighted the potential and transformative possibilities that can be achieved through well-planned technological advancements in an organization.

Udemy’s Data and AI Journey: Automation and Verification

Udemy's transition to Databricks focused on addressing common challenges seen in many data platforms, aiming to enable quicker and more efficient data processing. A key focus of this endeavor was on 'Automation and Verification', essential elements to ensure the data migration process is stable, accurate, and efficient.

1. Importance of Automation and Migration Tracking

During the data migration process, automating updates on migration status is indispensable. Udemy introduced a migration tracker that automatically updates data status with each execution. This tracker uses Spark to verify if data has been migrated correctly, preparing for the next operation. This automation minimizes the need for manual intervention, significantly enhancing data workflow efficiency.

2. Improved Execution Time Through Spark Applications

Udemy found that using Spark applications for data migration resulted in execution times that were three times faster than the previous platform. This speed improvement not only accelerates data processes but also provides timelier data insights, enabling quicker business decisions and better utilization of data resources.

3. Strategic Use of Variables for Optimization

During automation, various variables are set to instruct actions based on specific scenarios. This level of customization enhances the reliability and resilience of the data migration process, ensuring tasks are executed precisely.

By focusing on 'Automation and Verification', Udemy significantly strengthened its data processing capabilities. These enhancements are crucial in maintaining data integrity and efficiency, supporting Udemy's goal of providing an accessible and robust learning platform. This journey exemplifies the profound impact that sophisticated data management strategies can have on an organization's operational capabilities.

Final Optimization and Stabilization

Let's delve into the 'Final Optimization and Stabilization' section of Udemy's data and AI journey, following the transition to Databricks.

This session emphasized stabilization and cost optimization post-migration. Despite facing numerous challenges in the initial stages of migration, over time, Udemy achieved desired cost-efficiency.

Cost Management

Leveraging features provided by Datagrid is essential in cost management. Functions such as tagging, thinning, and RBAC (Role-Based Access Control) not only maintain cost-efficiency but also enhance security and compliance, aligning with best practices in data management.

Use of Ephemeral Compute Resources

For the first time, Udemy adopted ephemeral compute resources. Managing such resources requires careful preparation and data management. Ensuring proper data backup, automating data schema verification, and implementing rollback processes are crucial to minimizing impacts during migration.

Driving Automation

Process automation enhances task accuracy and reduces the likelihood of errors associated with manual intervention. This leads to reduced operational costs, increased efficiency, and establishes a robust foundation for further data-driven decision-making across various business operations.

The speaker highlighted how these strategic measures have strengthened Udemy’s approach to data management and analysis since the transition. They have successfully democratized data analysis, promoting more agile and informed decision-making throughout the organization.

As the migration project concluded and benefits became more apparent, businesses can leverage these efficiencies for continuous improvement and innovation. Udemy's case provides enlightening insights for companies revising their data strategies, proposing a robust model for continuously refining and optimizing data processes using advanced data platforms like Databricks.

About the special site during DAIS

This year, we have prepared a special site to report on the session contents and the situation from the DAIS site! We plan to update the blog every day during DAIS, so please take a look.

www.ap-com.co.jp