APC 技術ブログ


株式会社 エーピーコミュニケーションズの技術ブログです。

The MLOps Platform at WGU: Solutions to Production ML with Databricks


In this session report, the MLOps team at Western Governors University (WGU) presented on the theme "MLOps Platform at WGU: Production ML Solutions with Databricks." The session focused on the early stages of implementing MLOps (Machine Learning Operations) at WGU, an online university dealing with a vast amount of data, and the challenges faced during the design phase.

Design Phase and Initial Challenges

The session highlighted that the design stage encompassed numerous difficulties. Among these, the most prominent challenges were establishing standardized processes, setting up effective source control mechanisms, and implementing robust monitoring systems. These hurdles were critical to the success of the project, requiring extensive efforts from the project team to address.

The approach adopted by WGU’s MLOps team involved identifying key issues, prioritizing them based on impact, applying the necessary tools and processes appropriately, and pursuing continuous improvement in methodologies.

Furthermore, the session emphasized that implementing MLOps transcends mere technical challenges, necessitates a comprehensive organizational strategy, and requires strong cooperation among diverse stakeholders to successfully implement a large-scale system.

This session was insightful, detailing the early challenges WGU faced with MLOps using Databricks and the profound solutions implemented to overcome them. The following sections will discuss more detailed strategies and specific efforts by the team.

WGU’s MLOps platform, MARVIN, is highlighted for addressing the complexities of deploying ML models at scale using Databricks. We will delve deeper into some of the key strategies emphasized in this section.

One of the primary challenges WGU faces is ensuring traceability and auditability of models. Under the policy of "Everything as Code," all aspects including workflows, compute settings, and permissions management are meticulously handled through code. This ensures traceability of all data, workflows, experiments, models, and permissions associated with ML projects.

Here, MLflow plays a crucial role. Using MLflow allows for seamless tracking from model development to deployment, enabling traceable roll-backs from production models and bridging gaps.

Next, we explored simplification and standardization of the production process. WGU promotes automation through CI/CD, setting up separate folders for development, staging, and production environments. This structure ensures that the production process is reproducible and standardized.

Moreover, the ease of tuning for maintaining model performance post-deployment was emphasized. Continuous monitoring and timely updates are crucial, and the MARVIN platform is specially designed to handle these tasks efficiently.

Through these efforts, WGU has established an efficient and effective operational framework for ML models, maximizing the benefits of ML in products serving approximately 170,000 students.

WGU’s MLOps platform, MARVIN, focuses on efficient deployment of educational ML models using Databricks, with strict security measures and automated project processes.

Stringent Security Management

At the core of MARVIN’s design is a robust security layer to protect the operational environment. The platform adopts Databricks’ service principals for vigilant monitoring of all related access tokens. This careful token management is crucial in suppressing unauthorized access and potential data breaches, significantly enhancing overall system security.

Permissions within Databricks are rigorously managed on a group basis. Initial protocols for project commencement involve forming a new group dedicated to that project and including relevant team members. This organizational method ensures strict management of workspaces, experiments, models, and data access, establishing a safe and organized operational framework.

Automated Project Initiation

Any project deployment within MARVIN triggers three major pipelines, ensuring seamless and automated project execution. The initiation of a project activates a unique single project instance signaling the start of operations. As the project evolves, participants may reflect the dynamic nature of Databricks’ automation capabilities to seek updates across various components.

These systems are configured by MARVIN, leveraging sophisticated security and automation to support secure operations and the rapid deployment of educational models, efficiently serving WGU’s extensive student body.

About the special site during DAIS

This year, we have prepared a special site to report on the session contents and the situation from the DAIS site! We plan to update the blog every day during DAIS, so please take a look.