This interview first appeared in the TDWI Business Intelligence Journal Volume 25, No 2 (2020)
A growing number of organizations are using predictive analytics to make product recommendations, identify market segments, calculate customer lifetime value, and more. In the beginning, predictive modeling was a “cottage industry.” Only a few people developed, implemented, and maintained a limited number of models. However, the need for models has grown (often to the thousands) in many organizations.
The cottage industry approach is no longer viable, as more people are involved. Data pipelines must be established; model building, testing, and updating must be automated and documented; and the scope of data and analytics governance must be expanded. The ModelOps term is increasingly used to describe the automation of the model building, implementing, and updating processes.
Assume you are a consultant to a Chief Analytics Officer who was recently hired to “take analytics to the next level” in the company. Doing this requires scaling analytics in terms of the people, data, models, technologies, and processes. Please provide your perspective and advice on the following questions and issues:
- In terms of people, to what extent can business analysts (often with upgraded skills) and citizen data scientists contribute to and be integrated into a ModelOps environment? A company’s staffing isn’t likely to be all data scientists.
Staffing a common challenge that many fledgling data science programs face. The question asked is: “How can we scale to deliver more analytics models to the business even faster?” The first place to start is to ensure that the highly specialized data scientists are spending their time where they add the most value. This might mean selecting and building analytic models initially, but eventually they should be spending more time mentoring and reviewing the analytic models of citizen data scientists.
The key is to break down stages of the modern analytics life cycle process to enable project managers, business analysts, and data engineers to each do as much of the work as possible. The second place to focus on is efficiency in the model testing, deployment, monitoring, and updating process. Delivering an analytics model into production is more time-consuming than typically anticipated. Continuous deployment and containers can help with the process.
- The need for data engineers to build data pipelines for operational systems and analytics has grown, resulting in shortages of personnel with these skills. How should a company ensure that it has the data engineering skills it needs?
To ensure that your organization has the data engineering skills it needs, begin by understanding how data engineers are spending their time on projects. Specifically, how much time are they spending finding data, validating data, and testing the feasibility of the project?
Quite often, a noticeable portion of their time is spent early in the discovery phase of the project. These efforts can be easily performed by their business analyst counterparts—often more quickly because of their business and process knowledge to make iterative, quick decisions.
We take this one step further by understanding how much of the data engineer’s time is spent performing architecture- or operations-related work. We institute an architecture pattern book process whereby data engineers spend their time applying the architecture pattern to the business project rather than reinventing the wheel. The goal is to find ways to increase speed and agility for data engineers.
- As companies use additional data sources, often obtained from data brokers, how can they maintain data quality and ensure that the use of the data does not violate regulations and laws such as the GDPR and CCPA?
When it comes to proper use of data, this is where the data governance office (DGO) gets involved. Self-service data analytics or agile delivery teams should be able to incorporate external or third-party data into the enterprise data lake as needed. In the case where the external data is going to be used by multiple business groups, we recommend that the DGO include this in their reference data management responsibilities.
At a minimum, we strive to incorporate data governance and data quality at the point of data ingestion into the data lake. Data engineers, business analysts, and data scientists know to do this anytime their source data is not already in the data lake. For the discovery phase, the DGO is notified. For the usage and implementation phase, the DGO facilitates data owners and data stewards as necessary.
- To what extent do current analytics workbenches support ModelOps? What advances do you anticipate in these workbenches? How does programming languages such as R and Python fit into a ModelOps environment?
We are pleased to see that most data science platforms have incorporated ModelOps as part of their life cycle. Only a few years ago, enterprises recognized the need for operationalizing analytics, and it was handled manually.
Most data science platforms continue to recognize that data scientists need to be allowed to work in their programming language of choice, such as Python and R, and data science notebooks are becoming the de facto standard for workbenches.
The importance has shifted from building analytics models to the ongoing work of operationalizing and maintaining analytics models in production. The work now is to monitor if a deployed analytics model maintains its effectiveness over time, how it competes with other similar models in production, and whether updating data features or reinforcement training is needed.
- What processes need to be put in place for ModelOps?
ModelOps essentially provides the ability to maintain analytics models at scale. ModelOps monitoring capability enables the data science team to quickly see if any analytics models need attention or are trending in the wrong direction. When the data science team needs to adjust and update the model, the team must have a process in place for A/B testing to quickly react and analyze the impact of changes.
In addition, efficient model deployment processes through ModelOps can leverage continuous integration and continuous delivery (CD/CD) so that a change can be made and deployed quickly. Both of these capabilities enable data science teams to react quickly to notifications delivered from an analytics platform that utilizes ModelOps.