Machine Learning (ML) Deployment is one of the dark sides of both Data Science and Data Engineering. Mmanaged services like Databricks might help.
The daily mood
As already mentioned in a previous post, I have the privilege to shadow a starting Data lake project.
The team already prepared data and built a first Machine Learning (ML) model for a specific use-case.
They are currently in the process of deploying the scoring application and find this quite challenging. We are going to discuss why it is difficult, and of course how technology and automation may help.
Big Data & ML adoption
In 2009, the Knowledge Discovery in Data Minining (KDD) conference became a competition (KDD Cup) which reached the IT world with a disrupting report of lessons learnt in large scale ML projects. The industry just started to realize the rize of Big Data (ex. IoT) and the potential of ML for Business.
In the following years, Business Intelligence (BI) organizations not only embraced digital transformation but also initiated a long-term transition away from Data Warehouses (DW) and Massively Parallel Processing (MPP) appliances.
Starting 2012 they evaluated resilient systems like Apache Hadoop for reducing storage and processing costs, and for automating business related classifications or predictions via supervised ML.
The Hadoop ecosystem was large. It included HDFS, MapReduce, Pig and Hive to name a few standards, plus additional (mostly open-sourced) components selected or backed as a differentiator by Data Management platform distributors: Cloudera, Hortonworks and MapR.
Since only a few Data scientists and Big Data infrastructure experts were available on the job market, Data Engineers had to learn "the hard way", struggling to re-implement analysis techniques like for example SCD historization tables atop of non-POSIX compliant file systems. Results didn't met exceptations.
In 2014, new technologies were required to improve performance and accessibility. The in-memory cluster computing engine Apache Spark maintained by Databricks reached the critical-mass thanks to a large support for ML algorithms, programming API, and code compatibility between batch and realtime.
In 2016, managed Hadoop services from major public cloud providers became less expensive (consumption-based vs. node-based) and more robust (central vs decentral delivery) than on-premise alternatives from distributors.
Object-stores replaced block-stores, Spark replaced MapReduce, and Hadoop was not required anymore. Cloudera bought Hortonworks and MapR went bankrupt. Next step was to make Data virtualization (SQL layer) and Data Science (automation layer) more accessible to the industry.
ML gaps
When doing Maths at the university, students usually transit from academic papers and calculators to programming languages like R for numerical / array computing and statistical modelling, Python for general purpose application / development, C++ for object-oriented research / industrial engineering.
They may also use proprietary environments like Mathlab and SAS, but do not acknowledge much of service and infrastructure layers. With this, it will be easier for them to later endorse an application oriented mindset (Service Delivery contract) than an operational mindset (Service Level agreement).
The gap is even bigger with ML because it is a subset of Data Science, a research area that sets project focus more on exploration and prototyping, than robustness, scalability and reproducibility. Even worse, the Data scientist delivery, i.e. the model, is a living asset that is likely feared or untrusted.
Indeed, applications able to generate predictions from live data basically consists in an interface for serving a ML model, itself based on a complex statistical algorithm manually configured by a human and automatically calibrated by a machine based of historical data used for training and testing.
No surprise that the (evil) model raises some incomprehension or suspicion. By chance, Data Engineers and DataOps, who have less in-depth but wider knowledge, may bridge the gap between research and production. The following table shows different concerns at different stages of the ML lifecycle:
Data Scientist ♥ notebooks | Data Engineer ♥ workflows | DataOps ♥ audits | DevOps ♥ scripts | SysOps ♥ infra | |
1 | Data collection, distribution analysis | Connection from source, staging, data augmentation | Architecture definition, metadata, governance | Data store provisioning | Network, security |
2 | Feature augmentation /engineering | Data preparation, cleansing | Job scheduling | Engine provisioning | Compute sizing |
3 | Feature/algo selection, error magnitude | Data integration, schema validation | Application monitoring | Build & test automation (CI) | System monitoring |
4 | Training, param tuning, dumping | Model deployment to job/svc | Tracking & tracing | Config. management | Alerting |
5 | Model eveluation (perf, accurat) | Versioning, integration tests | Packaging, system tests | Promotion, release (CD) | Ticketing |
ML deployment
As more and more tools emerged on each other side of the river, 3 different options remained for ensuring compatibility and reproductibility (ex. between Python and Spark) of ML models, while being deployed into a productive environment and embedded into a business application:
- Mostly adopted: Use Python programming language for all stages of the ML lifecycle, e.g.:
- IPython interactive shell, Jupyter or Apache Zeppelin web-based notebooks (1)
- Python Data Analysis Library / Pandas for data manipulation (2)
- Feature tools for feature extraction (3)
- Scikit learn pipeline for modelling and workflow definition (4)
- Tox testing for testing, Python Package index for packaging, Spark Python API / PySpark for execution (5)
- Most reliable: Use Spark programming language to (re-)build the scoring application
- Best from architecture perspective: Use a Model interchange format specification
- A few years ago the Predictive Model Markup Language (PMML) was the most standard format for R, Python and Spark to export models. I used to play with JPMML which was probably the only open-source library allowing to directly execute models in that format, and it actually didn't do well. Not only it was based on Spark Java API, but it was also very limited and barely maintained.
- Most recently, Microsoft AWS and Facebook came together to create ONNX, a more promissing alternative at least on the long run, which is already well supported with Python.
In our current project we more or less adopted the first approach.
Our setup
In our specific project context, we acknowledged the value of some historical track & trace records available in an AWS S3 bucket for predicting the future duration of a critical business process, a practice called ML problem framing. Then we chose Databricks Unified Data Analytics Platform for ML development and collaboration, with setup on AWS. Our goal is of course to meet the business goals, and evnetually deploy something to production.
What is Databricks Unified Data Analytics Platform
Databricks is a Machine-Learning-as-a-Service (MLaaS) offering that is available atop of AWS and Azure IaaS. There are a couple of possible alternatives (like Dataiku, Mode) targeting the same goal to increase development collaboration and automate deployment life-cycle. Databricks platform ships with the following features:
- High-end IPython-style notebooks
- shared workspace
- cell-oriented development and execution
- markdown support for documentation
- syntax-highlighting and code-completion
- unlike Zeppelin in Zepl and Jupyter in AWS Sagemaker, the Databricks notebook is proprietary but the recent acquisition of redash.io kind of reveals its directions and pre-announces its open-sourcing for soon
- Staging and access via Delta table format
- storage agnostic SQL-query layer atop of AWS S3, Azure Storage and Hadoop HDFS.
- support of ACID transactions, as detailed in this academic paper.
- Code life-cycle management via MLflow
- Machine Learning Operations (MLOps) platform
- Track & trace, model packaging and sharing
- Integration with AWS SageMaker and Azure Machine Learning
- Code derivation to
- Visualization dashboards
- Scheduled jobs
- Serverless execution on most recent version of Apache Spark
- Scala code compilation via "package cells"
Our approach
Everybody in the team was quite new to the challenge. The stakeholders provided the required data. The scientist went well at analysing the data, creating and training a first model. The data engineer worked on data cleansing and model enrichment based on live-data (cf. Training vs. Inference).
According to the Udemy course "Deployment of Machine Learning Models", it is best practice to score data on-the-fly via a REST API rather than implement an asynchronous messaging and scoring system for example via stream processing. So the team followed the recommendation. A requirement that Databrick unfortunately doesn't help much with, except that MLflow supports serving as standalone Docker image executed on AWS Elastic Container Service (ECS), as well as deployment to Azure ML to AWS Sagemaker (both alternative MLaaS to Databricks).
Our challenge
Since the model is actually specific to a given context, we'll have to create, deploy and maintain many of them. Also, we will have to find a way to route central endpoint requests to the right scoring code. So many reasons for us to be concerned about deployment automation, right from the begining of the project. The team started to play around with AWS SageMaker for exposing the endpoint, executing and re-training the models. Since Databricks and SageMaker are now both occupying a significant place in the project, while presenting a large functionality overlap, I assume that we should be prepared for answering the question "why both?" (with resulting complexity), "why not just one?" (with eventual comprise), "what is best for our use-case?" (and others to come).
Conclusion
We just digged into the processes and tools involed in a larger ML project, and set the focus on ML model deployment. That central process step is actually critical for architecture decisions, since it is functionally the linkage between ML development and operations, or technically the crossroads between Big Data and DevOps. We obviously want the project to be successfull, and bring some return for the business. I am looking forward to experimenting and evaluating different options.
Comments
Post a Comment