035: ML model lifecycle with MLflow Skip to main content

035: ML model lifecycle with MLflow

Because a Machine Learning model is a living asset, it is best-practice to automate its development lifecycle to ensure repeatability.


The daily mood

Today was my weekly office day and I had the opportunity to meet a colleague from the SRE team, that I had actually not seen for a long time anymore. He told me about the SRE organisation, pains and achievements, which I found very insightfull. We also joined 3 other colleagues for quite a long lunch break at the terrasse of a restaurant. None of us had much social contact during the last months of lock-down, and it looks like we all enjoyed meeting again.

Beside that, I am now looking at some ML tooling for enforcing best-practices and operationalization, that we are using in a project (see my previous post on ML model deployment) and which I was not yet much familiar with. First candidate MLflow is actually pretty straight forward, at least in theory.


What is MLflow

MLflow is a ML lifecycle management platform written in Python. It supports the Data Scientist with an automation pipeline for training, evaluation, storage and serving of ML models, typically for online scoring applications.

MLflow is built around the following 4 features accessible via Web-UI and API:
  • Tracking: Logs all experiments in the form of timestamped parameters and results so that they can be accessed anytime for search, replay and comparison.
  • Projects: Organise code into a "standard" structure based on Conda and Docker for reproducible development and execution accross different operating systems.
  • Models: Package deliveries into a "standard" structure based on Python object serialization (i.e. Pickle) and configurations depending on the downstream-tool (e.g. Apache Spark, Azure ML and AWS SageMaker).
  • Registry: Centrally manage and share models for better collaboration and automation. The registry provides model versioning and lineage, stage transitions and annotations.

There are some public images and charts of MLflow on Docker Hub and Helm Hub that claim to be "production-ready", as well as a managed offering available as part of the Databricks platform.


Setup

$ pip3 install conda mlflow


Tracking

We'll run a piece of code that creates some dummy mlflow log_metric, log_param, log_artifacts.
 
$ wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/quickstart/mlflow_tracking.py
$ python3 mlflow_tracking.py

All tracking informations are stored on the local file system.

$ tree mlruns
mlruns
└── 0
    ├── 31a77fbb24614d549866908e960a3fe2
    │   ├── artifacts
    │   │   └── test.txt
    │   ├── meta.yaml
    │   ├── metrics
    │   │   └── foo
    │   ├── params
    │   │   └── param1
    │   └── tags
    │       ├── mlflow.source.git.commit
    │       ├── mlflow.source.name
    │       ├── mlflow.source.type
    │       └── mlflow.user
    └── meta.yaml

6 directories, 9 files

The Web-UI offers a nice representation of it.

$ mlflow ui
$ sensible-browser http://localhost:5000


Projects

Projects currently need to be packaged by hand, which means as a directory structure containing a MLproject-file and a conda.yaml-file, both in YAML-fomat. 

Example of MLproject:

name: sklearn_logistic_example

conda_env: conda.yaml

entry_points:
  main:
    command: "python train.py"

Example of conda.yaml:

name: sklearn-example
channels:
  - defaults
  - anaconda
  - conda-forge
dependencies:
  - python=3.6
  - scikit-learn=0.19.1
  - pip
  - pip:
    - mlflow>=1.0

Once done, you can run it from URI (ex. from local file system, remote file store or Git)

$ mlflow run examples/sklearn_elasticnet_wine -P alpha=0.4
$ mlflow run https://github.com/mlflow/mlflow-example.git -P alpha=0.4


Models

Models currently need to be externally created and provisioned to the MLflow project directory structure as a MLmodel-file in YAML-fomat and a model.pkl-file in binary format.

Example of MLmodel:

time_created: 2018-05-25T17:28:53.35

flavors:
  sklearn:
    sklearn_version: 0.19.1
    pickled_model: model.pkl
  python_function:
    loader_module: mlflow.sklearn

As we can see, MLmodel describes two from a larger set of built-in flavors. The mlflow.sklearn module defines a function log_model() that allows a Python script to save the model in MLflow format. This models can then be be loaded with load_model() function from same module for further Scikit-learn-object handling, or from module mlflow.pyfunc for inference. 


Model Serving

Once the model is ready, it seamlessly translates into a very simple REST API with just one POST endpoint called "/invocations".

First and foremost, the sever can run as a local web server based on Flask.

$ mlflow models serve --model-uri runs:/<run-id>/model --port 5001

Since Flask is not secure, it is highly recommended to expose the endpoint behind a Gateway. In case of a Spark model with scalability requirement, the server should also delegate the scoring process to an external Spark cluster, thus with corresponding latency.

A better option is to run the model as a Docker container provisioned through the out-of-the-box integration with Azure ML and AWS SageMaker.

$ mlflow azureml build-image --help
Register an MLflow model with Azure ML and build an Azure ML ContainerImage for deployment.

Note: In my specific project context we are not much interested about Azure ML since everything else is hosted in AWS. Also, MLflow's support of Azure ML looks much smaller than that of AWS SM.

$ mlflow sagemaker build-and-push-container --help
Build new MLflow Sagemaker image, assign it a name, and push to ECR.
$ mlflow sagemaker deploy --help
Deploy model on Sagemaker as a REST API endpoint.


You can also deploy your trained model by code via MLflow's API, using for example the deploy() function from mlflow.sagemaker module, which automatically creates a SageMaker endpoint.


References



Take away

I start to better understand the capabilities and limitations of processes and tools involved in a larger ML project, and on which only a few people have developped expertise at least inside our organisation. To be honest, I had actually expected more from the Databricks platform than what i've seen so far with their managed notebook, workflow and deployment approach. It is not much automated and supports development only, not operations. Next I definitely want to have a closer look at AWS SageMaker, and will try to figure-out wether both MLaaS platforms are overlaping or complementary.

Comments