The New Azure Machine Learning Services

Introduction

Last week at Ignite 2018, Microsoft released a public preview of their new Azure Machine Learning Services. The outcome is a much easier to understand and clearer product portfolio for everything that evolves around building custom Machine Learning models, including:

  • Data handling and cleaning
  • Model training and testing (including automated training)
  • Model deployment and monitoring

In this post, we’ll walk through the different services of Azure Machine Learning, clarify the terminology and give some pointers on how to get started.

Azure Machine Learning Services Overview

To start out with, we’ll have a quick look at the different services that make up Azure Machine Learning.

Azure Machine Learning Workspace

A Machine Learning Workspace is the centre unit that contains everything needed to perform Machine Learning on Azure. It holds our compute resources for training models, our data, our model files (including containerized models), our deployments, and run histories. Workspaces are now fully embedded in the Azure Portal, as well as in the new SDK and CLI.

Azure Machine Learning Workspace
Azure Machine Learning Workspace

Each Workspace contains the following five items:

  • Experiments – These are the “playgrounds” where we train and test our models. We could either have one experiment per problem we are trying to solve, or even one experiment per Machine Learning algorithm.
  • Compute – These are the compute targets for training our models in Azure.
  • Models – The output files for our trained models.
  • Images – Container images that contain packaged/containerized versions of our models, ready for deployment.
  • Deployments – Our deployed models, running either serverless on Azure Container Instances, on a Azure Kubernetes Service cluster, on FPGAs or on Azure IoT Edge
  • (Activities) – A rundown of what we’ve done in our Workspace.

Upon creation of a Workspace, Azure automatically creates a set of resources, that will be used to during the lifecycle of a workspace:

  • Machine Learning Service Workspace – our central unit of management
  • Storage Account – used for storing our model files, log outputs, etc. from our experiment runs
  • Azure Container Registry (ACR) – used for storing our containerized models
  • Key Vault – used for storing any secrets generated during the process
  • Application Insights – used for monitoring our models that have been deployed into production

Next, let’s have a look at the process that Azure Machine Learning following.

Building a Model and putting it into Production

Going from having data to putting a Machine Learning model into production is a pretty straight forward process with Azure Machine Learning. In total, it encompasses only a few steps:

  1. Ingest data into system or make data accessible
  2. Write code for our model, any framework supported (PyTorch, TensorFlow, Scikit-Learn, etc.)
  3. Train our model on either Azure Batch AI, a DSVM (Data Science Virtual Machine), Kubernetes (via Azure Kubernetes Service), or locally
  4. Azure Machine Learning will store the output artefacts of our models in a Blob Storage
  5. Given the artefacts, we’ll have Azure Machine Learning build a containerized model by using a score.py script to load the model and inferencing/scoring on incoming data
  6. Azure Machine Learning will push the container images into Azure Container Registry
  7. From there, we can have Azure Machine Learning deploy them to
  8. Application Insights is wired in, so that we can receive operational metrics for monitoring our models
Azure Machine Learning Overview
Azure Machine Learning Overview

I personally enjoyed the new quickstart guide, as it nicely documents the steps from training a model, up to deployment on Azure Container Instances.

Azure Machine Learning SDK and CLI

Finally, there is a dedicated CLI and SDK for using Azure Machine Learning – the days were we had to install the heavy Azure Machine Learning Workbench to get a command line are finally gone! The Workbench application has been deprecated and most its parts have been moved to Azure, the SDK and CLI. Here are the two links to get started:

The new Azure Machine Learning CLI
The new Azure Machine Learning CLI

The SDK, as well as the CLI for the most parts, allow us to fully orchestrate Azure Machine Learning including model training (locally or on Azure), model management, model deployment and model monitoring. The CLI is built on top of the Python SDK, hence offering a smaller subset of features.

The Python SDK also offers additional capabilities to build data pipelines for e.g., data cleaning or pulling data from different systems. Furthermore, it has the capabilities to perform automated Machine Learning, as well as hyperparameter tuning. More on those in the next section.

Automated Machine Learning

Automated Machine Learning is a new service that can automatically evaluate multiple, different machine learning algorithms for us and then pick the best performing model. In short:

  • We bring the training data and associated labels (as Pandas Dataframes or Numpy Arrays)
  • The service automatically tries out different models and returns a score for each

Historically, testing different Machine Learning algorithms has been a pretty time-consuming task. With this service, it is reduced to a few lines of Python. While this definitely won’t make a Data Scientist’s job redundant, it definitely makes them make productive.

Azure Automated Machine Learning Overview
Azure Automated Machine Learning Overview

In its current release, Automated Machine Learning is able to solve classification and regression problems using a set of algorithms from scikit-learn. We can run it on Azure (again, Batch AI, DSVM or AKS) or locally. Once we found a good model, we can deploy it in the same way as we would deploy a “hand-trained” model.

Automated Hyperparameter Tuning Service

The new Azure Machine Learning also includes a capability to perform automated hyperparameter tuning. The service is called “Hyperdrive”, even though the name is not really mentioned anywhere, except in the Python classname (azureml.train.hyperdrive). It works the following way:

  1. First, we specify which hyperparameters we want to optimize – This is completely up to us: e.g., our learning rate, our batch size, number of hidden layers or neurons in a deep neural network, dropout rates, etc.
  2. Next, we select a metric we want to optimize – Again, this is up to us, but most likely this would be our accuracy
  3. Lastly, we let it run on a compute target of our choice (preferably a Batch AI cluster)

The service supports early termination policies, which allow to greatly reduce search space for those cases where the system notices that a training run is showing poor metrics. Furthermore, sampling of the search space is supported.

Azure Batch AI

Azure Batch AI is a service that enables distributed deep learning (or Machine Learning in general). Batch AI completely manages the underlying VM-based infrastructure, we just need to tell it how many nodes we want and wether we want to have it autoscale according to our need.

Azure Machine Learning will help packaging our training experiments and easily deploy them on a Batch AI cluster. Additionally, Batch AI takes care of input and output data flow, hence storing all the results at the end of a run in Azure Blob or Files.

AI Toolkit for Azure IoT Edge

The AI Toolkit for Azure IoT Edge allows us to deploy our trained Machine Learning models to Azure IoT Edge. Azure IoT Edge is, as the same suggests, an IoT runtime that can be deployed on premises, hence close to the IoT devices. Some machine learning models might be trained and built in the cloud, but will need to live at the edge, mainly because of, e.g.:

  • Data sovereignty requirements
  • Privacy concerns
  • Bandwidth limitations

Under the hood, the Azure IoT Edge runtime deploys the containerized models coming out of Azure Machine Learning and is able to run on small devices as well as all the way up to regular servers.

Project Brainwave (FPGAs for ML)

Project Brainwave allows us to deploy trained models to FPGAs in Azure, for ultra-low latency inferencing. It now supports a bunch of deep network architectures:

  • ResNet 50
  • ResNet 152
  • VGG-16
  • SSD-VGG
  • DenseNet-121

Summary

The major overhaul of Azure Machine Learning is a big step forward. In my opinion, it really enables Data Scientists and Machine Learning experts to focus more on their actual work, rather than on infrastructure or dealing with deploying, monitoring and taking care of models. In the next posts, we’ll walk through some of the services and give them a try!

Until then, if you have any questions or suggestions, feel free to reach out via Twitter.

Leave a Reply

Your email address will not be published. Required fields are marked *