Check core SDK version number - Recommenders

Licensed under the MIT License.

AKS Load Testing

Once a model has been deployed to production it is important to ensure that the deployment target can support the expected load (number of users and expected response speed). This is critical for providing recommendations in production systems that must support recommendations for multiple users simultaneously. As the number of concurrent users grows the load on the recommendation system can increase significantly, so understanding the limits of any operationalized system is necessary to avoid unwanted system failures or slow response times for users.

To perform this kind of load test we can leverage tools that simulate user requests at varying rates and establish how many requests per seconds, or what the average response time is for the service. This notebook walks through the process of performing load testing for a deployed model on Azure Kubernetes Service (AKS).

This notebook assumes an AKS Webservice was used to deploy the model from a Azure Machine Learning service Workspace. An example of this approach is provided in the LightGBM Operationalization notebook.

We use Locust to perform the load testing, see documentation for more details about this tool.

python

import os
import subprocess
from tempfile import TemporaryDirectory
from urllib.parse import urlparse

import requests

from azureml.core import Workspace
from azureml.core import VERSION as azureml_version
from azureml.core.webservice import AksWebservice

from recommenders.datasets.criteo import get_spark_schema, load_pandas_df

# Check core SDK version number
print("Azure ML SDK version: {}".format(azureml_version))

python

# We increase the cell width to capture all the output from locust later
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Create a temporary directory for generated files

python

TMP_DIR = TemporaryDirectory()

Retrieve the AKS service information

python

# this must match the service name that has been deployed
SERVICE_NAME = 'lightgbm-criteo'

python

ws = Workspace.get(
    name="<AZUREML-WORKSPACE-NAME",
    subscription_id='<AZURE-SUBSCRIPTION-ID>',
    resource_group='<AZURE-RESOURCE-GROUP>',
)

python

aks_service = AksWebservice(ws, name=SERVICE_NAME)

python

# Get the scoring the URI
url = aks_service.scoring_uri
parsed_url = urlparse(url)

# Setup authentication using one of the keys from aks_service
headers = dict(Authorization='Bearer {}'.format(aks_service.get_keys()[0]))

Get Sample data for testing

python

# Grab some sample data
df = load_pandas_df(size='sample')

python

data = df.iloc[0, :].to_json()
print(data)

python

# Ensure the aks service is running and provides expected results
aks_service.run(data)

python

# Make sure an HTTP request to the service will also work
response = requests.post(url=url, json=data, headers=headers)
print(response.json())

Setup LocustFile

Locust uses a locust file (defaulting to locustfile.py) which controls the user behavior.

In this example we create a UserBehavior class which encapsulates the tasks that the user will conduct each time it is started. We are only interested in ensure the service can handle a request with sample data so the only task used is the score task which is a simple post request like what was done manually above.

The next class defines how a user will be instantiated, in this case we create a user which will make start an http session with the host server and execute the defined tasks. The task will be repeated after waiting for a small period of time. That wait period is determined by making a uniform random sample between the min and max wait times (in milliseconds).

python

locustfile = """
from locust import HttpLocust, TaskSet, task


class UserBehavior(TaskSet):
    @task
    def score(self):
        self.client.post("{score_url}", json='{data}', headers={headers})


class WebsiteUser(HttpLocust):
    task_set = UserBehavior
    # min and max time to wait before repeating task
    min_wait = 1000
    max_wait = 2000
""".format(data=data, headers=headers, score_url=parsed_url.path)

locustfile_path = os.path.join(TMP_DIR.name, 'locustfile.py')
with open(locustfile_path, 'w') as f:
    f.write(locustfile)

The next step is to start the locust load test tool. It can be run with a web interface or directly from the command line. In this case we will just run it from the command line and specify the number of concurrent users, how fast the users should spawn and how long the test should run for. All these options can be controlled via the web interface gui as well as providing more information on failures so it is useful to read the documentation for more advanced usage. Here we will just run the test and capture the summary results.

python

cmd = "locust -H {host} -f {path} --no-web -c {users} -r {rate} -t {duration} --only-summary".format(
    host='{url.scheme}://{url.netloc}'.format(url=parsed_url),
    path=locustfile_path,
    users=200,  # concurrent users
    rate=10,  # hatch rate (users / second)
    duration='1m',  # test duration
)
process = subprocess.run(cmd, shell=True, stderr=subprocess.PIPE)
print(process.stderr.decode('utf-8'))

Load Test Results

Above you can see the number of requests, failures and statistics on response time, as well as the number of requests per second that the server is handling.

The second line shows the distribution of response times which can be helpful to understand over all the requests how the load is impacting the response speed and whether there may be outliers which are impacting performance.

Cleanup temporary directory

python

TMP_DIR.cleanup()