Model serving
Contents
Model serving#
MLRun model serving allows composition of multi-stage, real-time pipelines, that include data manipulation and execution of models. The architecture allows high scalability while maintaining low latency performance.
Basic model serving#
The most basic model serving capability is deployment of a single model. To do that, you:
Create an MLRun function of type
serving
that implements a serving class with theload
andpredict
methods. MLRun function marketplace comes with a range of such functions that support the most common frameworks.Add the model to the function, using the
add_model()
method.Test and deploy a model server, using the
deploy()
method.
This results in a single model endpoint that can execute the model and return the model prediction.
See Basic model serving class and Using built-in model serving classes.
Optionally, you can create a mock server, which runs the model as an in-memory object within your development environment. This allows testing the model without deploying it.
Routers and ensembles#
A single serving function can host more than a single model. You can call add_model
multiple times and specify a different model per each
model key. Each add_model
creates another model endpoint.
You can also create an ensemble of models, where a call to one model endpoint combines the results of other models together.
Model serving pipeline#
Model execution is usually part of a greater pipeline, and the model serving is just a single step in that pipeline. Usually, there is a range of data processing that occurs before and after the model is executed. The process may even involve more than a single model in the pipeline, and/or filters and rules, related to the execution of the models.
MLRun implements model serving pipeline using its graph capabilities. This gives the capability to define steps, such as data processing, data enrichment, and data manipulation, prior to calling the model as well as perform steps after the model is executed, by performing additional steps on the model output.