In this video, we demo some recent enhancements in how to deploy, scale, and monitor machine learning models with Red Hat OpenShift AI.
Join us as we showcase key features like intelligent auto-scaling, which minimizes idle compute costs by scaling models down to zero when not in use. Learn how to serve large language models (LLMs) across multiple nodes, which can help overcome the limitation of single-node GPU memory. Finally, we demonstrate the power of inference graphs, allowing you to chain multiple models together to create a single, scalable endpoint for complex tasks. This demo also provides real-time visibility into model health and performance, helping you ensure your models are production-ready and fully observable.
0:00 – Introduction to the OpenShift AI platform and the model server and metrics team
1:00 – Demo 1: An end-to-end model run using a dog breed classification model.
1:56 – Using a transformer to normalize images before they are input to the model.
2:20 – Querying the model with a Python notebook and checking the metrics.
3:07 – The new stop annotation feature for pausing inactive models to save resources.
3:38 – Explanation of intelligent auto-scaling, including scaling down to zero when idle.
4:07 – Demo 2: Multi-node feature for distributing a single model inference workload across multiple nodes.
4:18 – Explanation of multi-node feature which helps overcome the limitation of single-node GPU memory for LLMs
4:49 – Explanation of the inference graph feature which allows chaining multiple models together.
4:59 – Demo 3: Inference graph, which involves using the image classification model to classify a dog image and then feeding that output into the vLLM model to generate a text poem
5:50 – The video concludes with a summary of the key features demonstrated.