Deploying A GPU-Powered LLM on Cloud Run

0 views

0 0

Deploying A GPU-Powered LLM on Cloud Run

Discover how you can deploy your own GPU-powered Large Language Model (LLM) on Google Cloud Run. This video walks through taking an open-source model like Gemma and deploying it as a scalable, serverless service with GPU acceleration. We explore the essential Dockerfile configurations and the `gcloud run deploy` command, highlighting key flags for optimizing performance and managing costs effectively. This setup enables independent scaling for your AI agent’s intelligent core.

Chapters:

0:00 – Introduction
0:32 – Why deploy a separate LLM?
1:32 – Serving the LLM with Ollama and Dockerfile
2:41 – Deploying to Cloud Run with `gcloud run deploy`
2:59 – Hardware configuration
3:20 – Performance and cost control
3:40 – Deployment summary
3:57 – Summary and next steps

Resources:
Codelab → https://goo.gle/aaiwcr-3
GitHub Repository → http://goo.gle/4pYAmMi
Google Cloud Run GPU → http://goo.gle/46EYI6g
ADK Documentation → http://goo.gle/46Thw0d

Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech

#GoogleCloud #LLM #CloudRun

Speakers: Amit Maraj
Products Mentioned: Cloud GPUs, Cloud Run

Date: October 7, 2025