AI-651
Advanced Cloud Native and Distributed AI Computing
Master Kubernetes, Ray, Terraform, and GitHub Actions to design, deploy, and manage distributed AI systems and cloud-based AI pipelines with scalability and fault tolerance.
Available Sections:
Details
This course equips students with the tools and techniques needed to deploy AI pipelines, APIs, microservices, and open-source models in the cloud using Kubernetes, Ray, Terraform, and GitHub Actions. The curriculum emphasizes the design of distributed AI systems that span multiple nodes, addressing critical factors such as scalability, fault tolerance, consistency, availability, and partition tolerance.
Students will explore the fundamentals of Kubernetes for container orchestration, Ray for distributed computing, Terraform for infrastructure as code, and GitHub Actions for CI/CD pipelines. Through hands-on projects, they will learn to integrate these technologies to build robust, scalable AI systems. By the end of the course, participants will have a deep understanding of distributed system design principles and practical experience deploying AI solutions in cloud environments.
What you will learn in this course
Deploy AI pipelines, APIs, and microservices with Kubernetes.
Use Ray for distributed computing across multiple nodes.
Implement infrastructure as code with Terraform.
Automate CI/CD workflows with GitHub Actions.
Design distributed AI systems with scalability and fault tolerance.
Ensure consistency, availability, and partition tolerance in AI systems.
Apply distributed system design principles to real-world AI projects.
Prerequisites
- AI-101 - Modern AI Python Programming
- AI-461 - Distributed AI Computing
- AI-301 - Cloud Native AI Microservices