We are looking for an exceptional MLOps Team Lead to own, build, and scale the infrastructure and automation that powers Labs state-of-the-art Large Language Models (LLMs) and AI systems.
This is a technical leadership role that blends hands-on engineering with strategic vision. You will define MLOps best practices, build high-performance ML infrastructure, and lead a world-class team working at the intersection of AI research and production-grade ML systems.
You will work closely with LLM Algorithm Researchers, ML Engineers, and Data Scientists to enable fast, scalable, and reliable ML workflows covering everything from distributed training to real-time inference optimization.
If you have deep technical expertise, thrive in high-scale AI environments, and want to lead the next generation of MLOps, we want to hear from you.
This is a technical leadership role that blends hands-on engineering with strategic vision. You will define MLOps best practices, build high-performance ML infrastructure, and lead a world-class team working at the intersection of AI research and production-grade ML systems.
You will work closely with LLM Algorithm Researchers, ML Engineers, and Data Scientists to enable fast, scalable, and reliable ML workflows covering everything from distributed training to real-time inference optimization.
If you have deep technical expertise, thrive in high-scale AI environments, and want to lead the next generation of MLOps, we want to hear from you.
Requirements:
3+ years of experience in MLOps, ML infrastructure, or AI platform engineering.
2+ years of hands-on experience in ML pipeline automation, large-scale model deployment, and infrastructure scaling.
Expertise in deep learning frameworks (like PyTorch, TensorFlow, JAX) and MLOps platforms (like Kubeflow, MLflow, TFX).
Proven track record of building production-grade ML systems that scale to billions of predictions daily.
Deep knowledge of Kubernetes, cloud-native architectures (AWS/GCP), and infrastructure as code (Terraform, Helm, ArgoCD).
Strong software engineering skills in Python, Bash, and Go, with a focus on writing clean, maintainable, and scalable code.
Experience with observability & monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).
Strong background in security, compliance, and model governance for AI/ML systems.
3+ years of experience in MLOps, ML infrastructure, or AI platform engineering.
2+ years of hands-on experience in ML pipeline automation, large-scale model deployment, and infrastructure scaling.
Expertise in deep learning frameworks (like PyTorch, TensorFlow, JAX) and MLOps platforms (like Kubeflow, MLflow, TFX).
Proven track record of building production-grade ML systems that scale to billions of predictions daily.
Deep knowledge of Kubernetes, cloud-native architectures (AWS/GCP), and infrastructure as code (Terraform, Helm, ArgoCD).
Strong software engineering skills in Python, Bash, and Go, with a focus on writing clean, maintainable, and scalable code.
Experience with observability & monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).
Strong background in security, compliance, and model governance for AI/ML systems.
This position is open to all candidates.

















