Were seeking an experienced AI/ML Platform Engineer to join our Foundations team, the group behind our next-generation GenAI platform powering innovation across the company and beyond. This team is building scalable, high-performance AI systems for both internal users and external customersdesigned to run seamlessly across cloud and on-premise environments using the latest advancements in hardware. In this role, youll lead efforts in distributed training, inference at large scale, resource optimization, and robust model lifecycle management using MLOps best practices. Your work will be critical to accelerating research, supporting production-grade AI infrastructure, and driving the development of our internal AI ecosystem.
Responsibilities:
Architect and build scalable ML infrastructure for training and inference workloads across heterogeneous compute environments (on-premise and cloud).
Design and implement distributed systems to support model lifecycle management from data ingestion and preprocessing, to training orchestration and deployment.
Optimize performance and cost-efficiency of large-scale model training and serving pipelines using technologies like Ray, Kubernetes, Spark, and GPU schedulers.
Collaborate with AI researchers, data scientists, and product teams to understand their workflows and translate them into reusable platform services and APIs.
Drive adoption of best practices for CI/CD, observability, and reproducibility in ML systems.
Contribute to the long-term vision and technical roadmap of the ML platform, ensuring it evolves to meet the growing demands of AI across the company.
Responsibilities:
Architect and build scalable ML infrastructure for training and inference workloads across heterogeneous compute environments (on-premise and cloud).
Design and implement distributed systems to support model lifecycle management from data ingestion and preprocessing, to training orchestration and deployment.
Optimize performance and cost-efficiency of large-scale model training and serving pipelines using technologies like Ray, Kubernetes, Spark, and GPU schedulers.
Collaborate with AI researchers, data scientists, and product teams to understand their workflows and translate them into reusable platform services and APIs.
Drive adoption of best practices for CI/CD, observability, and reproducibility in ML systems.
Contribute to the long-term vision and technical roadmap of the ML platform, ensuring it evolves to meet the growing demands of AI across the company.
Requirements:
5+ years of experience building large-scale distributed systems or platforms, preferably in ML or data-intensive environments
Proficiency in Python with strong software engineering practices, familiarity with data structures and design patterns
Deep understanding of orchestration systems (e.g., Kubernetes, Airflow, Argo) and distributed computing frameworks (e.g., Ray, Spark, Dask) and
Experience with GPU compute infrastructure, containerization (Docker), and cloud-native architectures
Proven track record of delivering production-grade infrastructure or developer platforms.
Solid grasp of ML workflows, including model training, evaluation, and inference pipelines.
5+ years of experience building large-scale distributed systems or platforms, preferably in ML or data-intensive environments
Proficiency in Python with strong software engineering practices, familiarity with data structures and design patterns
Deep understanding of orchestration systems (e.g., Kubernetes, Airflow, Argo) and distributed computing frameworks (e.g., Ray, Spark, Dask) and
Experience with GPU compute infrastructure, containerization (Docker), and cloud-native architectures
Proven track record of delivering production-grade infrastructure or developer platforms.
Solid grasp of ML workflows, including model training, evaluation, and inference pipelines.
This position is open to all candidates.