Role and Responsibilities
Own and evolve our internally built platform, ensuring scalability, reliability, and performance
Manage and optimize our monitoring stack, GPU workloads, and capacity across multi-cloud environments
Oversee tens of projects and Kubernetes clusters, ensuring efficient operation and continuous improvement
Collaborate closely with the CloudOPS architect to align infrastructure strategies
Build and lead strong teams with a culture of high ownership and engineering excellence
Implement and maintain best practices for infrastructure management, automation, and deployment
Balance hands-on technical work with team leadership and strategic planning
Work with various stakeholders (Research & Engineering) to align Infra initiatives with business and tech goals
Proven experience in Infrastructure Engineering, or Platform Engineering roles
Strong technical background with the ability to dive deep into complex systems and codebases
Experience managing large-scale, distributed systems and GPU workloads
Expertise in cloud technologies (multi-cloud experience preferred) and Kubernetes
Proven track record of building and leading high-performing technical teams
Excellent communication skills and ability to work with various stakeholders
Strong problem-solving skills and ability to handle complex, large-scale technical challenges
Experience with platform engineering and building internal developer platforms (preferred)
Familiarity with modern monitoring and observability tools and practices
Knowledge of infrastructure-as-code and GitOps principles
Ability to balance deep technical expertise with leadership skills