We are seeking a highly skilled AI Application Team Lead to build and lead a team responsible for developing, running, and optimizing large-scale AI workloads on our AI hardware platform. This role focuses on benchmarking state-of-the-art models (e.g., LLaMA, DeepSeek), executing MLPerf suites, analyzing system-level performance, and driving cross-stack optimizations across hardware, runtime, and software frameworks.
The ideal candidate combines strong technical depth in AI/ML systems, hands-on experience with LLM workloads, and leadership capability to guide a high-performance engineering team.
Responsibilities
Lead and mentor a team of AI application and performance engineers.
Run and optimize AI workloads (LLaMA, DeepSeek, etc.) and execute MLPerf benchmarks.
Analyze end-to-end performance and identify HW/SW bottlenecks.
Develop optimization strategies across models, kernels, frameworks, and runtime.
Build profiling, debugging, and validation tools for large-scale AI workloads.
Collaborate with hardware, compiler, and device software teams to improve performance.
The ideal candidate combines strong technical depth in AI/ML systems, hands-on experience with LLM workloads, and leadership capability to guide a high-performance engineering team.
Responsibilities
Lead and mentor a team of AI application and performance engineers.
Run and optimize AI workloads (LLaMA, DeepSeek, etc.) and execute MLPerf benchmarks.
Analyze end-to-end performance and identify HW/SW bottlenecks.
Develop optimization strategies across models, kernels, frameworks, and runtime.
Build profiling, debugging, and validation tools for large-scale AI workloads.
Collaborate with hardware, compiler, and device software teams to improve performance.
Requirements:
5+ years of experience in AI/ML engineering, performance optimization, or ML systems.
Deep understanding of LLM architectures, training & inference mechanics, and modern ML frameworks.
Strong proficiency in PyTorch ecosystem, with a specific focus on performance tuning via Triton, Cuda or MLIR-based compiler frameworks.
Hands-on expertise profiling and optimizing kernels (GEMM, attention, softmax, token pipelines).
Demonstrated experience running or tuning MLPerf or similar large-scale benchmarks.
Strong Python and C++ development skills.
Proven leadership experience: mentoring, guiding, or managing engineers.
5+ years of experience in AI/ML engineering, performance optimization, or ML systems.
Deep understanding of LLM architectures, training & inference mechanics, and modern ML frameworks.
Strong proficiency in PyTorch ecosystem, with a specific focus on performance tuning via Triton, Cuda or MLIR-based compiler frameworks.
Hands-on expertise profiling and optimizing kernels (GEMM, attention, softmax, token pipelines).
Demonstrated experience running or tuning MLPerf or similar large-scale benchmarks.
Strong Python and C++ development skills.
Proven leadership experience: mentoring, guiding, or managing engineers.
This position is open to all candidates.




















