We are seeking a skilled software engineer to join our NPU software stack development team. This role involves developing high-performance GPU programming frameworks, runtime systems, and libraries for AI/ML workloads. You will be responsible for implementing, optimizing, and maintaining GPU software stack components to support distributed AI training and inference.
Key Responsibilities
Identify bottlenecks, analysis and optimize in distributed NPU eco-system
Design and develop NPU memory management system
Design and develop optimized NPU development framework, execution path and debugging
Develop compatibility with AI frameworks (Triton, PyTorch, JAX)
Write high-quality, well-tested code with comprehensive documentation
Collaborate with other teams (Hardware, Network, QA, AI Framework Integration)
Participate in code reviews and technical design discussions.
Key Responsibilities
Identify bottlenecks, analysis and optimize in distributed NPU eco-system
Design and develop NPU memory management system
Design and develop optimized NPU development framework, execution path and debugging
Develop compatibility with AI frameworks (Triton, PyTorch, JAX)
Write high-quality, well-tested code with comprehensive documentation
Collaborate with other teams (Hardware, Network, QA, AI Framework Integration)
Participate in code reviews and technical design discussions.
Requirements:
Required Qualifications
5+ years of experience in distributed system programming
3+ years of experience with NPU programming (Triton, CUDA, HIP, OpenCL)
Expert-level C/C++ programming with focus on performance optimization
Expert-level Python programming with focus on DL/ML frameworks (PyTorch/JAX/etc)
Deep understanding of NPU architecture, memory tiering, and programming models
Knowledge of NPU runtime systems
Experience with performance profiling and optimization tools
Strong problem-solving and debugging skills
Experience with version control systems, Ticking system and collaborative development
Team player with excellent communication skills
Fast learner, highly organized, detail-oriented with high motivation
Preferred Qualifications
Experience with NPU software stack development
Experience with large-scale NPU systems (100+ NPUs)
Experience with DL/ML workloads (oriented AI) and distributed training / inferencing
Familiarity with containerization and orchestration.
Required Qualifications
5+ years of experience in distributed system programming
3+ years of experience with NPU programming (Triton, CUDA, HIP, OpenCL)
Expert-level C/C++ programming with focus on performance optimization
Expert-level Python programming with focus on DL/ML frameworks (PyTorch/JAX/etc)
Deep understanding of NPU architecture, memory tiering, and programming models
Knowledge of NPU runtime systems
Experience with performance profiling and optimization tools
Strong problem-solving and debugging skills
Experience with version control systems, Ticking system and collaborative development
Team player with excellent communication skills
Fast learner, highly organized, detail-oriented with high motivation
Preferred Qualifications
Experience with NPU software stack development
Experience with large-scale NPU systems (100+ NPUs)
Experience with DL/ML workloads (oriented AI) and distributed training / inferencing
Familiarity with containerization and orchestration.
This position is open to all candidates.















