Youll help define how AI models are deployed and scaled in production, driving decisions on everything from memory orchestration and compute scheduling to inter-node communication and system-level optimizations. This is an opportunity to work with top engineers, researchers, and partners across us and leave a mark on the way generative AI reaches real-world applications.
What Youll Be Doing:
Design and evolve scalable architectures for multi-node LLM inference across GPU clusters.
Develop infrastructure to optimize latency, throughput, and cost-efficiency of serving large models in production.
Collaborate with model, systems, compiler, and networking teams to ensure holistic, high-performance solutions.
Prototype novel approaches to KV cache handling, tensor/pipeline parallel execution, and dynamic batching.
Evaluate and integrate new software and hardware technologies relevant to Core Spectrum-X technologies, such as load balancing, telemetry, congestion control, vertical application integration.
Work closely with internal teams and external partners to translate high-level architecture into reliable, high-performance systems.
Author design documents, internal specs, and technical blog posts and contribute to open-source efforts when appropriate.
What We Need to See:
Bachelors, Masters, or PhD in Computer Science, Electrical Engineering, or equivalent experience.
8+ years of experience building large-scale distributed systems or performance-critical software.
Deep understanding of deep learning systems, GPU acceleration, and AI model execution flows and/or high performance networking.
Solid software engineering skills in C++ and/or Python, preferably demonstrate strong familiarity with CUDA or similar platforms.
Strong system-level thinking across memory, networking, scheduling, and compute orchestration.
Excellent communication skills and ability to collaborate across diverse technical domains.
Ways to Stand Out from the Crowd:
Experience working on LLM – training or inference pipelines, transformer model optimization, or model-parallel deployments.
Demonstrated success in profiling and optimizing performance bottlenecks across the LLM training or inference stack.
AI Accelerators and distributed communication patterns, congestion control and/or load balancing.
Proven optimization process for complex systems, deployed at scale to make impact.
Passion for solving tough technical problems and finding high-impact solutions.















