What youll be doing:
Cooperate with research teams to onboard new LLMs and VLMs into our opensource AI runtimes.
Optimize inference workloads using sophisticated profiling and simulation tools.
Build SOLID, extendable inference software systems, and refine robust APIs.
Implement and debug low-level GPU code to harness the latest HW features.
Own end-to-end inference acceleration features and work with teams around the world to deliver production-grade products.
What we need to see:
B.Sc., M.Sc. or equivalent experience in Computer Science or Computer Engineering.
5+ years of relevant hands-on software engineering experience.
Profound knowledge of software design principles.
Strong proficiency in at least one system and one scripting language.
Strong grasp of machine learning concepts.
People person with excellent communication skills that enjoys collaboration and teamwork.
Ways to stand out from the crowd:
Familiarity with our DL software stack, e.g. Triton Inference Server, TensorRT-LLM, and Model Optimizer.
Proven track record of performance modeling, profiling, debugging, and development in a performance-critical setting with our accelerators.
Familiarity with LLM quantization, fine-tunning, and caching algorithms.
Proficiency in GPU kernel programming (CUDA or OpenCL).
Prior experience working on a large software project with 50+ contributors.