What you'll be doing:
Learn our architecture with a focus on the technology that we drive.
Optimize AI/ML model training time at large scale.
Code and build proof-of-concept prototypes.
Design and define protocols and APIs for leveraging our technology in a data center.
Research and evaluate algorithms currently used in related applications.
Participate in defining hardware and system features, and assist software and hardware groups in enabling new technologies.
What we need to see:
B.Sc./M.Sc. or equivalent experience in Electrical Engineering or Computer Science from a leading university.
3-5 years of proven experience in the industry, specifically in SW engineering, distributed AI system training.
Familiarity with networking concepts, terms, and software stack.
Passion for problem-solving and algorithms research and development.
Background in distributed AI/ML models training on GPUs clusters.
Ways to stand out from the crowd:
Background in data center architecture.
Experience with Collective Communications Library such as NCCL.
good understanding of OS, driver and performance aspects of a system.
Background in network synchronization protocols such as IEEE 1588 PTP
Good command of Python, C/C++