What Youll Be Doing:
Lead research and development of end-to-end networking solutions for distributed AI training and inference at scale, with a focus on job completion time, failure resiliency, telemetry, scheduling, and placement.
Analyze current deployments, develop prototypes, and recommend architectural improvements.
Stay abreast of the latest research; become the teams authority in emerging networking techniques and technologies.
Design, simulate, and validate new systems using novel, scalable network simulator NSX.
Develop and test prototypes on large-scale GPU clusters (e.g., Israel-1).
Collaborate across hardware, firmware, and software teams to translate ideas into real networking product features.
Publish patents and present research at leading conferences.
What We Need to See:
M.Sc. or PhD (preferred) in Computer Science, Electrical/Computer Engineering, or related fieldor B.Sc. with research experience and publications.
5+ years of relevant experience.
Deep expertise in networking and communication internals (NCCL, RDMA, congestion control, routing).
Strong software engineering skills in C++ and/or Python.
Excellent system-level design and problem-solving abilities.
Outstanding communication and collaboration skills across technical domains.
Ways to Stand Out from the Crowd:
Proven passion for solving sophisticated technical problems and delivering impactful solutions.
Record of publications in top-tier conferences.
Experience in designing and building large-scale AI training clusters.
Post-PhD research experience
Practical understanding of deep learning systems, GPU acceleration, and AI model execution flows.















