Its a unique legacy of innovation thats fuelled by great technologyand amazing people. Today, were tapping into the unlimited potential of AI to define the next era of computing. Doing whats never been done before takes vision, innovation, and the worlds best talent. As our employee, youll be immersed in a diverse, supportive environment where everyone is inspired to do their best work.
Our Networking is looking for an AI & HPC Clusters's group manager to join Cloud Solutions group. In this role, you will build, manage, and maintain the biggest cluster in our Networking R&D to validate and test next-generation networking cloud technology and Reference Architecture that are being released to our customers. We are currently working on next generation BlackWell GPU Platform AI clouds with our XDR (800G InfiniBand) and SpectrumX800 next generation technology. Come join the team and see how you can make a lasting impact on the world.
What youll be doing:
Lead a group that is responsible for building, managing, and maintaining SW R&D clusters composed of Linux, Windows, and VMware systems, x86 and ARM CPU, GPU, Ethernet, and InfiniBand technologies.
Work closely with the engineering and architecture teams to understand, plan and build new clusters for validating and testing new Networking technology solutions.
Drive the design and implementation of automatic systems to deploy, configure, maintain, and monitor these clusters.
Drive the design and implementation of resource management systems for multiuser environments with different needs on these clusters.
Manage R&D lab including inventory, power, space, and cooling.
Build, expand, and mentor the team to address growing demands and requirements.
Innovate! Influence on our Networking cluster management tools to shine in customers view.
What We Need to See:
A degree in Computer Science, Engineering, or a related field.
5+ years of managerial experience including managers management.
10+ years of relevant overall professional experience.
Experience in Data center management from a multidisciplinary company, including handling power, cooling, and space.
Experience in managing HPC/AI clusters.
Deep understanding of operating systems, computer networks, and high-performance hardware.
Deep knowledge of distributed resource scheduling systems and orchestration tools such as Slurm, K8s.
Strong organizational and project management skills, comfortable with multitasking in a dynamic environment with shifting priorities and changing requirements.
Enthusiastic and ambitious personality, encouraging a positive and productive work environment.
Ways to Stand Out From the Crowd:
Knowledge of HPC and AI solution technologies from CPUs and GPUs to high-speed interconnects and supporting software.
Familiarity with CUDA and managing GPU-accelerated computing systems.
Experience and knowledge of InfiniBand