You'll join a small team working around the globe to build some of the most cutting-edge Datacenters in the world. This role will focus on working to deploy server and compute clusters built with brand new GPU platforms responsible for AI and Machine Learning. You'll be working with some of the world's largest and most sophisticated customers and supercomputers. You'll work alongside our Infiniband and Ethernet network engineers to deploy a complete solution for customers looking to adopt our solutions into their business.
Opportunities for global travel and learning about the newest GPU-related technologies are plentiful as we seek to build, shape and expand this new aspect of our business.
What you will be doing:
Primary responsibilities will include managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
Support operational and reliability aspects of large scale AI clusters with focus on performance at scale, real time monitoring, logging and alerting.
Engage in and improve the whole lifecycle of servicesfrom inception and design through deployment, operation and refinement.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
Be part of an on call rotation to support production systems.
What we need to see:
5+ years providing in-depth support and deployment services, solving problems for hardware and software products.
Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network-routing/advanced networking (tuning and monitoring).
Cluster management technologies, EX: Bright Cluster Manager
Scripting proficiency.
Good social skills with the ability to maintain and deliver resolutions for customer blocking issues as they arise.
Superb communication and presentation/oral skills.
Excellent verbal and written English skills.
Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
Candidates should have a minimum of a four-year degree from an accredited university or college in Computer Science, or Electrical or Computer Engineering.
Industry-standard Linux certifications.
Ways to stand out of a crowd:
InfiniBand experience.
Experience with GPU focused hardware/software.
Experience with MPI.
Automation tooling background (Ansible, Salt, Puppet etc.).
Ethernet and Storage technologies.