Senior HPC Cluster Engineer

באר שבערעננה

הגשת מועמדות יצירת בקשה לפגישה עם המעסיק

באר שבערעננה

פורסם לפני יותר מחודשיים

פורסמה ברשת

We have continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI the next era of computing. We are a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our lifes work, to amplify human imagination and intelligence. Make the choice to join us today!

As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.

What you'll be doing:

Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions.

Maintaining and building deep learning clusters at scale.

Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows.

Root cause analysis and suggest corrective action for problems large and small scales.

Finding and fixing problems before they occur.

Requirements:
What we need to see:

Bachelors degree in Computer Science, Electrical Engineering or related field or equivalent experience.

Minimum 5 years of experience designing and operating large scale compute infrastructure.

Experience analyzing and tuning performance for a variety of HPC workloads.

Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.

Experience with HPC cluster job schedulers such as SLURM, LSF.

In depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud.

Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting.

Experience with HPC workflows that use MPI.

Ways to stand out from the crowd:

Understanding of MLPerf benchmarking.

Familiarity with InfiniBand with IBOP and RDMA.

Understanding of fast, distributed storage systems like Lustre and GPFS for HPC workloads.

Background with Software Defined Networking and HPC cluster networking.

Familuarity with deep learning frameworks like PyTorch and TensorFlow.

This position is open to all candidates.

מידת ההתאמה שלי לתפקיד

עדכון כישורים

התאמה למשרה

התאמתך לתפקיד מחושבת על פי כישורך (כפי שסיפרת לנו עליהם) מול דרישות המעסיק - אין בכך כדי להעיד על קבלתך לעבודה (זה יחליט המעסיק)

למציאת הכשרות רלוונטיות עדכון כישורים

כישורים חסרים

InfiniBand ו-RDMAgot it don't got itMPI לזרימות עבודה HPCgot it don't got itאופטימיזציה של עומס HPCgot it don't got itבדיקת ביצועים MLPerfgot it don't got itטכנולוגיות מיכליםgot it don't got itכלים לניהול צביריםgot it don't got itכתיבת סקריפטים ב-Python/Bashgot it don't got itמסגרות למידה עמוקהgot it don't got itמערכות אחסון מבוזרותgot it don't got itמתזמני עבודה ל-HPCgot it don't got itעיצוב צבירי חישוב GPUgot it don't got itרשתות מוגדרות תוכנהgot it don't got itשליטה בהפצות לינוקסgot it don't got itתשתית למידה עמוקהgot it don't got it

משרות חדשות במערכת שיכולות לעניין אותך

DevOps Expert

פורסם לפני יותר מחודשיים

DevOps Expert – Guideline Group Guideline Group (traded on the Tel Aviv stock exchange) is a leading, well-established, company in ...

לוח אפשרויות קריירה - הדרך החכמה להייטק

Senior HPC Cluster Engineer