We are looking for a motivated SRE Manager to join our global Devops group in our Tel Aviv R&D center. The group is responsible for the reliability and availability of the production environment hosting Cortex products and the enablement of the entire Cortex RnD group using CI tools, infrastructure and automations. In this role you will be a part of a Devops group that is responsible for planning, executing and reporting of the various infrastructure and code projects, as well as managing and executing high pressure production maintenance work and issues.
The candidate will be a hands-on manager with an established background in operations and cloud infrastructure, developing and hiring extraordinary talent, have strong technical ability, great communication skills, and a motivation to achieve results in a dynamic fast paced environment
In this role, you will have full accountability for leading and managing a skilled team of Site Reliability Engineers, responsible for maintaining and enhancing our infrastructure, ensuring the resilience of our systems, and driving operational excellence.
Your Impact:
Team Leadership – Lead, mentor, and develop a team of SREs, fostering a culture of collaboration, innovation, and accountability
Reliability and Availability – Take ownership of the reliability and availability of our production environments, ensuring uninterrupted service to our users
Operational Efficiency – Drive initiatives to optimize operational processes, reduce downtime, and enhance system performance, such as post mortems, RCAs and remediation processes
Own monitoring processes, continuously improve alerts, metrics and work with the development teams to improve their applications SLOs
Manage and maintain optimal on-call rotations and shifts – Define escalation paths and take ownership of major incidents
Create management visibility showcasing our SLOs and SLAs
Cloud Expertise – Utilize your expertise in cloud platforms, with a strong emphasis on GCP, to optimize our infrastructure and leverage cloud-native technologies
Scripting and Automation – Demonstrate high proficiency in scripting languages, with a preference for Python, to automate routine tasks and processes
Technology Evaluation – Stay up-to-date with cutting-edge technologies, evaluating their potential impact on our operations, and implementing them when appropriate.
The candidate will be a hands-on manager with an established background in operations and cloud infrastructure, developing and hiring extraordinary talent, have strong technical ability, great communication skills, and a motivation to achieve results in a dynamic fast paced environment
In this role, you will have full accountability for leading and managing a skilled team of Site Reliability Engineers, responsible for maintaining and enhancing our infrastructure, ensuring the resilience of our systems, and driving operational excellence.
Your Impact:
Team Leadership – Lead, mentor, and develop a team of SREs, fostering a culture of collaboration, innovation, and accountability
Reliability and Availability – Take ownership of the reliability and availability of our production environments, ensuring uninterrupted service to our users
Operational Efficiency – Drive initiatives to optimize operational processes, reduce downtime, and enhance system performance, such as post mortems, RCAs and remediation processes
Own monitoring processes, continuously improve alerts, metrics and work with the development teams to improve their applications SLOs
Manage and maintain optimal on-call rotations and shifts – Define escalation paths and take ownership of major incidents
Create management visibility showcasing our SLOs and SLAs
Cloud Expertise – Utilize your expertise in cloud platforms, with a strong emphasis on GCP, to optimize our infrastructure and leverage cloud-native technologies
Scripting and Automation – Demonstrate high proficiency in scripting languages, with a preference for Python, to automate routine tasks and processes
Technology Evaluation – Stay up-to-date with cutting-edge technologies, evaluating their potential impact on our operations, and implementing them when appropriate.
Requirements:
Leadership – A minimum of 3+ years of experience in leading SRE or Operations teams supporting large-scale production environments
5+ years as an SRE, Devops or Operation roles
Cloud Proficiency – High proficiency in the GCP ecosystem
Monitoring – Understanding the SRE concepts of alerts improvements, SLIs, SLOs, avoiding alerts fatigue
Scripting Skills – Strong scripting skills, particularly in Python
Containerization – Experience with virtualized and containerized environments, including Kubernetes and Docker
Infrastructure-as-Code – Familiarity with IaC tools such as Terraform
Communication – Excellent communication and interpersonal skills, with the ability to collaborate effectively across teams
Adaptability – A knack for quickly grasping new technologies and the ability to manage multiple responsibilities simultaneously
Service Reliability – Experience navigating the complexities of business and service reliability.
Leadership – A minimum of 3+ years of experience in leading SRE or Operations teams supporting large-scale production environments
5+ years as an SRE, Devops or Operation roles
Cloud Proficiency – High proficiency in the GCP ecosystem
Monitoring – Understanding the SRE concepts of alerts improvements, SLIs, SLOs, avoiding alerts fatigue
Scripting Skills – Strong scripting skills, particularly in Python
Containerization – Experience with virtualized and containerized environments, including Kubernetes and Docker
Infrastructure-as-Code – Familiarity with IaC tools such as Terraform
Communication – Excellent communication and interpersonal skills, with the ability to collaborate effectively across teams
Adaptability – A knack for quickly grasping new technologies and the ability to manage multiple responsibilities simultaneously
Service Reliability – Experience navigating the complexities of business and service reliability.
This position is open to all candidates.