we are looking for an experienced Site Reliability Engineering (SRE) Engineer with a passion for cloud-native system observability and a track record in implementing state-of-the-art monitoring solutions that offer comprehensive insights.
As an SRE Engineer, you will be instrumental in driving the adoption of progressive delivery practices, ensuring the deployment of robust and reliable systems with minimal operational disruptions.
Responsibilities
Master the art of cloud-native system observability by identifying and deploying monitoring tools and solutions that provide deep operational insights, ensuring the reliability and performance of cloud infrastructure.
Champion progressive delivery methods, employing strategies and technologies that enable the smooth and reliable deployment of systems, minimizing downtime and operational friction.
Live and breathe system metrics, utilizing data to drive significant improvements across the platform. Your knack for interpreting complex data into actionable plans will be key to enhancing system reliability and performance.
Commit to maintaining high system uptime, rigorously meeting and exceeding Service Level Agreements (SLAs), Service Level Indicators (SLIs), and Service Level Objectives (SLOs), ensuring platform remains highly available and performant.
Adopt a proactive approach to system optimization, continuously seeking opportunities to improve infrastructure before issues arise, enhancing system efficiency and reducing the likelihood of unexpected downtime.
Work closely with Engineering, DevOps, and Product teams to integrate observability and reliability best practices into the architectural and infrastructure design, ensuring security and performance from the ground up.
Lead and contribute to the design and support of best-in-class integrations with third-party partners, vendors, and clients, alongside Architects, Developers, System, and Security Owners.
Train and educate the Technology team on SRE principles, tools, and best practices.
Respond to and manage incidents with a focus on rapid recovery and minimizing impact, utilizing insights gained to prevent future occurrences.
As an SRE Engineer, you will be instrumental in driving the adoption of progressive delivery practices, ensuring the deployment of robust and reliable systems with minimal operational disruptions.
Responsibilities
Master the art of cloud-native system observability by identifying and deploying monitoring tools and solutions that provide deep operational insights, ensuring the reliability and performance of cloud infrastructure.
Champion progressive delivery methods, employing strategies and technologies that enable the smooth and reliable deployment of systems, minimizing downtime and operational friction.
Live and breathe system metrics, utilizing data to drive significant improvements across the platform. Your knack for interpreting complex data into actionable plans will be key to enhancing system reliability and performance.
Commit to maintaining high system uptime, rigorously meeting and exceeding Service Level Agreements (SLAs), Service Level Indicators (SLIs), and Service Level Objectives (SLOs), ensuring platform remains highly available and performant.
Adopt a proactive approach to system optimization, continuously seeking opportunities to improve infrastructure before issues arise, enhancing system efficiency and reducing the likelihood of unexpected downtime.
Work closely with Engineering, DevOps, and Product teams to integrate observability and reliability best practices into the architectural and infrastructure design, ensuring security and performance from the ground up.
Lead and contribute to the design and support of best-in-class integrations with third-party partners, vendors, and clients, alongside Architects, Developers, System, and Security Owners.
Train and educate the Technology team on SRE principles, tools, and best practices.
Respond to and manage incidents with a focus on rapid recovery and minimizing impact, utilizing insights gained to prevent future occurrences.
Requirements:
Implement Advanced Observability Frameworks: Design and deploy comprehensive observability systems to monitor health, performance, and reliability of cloud-native applications. Utilize advanced tools for logging, metrics collection, and event monitoring to ensure deep visibility into system operations.
Deep knowledge of cloud platforms (AWS, GCP, Azure) and experience with cloud-native technologies.
Deep understanding of Kubernetes infrastructure.
Proficiency in monitoring tools (datadog, Prometheus, Grafana) and experience in setting up comprehensive monitoring and alerting systems.
Excellent problem-solving skills and the ability to work under pressure to resolve incidents and ensure system reliability.
Progressive Delivery Expertise: Leverage progressive delivery techniques such as canary releases (argo rollouts) – BIG advantage.
Tracing and Debugging: manage distributed tracing systems (Datadog APM / Jaeger / OpenTelemetry) to diagnose and troubleshoot complex issues across microservices architectures. Employ effective logging and tracing strategies to pinpoint root causes of incidents and performance bottlenecks – BIG advantage.
Programming and Scripting Skills: Proficiency in programming languages such as Python and Go, and Bash – MUST.
Good presentation skills: Ability to articulate technically advanced issues to all audiences; Ability to mentor and train internal staff.
Strong organizational skills and excellent attention to details.
Ability to effectively prioritize and execute tasks.
Self-driven.
Excellent English.
Implement Advanced Observability Frameworks: Design and deploy comprehensive observability systems to monitor health, performance, and reliability of cloud-native applications. Utilize advanced tools for logging, metrics collection, and event monitoring to ensure deep visibility into system operations.
Deep knowledge of cloud platforms (AWS, GCP, Azure) and experience with cloud-native technologies.
Deep understanding of Kubernetes infrastructure.
Proficiency in monitoring tools (datadog, Prometheus, Grafana) and experience in setting up comprehensive monitoring and alerting systems.
Excellent problem-solving skills and the ability to work under pressure to resolve incidents and ensure system reliability.
Progressive Delivery Expertise: Leverage progressive delivery techniques such as canary releases (argo rollouts) – BIG advantage.
Tracing and Debugging: manage distributed tracing systems (Datadog APM / Jaeger / OpenTelemetry) to diagnose and troubleshoot complex issues across microservices architectures. Employ effective logging and tracing strategies to pinpoint root causes of incidents and performance bottlenecks – BIG advantage.
Programming and Scripting Skills: Proficiency in programming languages such as Python and Go, and Bash – MUST.
Good presentation skills: Ability to articulate technically advanced issues to all audiences; Ability to mentor and train internal staff.
Strong organizational skills and excellent attention to details.
Ability to effectively prioritize and execute tasks.
Self-driven.
Excellent English.
This position is open to all candidates.