We are looking for a Site Reliability Engineer to join our DevOps team. You will ensure the reliability, performance, and scalability of our back-office solutions, which serve as the foundation for the entire purchasing process. This role will lead the development of SRE capabilities, meeting SLI/SLO/SLA targets, and establishing effective monitoring systems. You will enhance our Software Development Lifecycle by integrating reliability and scalability, working with cross-functional teams, and supporting production environments. Additionally, you will implement incident management processes and conduct post-mortem analyses to drive continuous improvement. If you have a strong engineering and automation background and are passionate about the E-commerce field, then we would love to hear from you.
Roles and Responsibilities:
Develop and implement SRE capabilities to enhance the reliability, availability, and performance of Admin solutions.
Design and maintain proactive monitoring and alerting systems for deep visibility into critical business flows, beyond simple statuses, to identify functional issues.
Drive improvements in the Software Development Lifecycle (SDLC) for reliability and scalability from design to deployment.
Collaborate with development and operations teams to troubleshoot production incidents affecting the purchase flow through root cause analysis.
Lead SRE initiatives to boost system resilience and operational efficiency.
Implement best practices for incident management and conduct blameless post-mortems, contributing to capacity planning and performance testing to ensure scalability.
Roles and Responsibilities:
Develop and implement SRE capabilities to enhance the reliability, availability, and performance of Admin solutions.
Design and maintain proactive monitoring and alerting systems for deep visibility into critical business flows, beyond simple statuses, to identify functional issues.
Drive improvements in the Software Development Lifecycle (SDLC) for reliability and scalability from design to deployment.
Collaborate with development and operations teams to troubleshoot production incidents affecting the purchase flow through root cause analysis.
Lead SRE initiatives to boost system resilience and operational efficiency.
Implement best practices for incident management and conduct blameless post-mortems, contributing to capacity planning and performance testing to ensure scalability.
Requirements:
5+ years of experience as a Site Reliability/DevOps Engineer
Deep understanding of E-commerce flows, specifically with back-office operations and order processing – must
Experience as an Automation/Software Engineer with a strong understanding of software development principles and in building, testing, and deploying distributed systems – must
Experience in designing, implementing, and utilizing monitoring and observability platforms such as DataDog, NewRelic, Prometheus/Grafana, or ELK stack – must
Proficiency in scripting and automation using languages such as Python, Java, etc. – must
Ability to create dashboards, alerts, and insightful queries – must
Experience with AWS services to build and operate scalable and resilient applications (e.g., EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch) – plus
Experience in automating infrastructure provisioning, application deployments, and repetitive operational tasks – plus
Proactive approach with excellent problem-solving skills
Strong collaborator, with an ability to work with cross-functional teams
Proficient in English.
5+ years of experience as a Site Reliability/DevOps Engineer
Deep understanding of E-commerce flows, specifically with back-office operations and order processing – must
Experience as an Automation/Software Engineer with a strong understanding of software development principles and in building, testing, and deploying distributed systems – must
Experience in designing, implementing, and utilizing monitoring and observability platforms such as DataDog, NewRelic, Prometheus/Grafana, or ELK stack – must
Proficiency in scripting and automation using languages such as Python, Java, etc. – must
Ability to create dashboards, alerts, and insightful queries – must
Experience with AWS services to build and operate scalable and resilient applications (e.g., EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch) – plus
Experience in automating infrastructure provisioning, application deployments, and repetitive operational tasks – plus
Proactive approach with excellent problem-solving skills
Strong collaborator, with an ability to work with cross-functional teams
Proficient in English.
This position is open to all candidates.