Required Site Reliability Engineer
We work in a flexible, hybrid model, so you can choose the home-office balance that works best for you.
Responsibilities:
Monitor, manage and operate our cloud service. Scale our service with required monitoring and alerting capabilities, and develop incident management, and security and compliance activities/processes.
Work closely with R&D to make sure new features are reliable, easily deployable, and support the requirements of the service in terms of scale and security.
Establish a regular operational feedback cycle into our engineering teams
Manage the Service Operations team to operate with a culture of business and customer-centricity by maintaining our SLA for each service, including incident response, problem management, and service upgrades.
Develop and drive, as the primary owner, the communication strategy for internal and external stakeholders (including customers) to convey service health, tracking against SLAs, current and historical incidents, upcoming events, or upgrades.
Ensure all technical procedures are documented, reviewed, and updated and actively contribute to the maintenance of operational standards & policies.
Collaborate with the Support team to understand and improve user experience, performance, incident response, and the serviceability of our offerings.
Collaborate with the internal R&D team to automate infrastructure services and system administration tasks wherever possible and implement a monitoring strategy to provide rapid feedback and diagnostics in the event of a service disruption.
Create relationships with other departments, including Marketing, Product Management, Engineering, and Customer Success, to make sure we provide services with high availability and superior performance for all our customers.
We work in a flexible, hybrid model, so you can choose the home-office balance that works best for you.
Responsibilities:
Monitor, manage and operate our cloud service. Scale our service with required monitoring and alerting capabilities, and develop incident management, and security and compliance activities/processes.
Work closely with R&D to make sure new features are reliable, easily deployable, and support the requirements of the service in terms of scale and security.
Establish a regular operational feedback cycle into our engineering teams
Manage the Service Operations team to operate with a culture of business and customer-centricity by maintaining our SLA for each service, including incident response, problem management, and service upgrades.
Develop and drive, as the primary owner, the communication strategy for internal and external stakeholders (including customers) to convey service health, tracking against SLAs, current and historical incidents, upcoming events, or upgrades.
Ensure all technical procedures are documented, reviewed, and updated and actively contribute to the maintenance of operational standards & policies.
Collaborate with the Support team to understand and improve user experience, performance, incident response, and the serviceability of our offerings.
Collaborate with the internal R&D team to automate infrastructure services and system administration tasks wherever possible and implement a monitoring strategy to provide rapid feedback and diagnostics in the event of a service disruption.
Create relationships with other departments, including Marketing, Product Management, Engineering, and Customer Success, to make sure we provide services with high availability and superior performance for all our customers.
Requirements:
At least 4 years of relevant industry experience in maintaining a high availability production environment as SRE / Automation Engineer.
At least 3 years of experience with service operations and extensive knowledge of cloud infrastructure planning and operations, design and deployment, as well as system life cycle management in supporting a SaaS infrastructure.
Solid understanding of Networking/VPCs/monitoring & alerting frameworks and tools.
Substantial experience in operating a high-availability cloud infrastructure.
Experience with cloud platforms like Azure or AWS.
Experience with running distributed systems deployed multiple geographies across the globe.
Knowledge of security practices, tooling, and automation.
Experience with monitoring tools such as DataDog, New Relic, Grafana, Prometheus.
Experience with automation tools such as Anisble, Terrafform.
Advanced knowledge of at least one scripting language such as Python or PowerShell.
Experience with CI/CD tools like Jenkins, Octopus or VSTS.
Some experience with relational database systems like SQL.
At least 4 years of relevant industry experience in maintaining a high availability production environment as SRE / Automation Engineer.
At least 3 years of experience with service operations and extensive knowledge of cloud infrastructure planning and operations, design and deployment, as well as system life cycle management in supporting a SaaS infrastructure.
Solid understanding of Networking/VPCs/monitoring & alerting frameworks and tools.
Substantial experience in operating a high-availability cloud infrastructure.
Experience with cloud platforms like Azure or AWS.
Experience with running distributed systems deployed multiple geographies across the globe.
Knowledge of security practices, tooling, and automation.
Experience with monitoring tools such as DataDog, New Relic, Grafana, Prometheus.
Experience with automation tools such as Anisble, Terrafform.
Advanced knowledge of at least one scripting language such as Python or PowerShell.
Experience with CI/CD tools like Jenkins, Octopus or VSTS.
Some experience with relational database systems like SQL.
This position is open to all candidates.