The ideal candidate enjoys working in a fast-paced environment with highly innovative technologies.
Your Impact
Provision, configure, and support resilient hybrid cloud deployment architectures using the automation framework
Collaborate with development teams to ensure applications are production-ready, scalable, and reliable from the outset
Manage CI/CD platform, Linux infrastructure, and collaborate with other SREs to deploy and maintain the automation framework, perform capacity planning, and create and review operational runbooks.
Set up critical infrastructure and develop tools and frameworks to automate operational tasks, including the deployment of machines, services, and applications
Participate in Incident Command on-call rotation supporting critical applications and services.
Conducts root cause analysis of critical business and production issues and drives future preventive measures
Manage scalability, capacity planning, redundancy, and resiliency
Maintain service availability and performance SLAs based on business and product requirements.
Contribute to documentation related to design, deployment, validation, and operations
Design proactive service monitoring, alerting, and trend analysis of underlying infrastructure, and support the operations team in implementation
Establish end-to-end monitoring and alerting on all critical components of the application.
Your Impact
Provision, configure, and support resilient hybrid cloud deployment architectures using the automation framework
Collaborate with development teams to ensure applications are production-ready, scalable, and reliable from the outset
Manage CI/CD platform, Linux infrastructure, and collaborate with other SREs to deploy and maintain the automation framework, perform capacity planning, and create and review operational runbooks.
Set up critical infrastructure and develop tools and frameworks to automate operational tasks, including the deployment of machines, services, and applications
Participate in Incident Command on-call rotation supporting critical applications and services.
Conducts root cause analysis of critical business and production issues and drives future preventive measures
Manage scalability, capacity planning, redundancy, and resiliency
Maintain service availability and performance SLAs based on business and product requirements.
Contribute to documentation related to design, deployment, validation, and operations
Design proactive service monitoring, alerting, and trend analysis of underlying infrastructure, and support the operations team in implementation
Establish end-to-end monitoring and alerting on all critical components of the application.
Requirements:
6+ Years of system engineering experience on mission-critical, enterprise-level systems
6+ years of experience using Infrastructure-As-Code to build large-scale environments, mainly on Linux platform (Ubuntu, SUSE, CentOS).
3+ years of experience working with cloud environments, primarily Google Cloud Platform
Demonstrated Linux/Systems experience in a hybrid (cloud, on-prem) environment
Strong experience with CI/CD pipeline, GitHub, Jenkins, Artifactory
Must have a strong foundation in Linux operating systems, Troubleshooting, Design, and Implementation
Expertise in configuration management with a framework such as Terraform, Ansible, and Helm.
Experience using Infrastructure-As-Code to build large-scale environments
Experience with Linux vulnerability management process and patching
Must have programming knowledge in Python/Bash/Perl/Go languages to automate infrastructure workflow
Understanding of software development methodologies and practices, including agile development, continuous integration, and continuous delivery
Understanding of Network Firewalls, load balancers, and complex network designs
Experience in monitoring technologies like Datadog, Nagios, Graphite, Cacti, and Grafana.
Understanding Kubernetes, container lifecycle, and troubleshooting
Hands-on knowledge of high-availability approaches such as load balancing, failover, clustering, and disaster recovery
Excellent problem-solving, critical thinking, communication, and teamwork skills
Passion, drive, energy, a sense of humor, and a great attitude.
6+ Years of system engineering experience on mission-critical, enterprise-level systems
6+ years of experience using Infrastructure-As-Code to build large-scale environments, mainly on Linux platform (Ubuntu, SUSE, CentOS).
3+ years of experience working with cloud environments, primarily Google Cloud Platform
Demonstrated Linux/Systems experience in a hybrid (cloud, on-prem) environment
Strong experience with CI/CD pipeline, GitHub, Jenkins, Artifactory
Must have a strong foundation in Linux operating systems, Troubleshooting, Design, and Implementation
Expertise in configuration management with a framework such as Terraform, Ansible, and Helm.
Experience using Infrastructure-As-Code to build large-scale environments
Experience with Linux vulnerability management process and patching
Must have programming knowledge in Python/Bash/Perl/Go languages to automate infrastructure workflow
Understanding of software development methodologies and practices, including agile development, continuous integration, and continuous delivery
Understanding of Network Firewalls, load balancers, and complex network designs
Experience in monitoring technologies like Datadog, Nagios, Graphite, Cacti, and Grafana.
Understanding Kubernetes, container lifecycle, and troubleshooting
Hands-on knowledge of high-availability approaches such as load balancing, failover, clustering, and disaster recovery
Excellent problem-solving, critical thinking, communication, and teamwork skills
Passion, drive, energy, a sense of humor, and a great attitude.
This position is open to all candidates.