What You'll Do:
Design, build, and support tooling, automation, and infrastructure to maximize the reliability, scalability, and performance of our Cognition.
Proactively identify, mitigate, and resolve issues, leveraging AI-driven insights and automation where possible.
Develop robust monitoring, alerting, and incident response strategies; ensure actionable observability across all critical systems.
Drive best practices in CI/CD, Infrastructure-as-Code, environment provisioning, and disaster recovery.
Collaborate closely with engineering teams to build, deploy, and maintain highly available services in production.
Take responsibility for uptime, reliability, and the operational excellence of our Cognition.
Help define and measure SLOs/SLAs to ensure world-class service delivery.
Design, build, and support tooling, automation, and infrastructure to maximize the reliability, scalability, and performance of our Cognition.
Proactively identify, mitigate, and resolve issues, leveraging AI-driven insights and automation where possible.
Develop robust monitoring, alerting, and incident response strategies; ensure actionable observability across all critical systems.
Drive best practices in CI/CD, Infrastructure-as-Code, environment provisioning, and disaster recovery.
Collaborate closely with engineering teams to build, deploy, and maintain highly available services in production.
Take responsibility for uptime, reliability, and the operational excellence of our Cognition.
Help define and measure SLOs/SLAs to ensure world-class service delivery.
Requirements:
3+ years in Site Reliability, DevOps, or related Infrastructure Engineering roles in 24/7 production environments.
Deep experience operating, automating, and supporting distributed systems on AWS or similar clouds.
Experience with Infrastructure-as-Code (e.g., Terraform, CloudFormation) and CI/CD tooling (e.g., Jenkins, Github Actions, etc.).
Strong skills in Python, Bash, or comparable scripting languages for automation.
Hands-on experience with observability stacks (e.g., New Relic, Grafana, CloudWatch, Datadog) and incident response.
Familiarity with microservices architectures and patterns for resilience/scalability (e.g., throttling, retries, circuit breakers).
Experience with common data stores (MySQL/RDS, DocumentDB, Elasticsearch, Redis).
Working knowledge of Node.js/TypeScript backends (bonus: performance optimization and monitoring); experience with Java, Python, or Go is a plus.
Interest or experience in applying AI for infrastructure automation, monitoring, or optimization (a strong plus).
A collaborative mindset with strong communication skills, able to work independently and comfortably across teams and disciplines.
Thrives in a fast-paced, high-growth environment and ready to tackle complex system challenges at scale.
Data-driven, analytical thinker with the ability to dive into metrics, identify insights, and drive product improvements.
Startup-ready: thrive in fast-paced, ambiguous environments; bias for learning, action, and innovation.
3+ years in Site Reliability, DevOps, or related Infrastructure Engineering roles in 24/7 production environments.
Deep experience operating, automating, and supporting distributed systems on AWS or similar clouds.
Experience with Infrastructure-as-Code (e.g., Terraform, CloudFormation) and CI/CD tooling (e.g., Jenkins, Github Actions, etc.).
Strong skills in Python, Bash, or comparable scripting languages for automation.
Hands-on experience with observability stacks (e.g., New Relic, Grafana, CloudWatch, Datadog) and incident response.
Familiarity with microservices architectures and patterns for resilience/scalability (e.g., throttling, retries, circuit breakers).
Experience with common data stores (MySQL/RDS, DocumentDB, Elasticsearch, Redis).
Working knowledge of Node.js/TypeScript backends (bonus: performance optimization and monitoring); experience with Java, Python, or Go is a plus.
Interest or experience in applying AI for infrastructure automation, monitoring, or optimization (a strong plus).
A collaborative mindset with strong communication skills, able to work independently and comfortably across teams and disciplines.
Thrives in a fast-paced, high-growth environment and ready to tackle complex system challenges at scale.
Data-driven, analytical thinker with the ability to dive into metrics, identify insights, and drive product improvements.
Startup-ready: thrive in fast-paced, ambiguous environments; bias for learning, action, and innovation.
This position is open to all candidates.