It starts with you – an engineer driven to build modern, real-time data platforms that help teams move faster with trust. You care about great service, performance, and cost. Youll architect and ship a top-of-the-line open streaming data lake/lakehouse and data stack, turning massive threat signals into intuitive, self-serve data and fast retrieval for humans and AI agents – powering a unified foundation for AI-driven mission-critical workflows across cloud and on-prem.
If you want to make a meaningful impact, join our companys mission and build best-in-class data systems that move the world forward – this role is for you.
The Responsibilities
Build self-serve platform surfaces (APIs, specs, CLI/UI) for streaming and batch pipelines with correctness, safe replay/backfills, and CDC.
Run the open data lake/lakehouse across cloud and on-prem; enable schema evolution and time travel; tune partitioning and compaction to balance latency, freshness, and cost.
Provide serving and storage across real-time OLAP, OLTP, document engines, and vector databases.
Own the data layer for AI – trusted datasets for training and inference, feature and embedding storage, RAG-ready collections, and foundational building blocks that accelerate AI development across the organization.
Enable AI-native capabilities – support agentic pipelines, self-tuning processes, and secure sandboxing for model experimentation and deployment.
Make catalog, lineage, observability, and governance first-class – with clear ownership, freshness SLAs, and access controls.
Improve performance and cost by tuning runtimes and I/O, profiling bottlenecks, planning capacity, and keeping spend predictable.
Ship paved-road tooling – shared libraries, templates, CI/CD, IaC, and runbooks – while collaborating across AI, ML, Data Science, Engineering, Product, and DevOps. Own architecture, documentation, and operations end-to-end.
If you want to make a meaningful impact, join our companys mission and build best-in-class data systems that move the world forward – this role is for you.
The Responsibilities
Build self-serve platform surfaces (APIs, specs, CLI/UI) for streaming and batch pipelines with correctness, safe replay/backfills, and CDC.
Run the open data lake/lakehouse across cloud and on-prem; enable schema evolution and time travel; tune partitioning and compaction to balance latency, freshness, and cost.
Provide serving and storage across real-time OLAP, OLTP, document engines, and vector databases.
Own the data layer for AI – trusted datasets for training and inference, feature and embedding storage, RAG-ready collections, and foundational building blocks that accelerate AI development across the organization.
Enable AI-native capabilities – support agentic pipelines, self-tuning processes, and secure sandboxing for model experimentation and deployment.
Make catalog, lineage, observability, and governance first-class – with clear ownership, freshness SLAs, and access controls.
Improve performance and cost by tuning runtimes and I/O, profiling bottlenecks, planning capacity, and keeping spend predictable.
Ship paved-road tooling – shared libraries, templates, CI/CD, IaC, and runbooks – while collaborating across AI, ML, Data Science, Engineering, Product, and DevOps. Own architecture, documentation, and operations end-to-end.
Requirements:
6+ years in software engineering, data engineering, platform engineering, or distributed systems, with hands-on experience building and operating data infrastructure at scale.
Streaming & ingestion – Technologies like Flink, Structured Streaming, Kafka, Debezium, Spark, dbt, Airflow/Dagster
Open data lake/lakehouse – Table formats like Iceberg, Delta, or Hudi; columnar formats; partitioning, compaction, schema evolution, time-travel
Serving & retrieval – OLAP engines like ClickHouse or Trino; vector databases like Milvus, Qdrant, or LanceDB; low-latency stores like Redis, ScyllaDB, or DynamoDB
Databases – OLTP systems like Postgres or MySQL; document/search engines like MongoDB or ElasticSearch; serialization with Avro/Protobuf; warehouse patterns
Platform & infra – Kubernetes, AWS, Terraform or similar IaC, CI/CD, observability, incident response
Performance & cost – JVM tuning, query optimization, capacity planning, compute/storage cost modeling
Engineering craft – Java/Scala/Python, testing, secure coding, AI coding tools like Cursor, Claude Code, or Copilot.
6+ years in software engineering, data engineering, platform engineering, or distributed systems, with hands-on experience building and operating data infrastructure at scale.
Streaming & ingestion – Technologies like Flink, Structured Streaming, Kafka, Debezium, Spark, dbt, Airflow/Dagster
Open data lake/lakehouse – Table formats like Iceberg, Delta, or Hudi; columnar formats; partitioning, compaction, schema evolution, time-travel
Serving & retrieval – OLAP engines like ClickHouse or Trino; vector databases like Milvus, Qdrant, or LanceDB; low-latency stores like Redis, ScyllaDB, or DynamoDB
Databases – OLTP systems like Postgres or MySQL; document/search engines like MongoDB or ElasticSearch; serialization with Avro/Protobuf; warehouse patterns
Platform & infra – Kubernetes, AWS, Terraform or similar IaC, CI/CD, observability, incident response
Performance & cost – JVM tuning, query optimization, capacity planning, compute/storage cost modeling
Engineering craft – Java/Scala/Python, testing, secure coding, AI coding tools like Cursor, Claude Code, or Copilot.
This position is open to all candidates.


















