This is a research‑first role focused on deeply understanding LLM internals to improve the security of AI agents. Youll design careful experiments on activations and interpretable features- e.g., probing, attribution & ablation/patching, representation‑geometry analyses-to uncover mechanisms behind jailbreak, indirect prompt injection, and other attacks. Then translate those insights into signals that can be used for detection and analysis of a model response.
The field of LLM interpretability at scale is exploding, with several major publications in the last months, and major opportunities for innovation.
What Youll Do
Investigate model internals, including activation/features analysis, unsupervised clustering, discovery of directions in latent space, etc. It may also require training specific model parts to improve interpretability metrics.
Design security‑grounded evaluations: curate datasets for different attack types, evaluate performance of different white box (model internals) methods compared to black box (input/output only) baselines.
Publish and share: produce our company Labs posts and open artifacts; when the work is strong, aim for tier‑1 ML venues (NeurIPS, ICML, etc.) and security forums. A publication of code and/or trained models in cases of community relevant novelty.
Build tools: Several open source libraries exist (like Anthropics attribution graphs infra), but the research in the field is very dynamic, which will require you to build and adapt tools to your own research directions. This also includes agents to automate research work and distill knowledge from designed experiments.
The field of LLM interpretability at scale is exploding, with several major publications in the last months, and major opportunities for innovation.
What Youll Do
Investigate model internals, including activation/features analysis, unsupervised clustering, discovery of directions in latent space, etc. It may also require training specific model parts to improve interpretability metrics.
Design security‑grounded evaluations: curate datasets for different attack types, evaluate performance of different white box (model internals) methods compared to black box (input/output only) baselines.
Publish and share: produce our company Labs posts and open artifacts; when the work is strong, aim for tier‑1 ML venues (NeurIPS, ICML, etc.) and security forums. A publication of code and/or trained models in cases of community relevant novelty.
Build tools: Several open source libraries exist (like Anthropics attribution graphs infra), but the research in the field is very dynamic, which will require you to build and adapt tools to your own research directions. This also includes agents to automate research work and distill knowledge from designed experiments.
Requirements:
Deep learning expertise with a track record of non‑trivial research (industry or academia) in LLMs or other domains (e.g., CV, speech). We care that youve changed models or methods in meaningful ways (architecture/training/eval), not just used them.
Strong experimental design and scientific writing; comfort pre‑registering hypotheses, testing causal claims, proposing novel directions in a fast-changing field.
PhD or equivalent research experience in the industry (5+ years in a leading research team). Publication record or a portfolio of high‑impact open artifacts will make you stand out from the crowd.
Familiarity with AI frameworks (e.g., HuggingFace Transformers, LangChain, scikit-learn, PyTorch); Experience with a production grade codebase with several contributors is a bonus.
Experience in data analysis: visualization, exploration, cleanup.
Knowledge in GenAI tools such as LLM Orchestrations and integration packages, Agents, RAG systems – a bonus.
Deep learning expertise with a track record of non‑trivial research (industry or academia) in LLMs or other domains (e.g., CV, speech). We care that youve changed models or methods in meaningful ways (architecture/training/eval), not just used them.
Strong experimental design and scientific writing; comfort pre‑registering hypotheses, testing causal claims, proposing novel directions in a fast-changing field.
PhD or equivalent research experience in the industry (5+ years in a leading research team). Publication record or a portfolio of high‑impact open artifacts will make you stand out from the crowd.
Familiarity with AI frameworks (e.g., HuggingFace Transformers, LangChain, scikit-learn, PyTorch); Experience with a production grade codebase with several contributors is a bonus.
Experience in data analysis: visualization, exploration, cleanup.
Knowledge in GenAI tools such as LLM Orchestrations and integration packages, Agents, RAG systems – a bonus.
This position is open to all candidates.













