Senior Research Data Engineer: MSR AI for Science
Role details
Job location
Tech stack
Job description
- Data integration for structure & dynamics: Build ingestion/curation pipelines for structural/biophysical data (mmCIF/PDB, EM maps/particles, binding/biophysics, spectroscopy); implement map/volume preprocessing (e.g., resolution filtering, normalization) and alignment to model inputs/outputs.
- Cryo-EM expertise: Operationalize end-to-end flows from raw image stacks/particles to 3D maps and model-ready tensors; interoperate with community formats (e.g., EMDB/EMPIAR, mmCIF) and link to sequences/annotations.
- Signal & information content: Design dataset diagnostics (e.g., mutual-information-like measures, effective sample size, SNR proxies) to quantify what data teach the model; build active-learning loops that maximize learning per euro of data collection time.
- Model-aware data services: Implement scalable, versioned data services and feature stores that feed training/evaluation; design loaders/augmentations optimized for throughput and correctness (GPU-aware).
- Training-at-scale engineering: Own distributed data pipelines and orchestration for large runs on Azure; profile and tune I/O, storage tiers, data locality, and caching; monitor cost, utilization, and failure modes.
- Quality, governance, and reproducibility: Codify schemas/ontologies, metadata contracts, unit/integration tests, and lineage; automate validation and data drift detection; maintain documentation and examples.
- Partner across disciplines: Work closely with ML researchers, structural biologists, and drug designers; translate experimental constraints into robust computational workflows; communicate clearly and proactively.
Requirements
Do you have a Doctoral degree?, We seek a highly motivated Senior RSDE to join our Biomolecular Emulator (BioEmu) team. The BioEmu project aims to model the dynamics and function of proteins - how they change shape, bind to each other, and bind small molecules. This approach will help us to understand biological function and dysfunction on a structural level and lead to more effective and targeted drug discovery. Our BioEmu-1 model was published in Science (see our blog post for links to our open-source software and other resources and this explainer video)., Required:
- PhD or equivalent experience in Computer Science, Machine Learning, Applied Mathematics, Computational Biology, or related field.
- Strong software engineering in Python (packaging, testing, CI), with systems thinking for data-intensive ML.
- Deep learning experience (PyTorch/JAX/TensorFlow) and solid foundations in linear algebra, probability, and statistics.
- Proven experience designing robust data pipelines for large-scale ML (HPC or cloud).
- Ability to reason about learning signal and to assess information content of real-world scientific datasets.
- Excellent collaboration and communication in interdisciplinary teams.
Preferred:
- Hands-on cryo-EM experience (e.g., map reconstruction, refinement, or pipeline tooling).
- CUDA or C++ for performance-critical components; experience with mixed precision and memory-efficient training.
- Experience integrating experimental data into ML models (e.g., constraints/priors from cryo-EM, binding assays, spectroscopy).
- Familiarity with MD data, structure prediction systems, or protein design work-flows.
- Experience with cost-optimization for data collection and cloud utilization; clear track record of building reliable, maintainable research software at scale.
- Experience with structural biology or molecular biology data/techniques (e.g., cryo-EM, binding assays, spectroscopy, expression, sequencing)