Senior Backend Engineer, ML Infrastructure & Reliability
Role details
Job location
Tech stack
Job description
This is a backend software engineering role with end-to-end reliability ownership.
You will design, build, and operate a Django production backend that orchestrates ML inference workflows across internal services and third-party APIs. The core challenge is high-throughput orchestration: asynchronous execution, retries, idempotency, backpressure, failure handling, and system-level observability.
Infrastructure and Terraform are supporting tools. The primary output of this role is reliable production software.
You will work closely with ML engineers and backend teams to turn research systems into robust, production-grade services.
What You'll Do
- Design, build, and maintain Django services that coordinate and serve ML inference workflows.
- Own high-throughput asynchronous execution using queues, workers, and schedulers.
- Design safe orchestration patterns: idempotency, deduplication, retries, rate limiting, and backpressure.
- Build and operate systems with clear SLOs, error budgets, and on-call ownership.
- Lead incident response, write postmortems, and drive long-term reliability improvements.
- Implement end-to-end observability: metrics, logs, traces, dashboards, alerts, and runbooks.
- Improve reliability of service integrations using timeouts, circuit breakers, fallbacks, and dependency health modeling.
- Collaborate with ML engineers to productionize training and inference pipelines.
- Own CI/CD and deployment workflows for backend and ML-facing services.
- Use Infrastructure as Code (Terraform) to support reliability, scalability, and repeatability.
- Optimize performance and cost across compute, storage, databases, and external dependencies., * High ownership over core production systems that power ML inference
- Real reliability and scale problems, not maintenance work
- Close collaboration with backend and ML engineers
- Opportunity to define reliability standards as the platform scales
If you've owned Django services in production, built high-throughput async systems, and care deeply about reliability, this role should feel familiar.
Requirements
Do you have experience in Terraform?, * Strong background as a Python backend engineer with ownership of production systems.
- Hands-on experience running Django in production (ORM usage, migrations, performance tuning, request lifecycle).
- Experience integrating with multiple internal and external services in reliability-critical paths.
- Proven experience building and operating asynchronous job systems (e.g., Celery, RQ, Arq, or equivalents).
- Hands-on experience with workflow or orchestration systems (Temporal, Prefect, Airflow, Step Functions).
- Solid understanding of distributed systems reliability: timeouts, retries, idempotency, rate limiting, backpressure, and failure isolation.
- Experience defining and operating SLOs/SLAs, including alerting and on-call participation.
- Strong Linux, networking, and debugging fundamentals.
- Working knowledge of cloud platforms (AWS and/or GCP).
- Practical experience using Infrastructure as Code (Terraform) as part of a broader system., * Experience operating ML inference or training infrastructure at scale.
- Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo Workflows).
- Experience with distributed tracing and observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki).
- Experience operating Postgres and caches (e.g., Redis) in high-throughput systems.
- Startup or greenfield system ownership experience.