Senior Backend Engineer, ML Infrastructure & Reliability

Graswald GmbH
13 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

API
Artificial Intelligence
Airflow
Amazon Web Services (AWS)
Cloud Computing
Databases
Continuous Integration
Data Deduplication
Software Debugging
Linux
Distributed Systems
Django
Systems Theories
Python
PostgreSQL
Performance Tuning
Redis
Prometheus
Software Engineering
Grafana
Backend
Machine Learning Operations
Celery
Terraform

Job description

This is a backend software engineering role with end-to-end reliability ownership.

You will design, build, and operate a Django production backend that orchestrates ML inference workflows across internal services and third-party APIs. The core challenge is high-throughput orchestration: asynchronous execution, retries, idempotency, backpressure, failure handling, and system-level observability.

Infrastructure and Terraform are supporting tools. The primary output of this role is reliable production software.

You will work closely with ML engineers and backend teams to turn research systems into robust, production-grade services.

What You'll Do

  • Design, build, and maintain Django services that coordinate and serve ML inference workflows.
  • Own high-throughput asynchronous execution using queues, workers, and schedulers.
  • Design safe orchestration patterns: idempotency, deduplication, retries, rate limiting, and backpressure.
  • Build and operate systems with clear SLOs, error budgets, and on-call ownership.
  • Lead incident response, write postmortems, and drive long-term reliability improvements.
  • Implement end-to-end observability: metrics, logs, traces, dashboards, alerts, and runbooks.
  • Improve reliability of service integrations using timeouts, circuit breakers, fallbacks, and dependency health modeling.
  • Collaborate with ML engineers to productionize training and inference pipelines.
  • Own CI/CD and deployment workflows for backend and ML-facing services.
  • Use Infrastructure as Code (Terraform) to support reliability, scalability, and repeatability.
  • Optimize performance and cost across compute, storage, databases, and external dependencies., * High ownership over core production systems that power ML inference
  • Real reliability and scale problems, not maintenance work
  • Close collaboration with backend and ML engineers
  • Opportunity to define reliability standards as the platform scales

If you've owned Django services in production, built high-throughput async systems, and care deeply about reliability, this role should feel familiar.

Requirements

Do you have experience in Terraform?, * Strong background as a Python backend engineer with ownership of production systems.

  • Hands-on experience running Django in production (ORM usage, migrations, performance tuning, request lifecycle).
  • Experience integrating with multiple internal and external services in reliability-critical paths.
  • Proven experience building and operating asynchronous job systems (e.g., Celery, RQ, Arq, or equivalents).
  • Hands-on experience with workflow or orchestration systems (Temporal, Prefect, Airflow, Step Functions).
  • Solid understanding of distributed systems reliability: timeouts, retries, idempotency, rate limiting, backpressure, and failure isolation.
  • Experience defining and operating SLOs/SLAs, including alerting and on-call participation.
  • Strong Linux, networking, and debugging fundamentals.
  • Working knowledge of cloud platforms (AWS and/or GCP).
  • Practical experience using Infrastructure as Code (Terraform) as part of a broader system., * Experience operating ML inference or training infrastructure at scale.
  • Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo Workflows).
  • Experience with distributed tracing and observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki).
  • Experience operating Postgres and caches (e.g., Redis) in high-throughput systems.
  • Startup or greenfield system ownership experience.

About the company

Graswald AI is transforming how the world's most iconic brands create content using AI. Backed by leading investors and powered by a world-class team, we're redefining the fashion content process - no physical studios, samples, logistics - just cutting-edge AI and automation. We build AI systems that power large-scale content generation for global brands. Our core production application is a Django-based backend that coordinates high-throughput ML inference across many internal systems and external providers. As usage grows, reliability, orchestration, and operational correctness are critical to the business. This role exists to ensure those systems remain dependable, observable, and scalable as we grow., At Graswald AI, we are building the AI operating system for fashion brands and retailers, to drive efficiency, flexibility and profitability. Today we specialise in generating eCommerce and campaign imagery and video. In just the past year, we've brought on 50 enterprise fashion brands, helping them reduce costs, accelerate timelines, and maintain the highest standards of visual quality. Backed by leading VCs and strategic investors - including Lakestar, Orendt Studios, and prominent angels - we are building the full software stack and Operating System for enterprise fashion brands, enabling brands to create, scale, and connect with their customers like never before.     If you require alternative methods of application or screening, you must approach the employer directly to request this as Indeed is not responsible for the employer's application process.

Apply for this position