Senior Machine Learning Site Reliability Engineer

Prisma
Municipality of Madrid, Spain
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Municipality of Madrid, Spain

Tech stack

Amazon Web Services (AWS)
Systems Engineering
Cloud Engineering
Software Quality
Database Storage Structures
Software Debugging
Distributed Systems
DNS
HP Systems Insight Manager
Python
PostgreSQL
Machine Learning
RabbitMQ
Redis
Reliability Engineering
Software Engineering
Management of Software Versions
Software Vulnerability Management
Datadog
Pulumi
Mttr
Reliability of Systems
PySpark
Kubernetes
Infrastructure Automation Frameworks
Cloudflare
Kafka
Machine Learning Operations
Terraform
Elixir
Microservices

Job description

Since 2015, we've been using our love of data and tech to rethink motor insurance and bring drivers a great experience at a great price. Our story began in Italy, where we've quickly become the number one online motor insurance provider. In fact, we're trusted by over 4 million drivers. And now we're expanding to help millions more drivers in the UK and Spain. To help fuel that growth, we need a Senior Machine Learning Site Reliability Engineer to join our Infrastructure team. This team is the beating heart of Prima. You'll be joining over 300 engineers across software development, infrastructure, operations and security. Fueled by curiosity, experimentation and collaboration, you'll help deliver scalable, impactful solutions that shape the future of insurance. Excited to make an impact? Here are the details, * Hands-on Reliability & System Engineering: Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs, working directly on production infrastructure, and collaborating closely with software engineers on system design and reliability improvements

  • Automation, Operations & Incident Response: Actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR, participate in and lead incident response, and drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
  • Performance, Capacity & Security: Continuously analyze and optimize system performance and cost, provide data, insights, and recommendations to inform capacity planning, and support security best practices through hands-on vulnerability remediation and threat mitigation

Requirements

Do you have experience in Terraform?, * SRE & Cloud Engineering: Hands-on experience with SRE practices in production, strong AWS expertise, Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)

  • Automation, Software Engineering and MLOps: Demonstrate strong software engineering fundamentals with an emphasis on code quality and maintainability. This includes solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging), hands-on experience with PySpark, and a consistent focus on writing clean, well-structured, and maintainable code. Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles is also expected
  • Reliability, Data & Operations: Add stakeholder engagement and mentoring e.g. lead incident response and RCAs, improve system reliability, and engage stakeholders to propose solutions, share learnings, and mentor others, * Regulated Environments & Security: Experience operating in highly regulated industries (e.g. Insurance, Banking, Healthcare), managing sensitive data, and supporting secure networking setups, including exposure to security technologies such as Cloudflare
  • Distributed Systems & Microservices: Strong understanding of microservices architectures, their principles and trade-offs, with the ability to troubleshoot and maintain distributed systems and supporting technologies (RabbitMQ, Kafka, PostgreSQL, Redis)
  • Observability & Platform Operations: Hands-on experience with Datadog for platform and application monitoring, performance optimisation, and solid fundamentals in database structures and operational troubleshooting, with exposure to systems built in languages such as Rust and Elixir, At Prima, we celebrate uniqueness. If you don't meet every requirement but are passionate about this role, we still want to hear from you. Innovation thrives on diverse perspectives.

About the company

At Prisma, we are building the data layer for modern applications. If you are fascinated by the leading-edge architecture and technology used in today’s data-intensive, highly scalable software systems, with distributed graph data on a massive scale, but you want the energy, challenges, and freedom that come with working in a small startup, then a job at Prisma might be for you.

With funding from top-tier investors Amplify Partners and Kleiner Perkins, we are a small, distributed team working on making the advanced data infrastructure developed by large tech companies accessible to all application developers around the world. Our hard work is paying off, with adoption and implementation of Prisma by some of the most successful and interesting companies out there today, and the fun is just beginning!

Apply for this position