Staff Engineer - Performance, Reliability & AI Automation

factorial

Barcelona, Spain

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Remote

Barcelona, Spain

Tech stack

Artificial Intelligence

Cloud Computing

Cloud Engineering

Databases

Continuous Integration

Distributed Systems

Load Testing

MySQL

Online Transaction Processing

Performance Tuning

Ruby on Rails

Reliability Engineering

Software Systems

TypeScript

Datadog

Performance Testing

Backend

Low Latency

Kafka

GraphQL

Vertica

Job description

The team's primary goal is to increase Factorial's quality, performance, and scalability by continuously improving the way we build our product. We focus on strengthening our tools, maintaining foundational elements, and promoting best practices in close collaboration with the rest of the engineering organization.

Our mission is to equip product builders with robust, AI-enabled tools and practices to deliver with quality, confidence, and efficiency. We work across teams to improve software patterns, remove obstacles, complete unfinished migrations, reduce tech debt, and raise the overall engineering bar across the company.

Within this context, performance and reliability are an increasingly important area of focus. As Factorial continues to scale, we want to strengthen how we define service health, measure system behavior, standardize observability, validate systems under load, and apply modern AI workflows to help engineers interpret signals and act faster.

As a Staff Engineer at Factorial, you'll be part of a team of 200+ Engineers. We look for people who are curious, proactive, technically strong, and effective communicators. People who enjoy solving complex technical problems, raising standards across teams, and building systems that help the whole organization perform at a higher level.

In this role, you will help shape how Factorial defines, measures, and improves performance and service health across the engineering organization. You will work closely with product teams and infrastructure teams to improve observability, load validation, performance visibility, and the engineering practices around these topics.

You will partner with senior engineering leaders across product, infrastructure, and DX to raise the company-wide standard for performance and reliability.

This is a cross-cutting staff role with broad impact. You will contribute through hands-on technical work, technical leadership, and by helping teams adopt stronger practices around SLOs, observability, performance optimization, and AI-assisted analysis workflows.

You'll work in a multicultural, English-speaking environment where your technical depth, systems thinking, and problem-solving skills will directly contribute to our mission of improving business management processes for companies worldwide.

Factorial serves 15,000+ active customers and 1M+ active users across business-critical workflows. Our current environment includes:

a large Ruby on Rails backend with GraphQL APIs
TypeScript frontend applications
complex CI/CD workflows
MySQL with replicas for OLTP workloads
ClickHouse for analytical workloads
Kafka for event-driven and asynchronous processing
a multi-region cloud architecture
meaningful TPS, IOPS, concurrency, and sustained operational load

Don't let unfamiliarity with every part of this stack hold you back. If you have worked in complex systems, know how to reason about performance and reliability at scale, and are excited to help teams build better software with better signals, we want to talk to you.

Defining and evolving SLIs and SLOs for critical product journeys
Improving and standardizing observability, dashboards, and service health visibility across teams
Investigating bottlenecks and regressions across application, database, asynchronous, and system layers
Driving improvements in latency, throughput, scalability, and reliability
Building more structured load testing workflows for critical paths
Helping teams validate system behavior under realistic traffic, concurrency, and tenant-scale conditions
Analyzing capacity, saturation, and behavior under peak load and growth scenarios
Defining practices and tooling to help prevent performance regressions before production
Working closely with product and infrastructure teams to align on performance priorities and system behavior under load
Designing AI-assisted workflows to support metric and alert interpretation, anomaly analysis, incident investigation, performance insights generation and more.

Requirements

Strong hands-on experience improving performance, scalability, and reliability in complex software systems
Experience defining or operating SLIs, SLOs, and service health frameworks
Strong knowledge of observability practices and tools such as Datadog
Experience investigating production bottlenecks across application, database, and distributed system layers
Experience building or improving load testing, benchmarking, or performance validation workflows
Experience diagnosing tail latency, throughput issues, and performance variability in production
Broad experience working with cloud-based production systems
Strong communication skills, including technical writing and cross-team alignment
A proactive mindset and strong ownership mentality
Significant experience building and operating production systems at scale
Experience working in large-scale environments with meaningful traffic and operational complexity
Experience with Ruby on Rails, MySQL, Kafka, GraphQL, ClickHouse, or equivalent technologies
Previous experience in Performance Engineering or Reliability Engineering
Interest in modern AI tools and practical use of agentic workflows in engineering

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all