Staff Engineer - Performance, Reliability & AI Automation
Role details
Job location
Tech stack
Job description
The team's primary goal is to increase Factorial's quality, performance, and scalability by continuously improving the way we build our product. We focus on strengthening our tools, maintaining foundational elements, and promoting best practices in close collaboration with the rest of the engineering organization.
Our mission is to equip product builders with robust, AI-enabled tools and practices to deliver with quality, confidence, and efficiency. We work across teams to improve software patterns, remove obstacles, complete unfinished migrations, reduce tech debt, and raise the overall engineering bar across the company.
Within this context, performance and reliability are an increasingly important area of focus. As Factorial continues to scale, we want to strengthen how we define service health, measure system behavior, standardize observability, validate systems under load, and apply modern AI workflows to help engineers interpret signals and act faster.
As a Staff Engineer at Factorial, you'll be part of a team of 200+ Engineers. We look for people who are curious, proactive, technically strong, and effective communicators. People who enjoy solving complex technical problems, raising standards across teams, and building systems that help the whole organization perform at a higher level.
In this role, you will help shape how Factorial defines, measures, and improves performance and service health across the engineering organization. You will work closely with product teams and infrastructure teams to improve observability, load validation, performance visibility, and the engineering practices around these topics.
You will partner with senior engineering leaders across product, infrastructure, and DX to raise the company-wide standard for performance and reliability.
This is a cross-cutting staff role with broad impact. You will contribute through hands-on technical work, technical leadership, and by helping teams adopt stronger practices around SLOs, observability, performance optimization, and AI-assisted analysis workflows.
You'll work in a multicultural, English-speaking environment where your technical depth, systems thinking, and problem-solving skills will directly contribute to our mission of improving business management processes for companies worldwide.
Factorial serves 15,000+ active customers and 1M+ active users across business-critical workflows. Our current environment includes:
- a large Ruby on Rails backend with GraphQL APIs
- TypeScript frontend applications
- complex CI/CD workflows
- MySQL with replicas for OLTP workloads
- ClickHouse for analytical workloads
- Kafka for event-driven and asynchronous processing
- a multi-region cloud architecture
- meaningful TPS, IOPS, concurrency, and sustained operational load
Don't let unfamiliarity with every part of this stack hold you back. If you have worked in complex systems, know how to reason about performance and reliability at scale, and are excited to help teams build better software with better signals, we want to talk to you.
- Defining and evolving SLIs and SLOs for critical product journeys
- Improving and standardizing observability, dashboards, and service health visibility across teams
- Investigating bottlenecks and regressions across application, database, asynchronous, and system layers
- Driving improvements in latency, throughput, scalability, and reliability
- Building more structured load testing workflows for critical paths
- Helping teams validate system behavior under realistic traffic, concurrency, and tenant-scale conditions
- Analyzing capacity, saturation, and behavior under peak load and growth scenarios
- Defining practices and tooling to help prevent performance regressions before production
- Working closely with product and infrastructure teams to align on performance priorities and system behavior under load
- Designing AI-assisted workflows to support metric and alert interpretation, anomaly analysis, incident investigation, performance insights generation and more.
Requirements
-
Strong hands-on experience improving performance, scalability, and reliability in complex software systems
-
Experience defining or operating SLIs, SLOs, and service health frameworks
-
Strong knowledge of observability practices and tools such as Datadog
-
Experience investigating production bottlenecks across application, database, and distributed system layers
-
Experience building or improving load testing, benchmarking, or performance validation workflows
-
Experience diagnosing tail latency, throughput issues, and performance variability in production
-
Broad experience working with cloud-based production systems
-
Strong communication skills, including technical writing and cross-team alignment
-
A proactive mindset and strong ownership mentality
-
Significant experience building and operating production systems at scale
-
Experience working in large-scale environments with meaningful traffic and operational complexity
-
Experience with Ruby on Rails, MySQL, Kafka, GraphQL, ClickHouse, or equivalent technologies
-
Previous experience in Performance Engineering or Reliability Engineering
-
Interest in modern AI tools and practical use of agentic workflows in engineering