Site Reliability Engineer

TechSpace Solutions Inc.
Cincinnati, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Cincinnati, United States of America

Tech stack

Java
API
Amazon Web Services (AWS)
Cloud Computing
Databases
Data Systems
Software Debugging
Distributed Systems
Payment Systems
Performance Tuning
Ruby on Rails
Reliability Engineering
Runbook
Software Engineering
SQL Databases
Datadog
Grafana
Information Technology
Splunk
New Relic (SaaS)
Service Stack
Microservices

Job description

  • Client is looking for an enterprise-grade embedded finance platform enabling organizations to build, launch, and scale compliant banking, payments, and lending solutions.
  • We are seeking a Principal Software Engineer to join our Production Engineering team. This is a hands-on technical leadership role focused on operating, debugging, and improving highly distributed, mission-critical payment systems. The ideal candidate thrives in complex production environments and enjoys solving deep technical challenges across applications, infrastructure, and data systems., * Lead production triage and incident response across APIs, payment systems, distributed services, infrastructure, and databases.
  • Diagnose and resolve complex production issues spanning code, infrastructure, data, and third-party dependencies.
  • Partner with engineering teams to implement permanent fixes and improve platform reliability.
  • Design and implement monitoring, alerting, automation, and operational tooling.
  • Improve system observability, resiliency, and debuggability.
  • Work across a mixed technology stack including Ruby on Rails, Java, AWS, APIs, and SQL databases.
  • Develop runbooks and diagnostic workflows for operational excellence.
  • Mentor engineers and influence best practices across engineering and SRE teams.
  • Participate in architectural discussions to build highly reliable and scalable systems.

Requirements

  • 10+ years of experience in Software Engineering, Production Engineering, SRE, or Distributed Systems.
  • Strong experience debugging production issues end-to-end (application, infrastructure, data, and dependencies).

Hands-on experience with:

  • AWS and cloud-native environments
  • Ruby on Rails and/or Java
  • APIs, Microservices, and Distributed Systems
  • SQL and database troubleshooting
  • Observability tools such as Splunk, Datadog, New Relic, etc.

Deep understanding of:

  • System behavior in production
  • Fault isolation and troubleshooting
  • Performance optimization and resiliency patterns
  • Excellent communication and stakeholder management skills.
  • Ability to work effectively during incidents and high-pressure situations.

Preferred Qualifications:

  • Experience in Payments, FinTech, Banking, or other regulated environments.
  • Experience building and operating large-scale, high-availability platforms.
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.

Apply for this position