Site Reliability Engineer
Role details
Job location
Tech stack
Job description
- Ensure system reliability and performance across multi-cloud, multi-region platforms using first principles thinking
- Build and maintain comprehensive observability solutions (OpenTelemetry, New Relic, Grafana, Prometheus) that provide actionable insights into system health and performance.
- Automate infrastructure provisioning and deployments using Terraform and infrastructure-as-code practices
- Define, implement, and monitor SLOs/SLIs that align with business-critical SLAs and drive accountability for reliability.
- Manage and optimize Kubernetes clusters (EKS, GKE) with a focus on security hardening, performance, and operational excellence.
- Lead incident response efforts, troubleshoot complex system issues, restore service quickly, and conduct thorough root cause analysis
- Implement preventive measures and reliability improvements based on lessons learned from incidents and system behavior patterns.
- Partner with platform engineers and developers to embed reliability best practices into system architecture and delivery pipelines
- Proactively scale infrastructure capacity based on growth forecasts and traffic patterns.
- Contribute to architecture reviews with a deep focus on reliability, performance, and operational sustainability.
- Foster a culture of continuous improvement, systematic problem-solving, and operational excellence., * Work in a small, high-impact team where your contributions directly shape system reliability and operational practices
- Focus on strategic engineering rather than firefighting. We build monitoring, automation, and guardrails that prevent problems rather than just reacting to them.
- Engage in first-principles thinking and a continuous-improvement culture that values thoughtful design over quick fixes.
- Collaborate across a multi-cloud environment (AWS, Google Cloud Platform, Kubernetes) supporting diverse, mission-critical workloads.
- Partner with platform engineers, developers, and principal engineers who provide technical guidance and collaboration
- Own reliability for systems that directly impact business outcomes and customer experiences
- Work alongside platform engineers to ensure the platforms they build are operationally sound and reliable at scale.
At Red Ventures, reliability isn't just about keeping systems running; it's about engineering resilience through thoughtful observability, automation, and operational discipline. You'll work with passionate engineers who value systematic problem-solving, learn from failures, and build reliability into every layer of the stack., We are committed to providing equal employment opportunities to qualified individuals with disabilities. This includes providing reasonable accommodation where appropriate. Should you require a reasonable accommodation to apply or participate in the job application or interview process, please contact
If you are based in California, we encourage you to read this important information for California residents linked here.
At Red Ventures, we believe in real human connection. That's why we do not hire someone through text, social media, or email only. As part of the hiring process, you should expect live conversations with RV teammates before any offer is made. Also, keep an eye on the sender: we only use official @redventures.com email addresses at the portfolio level or business specific email addresses (e.g., @thepointsguy.com), not ones like "redventurescareer.com." We will never ask candidates to send money, buy equipment, or share financial account info during your journey with us. You can always find our open roles on redventures.com- if you receive a message that seems suspicious, please use redventures.com to verify the opportunity.
For more, the U.S. Federal Trade Commission has published helpful articles to help individuals learn more about protecting themselves from recruiter scams. If you think you've been targeted, feel free to report it to your local authorities. Stay safe out there!
Requirements
- 3-5 years of experience in SRE, DevOps, or cloud infrastructure engineering roles
- Experience leveraging AI/ML tools to enhance observability, including anomaly detection, alert noise reduction, and predictive incident identification
- Experience using generative AI or LLM-based tools to accelerate debugging, runbook creation, and operational knowledge sharing
- Strong hands-on experience with AWS and Google Cloud Platform cloud platforms
- Deep Kubernetes expertise (EKS, GKE), including security, networking, and operational best practices
- Proficiency with infrastructure-as-code using Terraform
- Experience building and maintaining observability systems (New Relic, Grafana, Prometheus, OpenTelemetry, or similar)
- Solid understanding of CI/CD pipelines and automated deployment strategies (Harness, Jenkins, GitLab CI, or similar)
- Strong scripting and automation skills (Python, Bash, Go, or similar languages)
- Proven track record of maintaining high-availability systems (99.9%+ uptime)
- Deep understanding of distributed systems, microservices architectures, and scalability patterns
- Experience with incident management, troubleshooting complex systems, and learning from failures
- Strong first-principles thinking, ability to reason from fundamentals rather than relying solely on existing patterns
- Excellent written and verbal communication skills with the ability to explain complex technical concepts clearly
Bonus Points For:
- Cloud certifications (AWS Solutions Architect, Google Cloud Platform Professional Cloud Architect, or equivalent)
- Experience with data platform infrastructure (Databricks, Snowflake, or similar)
- Familiarity with security scanning and remediation tools (Wiz, Aqua, Prisma Cloud, or similar)
- Knowledge of compliance frameworks (SOC 2, PCI-DSS, HIPAA) and their operational implications
- Experience with chaos engineering, resilience testing, or systematic failure injection
- Database performance tuning and optimization expertise (PostgreSQL, MySQL, etc.)
- Experience with log aggregation and analytics platforms (ELK Stack, Splunk, or similar)
- Understanding of cloud security, network architecture, and multi-region deployment patterns
- Familiarity with DLP (Data Loss Prevention) solutions (Netskope, Zscaler, or similar)
- Background working with regulated industries or highly available consumer-facing applications
Benefits & conditions
This range reflects total cash compensation, which may include base salary only or base salary plus target bonus, depending on the role. Where eligible, equity may also be offered separately and not included below. Actual compensation varies based on location, experience, and qualifications.
- Total Cash Compensation Range: $100,000 - $145,000 per year
Additionally, the following benefits are provided by Red Ventures, subject to eligibility requirements.
- Health Insurance Coverage (medical, dental, and vision)
- Life Insurance
- Short and Long-Term Disability Insurance
- Flexible Spending Accounts
- Holiday Pay
- 401(k) with match
- Employee Assistance Program
- Paid Parental Bonding Benefit Program
- Flexible Paid Time Off (PTO): We believe time to rest and recharge is essential. That's why we offer a generous and flexible PTO policy. Full-time employees accrue 20 days of PTO for a full calendar year annually, with an increase to 25 days after five years of service., We offer competitive salaries and a comprehensive benefits program for full-time employees, including medical, dental and vision coverage, paid time off, life insurance, disability coverage, employee assistance program, 401(k) plan and a paid parental leave program.