Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We're on a mission to democratize audio creation by building world-class audio infrastructure for our customers. As a Site Reliability Engineer, you'll play a key role in improving our platform's developer operations including observability, monitoring, and overall reliability. You will be part of a cross-functional team dedicated to implementing robust DevOps practices and enhancing infrastructure and site reliability engineering (SRE). A customer-focused mindset is essential, as the team collaborates closely with stakeholders to ensure solutions meet business and user needs. In addition to a focus on observability, you will contribute hands-on by developing features, automating workflows, and supporting the deployment of advanced machine-learning models. Strong communication skills are vital for working effectively with engineers, product teams, and stakeholders across the organization. Key Responsibilities
- Responding to incidents and helping stabilize our technology stack
- Investigating performance issues and runtime errors
- Writing and shipping features for our customers - we expect everyone to be engineers as well!
- Accelerating our CI/CD process
- Deploying machine learning models onto GPU-based systems on AWS
- Supporting engineers on high-value deployments and teaching best practices about CI/CD to these engineers
- Identifying and resolving security issues
- Automating tests and supporting our engineers on building great software
Requirements
- Strong experience with monitoring/observability tools (Grafana, Prometheus, or similar).
- Proficiency in Python, Docker, Kubernetes, and CI/CD pipelines.
- Hands-on cloud experience (AWS or similar).
- A passion for designing and implementing scalable observability solutions.
- Minimum 3 years experience working in a backend related role.
Preferred Qualifications
- Security expertise or interest in vulnerability assessments.