Site Reliability Engineer (SRE)
Role details
Job location
Tech stack
Job description
a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms. Service Optimization : Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services. CI/CD Management : Build and optimize CI/CD tools and processes to ensure efficient and reliable deployments. Monitoring and Incident Response : Develop and manage monitoring, alerting, and incident response strategies to minimize downtime and enable rapid recovery. Root Cause Analysis : Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures. Automation and Efficiency : Automate repetitive tasks and optimize system performance to improve operational efficiency. On-Call Support : Participate in covering weekday business hours and once-monthly weekend shifts. Collaboration and Customer Engagement Cross-Functional Teamwork : Work closely with software engineering, infrastructure
Requirements
and product teams to integrate reliability practices into every stage of the development lifecycle. Reliability Advocacy : Champion SRE best practices and foster a culture of operational excellence across the organization. Global Team Collaboration : Collaborate with a distributed team of engineers worldwide to provide round-the-clock support. Customer Support : Interface with customers to address and resolve reported incidents, ensuring a seamless user experience. Qualifications and Skills SRE Expertise : Proven experience as a Site Reliability Engineer or similar role, with a history of supporting complex distributed systems (minimum five years supporting complex distributed systems). Observability Tools : Experience with monitoring and debugging tools like Prometheus, Vector, Grafana, Superset, or Kibana. Cloud Platforms : Proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode). Database Knowledge : Experience with SQL databases; familiarity with Postgre SQL is a plus but not required. Programming Skills : Proficiency in programming languages such as Python, Go, or Rust. Linux Expertise : Strong experience with Linux systems, including performance tuning and system-level troubleshooting. Communication Skills : Excellent written and verbal communication skills, with the ability to convey technical concepts clearly to diverse audiences, including customers and cross-functional teams. #J-18808-Ljbffr ", "employmentType": "FULL_TIME", "industry": "Site Reliability", "jobLocation" : { "@type": "Place", "address": { "@type": "PostalAddress", "streetAddress": "n/a", "addressLocality": "Spain", "addressRegion": "Spain", "addressCountry": "ES", "postalCode": "n/a" } }, "salaryCurrency": "EUR", "title": "Senior site reliability engineer", "hiringOrganization" : { "@type" : "Organization", "name" : "Hydrolix" } }