Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We are looking for an experienced Site Reliability Engineer (SRE) to join our team in Barcelona. You will focus on ensuring the resilience and observability of our platform, using SRE principles such as SLIs/SLOs, error budgets, and toil reduction. You will work with experts to ensure our critical infrastructure is stable, scalable, and highly available through automation and data analysis., As an eDOer, you will have clear objectives, great challenges and a clear overview of how your work contributes to the global company project and its customers. We use Cloud platforms like Google Cloud Platform, focusing on the transition toward software-defined infrastructure. As an SRE, you will use these tools to maximize uptime and system stability:
- Kubernetes
- ArgoCD
- Horizon WorkSpace (Virtual Desktops)
- Certificate Management
- GCVE
- GKE
- Dockers
- Google Cloud, Azure, Amazon
- F5 LoadBalancers
- Fortigate Firewall
- ZTNA/Sase
- Identity Management
- Datadog
- Grafana
- Kibana
- Elasticsearch
- Kafka
- Corporate Services
- Jira Service
- Security Services
- Github
- Jenkins
You will be responsible for:
- Leading incident response, triage, and troubleshooting in complex distributed systems.
- Designing and implementing automated remediation strategies to reduce operational toil.
- Managing comprehensive observability (monitoring, alerts, logs) to maintain ecosystem health.
- Facilitating blameless post-mortems and driving improvements based on incident learning.
- Act as an internal consultant and evangelist. You will train multidisciplinary product teams on how to instrument their code effectively (using OpenTelemetry/APM) and build their own custom dashboards.
We apply the GitOps paradigm and extreme automation to ensure stability, using:
- Terraform for Infrastructure as Code
- GitLab-CI and Jenkins for CI/CD
- Docker as the Build-Ship-Run anywhere philosophy
- Kubernetes as the preferred orchestrator
- Helm for easily deployable applications
As an SRE, your mission will be:
- Define and monitor SLIs/SLOs to align technical performance with business needs.
- Optimize infrastructure through code to ensure scalability and availability.
- Create and manage efficient and automated product deployment lifecycle.
- Collaborate with other Company teams to implement best practices for the development lifecycle.
- Implement continuous integration, delivery and deployment methodologies.
Your ultimate goal is to drive down Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR) by providing automated, correlated insights during high-severity incidents.
Requirements
Are you passionate about system stability and continuous improvement? Join us to lead the reliability of our global services., We are looking for professionals with a data-driven approach to reliability and a passion for technical diagnosis:
- Solid experience in the incident management lifecycle and triage.
- Advanced debugging skills in distributed systems.
- Solid experience in Infrastructure as Code: Terraform, Terragrunt.
- Experience in Scripting languages: Python, YAML, Go, ...
- Experience in automation tools: terraform, argoCD, Crossplane...
- Experience with Orchestrators: Kubernetes.
- Proactive and data-driven approach to reliability improvement.
- Experience with GCP.
- Methodical (Definition of Done, Ways of working).
- Always looking for innovation and improvement.
- CAN DO attitude is a must.
- Open to learn attitude is a must.
- Desirable: Network Engineer experience.
- Desirable: Docker experience.
- Desirable: Experience creating MCPs and A2A., * Experience with Applied AI Tools: Demonstrated comfort using practical AI tools such as GitHub Copilot, ChatGPT, or other AI-powered coding assistants.
- Experimentation Mindset: Curiosity and eagerness to explore, experiment with, and integrate emerging AI-driven solutions into different workflows.
- AI-Enhanced Problem Solving: Ability to effectively leverage AI tools to enhance productivity.
- Adaptability and Learning Agility: Enthusiastic about continuously learning and quickly adapting to new AI features and capabilities.
- Collaboration with AI: Experience or interest in collaborating closely with AI tools to complement traditional practices.
Benefits & conditions
Prime Plus membership, competitive salary and benefits package, including flexible benefits, performance-based bonuses, birthday day off, discounts and partnerships, relocation support and premium equipment with role-based selection options and device ownership through our equipment lifecycle program when it reaches its refresh cycle.
Continuous learning to fuel your growth and explore new horizons! Learn and grow with free Coursera access, soft skills workshops, tech training, leadership development, and more. Plus, enjoy a great onboarding program.
Grow opportunities to empower your career, and unleash your potential! Personalised career paths and the eVOLVE Program will help you discover, grow, and thrive. Internal mobility opportunities let you pursue horizontal career changes and promotions.
Your Well-being is our priority. Embrace Freedom and Flexibility! At eDO, we value flexibility, employee care, and transparency. We offer a hybrid home-office model focused on outcome. You'll be able to find the right work-personal life balance that suits you best.
Work hard, party hard! We believe in having fun and connecting with colleagues! Join eDO for after-work events, padel tournaments, parties, and more. Create communities based on your passions, like sports and music. Come to work as you are, with no dress code, and enjoy free fruit, coffee, and tea at our offices.
Enjoy a dynamic and healthy environment! Be innovative, take risks, and share your ideas. Our diverse and open-minded teams support high performance, learning, and growth. You'll work in an Agile mindset environment with recognition at our core.