Site Reliability Engineer - IDM Team

Lunik Explorers at Work
31 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Tech stack

Microsoft Active Directory
Active Directory Federation Services
Amazon Web Services (AWS)
User Authentication
Authentication Protocols
Azure
Bash
Ubuntu (Operating System)
CentOS
Cloud Computing
Configuration Management
Computer Programming
Linux
Distributed Systems
DNS
Fault Tolerance
Monitoring of Systems
Icinga
Identity and Access Management
Python
Lightweight Directory Access Protocols (LDAP)
Microsoft Operating Systems
OAuth
Open Source Technology
OpenStack
Performance Tuning
Powershell
Reliability Engineering
Openid Connect
Ansible
Prometheus
Security Assertion Markup Language (SAML)
Data Logging
Scripting (Bash/Python/Go/Ruby)
Okta
Fluentd
Grafana
Reliability of Systems
Containerization
Kubernetes
Patch Management
Operational Systems
Puppet
Terraform
Docker
ELK
VMware

Job description

Work model: Hybrid (2 days in the office per week) Job Type : Full Time Job Location : Málaga, Madrid or Sevilla As a Site Reliability Engineer (SRE) in the IDM team, you will be responsible for contributing to the reliability, availability, and performance of mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, applying your technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response. The role requires prior experience as an SRE or in similar functions, as well as solid knowledge of the technologies and methodologies described below. A collaborative mindset, focus on continuous improvement, and strong teamwork skills will be key to success in this role. Candidates should ideally have a background in open-source systems and Linux, although knowledge and experience with Microsoft systems will also be considered positively. Responsibilities Reliability & Availability Contribute to maintaining and improving system reliability, uptime, and performance across production environments. Support tracking of service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs). Assist in improving incident response processes and implementing fault-tolerant systems. Automation & Infrastructure Develop and maintain automation tools for infrastructure management. Collaborate with development teams to integrate reliability practices into CI/CD pipelines. Contribute to improving scalability and resilience of cloud infrastructure. Monitoring & Observability Implement and maintain monitoring systems and alerts to proactively identify issues. Help define key performance metrics and support the implementation of logging and observability solutions. Incident Management & Root Cause Analysis Participate in incident response, assisting with root cause analysis and post-mortems. Document findings and collaborate on improving procedures and playbooks. Work closely with other SREs, software engineers, and cross-functional teams to ensure service reliability. Contribute to continuous improvement initiatives to reduce toil and optimize resource utilization. Requirements Required Soft Skills Problem-Solving & Critical Thinking Ability to analyze and troubleshoot complex technical issues. Continuous improvement mindset with innovative problem-solving skills. Strong verbal and written communication skills to explain technical issues. Ability to collaborate with multidisciplinary teams. Adaptability & Flexibility Comfortable working in dynamic environments with shifting priorities. Open to new technologies and adaptable in improving processes. Ownership & Accountability Strong commitment to production system reliability. Proactive in identifying and resolving issues. Resilience under Pressure Ability to remain calm and focused during critical incidents. Required Technical

Requirements

Skills Infrastructure Automation & Configuration Management Experience with IaC tools such as Terraform, Ansible, AWX, or Puppet. Knowledge of Docker and Kubernetes. Familiarity with cloud platforms (AWS, GCP, or Azure). This is not mandatory, but it will be considered positively. Administration of hypervisors (VMware or OpenStack is a plus). DNS management in Microsoft and open-source environments (BIND, CoreDNS, etc.). Monitoring & Observability Hands-on experience with tools like Prometheus, Grafana, Icinga. Knowledge of logging and tracing (ELK stack, Fluentd, OpenTelemetry). Authentication & Identity Management Familiarity with authentication protocols: LDAP, SAML, OAuth, OpenID Connect. Experience with tools such as Active Directory, FreeIPA, Keycloak is a plus and ADFS. Knowledge of MFA solutions (PrivacyIDEA, Azure MFA, Duo, Okta, etc.). Experience supporting incident management and documenting post-mortems. Operating Systems Administration of Ubuntu and CentOS. We will consider Microsoft operating systems favorably, but it is not a requirement. Knowledge of security, performance tuning, and patch management. Microsoft Systems Management Knowledge of Active Directory, GPOs, DNS, and replication. Scripting & Programming Proficiency in PowerShell, Bash, Python and Ansible. Ability to automate tasks and manage infrastructure as code. Containerization & Orchestration Experience with Docker, Podman, and Kubernetes. Deployment and management of containerized applications. Performance Tuning & Optimization Ability to identify and resolve bottlenecks in distributed systems. #J-18808-Ljbffr

Apply for this position