Site Reliability Engineer - IDM Team

Lunik Explorers at Work

31 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Tech stack

Microsoft Active Directory

Active Directory Federation Services

Amazon Web Services (AWS)

User Authentication

Authentication Protocols

Azure

Bash

Ubuntu (Operating System)

CentOS

Cloud Computing

Configuration Management

Computer Programming

Linux

Distributed Systems

DNS

Fault Tolerance

Monitoring of Systems

Icinga

Identity and Access Management

Python

Lightweight Directory Access Protocols (LDAP)

Microsoft Operating Systems

OAuth

Open Source Technology

OpenStack

Performance Tuning

Powershell

Reliability Engineering

Openid Connect

Ansible

Prometheus

Security Assertion Markup Language (SAML)

Data Logging

Scripting (Bash/Python/Go/Ruby)

Okta

Fluentd

Grafana

Reliability of Systems

Containerization

Kubernetes

Patch Management

Operational Systems

Puppet

Terraform

Docker

ELK

VMware

Job description

Work model: Hybrid (2 days in the office per week) Job Type : Full Time Job Location : Málaga, Madrid or Sevilla As a Site Reliability Engineer (SRE) in the IDM team, you will be responsible for contributing to the reliability, availability, and performance of mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, applying your technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response. The role requires prior experience as an SRE or in similar functions, as well as solid knowledge of the technologies and methodologies described below. A collaborative mindset, focus on continuous improvement, and strong teamwork skills will be key to success in this role. Candidates should ideally have a background in open-source systems and Linux, although knowledge and experience with Microsoft systems will also be considered positively. Responsibilities Reliability & Availability Contribute to maintaining and improving system reliability, uptime, and performance across production environments. Support tracking of service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs). Assist in improving incident response processes and implementing fault-tolerant systems. Automation & Infrastructure Develop and maintain automation tools for infrastructure management. Collaborate with development teams to integrate reliability practices into CI/CD pipelines. Contribute to improving scalability and resilience of cloud infrastructure. Monitoring & Observability Implement and maintain monitoring systems and alerts to proactively identify issues. Help define key performance metrics and support the implementation of logging and observability solutions. Incident Management & Root Cause Analysis Participate in incident response, assisting with root cause analysis and post-mortems. Document findings and collaborate on improving procedures and playbooks. Work closely with other SREs, software engineers, and cross-functional teams to ensure service reliability. Contribute to continuous improvement initiatives to reduce toil and optimize resource utilization. Requirements Required Soft Skills Problem-Solving & Critical Thinking Ability to analyze and troubleshoot complex technical issues. Continuous improvement mindset with innovative problem-solving skills. Strong verbal and written communication skills to explain technical issues. Ability to collaborate with multidisciplinary teams. Adaptability & Flexibility Comfortable working in dynamic environments with shifting priorities. Open to new technologies and adaptable in improving processes. Ownership & Accountability Strong commitment to production system reliability. Proactive in identifying and resolving issues. Resilience under Pressure Ability to remain calm and focused during critical incidents. Required Technical

Requirements

Skills Infrastructure Automation & Configuration Management Experience with IaC tools such as Terraform, Ansible, AWX, or Puppet. Knowledge of Docker and Kubernetes. Familiarity with cloud platforms (AWS, GCP, or Azure). This is not mandatory, but it will be considered positively. Administration of hypervisors (VMware or OpenStack is a plus). DNS management in Microsoft and open-source environments (BIND, CoreDNS, etc.). Monitoring & Observability Hands-on experience with tools like Prometheus, Grafana, Icinga. Knowledge of logging and tracing (ELK stack, Fluentd, OpenTelemetry). Authentication & Identity Management Familiarity with authentication protocols: LDAP, SAML, OAuth, OpenID Connect. Experience with tools such as Active Directory, FreeIPA, Keycloak is a plus and ADFS. Knowledge of MFA solutions (PrivacyIDEA, Azure MFA, Duo, Okta, etc.). Experience supporting incident management and documenting post-mortems. Operating Systems Administration of Ubuntu and CentOS. We will consider Microsoft operating systems favorably, but it is not a requirement. Knowledge of security, performance tuning, and patch management. Microsoft Systems Management Knowledge of Active Directory, GPOs, DNS, and replication. Scripting & Programming Proficiency in PowerShell, Bash, Python and Ansible. Ability to automate tasks and manage infrastructure as code. Containerization & Orchestration Experience with Docker, Podman, and Kubernetes. Deployment and management of containerized applications. Performance Tuning & Optimization Ability to identify and resolve bottlenecks in distributed systems. #J-18808-Ljbffr