STACKIT CLOUD SITE RELIABILITY ENGINEER STORAGE

Barcelona, Spain

3 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, German

Job location

Barcelona, Spain

Tech stack

API

Bash

Cloud Computing

File Systems

Elasticsearch

Monitoring of Systems

Python

NetApp Applications

Reliability Engineering

Ansible

Prometheus

Ceph

Data Logging

System Availability

Grafana

Kubernetes

Storage Technologies

REST

Job description

Automation: You automate the provisioning and operating processes in the storage environment with your own aspiration to become a little better every day and to continuously optimize our products. * Architecture: With your team, you support a robust and efficient storage architecture - because it is important to you to build a long-term stable and reliable solution that our customers like to use. * End-to-end responsibility: Identifying with the products we provide to our customers is very important to us. That is why we actively practice end-to-end responsibility and receive support from many internal STACKIT service teams to refine our services. * Performance and capacity planning: You will analyze and optimize the performance of our existing systems regarding future scaling of the landscape. This also includes forward-looking capacity planning. * Incident and post-mortem analysis: You take care of the processing of (major) incidents with storage participation as part

Requirements

of the incident & problem management process of STACKIT with the aim of deriving mitigating measures for the future and then successfully implementing them. Experience and skills you will need: * You want to make a big difference and play a key role in shaping the solution with state-of-the-art cloud technologies * You have experience in one ore more various storage product(s) (e.g. NetApp, Cohesity, Pure, Ceph) in the area of block, object, backup or file storage and have good knowledge of cloud environments and their architectures * You are an expert in the operation of storage infrastructure (e.g. solution scenarios, provision, scaling, migration, incident response) and their automation (e.g. using Golang / Python, Bash, Ansible) * You are already familiar with containerized system landscapes of the storage environment (e.g. k8s) * You have experience in monitoring, alerting and logging to ensure complete system monitoring (e.g. Prometheus, Grafana, Elasticsearch) * Ideally, you are already working with APIs and developing them further (e.g. REST API with Golang and Python) * You enjoy the challenges of operating storage systems (e.g. protocols, troubleshooting, performance analyzes, high availability, lifecycle) * You have a passion and enthusiasm for new technologies and topics related to various storage systems * You like to be part of a motivated team that always strives for improvement and continuously develops itself (and the products) * Your excellent communication skills in English (and optional in German) form the basis for successful cooperation in international, agile teams #J-18808-Ljbffr

About the company

Make an amazing climb in your career in an international team of experts. Our company provides technological services for the whole Schwarz group of more than 30 countries in Europe and the US. Our vision is to be the leading ecosystem for a better life. We built the European sovereign cloud STACKIT. With XM Cyber we set new standards in differing cyber crimes. We run AI better than anyone. With us you will find a variety of opportunities to grow and do your best at your calling - IT. We exist to improve life with our products and services - for today's generation and future generations. We act future proof! The impact you will create: * Stability & reliability: You are responsible for maintaining and optimizing the stability and availability of our highly available, resilient storage infrastructure (Block, Object, Backup and File storage). You ensure this through proactive monitoring, solving occurring incidents on your own responsibility and avoiding their occurrence in the future