Observability Infrastructure Engineer

Adyen N.V.
Amsterdam, Netherlands
6 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Amsterdam, Netherlands

Tech stack

API
Data Stores
Software Debugging
Linux
File Systems
Distributed Systems
Elasticsearch
Python
Prometheus
Software Engineering
Data Streaming
Data Logging
Data Processing
Grafana
Software Troubleshooting
Kubernetes
Bare Metal
Build Tools
Operational Systems

Job description

We are looking for an experienced Observability Infrastructure Engineer to join our Platform Engineering organization. You will be part of the team responsible for building and running Observability pillars on premise and on Kubernetes. Our systems collect, process, and store the logs, metrics, and traces that allow hundreds of product teams to monitor their services in real time.

This is a role for a builder and a problem solver who enjoys deep technical troubleshooting across distributed systems and then turns recurring issues into automated, repeatable solutions. You will work in a large-scale environment where we manage petabytes of data and thousands of servers. We are currently in the middle of a major transformation: focusing on automation of operations and enabling self service for our users.

What you will do

  • Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems. You will play a key role in redesigning our infrastructure to support new global regions, ensuring data isolation and regulatory compliance in different geographies, and more.
  • Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments.
  • Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks. Your goal is to build self-healing systems that do not require manual intervention during the night. You will improve our CI pipelines to ensure that changes to our clusters are safe, predictable, and automated.
  • Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines. We deal with high-volume data streams that can overwhelm standard configurations. You will tune our Elasticsearch clusters, optimize Prometheus and VictoriaMetrics storage, and ensure our OpenTelemetry implementation can handle peak traffic without missing a beat.
  • Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place. You will help us upgrade our stack to the latest versions and ensure our platform remains secure and performant. You will improve the self-service experience by implementing automated guardrails and quota management to prevent noisy tenants from destabilizing the platform, while designing safer API access patterns for our users.

Requirements

Do you have experience in Software troubleshooting?, * 4+ years of experience in the observability domain or in a relevant platform/infrastructure domain.

  • Observability Stack Expertise: You have hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing.
  • Linux Experience: You understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware .
  • Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
  • Software Engineering Mindset: You are proficient in Go or Python and do not just write scripts; you build tools and automation platforms that treat infrastructure as code.

Nice to have

  • Experience with large scale, multi tenant isolation and quota or cost governance approaches for telemetry platforms.
  • Familiarity with regulated environments where security, audibility, and data handling requirements shape platform design decisions.

About the company

Adyen provides payments, data, and financial products in a single solution for customers like Facebook, Uber, H&M, and Microsoft - making us the financial technology platform of choice. At Adyen, everything we do is engineered for ambition. For our teams, we create an environment with opportunities for our people to succeed, backed by the culture and support to ensure they are enabled to truly own their careers. The people of Adyen are motivated individuals who tackle unique technical challenges at scale and solve them as a team. Together, we deliver innovative and ethical solutions that help businesses achieve their ambitions faster., Our unique approach is a product of our diverse perspectives. This diversity of backgrounds and cultures is essential in helping us maintain our momentum. Our business and technical challenges are unique, and we need as many different voices as possible to join us in solving them - voices like yours. No matter who you are or where you're from, we welcome you to be your true self at Adyen.

Apply for this position