SRE Engineer
Role details
Job location
Tech stack
Job description
Spain. At Finom, we're not just redefining the entrepreneurial experience - we're empowering our employees to make a real difference. Your work matters, and your impact extends far beyond product metrics. We nurture innovation and an inspiring work environment where bold ideas thrive, prioritizing thorough research, swift implementation of solutions, and ensuring that every effort we make benefits our users, employees, partners, and our business as a whole. Maintaining our start-up spirit, we prioritize thorough research, swift implementation of solutions, and ensuring that every effort we make benefits our users, employees, partners, and, of course, our business. We are looking for a Senior SRE Engineer to drive the design, implementation, and evolution of our Kubernetes-based platform in a multi-cloud environment (GCP/AWS). At Finom, SREs are not just executors of tasks; you are the architects of reliability. This role requires strong ownership of reliability, scalability, and
Requirements
platform architecture for high-load, mission-critical systems operating 24/7. What You Will Be Doing Lead the Platform Evolution: Design and operate our Kubernetes ecosystem (GKE, multi-cluster) with a focus on high availability and zero-downtime operations. Build "Paved Roads": Own and evolve our PaaS strategy, using GitOps (ArgoCD) and CI/CD (GitLab) to empower domain teams to deploy independently. Architect Reliability: Define and implement our observability strategy across metrics, logs, and tracing (Prometheus, VictoriaMetrics, OpenTelemetry). Drive Infrastructure-as-Code: Lead the automation of our infrastructure using Terraform, ensuring all resources are standardized and version-controlled. Own the Error Budget: Partner with engineering teams to establish and manage SLOs, SLAs, and incident management frameworks. Disaster Recovery Mastery: Design and participate in regular DR drills, implementing blue/green and active/passive strategies across regions to ensure service continuity. Innovate Operations: Proactively apply AI-driven approaches to improve operational efficiency and automated bottleneck detection. Who You Are Production K8s Mastery: Strong hands-on experience managing Kubernetes (GKE preferred) in high-load, multi-cluster production environments. Cloud Infrastructure: Deep experience with GCP (AWS is a strong plus) and Terraform for large-scale infrastructure. GitOps Expertise: Solid experience with ArgoCD , GitLab CI, and the "Infrastructure as Code" philosophy. Observability Expert: Deep knowledge of the Prometheus/Grafana stack and implementing tracing/logging at scale. System Design: Proven ability to design highly available 24/7 systems with automated failover and rollback capabilities. English Fluency: English level B2+ for effective cross-functional communication. Nice-to-Haves Compliance Knowledge: Understanding of banking-grade standards like PCI DSS, GDPR, or ISO 27001 . Distributed Systems: Experience with Kafka (Confluent), RabbitMQ, or managing high-load Redis and PostgreSQL clusters. AI for Ops: Experience using AI tools to improve alerting, anomaly detection, or engineering efficiency. Security-Minded: Experience with Vault for secret management and credential rotation. Our Infrastructure Landscape Primary Cloud: GCP (~90%) Orchestration & Deploy: GKE, ArgoCD, GitLab CI Automation: Terraform Data & Messaging: PostgreSQL, Kafka, Redis, RabbitMQ Observability: Prometheus, Grafana, VictoriaMetrics, OpenTelemetry, Cloud Logging Security: Vault What You Will Get In Return Make a genuine impact on the product - Join our upward trajectory, and grow with us. We provide the resources and opportunities for c