Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Principal Site Reliability Engineer (GCP) Who We AreVeson Nautical empowers the global maritime industry to navigate complexity on all sides of the trade. Veson's platform combines AI-driven workflows, trusted data, and seamless collaboration, to deliver the insight and context needed for confident, competitive decision-making.The Opportunity As a Principal Site Reliability Engineer for Google Cloud Platform (GCP) at Veson Nautical, you will be responsible for designing, building, monitoring and supporting the GCP infrastructure that underpins our rapidly growing SaaS platform and the services and products that depend upon it.Our business and our platforms are experiencing rapid growth, which ensures we have no shortage of exciting and challenging projects to work on.The Team This is a hands-on technical role where you will take responsibility for our global GCP infrastructure while working closely with embedded Site Reliability Engineers working on our Shipfix product, and software
Requirements
engineers across the organization. There is the potential for this to become a larger team, and so we are looking for someone who has interest in future leadership.We are looking for an engineer who can think systematically and manage complex systems at scale through automation. The successful candidate will be comfortable participating in architectural discussions, and working with other stakeholders in Cybersecurity, Global Infrastructure and Engineering.This position will be in our London office in Southwark, in a hybrid model where we expect a minimum of two days/week attendance in person.Our StackGoogle Cloud Platform - primarily PaaS services (Bigtable, Cloud SQL, Dataflow, Datastore, GKE, GCS, KMS, Pub/Sub)Email - ingestion through Microsoft Graph API automation and IMAP integrationsElasticSearch hosted with Kubernetes OperatorCI/CD - Gitlab Pipelines and ArgoCDInfrastructure-as-Code - Terraform, Terragrunt and AtlantisMonitoring and Security - Cloud Armor Enterprise, Grafana/Grafana Tempo, OpenTelemetry, OpsGenie, Renovate, SentryAI Tools - Augment Code, GitHub Copilot, Claude CodeKey ResponsibilitiesDesign, implement, and manage scalable, reliable, and secure cloud infrastructure on Google Cloud Platform.Oversee the provisioning and management of containerized applications using Docker and Kubernetes.Drive automation initiatives for infrastructure provisioning and configuration management using Terraform and other IaC tools.Partner closely with development teams to ensure reliability, performance, and scalability of platforms.Establish and maintain comprehensive monitoring, alerting, and observability practices.Build processes and discipline to improve consistency, visibility, and documentation across infrastructure and operations.Lead incident response efforts and ensure service uptime.Develop automation, monitoring, and management solutions.Prepare infrastructure for integration and future growth.Skills / Experience Needed to Be Successful in This Role RequiredPrevious experience working on a large-scale Software-as-a-Service (SaaS) platform which supported thousands of global users in a 24x7x365 environment.Previous experience as part of a Site Reliability Engineering / Cloud Ops / Platform / Infrastructure team.Operational experience with Google Cloud Platform, Kubernetes and Terraform.Programming or scripting experience in Python, bash, or a similar language.Experience with cloud cost management (budgeting, anomaly detection, cost analysis and reporting, etc).Highly DesirablePrevious experience architecting large Google Cloud infrastructure.Previous experience deploying / operating / monitoring Elasticsearch clusters.Experience leading a geographically distributed team.Nice To Have SkillsKEDA / ArgoCD.PostgreSQL and BigQuery database management experience.We are focused on building a diverse and inclusive workforce. If you're excited about this role, but do not meet 100% of the qualifications