Senior Site Reliability Engineer

Kraken

31 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Amazon Web Services (AWS)

Application Performance Management

Software as a Service

Relational Databases

Django

Python

PostgreSQL

Linux Distribution

RabbitMQ

Reliability Engineering

Software Deployment

Datadog

Data Logging

Amazon Web Services (AWS)

Kubernetes

Celery

Terraform

Docker

Job description

As a Site Reliability Engineer within the newly created 'Product Reliability' team, you'll be responsible for ensuring the availability, performance, and scalability of the products on our platform. Your proficiency in supporting products that serve millions of customers will ensure stability and high performance for our brands and clients. You'll keep up with best practices in building products for scale. Your communication skills and attention to detail will be indispensable as you pinpoint areas for enhancement, ensure optimal product performance, and continuously improve our reliability and efficiency., * Teach and support product teams on best practices for reliability, implementation patterns and effective usage of our existing platforms

Support product teams in improving the performance and availability of their systems
Be hands-on in code and infrastructure to help product teams with reliability improvements
Provide comprehensive feedback to the wider Platform group on improvements to be made to core infrastructure based on observations and first-hand experience in the code base
Support the build-out of proof-of-concept requirements in product teams as needed to evolve application deployment architecture to align with business growth as well as enhance scalability and system resilience
Collaborate with product teams to support the release of new features and services, ensuring adherence to reliability and performance standards
Guide product teams in designing systems for resilience and graceful failure under heavy load
Assist application teams with post-incident tasks and follow-ups, and contribute to the creation and review of post-mortem documentation
Analyse incident metrics to identify trends and potential improvements, communicating these insights to the product teams
Help solve interesting and difficult problems. There's a great opportunity for disruption in the global energy market

Requirements

Do you have experience in Terraform?, * Great communication skills, working effectively with developers, product managers and other business stakeholders to understand, design and deliver impactful projects and reliability improvements

Proficient using AWS; we use a lot of different AWS services and not just the standard few
Strong Python skills; particularly with Django, the Django ORM and Celery
Good expertise in multiple of the following areas:
PostgreSQL, or a similar RDBMS, particularly in Amazon RDS at scale
Docker and Kubernetes; we use Amazon EKS in production
Datadog, or a similar logging/monitoring tool
Messaging queues, event-driven async processing or similar technologies - we use RabbitMQ
Terraform, or a similar infrastructure-as-code tool
Experience with a Linux distribution
Previous experience working in small, highly-autonomous teams, * Previous experience as a Site Reliability Engineer
Experience working on SaaS platforms, including engaging product teams to ensure up-skilling and knowledge sharing across teams
Experience managing and supporting a large scale internet facing service
Experience in responding to incidents and outages, writing technical incident reports and organising incident retrospectives
Experience working with very large relational databases
Experience in using service level objectives to improve application performance
A proactive, innovative mindset

About the company

Help us use technology to make a big green dent in the universe! Kraken powers some of the most innovative global developments in energy. We're a technology company focused on creating a smart, sustainable energy system. From optimising renewable generation, creating a more intelligent grid and enabling utilities to provide excellent customer experiences, our operating system for energy is transforming the industry around the world in a way that benefits everyone. It's a really exciting time in energy. Help us make a real impact on shaping a better, more sustainable future. Our Global Platform Engineering Reliability group is responsible for architecting, developing, and maintaining the resilient and scalable infrastructure that power and support our platforms., Kraken is a certified Great Place to Work in France, Germany, Spain, Japan and Australia. In the UK we are one of the Best Workplaces on Glassdoor with a score of 4.7. Check out our Welcome to the Jungle site (FR/EN) to learn more about our teams and culture.