Site Reliability Engineer

Sciencelogic

Reston, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Reston, United Kingdom

Tech stack

Microsoft Windows

Agile Methodologies

Business Analytics Applications

JIRA

Bash

Software as a Service

Cloud Computing

Cloud Engineering

Computer Security

Data Structures

Data Stores

DevOps

Distributed Systems

Amazon DynamoDB

Perl

Groovy

Information Technology Operations

Python

PostgreSQL

Linux System Administration

MySQL

Networking Basics

Powershell

Reliability Engineering

Ansible

Prometheus

Software Engineering

SQL Databases

Trello

Scripting (Bash/Python/Go/Ruby)

Cloudformation

Performance Monitor

Cloudwatch

Terraform

New Relic (SaaS)

Job description

Overall, we're passionate about automation and solving complex business and technology challenges. Our team combines SRE, DevOps, Software Development and Information Security knowledge to help make Cloud operations agile, elastic inside the security and governance framework boundaries. If you are well versed in cloud technologies, have an automation mindset and are ardent follower of the SRE discipline…then our team will be benefited by your skillset!, * Be a key contributor on an Agile development team, collaboratively realizing business value through iterative software development lifecycle

Build and execute the monitoring strategy for ScienceLogic SaaS infrastructure
Define, deploy, and maintain system and service monitors
Be the authority for various monitoring technologies like Prometheus, AWS Cloudwatch, Scylla manager, New Relic to provide next generation monitoring solutions for ScienceLogic SaaS
Employ advanced monitoring practices and technologies to detect and automatically resolve platform issues before they impact the customer's experience.
Participate in architecture and operations reviews
Identify and automate measurement of operations SLAs, SLOs using SLIs
Triage incident response, document SOPs, Runbooks and train NOC team members
Participate in shared on-call manager rotation for escalations during incidents and outages, occasionally during off hours
Provide dash boarding and analytics solutions to internal teams based on requirements

Requirements

We're seeking an experienced Site Reliability Engineer who is passionate about building and owning modern monitoring and observability solutions at scale. You'll play a key role in designing proactive monitoring strategies, defining SLIs/SLOs, automating detection and remediation, and improving platform reliability across our SaaS environment.

The ideal candidate is a hands-on engineer with strong cloud, automation, and scripting experience, deep familiarity with tools like Prometheus, AWS CloudWatch, and New Relic, and a collaborative mindset. You enjoy solving complex problems, mentoring others, and continuously improving systems before issues impact customers., * 8+ years of software development or site reliability engineering or equivalent experience

Skilled at problem solving, algorithms, and data structures
Building tools and scripting frameworks from scratch
Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
Configuration automation using Ansible or equivalent tools
Exposure to Windows and Linux administration skills
Project management tools like Jira, Trello
Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
Familiarity with basic networking, security and cloud engineering concepts
Team player who is eager to help others to succeed through mentoring and leading by example
Highly collaborative with effective written and verbal communication skills

Benefits & conditions

Comprehensive medical, dental and vision plans
401(k) plan with employer match
Flexible Paid Time Off (FTO) so that you can take the time that you need to re-energize
Volunteer Time Off (VTO) - take two days off per calendar year to volunteer with your preferred charitable organization
5-year Service Milestone Sabbatical
Paid parental leave
Generous employee referral bonus program
Pet insurance
HQ Office centrally located in Reston Town Center featuring a well-stocked kitchen with rotating snacks and beverages, and catered lunch on Thursdays
Regular virtual company-wide events, including cooking classes, yoga, meditation and more
Mentorship and professional development opportunities with experienced product marketing leaders
The opportunity to learn and develop from some of the best and brightest minds in the industry!

About the company

ScienceLogic is going through a product transformation and the Site Reliability team is at the forefront of it. We are responsible for the design, deployment, and maintenance of the Cloud Infrastructure used for running the company's revenue generating go-forward SaaS product line. ScienceLogic's current SaaS product is a single tenancy, highly available and secure platform used by many customers for achieving their AIOps objectives. Cloud Operations leads the SaaS portfolio from the front by onboarding new customers on their own dedicated instance of the product, performing capacity planning, platform maintenance, upgrades, security and triaging incident response for the SaaS platform., ScienceLogic is a leader in IT Operations Management, providing modern IT operations with actionable insights to resolve and predict problems faster in a digital, ephemeral world. Its solution sees everything across cloud and distributed architectures, contextualizes data through relationship mapping, and acts on this insight through integration and automation.