Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Overall, we're passionate about automation and solving complex business and technology challenges. Our team combines SRE, DevOps, Software Development and Information Security knowledge to help make Cloud operations agile, elastic inside the security and governance framework boundaries. If you are well versed in cloud technologies, have an automation mindset and are ardent follower of the SRE discipline…then our team will be benefited by your skillset!, * Be a key contributor on an Agile development team, collaboratively realizing business value through iterative software development lifecycle
- Build and execute the monitoring strategy for ScienceLogic SaaS infrastructure
- Define, deploy, and maintain system and service monitors
- Be the authority for various monitoring technologies like Prometheus, AWS Cloudwatch, Scylla manager, New Relic to provide next generation monitoring solutions for ScienceLogic SaaS
- Employ advanced monitoring practices and technologies to detect and automatically resolve platform issues before they impact the customer's experience.
- Participate in architecture and operations reviews
- Identify and automate measurement of operations SLAs, SLOs using SLIs
- Triage incident response, document SOPs, Runbooks and train NOC team members
- Participate in shared on-call manager rotation for escalations during incidents and outages, occasionally during off hours
- Provide dash boarding and analytics solutions to internal teams based on requirements
Requirements
We're seeking an experienced Site Reliability Engineer who is passionate about building and owning modern monitoring and observability solutions at scale. You'll play a key role in designing proactive monitoring strategies, defining SLIs/SLOs, automating detection and remediation, and improving platform reliability across our SaaS environment.
The ideal candidate is a hands-on engineer with strong cloud, automation, and scripting experience, deep familiarity with tools like Prometheus, AWS CloudWatch, and New Relic, and a collaborative mindset. You enjoy solving complex problems, mentoring others, and continuously improving systems before issues impact customers., * 8+ years of software development or site reliability engineering or equivalent experience
- Skilled at problem solving, algorithms, and data structures
- Building tools and scripting frameworks from scratch
- Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
- Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
- Configuration automation using Ansible or equivalent tools
- Exposure to Windows and Linux administration skills
- Project management tools like Jira, Trello
- Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
- Familiarity with basic networking, security and cloud engineering concepts
- Team player who is eager to help others to succeed through mentoring and leading by example
- Highly collaborative with effective written and verbal communication skills
Benefits & conditions
- Comprehensive medical, dental and vision plans
- 401(k) plan with employer match
- Flexible Paid Time Off (FTO) so that you can take the time that you need to re-energize
- Volunteer Time Off (VTO) - take two days off per calendar year to volunteer with your preferred charitable organization
- 5-year Service Milestone Sabbatical
- Paid parental leave
- Generous employee referral bonus program
- Pet insurance
- HQ Office centrally located in Reston Town Center featuring a well-stocked kitchen with rotating snacks and beverages, and catered lunch on Thursdays
- Regular virtual company-wide events, including cooking classes, yoga, meditation and more
- Mentorship and professional development opportunities with experienced product marketing leaders
- The opportunity to learn and develop from some of the best and brightest minds in the industry!