Senior Monitoring & Observability Engineer, Los Angeles

O'Neil Digital Solutions, LLC

Los Angeles, United States of America

9 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 125K

Job location

Los Angeles, United States of America

Tech stack

Microsoft Windows

API

Amazon Web Services (AWS)

Data analysis

Azure

Bash

Border Gateway Protocol

Business Software

Program Optimization

Computer Networks

System Configuration

Data Deduplication

Dynamic Host Configuration Protocol

DevOps

DNS

Monitoring of Systems

Information Technology Operations

Virtual Private Networks (VPN)

Python

Linux System Administration

Log Analysis

Windows Server

Nagios

Network Monitoring

Open Shortest Path First

Paessler Router Traffic Grapher

Performance Tuning

Powershell

Reliability Engineering

Ansible

Runbook

Security Information and Event Management

Systems Integration

TCP/IP

Virtual Local Area Networks

Zabbix

Datadog

Scripting (Bash/Python/Go/Ruby)

Computer Networking Systems

Google Cloud Platform

Cloud Platform System

System Availability

Grafana

Firewalls (Computer Science)

Information Technology

SolarWinds (Software)

ArcSight Event Correlation

Splunk

Appdynamics

Dynatrace

Job description

Data Analysis Inc. is the controlling entity of the O'Neil family of businesses, supporting companies across global equity markets, health care, financial services, digital news, insurance, and other industries. Our technology teams support a global footprint and build reliable, efficient systems that help our businesses serve customers with speed, accuracy, and integrity.

We are looking for people who enjoy solving complex technical problems, improving operational reliability, and partnering across teams to make systems more resilient.

About the Role

We are seeking a Senior Monitoring & Observability Engineer to design, implement, tune, and support enterprise monitoring and observability platforms across infrastructure, networks, cloud environments, and critical business applications.

This role is more than dashboard monitoring. The right candidate will have hands-on experience building and improving observability platforms, reducing alert noise, automating operational workflows, supporting incident response, and helping teams use data to improve reliability and performance.

You will serve as a senior escalation point for monitoring, alerting, incident response, root cause analysis, and platform optimization. You will work closely with Infrastructure, Security, DevOps, IT Operations, and application teams to improve visibility, system availability, and operational efficiency.

Compensation: $115K - $125K base pay, + 10% yearly bonus target
Location: 12655 Beatrice St., Los Angeles, CA 90066

What You'll Do

In this role, you will:

Monitor and support IT infrastructure, network systems, cloud environments, and business applications using enterprise monitoring and observability tools.
Design, configure, tune, and improve monitoring platforms, including dashboards, alerts, integrations, synthetic checks, log pipelines, APM configurations, and reporting.
Serve as a senior escalation point for TOC/NOC engineers and other technical teams during incidents and complex troubleshooting efforts.
Lead or support incident response, root cause analysis, escalation, and post-incident review processes.
Help improve alert quality by reducing noise, tuning thresholds, deduplicating events, improving correlation, and strengthening runbook-driven response.
Analyze logs, network traffic, events, metrics, traces, and performance data to identify trends, risks, outages, and improvement opportunities.
Build and maintain automation scripts and tooling to improve operational efficiency, alerting quality, response time, and platform reliability.
Document monitoring standards, troubleshooting procedures, dashboards, system configurations, and operational runbooks.
Partner with Infrastructure, Security, DevOps, application, vendor, and service provider teams to resolve platform and infrastructure issues.
Participate in a 24/7 on-call rotation and provide senior-level support during major incidents.
Mentor junior TOC/NOC engineers on monitoring tools, dashboards, alert handling, troubleshooting, and incident response practices.

Requirements

Do you have experience in Tooling?, Qualified candidates should have:

3+ years of experience in IT operations, network monitoring, systems administration, infrastructure operations, or a similar technical operations role.
Hands-on experience implementing, configuring, tuning, or supporting one or more enterprise monitoring or observability platforms, such as Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds Orion, SolarWinds DPA, Nagios, PRTG, or Zabbix.
Strong understanding of monitoring and observability concepts, including logs, metrics, traces, alerts, dashboards, synthetic monitoring, APM, and incident workflows.
Experience supporting Windows and/or Linux environments.
Experience with at least one major cloud platform, such as AWS, Azure, or Google Cloud Platform.
Working knowledge of networking concepts and protocols, including TCP/IP, DNS, DHCP, VPN, VLANs, BGP, and OSPF.
Experience with scripting or automation using Python, PowerShell, Bash, Ansible, or similar tools.
Familiarity with ITIL practices related to incident, problem, and change management.
Working knowledge of cybersecurity best practices, firewalls, configurations, and SIEM tools.
Ability to troubleshoot complex infrastructure, application, network, and monitoring issues in a high-pressure environment.
Strong communication skills and the ability to translate technical monitoring data into clear operational actions.

Preferred Qualifications

The ideal candidate will also have:

Experience designing or maturing observability platforms in an enterprise environment.
Experience working in or with Site Reliability Engineering teams or SRE-aligned practices.
Experience with SLIs, SLOs, service health measurement, error budgets, and post-incident improvement.
Experience improving alert quality through threshold tuning, deduplication, event correlation, runbook automation, or workflow improvements.
Experience with APIs, configuration-as-code, CI/CD pipelines, and monitoring-as-code practices.
Experience mentoring junior engineers or serving as a technical lead during incidents.

Certifications

The following certifications are helpful but not required:

CompTIA A+, Network+, or Security+
Microsoft Fundamentals certifications, including Azure, Microsoft 365, or Windows Server
AWS Cloud Practitioner or Azure Fundamentals
ITIL Foundation
Vendor certifications in Datadog, Splunk, Dynatrace, AppDynamics, SolarWinds, or similar platforms, Do you have at least 3 years of hands-on experience in IT operations, infrastructure operations, network monitoring, systems administration, or a similar technical operations role? Please answer yes or no and briefly describe the systems, infrastructure, applications, or environments you have supported.
Observability Platform Experience

Have you personally configured, implemented, tuned, or administered an enterprise monitoring or observability platform such as Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds, Nagios, PRTG, Zabbix, or a similar tool? Please answer yes or no and include the specific tools you have used, along with what you configured or improved, such as dashboards, alerts, integrations, APM, synthetic monitoring, log pipelines, runbooks, or alert tuning.

Scripting / Automation Experience

Do you have hands-on scripting or automation experience using Python, PowerShell, Bash, Ansible, or a similar tool? Please answer yes or no and describe one script, workflow, or automation you built or maintained to improve monitoring, alerting, troubleshooting, incident response, or operational efficiency.

On-Call / Incident Response Availability

Are you willing and able to participate in a 24/7 on-call rotation and provide support during high-priority incidents, outages, or escalations? Please answer yes or no and briefly describe your experience with incident response, outage troubleshooting, escalation support, root cause analysis, or post-incident reviews.

Benefits & conditions

Pulled from the full job description

Referral program
Professional development assistance
Tuition reimbursement
Parental leave
401(k)
Health insurance
Retirement plan, This role is performed primarily in an office environment. The position requires extended periods of sitting, working on a computer, using a telephone or collaboration tools, and entering data. The role may require lifting up to 10 pounds.

This position also requires participation in an on-call rotation and may require support during high-priority incidents outside of standard business hours.

Equal Opportunity Employer

Data Analysis Inc. is an equal opportunity employer. Employment decisions are based on merit, qualifications, competence, performance, and business needs. We do not discriminate on the basis of race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, genetic information, gender, sexual orientation, gender identity or expression, veteran status, or any other status protected under federal, state, or local law.

Pay: $115,000.00 - $125,000.00 per year, * 401(k)

401(k) matching
Dental insurance
Employee assistance program
Flexible spending account
Health insurance
Health savings account
Life insurance
Paid time off
Parental leave
Professional development assistance
Referral program
Retirement plan
Tuition reimbursement
Vision insurance

Application Question(s):

This is an on-site role. Do you currently live within commutable distance to Los Angeles, 90066?
Monitoring / IT Operations Experience

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all