Senior Monitoring & Observability Engineer, Los Angeles

O'Neil Digital Solutions, LLC
Los Angeles, United States of America
9 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 125K

Job location

Los Angeles, United States of America

Tech stack

Microsoft Windows
API
Amazon Web Services (AWS)
Data analysis
Azure
Bash
Border Gateway Protocol
Business Software
Program Optimization
Computer Networks
System Configuration
Data Deduplication
Dynamic Host Configuration Protocol
DevOps
DNS
Monitoring of Systems
Information Technology Operations
Virtual Private Networks (VPN)
Python
Linux System Administration
Log Analysis
Windows Server
Nagios
Network Monitoring
Open Shortest Path First
Paessler Router Traffic Grapher
Performance Tuning
Powershell
Reliability Engineering
Ansible
Runbook
Security Information and Event Management
Systems Integration
TCP/IP
Virtual Local Area Networks
Zabbix
Datadog
Scripting (Bash/Python/Go/Ruby)
Computer Networking Systems
Google Cloud Platform
Cloud Platform System
System Availability
Grafana
Firewalls (Computer Science)
Information Technology
SolarWinds (Software)
ArcSight Event Correlation
Splunk
Appdynamics
Dynatrace

Job description

Data Analysis Inc. is the controlling entity of the O'Neil family of businesses, supporting companies across global equity markets, health care, financial services, digital news, insurance, and other industries. Our technology teams support a global footprint and build reliable, efficient systems that help our businesses serve customers with speed, accuracy, and integrity.

We are looking for people who enjoy solving complex technical problems, improving operational reliability, and partnering across teams to make systems more resilient.

About the Role

We are seeking a Senior Monitoring & Observability Engineer to design, implement, tune, and support enterprise monitoring and observability platforms across infrastructure, networks, cloud environments, and critical business applications.

This role is more than dashboard monitoring. The right candidate will have hands-on experience building and improving observability platforms, reducing alert noise, automating operational workflows, supporting incident response, and helping teams use data to improve reliability and performance.

You will serve as a senior escalation point for monitoring, alerting, incident response, root cause analysis, and platform optimization. You will work closely with Infrastructure, Security, DevOps, IT Operations, and application teams to improve visibility, system availability, and operational efficiency.

  • Compensation: $115K - $125K base pay, + 10% yearly bonus target
  • Location: 12655 Beatrice St., Los Angeles, CA 90066

What You'll Do

In this role, you will:

  • Monitor and support IT infrastructure, network systems, cloud environments, and business applications using enterprise monitoring and observability tools.
  • Design, configure, tune, and improve monitoring platforms, including dashboards, alerts, integrations, synthetic checks, log pipelines, APM configurations, and reporting.
  • Serve as a senior escalation point for TOC/NOC engineers and other technical teams during incidents and complex troubleshooting efforts.
  • Lead or support incident response, root cause analysis, escalation, and post-incident review processes.
  • Help improve alert quality by reducing noise, tuning thresholds, deduplicating events, improving correlation, and strengthening runbook-driven response.
  • Analyze logs, network traffic, events, metrics, traces, and performance data to identify trends, risks, outages, and improvement opportunities.
  • Build and maintain automation scripts and tooling to improve operational efficiency, alerting quality, response time, and platform reliability.
  • Document monitoring standards, troubleshooting procedures, dashboards, system configurations, and operational runbooks.
  • Partner with Infrastructure, Security, DevOps, application, vendor, and service provider teams to resolve platform and infrastructure issues.
  • Participate in a 24/7 on-call rotation and provide senior-level support during major incidents.
  • Mentor junior TOC/NOC engineers on monitoring tools, dashboards, alert handling, troubleshooting, and incident response practices.

Requirements

Do you have experience in Tooling?, Qualified candidates should have:

  • 3+ years of experience in IT operations, network monitoring, systems administration, infrastructure operations, or a similar technical operations role.
  • Hands-on experience implementing, configuring, tuning, or supporting one or more enterprise monitoring or observability platforms, such as Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds Orion, SolarWinds DPA, Nagios, PRTG, or Zabbix.
  • Strong understanding of monitoring and observability concepts, including logs, metrics, traces, alerts, dashboards, synthetic monitoring, APM, and incident workflows.
  • Experience supporting Windows and/or Linux environments.
  • Experience with at least one major cloud platform, such as AWS, Azure, or Google Cloud Platform.
  • Working knowledge of networking concepts and protocols, including TCP/IP, DNS, DHCP, VPN, VLANs, BGP, and OSPF.
  • Experience with scripting or automation using Python, PowerShell, Bash, Ansible, or similar tools.
  • Familiarity with ITIL practices related to incident, problem, and change management.
  • Working knowledge of cybersecurity best practices, firewalls, configurations, and SIEM tools.
  • Ability to troubleshoot complex infrastructure, application, network, and monitoring issues in a high-pressure environment.
  • Strong communication skills and the ability to translate technical monitoring data into clear operational actions.

Preferred Qualifications

The ideal candidate will also have:

  • Experience designing or maturing observability platforms in an enterprise environment.
  • Experience working in or with Site Reliability Engineering teams or SRE-aligned practices.
  • Experience with SLIs, SLOs, service health measurement, error budgets, and post-incident improvement.
  • Experience improving alert quality through threshold tuning, deduplication, event correlation, runbook automation, or workflow improvements.
  • Experience with APIs, configuration-as-code, CI/CD pipelines, and monitoring-as-code practices.
  • Experience mentoring junior engineers or serving as a technical lead during incidents.

Certifications

The following certifications are helpful but not required:

  • CompTIA A+, Network+, or Security+
  • Microsoft Fundamentals certifications, including Azure, Microsoft 365, or Windows Server
  • AWS Cloud Practitioner or Azure Fundamentals
  • ITIL Foundation
  • Vendor certifications in Datadog, Splunk, Dynatrace, AppDynamics, SolarWinds, or similar platforms, Do you have at least 3 years of hands-on experience in IT operations, infrastructure operations, network monitoring, systems administration, or a similar technical operations role? Please answer yes or no and briefly describe the systems, infrastructure, applications, or environments you have supported.
  • Observability Platform Experience

Have you personally configured, implemented, tuned, or administered an enterprise monitoring or observability platform such as Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds, Nagios, PRTG, Zabbix, or a similar tool? Please answer yes or no and include the specific tools you have used, along with what you configured or improved, such as dashboards, alerts, integrations, APM, synthetic monitoring, log pipelines, runbooks, or alert tuning.

  • Scripting / Automation Experience

Do you have hands-on scripting or automation experience using Python, PowerShell, Bash, Ansible, or a similar tool? Please answer yes or no and describe one script, workflow, or automation you built or maintained to improve monitoring, alerting, troubleshooting, incident response, or operational efficiency.

  • On-Call / Incident Response Availability

Are you willing and able to participate in a 24/7 on-call rotation and provide support during high-priority incidents, outages, or escalations? Please answer yes or no and briefly describe your experience with incident response, outage troubleshooting, escalation support, root cause analysis, or post-incident reviews.

Benefits & conditions

Pulled from the full job description

  • Referral program
  • Professional development assistance
  • Tuition reimbursement
  • Parental leave
  • 401(k)
  • Health insurance
  • Retirement plan, This role is performed primarily in an office environment. The position requires extended periods of sitting, working on a computer, using a telephone or collaboration tools, and entering data. The role may require lifting up to 10 pounds.

This position also requires participation in an on-call rotation and may require support during high-priority incidents outside of standard business hours.

Equal Opportunity Employer

Data Analysis Inc. is an equal opportunity employer. Employment decisions are based on merit, qualifications, competence, performance, and business needs. We do not discriminate on the basis of race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, genetic information, gender, sexual orientation, gender identity or expression, veteran status, or any other status protected under federal, state, or local law.

Pay: $115,000.00 - $125,000.00 per year, * 401(k)

  • 401(k) matching
  • Dental insurance
  • Employee assistance program
  • Flexible spending account
  • Health insurance
  • Health savings account
  • Life insurance
  • Paid time off
  • Parental leave
  • Professional development assistance
  • Referral program
  • Retirement plan
  • Tuition reimbursement
  • Vision insurance

Application Question(s):

  • This is an on-site role. Do you currently live within commutable distance to Los Angeles, 90066?
  • Monitoring / IT Operations Experience

Apply for this position