NOC Engineer
Role details
Job location
Tech stack
Job description
The NOC Engineeris a senior operational engineering role responsible for improving the availability, stability, and reliability of enterprise IT and OT systems across a multi-affiliate, regulated environment. This role leads complex incident response, resolves cross-domain production issues, and reduces repeat incidents through advanced troubleshooting, observability, automation, and disciplined operational execution.
The Network Operations Center plays a critical role in enterprise operations and supports the continued evolution of a broader command center model for IT and OT operations. This is an opportunity to join a talented team, help strengthen monitoring and operational capabilities, and contribute to meaningful enterprise reliability work. If you are a hands-on engineer who enjoys solving difficult technical problems, improving operations, and helping build something stronger, we encourage you to apply.
This role serves as a top-tier escalation point, supports advanced first- and second-level troubleshooting across Windows, Linux, networking, enterprise applications, and infrastructure platforms, and is expected to develop strong technical and operational documentation, including SOPs, runbooks, troubleshooting guides, incident reports, post-incident reviews, and operational summaries. This position may also participate in a rotational on-call schedule and may be required to provide after-hours support for major incidents, critical issues, maintenance activities, or operational escalations., Incident Management, Escalation & Service Restoration
- Lead major and critical incidents end-to-end, including restoration strategy, technical coordination, stakeholder communications, and escalation management.
- Act as the senior escalation point for network outages, infrastructure failures, and service-impacting incidents, driving timely restoration with minimal supervision.
- Manage incident bridges with clear communication, accurate timelines, and disciplined coordination across infrastructure, application, security, platform, and vendor teams.
- Ensure post-incident reviews are complete, actionable, and tracked through closure with clear owners, due dates, and validation steps.
Advanced Technical Troubleshooting & Network Engineering
- Troubleshoot and restore complex production issues across Layer 2 / Layer 3 networking, servers, applications, identity services, virtualization, infrastructure platforms, and OT-related systems.
- Perform advanced hands-on troubleshooting across routers, switches, firewalls, Windows servers, Linux systems, VPNs, load balancers, and critical infrastructure dependencies, including work with Cisco and Juniper network products and their command-line interfaces (CLI).
- Apply strong working knowledge of TCP/IP, routing, switching, VLANs, DNS, DHCP, VPN technologies, firewalls, and enterprise network protocols to isolate failure domains and restore service quickly and accurately.
- Use logs, metrics, dashboards, packet captures, traces, and vendor / platform command-line tools to diagnose issues, identify root cause, restore service, and partner with engineering teams or vendors on permanent fixes.
Monitoring, Observability & Automation
- Work with enterprise monitoring and event management platforms to improve alert quality, service visibility, and operational awareness.
- Proactively monitor network and infrastructure health, investigate performance issues, and identify trends that may affect availability, latency, or service quality.
- Design and improve automation using APIs, scripting, coding, and operational tooling to reduce manual effort, improve consistency, and strengthen command center capabilities.
Operational Readiness, Documentation & Continuous Improvement
- Review new and changed services for operational readiness, including monitoring, alerting, dependencies, runbooks, support models, and escalation paths.
- Support high-risk changes, maintenance windows, and cutovers by validating outcomes, detecting regressions, and coordinating rollback when needed.
- Develop and maintain SOPs, procedures, runbooks, network diagrams, troubleshooting guides, incident reports, and operational summaries, while using incident trends and support metrics to drive continuous improvement.
Technical Escalation & Mentorship
- Serve as the senior technical escalation point for complex production issues, major incidents, and high-impact service degradations.
- Provide hands-on coaching and technical direction to junior engineers and NOC personnel during troubleshooting, restoration, and incident response activities.
- Partner with engineering and service owners to improve supportability, resilience, observability, and production readiness for critical services and platforms.
Requirements
Bachelor's degree in computer science, information technology or related field; or equivalent work experience. (Typically, four years of additional related, progressive work experience would be needed for candidates applying for this position who do not possess a bachelor's degree. A minimum of two years additional directly related technical experience is required.)
Must have five or more years of experience.
Strong experience leading or supporting high-severity incidents in a production environment.
Strong hands-on troubleshooting across Windows, Linux, networking, and enterprise infrastructure.
Solid knowledge of TCP/IP, routing, switching, VLANs, DNS, DHCP, VPNs, firewalls, and load balancing.
Experience with enterprise monitoring, alerting, and ticketing platforms.
Experience using logs, dashboards, packet captures, traces, and network / system diagnostic tools.
Experience with Python, scripting, APIs, automation, or coding-based solutions.
Strong writing and communication skills, with the ability to create clear, accurate, and professional SOPs, runbooks, network diagrams, technical procedures, incident reports, post-incident reviews, and operational summaries.
Ability to work effectively in a 247 environment, remain calm during major incidents, and participate in a rotational on-call schedule as needed.
Preferred Qualifications
Experience in regulated, multi-affiliate, or OT / industrial environments.
CCNA, CCNP, or equivalent networking knowledge.
Experience with observability platforms, metrics, logs, traces, and alert design.
Familiarity with ITIL-based incident, problem, and change management practices.
Experience with cloud, hybrid infrastructure, or configuration / automation tooling.
Experience helping build or mature a command center or enterprise operations function.
Candidates will complete a short technical simulation involving real-world troubleshooting scenarios. The simulation may include troubleshooting scenarios across networking, Windows, Linux, incident response, automation, and operational decision-making.