Software Engineer

THE JUDGE GROUP, INC.

Charlotte, United States of America

2 days ago

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English

Experience level

Senior

Compensation

$ 154K

Job location

Charlotte, United States of America

Tech stack

Artificial Intelligence

Systems Engineering

Build Automation

Bash

Linux

Middleware

Python

Windows Server

Openshift

Powershell

Reliability Engineering

Site Reliability Engineering Practices

Ansible

Runbook

Software Engineering

Systems Architecture

Scripting (Bash/Python/Go/Ruby)

Load Balancing

Mttr

Git Flow

Kubernetes

Splunk

Job description

We are seeking a senior, hands-on Software Engineer to support and evolve large-scale application and middleware platforms with a Site Reliability Engineering (SRE) mindset. This role focuses on production reliability, observability, and automation, shifting operations from reactive support to proactive, engineered reliability.

You will serve as an L2/L3 escalation point for mission-critical systems, owning incident response, problem management, and runbook-driven operations. You'll also build automation, infrastructure-as-code, and observability solutions that reduce toil, improve MTTR, and increase platform stability across VM-based and container-adjacent environments, including OpenShift (OCP).

This role supports a fast-growing platform portfolio (200+ applications, scaling rapidly) and requires strong architectural understanding, technical depth, and the ability to adapt across technologies. What You'll Do

Act as a senior escalation point for L2/L3 production incidents, leading troubleshooting, recovery, and stabilization of application and middleware services.
Apply SRE practices daily: define and improve reliability signals, enhance alert quality, conduct blameless post-incident reviews, and prioritize systemic fixes over manual work.
Design and operate observability solutions (logs, metrics, traces, dashboards, and actionable alerts) to improve detection, diagnosis, and recovery times.
Build and maintain automation and infrastructure-as-code to support repeatable, audited, and resilient operations across VM and container-adjacent platforms.
Develop standardized operational automation (status checks, start/stop/restart patterns) to reduce dependency bottlenecks and enable safe self-service.
Implement intelligent automation (including AI-assisted operations where appropriate) with strong guardrails for accuracy, security, and compliance.
Monitor and remediate configuration drift; support automated compliance validation aligned with enterprise risk and change management.
Integrate infrastructure and operational automation into CI/CD pipelines for safer, consistent rollouts.
Support shared platform components such as ingress, load balancing integrations, and common middleware services.
Create and maintain runbooks, operational documentation, and validation procedures to ensure consistent execution and operational readiness.
Participate in an on-call rotation supporting 24x7 production operations., * Reduced incident frequency and faster recovery times through better observability and automation.
Measurable reduction in operational toil and manual intervention.
Reliable, auditable, and repeatable platform operations at scale.
Clear, maintainable documentation and runbooks that enable consistent execution.
Strong partnership with application, infrastructure, and security teams.

Requirements

5+ years of experience in software engineering, systems engineering, or production operations, or equivalent practical experience.
Hands-on experience supporting production applications or middleware in complex, highly available environments.
Strong troubleshooting skills with the ability to understand system architecture, capacity constraints, and failure modes.
Experience with automation or scripting (e.g., Python, Bash, PowerShell, or similar).
Experience working in Linux and/or Windows Server production environments.
Familiarity with Git-based workflows and infrastructure or configuration as code.
Ability to learn new technologies quickly and adapt across diverse platforms., * Experience supporting container-adjacent or Kubernetes-based platforms, including OpenShift (OCP).
Experience implementing SRE operating practices (reliability metrics, alert engineering, toil reduction).
Experience with observability platforms (e.g., Splunk, Elastic, or similar) beyond a single tool.
Experience with automation frameworks (Ansible or equivalent).
Experience integrating operations with CI/CD pipelines.
Exposure to responsible AI usage in operations (automation assistance, predictive signals, guarded remediation).
Strong communication skills and experience working in regulated or enterprise environments.