Lead Associate Principal, Software Engineering: DevOps in Chicago

Energy Jobline

Chicago, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Chicago, United States of America

Tech stack

Java

Agile Methodologies

Amazon Web Services (AWS)

Applications Architecture

Application Configuration Access Protocols

Application Performance Management

Mobile Application Development

Cloud Computing

Cloud Computing Security

Continuous Integration

Data Stores

Software Design Patterns

DevOps

Distributed Systems

Enterprise Architecture Framework

Protocol Buffers

Identity and Access Management

JSON

Python

NoSQL

Systems Development Life Cycle

Role-Based Access Control

Reliability Engineering

Site Reliability Engineering Practices

Ansible

Prometheus

Scala

Security Software

Software Deployment

Software Engineering

SQL Databases

Data Streaming

Datadog

Data Logging

Google Cloud Platform

Performance Testing

Grafana

GIT

Togaf

Event Driven Architecture

Kubernetes

Information Technology

Avro

Hashicorp

Kafka

Api Design

Terraform

Software Version Control

Dynatrace

Pagerduty

Jenkins

Artifactory

Microservices

Job description

We are seeking an experienced Site Reliability Engineer / DevOps Infrastructure Lead to support a highly scalable, cloud-based technology platform. This individual will collaborate closely with product, infrastructure, operations, security, architecture, network, testing, and production control teams to gather technical requirements, improve platform reliability, and drive operational excellence., * Guide the implementation of CI/CD pipelines within a Kubernetes environment.

Review, configure, and support execution of Terraform and Ansible automation pipelines delivered by product teams.
Support the setup of shared infrastructure platforms, including multi-region Kubernetes and Kafka clusters.
Gather application deployment and sizing requirements to support expected workloads.
Define and enforce Service Level Objectives, Service Level Indicators, and Error Budgets in partnership with product teams.
Lead blameless post-mortems and drive resolution of action items to reduce repeat incidents.
Design and implement observability frameworks covering metrics, logs, and distributed tracing across platform services.
Identify and automate repetitive operational work to reduce toil and improve efficiency.
Partner with product teams to embed reliability requirements and non-functional requirements early in the software development lifecycle.
Monitor application performance and partner with product teams to tune systems.
Work with product team leads and technical practitioners to create deployment and reliability plans.
Collaborate with Enterprise Architecture and Renaissance architecture teams to define implementation architecture.
Promote application configuration standards that support a strong security posture.
Partner with access management and security teams to establish roles and permissions using least-privilege strategies.
Collaborate with integration and performance testing teams to leverage integrated release testing in the Release Acceptance environment.
Work with production control teams on monitoring, failover, logging, and alerting strategies.
Own and continuously improve incident response runbooks, on-call rotations, and escalation procedures.
Conduct capacity planning and load forecasting to proactively address scalability requirements.
Implement and validate infrastructure failover scenarios.
Partner with network teams on connectivity planning and issue resolution, including connectivity between on-premises environments and AWS.
Follow and support program-level Agile practices to improve collaboration and delivery.
Develop documentation for technical infrastructure, architecture, and reliability support.

Requirements

The ideal candidate will bring strong experience in AWS, Kubernetes, Kafka, CI/CD, Terraform, Ansible, observability, incident management, and large-scale distributed systems. This role requires a hands-on technical leader who can guide infrastructure implementation, promote reliability best practices, improve system observability, and support high-performance, multi-region cloud environments., * Bachelor's degree in Computer Science, a related technical field, or equivalent professional experience.

7+ years of experience building large-scale, data-centric technology solutions.
7+ years of recent experience participating on a DevOps or SRE team, or serving as a product owner for a DevOps/SRE team.
Strong understanding of Kanban and/or Agile methodologies.
Familiarity with SRE principles as defined by Google SRE practices, including error budgets, toil elimination, and reliability hierarchy.
Ability to succeed in a fast-paced environment with frequent changes.
Strong communication skills with the ability to engage both technical and non-technical audiences.
Self-starter who takes initiative to research, learn, and deliver solutions.
Collaborative team player with a humble, team-first mindset.

Required Technical Skills

Strong experience with AWS EC2, Kubernetes, Kafka, Jenkins, Terraform, Ansible, and HashiCorp Vault.
Experience with observability tools such as Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent platforms.
Experience with incident management and on-call tooling such as PagerDuty, OpsGenie, or similar tools.
Strong knowledge of microservices and streaming data-intensive application architecture.
Experience with application architecture, networking, and cloud security.
Experience setting up AWS platforms for high-performance requirements.
Broad experience with API-based development.
Experience using Git and Artifactory for source control and artifact management.
Strong knowledge of multi-AZ and multi-region failover architecture.
Familiarity with chaos engineering principles and tooling such as Chaos Monkey, Gremlin, or LitmusChaos.
Fluency with data formats and structures including JSON, Protobuf, and Avro.
Experience with SQL and NoSQL databases, as well as in-memory data stores.
Software development experience with Java, Python, Scala, and/or Golang.
Experience with at least two of the following:

Web or mobile application development
Unix/Linux environments
Event-driven systems
Transaction processing systems
Distributed and parallel systems
Large-scale software system development
Security software development
Public cloud platforms

Strong understanding of industry best practices, software design patterns, and architecture principles.
Knowledge of enterprise architecture frameworks such as TOGAF.
Ability to define and document architecture strategies, technical designs, and requirements across enterprise architecture domains.
Ability to define service-based and component-based architectures and visually communicate enterprise architecture concepts.

Certifications

AWS Certified Solutions Architect and/or AWS DevOps Engineer certification.
Kubernetes and/or Kafka certification.
Google Cloud Professional Site Reliability Engineer certification or equivalent SRE-focused certification.
Project or program management certification., The ideal candidate is a senior technical professional with deep experience in cloud infrastructure, DevOps, SRE, and enterprise-scale distributed systems. This person should be comfortable partnering across multiple technical teams, driving reliability standards, improving observability, supporting incident response, and helping build resilient cloud platforms that can scale across multi-region environments.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all