Lead Associate Principal, Software Engineering: DevOps in Chicago

Energy Jobline
Chicago, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Chicago, United States of America

Tech stack

Java
Agile Methodologies
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Applications Architecture
Application Configuration Access Protocols
Application Performance Management
Mobile Application Development
Cloud Computing
Cloud Computing Security
Continuous Integration
Data Stores
Software Design Patterns
DevOps
Distributed Systems
Enterprise Architecture Framework
Protocol Buffers
Identity and Access Management
JSON
Python
NoSQL
Systems Development Life Cycle
Role-Based Access Control
Reliability Engineering
Site Reliability Engineering Practices
Ansible
Prometheus
Scala
Security Software
Software Deployment
Software Engineering
SQL Databases
Data Streaming
Datadog
Data Logging
Google Cloud Platform
Performance Testing
Grafana
GIT
Togaf
Event Driven Architecture
Kubernetes
Information Technology
Avro
Hashicorp
Kafka
Api Design
Terraform
Software Version Control
Dynatrace
Pagerduty
Jenkins
Artifactory
Go
Microservices

Job description

We are seeking an experienced Site Reliability Engineer / DevOps Infrastructure Lead to support a highly scalable, cloud-based technology platform. This individual will collaborate closely with product, infrastructure, operations, security, architecture, network, testing, and production control teams to gather technical requirements, improve platform reliability, and drive operational excellence., * Guide the implementation of CI/CD pipelines within a Kubernetes environment.

  • Review, configure, and support execution of Terraform and Ansible automation pipelines delivered by product teams.
  • Support the setup of shared infrastructure platforms, including multi-region Kubernetes and Kafka clusters.
  • Gather application deployment and sizing requirements to support expected workloads.
  • Define and enforce Service Level Objectives, Service Level Indicators, and Error Budgets in partnership with product teams.
  • Lead blameless post-mortems and drive resolution of action items to reduce repeat incidents.
  • Design and implement observability frameworks covering metrics, logs, and distributed tracing across platform services.
  • Identify and automate repetitive operational work to reduce toil and improve efficiency.
  • Partner with product teams to embed reliability requirements and non-functional requirements early in the software development lifecycle.
  • Monitor application performance and partner with product teams to tune systems.
  • Work with product team leads and technical practitioners to create deployment and reliability plans.
  • Collaborate with Enterprise Architecture and Renaissance architecture teams to define implementation architecture.
  • Promote application configuration standards that support a strong security posture.
  • Partner with access management and security teams to establish roles and permissions using least-privilege strategies.
  • Collaborate with integration and performance testing teams to leverage integrated release testing in the Release Acceptance environment.
  • Work with production control teams on monitoring, failover, logging, and alerting strategies.
  • Own and continuously improve incident response runbooks, on-call rotations, and escalation procedures.
  • Conduct capacity planning and load forecasting to proactively address scalability requirements.
  • Implement and validate infrastructure failover scenarios.
  • Partner with network teams on connectivity planning and issue resolution, including connectivity between on-premises environments and AWS.
  • Follow and support program-level Agile practices to improve collaboration and delivery.
  • Develop documentation for technical infrastructure, architecture, and reliability support.

Requirements

The ideal candidate will bring strong experience in AWS, Kubernetes, Kafka, CI/CD, Terraform, Ansible, observability, incident management, and large-scale distributed systems. This role requires a hands-on technical leader who can guide infrastructure implementation, promote reliability best practices, improve system observability, and support high-performance, multi-region cloud environments., * Bachelor's degree in Computer Science, a related technical field, or equivalent professional experience.

  • 7+ years of experience building large-scale, data-centric technology solutions.
  • 7+ years of recent experience participating on a DevOps or SRE team, or serving as a product owner for a DevOps/SRE team.
  • Strong understanding of Kanban and/or Agile methodologies.
  • Familiarity with SRE principles as defined by Google SRE practices, including error budgets, toil elimination, and reliability hierarchy.
  • Ability to succeed in a fast-paced environment with frequent changes.
  • Strong communication skills with the ability to engage both technical and non-technical audiences.
  • Self-starter who takes initiative to research, learn, and deliver solutions.
  • Collaborative team player with a humble, team-first mindset.

Required Technical Skills

  • Strong experience with AWS EC2, Kubernetes, Kafka, Jenkins, Terraform, Ansible, and HashiCorp Vault.
  • Experience with observability tools such as Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent platforms.
  • Experience with incident management and on-call tooling such as PagerDuty, OpsGenie, or similar tools.
  • Strong knowledge of microservices and streaming data-intensive application architecture.
  • Experience with application architecture, networking, and cloud security.
  • Experience setting up AWS platforms for high-performance requirements.
  • Broad experience with API-based development.
  • Experience using Git and Artifactory for source control and artifact management.
  • Strong knowledge of multi-AZ and multi-region failover architecture.
  • Familiarity with chaos engineering principles and tooling such as Chaos Monkey, Gremlin, or LitmusChaos.
  • Fluency with data formats and structures including JSON, Protobuf, and Avro.
  • Experience with SQL and NoSQL databases, as well as in-memory data stores.
  • Software development experience with Java, Python, Scala, and/or Golang.
  • Experience with at least two of the following:
  • Web or mobile application development
  • Unix/Linux environments
  • Event-driven systems
  • Transaction processing systems
  • Distributed and parallel systems
  • Large-scale software system development
  • Security software development
  • Public cloud platforms
  • Strong understanding of industry best practices, software design patterns, and architecture principles.
  • Knowledge of enterprise architecture frameworks such as TOGAF.
  • Ability to define and document architecture strategies, technical designs, and requirements across enterprise architecture domains.
  • Ability to define service-based and component-based architectures and visually communicate enterprise architecture concepts.

Certifications

  • AWS Certified Solutions Architect and/or AWS DevOps Engineer certification.
  • Kubernetes and/or Kafka certification.
  • Google Cloud Professional Site Reliability Engineer certification or equivalent SRE-focused certification.
  • Project or program management certification., The ideal candidate is a senior technical professional with deep experience in cloud infrastructure, DevOps, SRE, and enterprise-scale distributed systems. This person should be comfortable partnering across multiple technical teams, driving reliability standards, improving observability, supporting incident response, and helping build resilient cloud platforms that can scale across multi-region environments.

Apply for this position