Site Reliability Engineer

Cooper Standard
Northville, United States of America
15 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Northville, United States of America

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Audit Trail
Bash
Bioinformatics
Cloud Computing
Code Review
Databases
Computer Engineering
Continuous Integration
Data Retrieval
Linux
Disaster Recovery
DNS
Github
Monitoring of Systems
Identity and Access Management
Virtual Private Networks (VPN)
Python
PostgreSQL
Modbus
Message Queuing Telemetry Transport (MQTT)
Routing
Nginx
Public Key Infrastructure
Reliability Engineering
Ansible
Prometheus
OPC Unified Architecture
Runbook
Data Logging
Scripting (Bash/Python/Go/Ruby)
Transport Layer Security
Data Ingestion
Istio
System Availability
Grafana
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Gitlab-ci
Kubernetes
Information Technology
Low Latency
InfluxDB
Hashicorp
Cloudwatch
Terraform
Software Version Control

Job description

Liveline enables dramatic improvements in manufacturing performance thorough a unique application of artificial intelligence to provide real-time process control and predictive assistants for plant personnel. Our focus is on automating complex processes, not simply providing dashboards for managers and operators.

Our team combines experts in AI with world-class process engineers who can focus on the "last mile" with customers: Extracting data from the process and implementing controls on the shop floor. We speak the language of AI but also industrial controllers.

Our hardware and software offerings are scalable and cost-effective whether customers have one production line or hundreds, delivering an ROI that's attractive to small and medium-sized enterprises.

We are passionate about democratizing the power of analytics and advanced automation for manufacturers of almost any size. Through our approach, producers can de-mystify complex processes and free up valuable technicians to focus on more advanced tasks instead of constantly monitoring and adjusting equipment parameters.

A Liveline Technologies SRE is responsible for the reliability, performance, observability, and operational excellence of Liveline's production services. This spans from the factory-floor edge systems to AWS cloud components. You will help build and run resilient infrastructure, automate repetitive work with code (Terraform, Bash, Python), implement monitoring and alerting (Prometheus/Grafana), and participate in incident response/on-call to ensure uptime for mission-critical manufacturing systems. You'll collaborate closely with controls engineers, data scientists, and software teams to safely deploy changes, define SLIs/SLOs, and continuously improve availability and latency for real-time process control., * Operate Production Systems: Maintain high availability, performance, and security of Liveline's production stack across AWS and plant/edge environments.

  • Observability & Monitoring: Stand up, tune, and maintain Prometheus/Grafana dashboards, alerts, recording rules, and runbooks. Implement logs/traces (e.g., OpenTelemetry) and actionable alerting.

  • Infrastructure as Code: Build and manage reproducible infrastructure with Terraform (VPC, IAM, EC2/EKS/ECS, RDS, S3, CloudWatch, CloudTrail). Apply version control, code reviews, and plan/apply workflows.

  • Automation & Tooling: Write Bash and Python scripts and small services to automate operational tasks, health checks, failover routines, backup/restore, and environment bootstrapping.

  • NOC / Incident Response: Participate in a follow-the-sun/on-call rotation; triage and resolve incidents, lead initial comms, and produce blameless postmortems with clear corrective actions.

  • SLIs/SLOs/Error Budgets: Define and instrument SLIs (availability, latency, error rate, freshness), set SLOs with stakeholders, and manage error budgets to guide release velocity and reliability tradeoffs.

  • Networking & Connectivity: Support secure, reliable connectivity between factory networks and cloud (site-to-site VPNs, routing, DNS, TLS, private subnets, security groups, network ACLs).

  • Databases & Storage: Operate and tune PostgreSQL/TimescaleDB, InfluxDB, or similar time-series/relational stores; manage backups, PITR, replication, partitioning, and performance baselining.

  • CI/CD & Release Engineering: Contribute to build/deploy pipelines (e.g., GitHub Actions/GitLab CI), implement canaries/blue-green strategies, and enforce change management and rollback plans.

  • Security & Compliance: Enforce least-privilege IAM, secret management (AWS Secrets Manager/SSM), encryption, artifact signing, and basic hardening for Linux and Kubernetes workloads.

  • Edge & OT Collaboration: Partner with process/controls engineers to ensure reliable data ingestion from PLCs/industrial gateways (e.g., OPC UA/Modbus), and safe deploys to plant edge nodes.

  • Cost, Capacity & Performance: Right-size compute/storage, set budgets/alerts, forecast capacity, and optimize resource utilization without compromising SLOs.

  • Documentation & Runbooks: Author and maintain runbooks, architecture diagrams, operational playbooks, and disaster recovery procedures.

Requirements

  • Bachelor's Degree in IT, Computer Science, or Computer Engineering (or equivalent experience).
  • 5+ years of experience in a corporate IT or startup setting
  • Familiar with containers (Docker) and orchestration (Kubernetes or ECS).
  • Experience running production workloads, participating in on-call, and writing postmortems.
  • Strong communication skills with the ability to explain tradeoffs to non-SRE stakeholders.
  • Intellectual curiosity, ownership mindset, and bias for automation.
  • Willingness and ability to travel to customer sites and plants, as necessary.

Nice to Have

  • Kubernetes (EKS), Helm, Kustomize.
  • Service Mesh/Ingress (Envoy, NGINX, ALB).
  • Logging/Tracing: OpenSearch/ELK, Loki, OpenTelemetry.
  • Config Management: Ansible.
  • Secrets & PKI: HashiCorp Vault, mTLS.
  • Edge/Industrial Protocols: OPC UA, Modbus, MQTT; experience with industrial gateways.
  • Compliance exposure (SOC 2, ISO 27001) and change management (ITIL)., Cooper Standard is an Equal Employment Opportunity employer. All qualified applicants/employees will receive consideration for employment without regard to that individual's age, race, color, religion or creed, national origin or ancestry, sex (including pregnancy), sexual orientation, gender, gender identity, physical or mental disability, veteran status, genetic information, ethnicity, citizenship, or any other characteristic protected by law. Please note that Cooper Standard maintains a list of preferred recruiting agencies and only members of our Talent Acquisition Department have the authority to engage and authorize recruiting services. We also do not seek or accept unsolicited resumes from third party recruiters or staffing agencies. Any unsolicited resumes sent to Cooper Standard will be considered unencumbered and free from any charge whatsoever. You must create an Indeed account before continuing to the company website to apply

Apply for this position