Senior Site Reliability Engineer - OpenShift Dedicated
Role details
Job location
Tech stack
Job description
Senior Site Reliability Engineer - OpenShift Dedicated The Red Hat OpenShift Dedicated Site Reliability Engineering (SRE) team is looking for a Senior Software Engineer to join our global team. In this role, you will work on Red Hat OpenShift, which is enterprise Kubernetes, as part of a team that develops and operates Red Hat OpenShift Dedicated, a public cloud service based on Red Hat OpenShift for large enterprise customers. You'll play a key role in contributing to solutions that make Red Hat OpenShift Dedicated scalable, featureful, resilient, and secure while maintaining a balance between development and operations work. You'll contribute to the design and development of automation software to provision, upgrade, monitor, and heal a large global fleet of Red Hat OpenShift clusters deployed across multiple public clouds. You'll participate in a global on-call rotation and help lead incident management, root cause analysis, and continuous improvement activities, managing
Requirements
engineering efforts against a service-level agreement (SLA) and error budget. OpenShift SRE is a sophisticated, global, fast-paced team inside the world's open source leader with constant opportunities to learn new skills and innovate new solutions to meet our customers' demands. As a Senior Software Engineer on this team, you will directly contribute to Red Hat's success in the rapidly growing Kubernetes as a Service (KaaS) market. What You Will Do * Design and write automation software to provision, upgrade, monitor, and heal a large global fleet of Red Hat OpenShift clusters deployed across multiple public clouds * Identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions * Participate in the release cycles of our offerings, deploying code to integration, staging, and production environments, integrating with continuous integration (CI) and continuous delivery (CD) tooling, monitoring, and change management * Perform software updates, peer code reviews, testing, and Common Vulnerabilities and Exposures (CVE) analysis; respond to security threats * Interact with automated monitoring and healing infrastructure to ensure healthy environments * Provide engineering support to Red Hat's global technical support team to resolve customer issues * Create and maintain standard operating procedures (SOPs) for performing maintenance tasks, applying configuration changes, and remediating problems in our environment * Participate in a global on-call rotation, including periodic weekend and holiday on-call duties What You Will Bring * 3+ years of software engineering experience using object-oriented languages; Golang and Python are preferred * Experience managing Linux-based systems in a public cloud like Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure * Commercial experience with enterprise system monitoring; knowledge of Prometheus is a plus * Experience with container technology, Kubernetes, OpenShift and configuration management tools (Red Hat Ansible Automation, Puppet, or Chef) is a big plus * Demonstrated ability to quickly and accurately troubleshoot system issues * Solid written and verbal communication skills in English Equal Opportunity Policy (EEO) Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, veteran status, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law. #J-18808-Ljbffr