Senior Site Reliability Engineer

Oracle
Austin, United States of America
13 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Reston, United States of America

Tech stack

Microsoft Active Directory
API
Artificial Intelligence
Systems Engineering
Backup Devices
Bash
Unix
Profiling
Continuous Integration
Data Integrity
Software Debugging
Linux
Distributed Data Store
Failover
Information Technology Operations
Internet Small Computer System Interface (ISCSI)
Python
Kerberos (Protocol)
Lightweight Directory Access Protocols (LDAP)
PostgreSQL
Uptime
Open Source Technology
Oracle Applications
Performance Tuning
Reliability Engineering
Cloud Services
Ansible
Prometheus
Software Engineering
Systems Architecture
Oracle Linux
Ceph
Data Storage Management
Enterprise Software Applications
Autoscaling
Delivery Pipeline
Grafana
Containerization
Kubernetes
Infrastructure Automation Frameworks
Storage Technologies
Information Technology
Deployment Automation
Patch Management
Terraform
Oracle Cloud Infrastructure
Docker
Jenkins
Nvme
VMware

Job description

At Oracle Cloud Infrastructure (OCI), we build the future of the cloud for Enterprises as a diverse team of fellow creators and inventors. We act with the speed and attitude of a start-up, with the scale and customer-focus of the leading enterprise software company in the world. Values are OCI's foundation and how we deliver excellence. We strive for equity, inclusion, and respect for all. We are committed to the greater good in our products and our actions. We are constantly learning and taking opportunities to grow our careers and ourselves. We challenge each other to stretch beyond our past to build our future. You are the builder here. You will be part of a team of intelligent, motivated, and diverse people and given the autonomy and support to do your best work. It is a dynamic and flexible workplace where you'll belong and be encouraged., We are seeking a highly skilled Senior Site Reliability Engineer to join our IT team, focusing on supporting Linux infrastructure services and advancing our CI/CD pipelines through Ansible automation. This role is pivotal in ensuring the reliability, scalability, and efficiency of our systems as we transition to more automated infrastructure management practices., We are seeking a skilled Site Reliability Engineer to design, build, operate, and automate services for traditional IT infrastructure. The ideal candidate will have expertise in Oracle Linux, Ansible, and software development, ensuring the reliability, scalability, and efficiency of our systems., As a Senior Site Reliability Engineer, you will be responsible for:

  • System Design and Operation:
  • Design and manage distributed Unix-based systems, particularly Oracle Linux.
  • Implement auto-scaling and self-healing infrastructure to ensure uptime and durability.
  • Tune system internals, including kernel parameters, networking, and filesystems, for high performance.
  • Maintain timely OS patching and compliance posture across environments.
  • Integrate systems with enterprise identity services such as Active Directory, LDAP, and Kerberos.
  • Storage Management:
  • Design, implement, and manage distributed storage solutions using technologies like GlusterFS.
  • Ensure data reliability and availability through replication strategies and geo-replication.
  • Monitor and optimize storage performance, addressing bottlenecks and ensuring scalability.
  • Collaborate with development teams to understand storage requirements and provide appropriate solutions.
  • Automation and Infrastructure as Code:
  • Develop and maintain infrastructure automation using Ansible and Terraform.
  • Automate deployment pipelines, service configurations, and patch management.
  • Develop scripts and services in Python and Bash to enhance infrastructure delivery workflows.
  • Extend APIs and platform automation to drive efficiency and repeatability.
  • Observability and Incident Response:
  • Develop observability stacks using tools like Prometheus, Grafana, and other open-source telemetry tools.
  • Create dashboards and SLO/SLI-based alerts for real-time monitoring of production systems.
  • Participate in a global 24/7 on-call rotation, leading responses for high-severity incidents.
  • Conduct post-incident analysis (RCA) and drive remediations that improve long-term reliability.
  • Collaboration and Standards:
  • Partner with development teams to embed reliability in deployment pipelines.
  • Help define system architecture standards and maintain robust platform documentation.
  • Mentor engineers in Unix performance, observability, and debugging practices.
  • Champion a culture of automation, resilience, and continuous improvement., Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers turn that promise into a better future for all. Discover your potential at a company leading the way in AI and cloud solutions that impact billions of lives.

Requirements

A great Senior Site Reliability Engineer will make all the difference for delivering quality solutions to our customers. Are you passionate about designing, developing, testing and delivering Infrastructure services? Do you thrive in a fast-paced environment, and want to be an integral part of a truly great team?, + US Government TS/SCI with Polygraph.

  • U.S. Citizenship required for Federal Government customer.
  • Education and Experience:
  • Bachelor's or Master's degree in Computer Science or related engineering field.
  • 5+ years of experience in software development/IT operations.
  • Technical Skills:
  • 5+ years in SRE, Infrastructure, or Systems Engineering roles managing production services.
  • Deep expertise with Unix/Linux systems, particularly Oracle Linux.
  • Experience in kernel tuning, performance profiling, and debugging complex system issues.
  • Proficiency in Python and Bash scripting.
  • Strong grasp of Infrastructure as Code tools like Ansible and Terraform.
  • Experience running hybrid infrastructure (on-premises) with VMware, containers, and Kubernetes.
  • Hands-on experience with monitoring, telemetry, and observability stacks.
  • Expertise in distributed storage systems, particularly GlusterFS.
  • Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF.
  • Soft Skills:
  • Excellent problem-solving skills; ability to multi-task and prioritize.
  • Ability to work independently; works well under pressure.
  • Strong communication and collaboration skills with the ability to engage and influence.
  • Self-motivated, able, and willing to help where help is needed.
  • Able to build and establish relationships, be culturally sensitive, have goal alignment, and learning agility.
  • High-reaching to work with geographically distributed teams.
  • Preferred Qualifications:
  • Experience with virtualization and container technologies (e.g., Docker, Kubernetes).
  • Experience with continuous integration platforms such as Jenkins.
  • Experience with monitoring and alerting technologies (e.g., Prometheus, Grafana).
  • Experience with PostgreSQL; understanding of replication, failover, backups.
  • Familiarity with other distributed storage systems like Ceph or MinIO.

About the company

 Oracle offers integrated suites of applications plus secure, autonomous infrastructure in the Oracle Cloud. For more information about Oracle (NYSE: ORCL), please visit us at www.oracle.com.

Our mission is to help people see data in new ways, discover insights, unlock endless possibilities.

Apply for this position