Site Reliability Engineer - SRE
Role details
Job location
Tech stack
Job description
Our organization is seeking a motivated Site Reliability Engineer (SRE) to join our dynamic Advisor Platform Engineering team. This role focuses on safeguarding the availability, performance, and scalability of our mission-critical, Azure-hosted platform. As an individual contributor, you will apply your expertise in cloud infrastructure, automation, and observability to maintain and enhance our systems, collaborating closely with Agile development teams to embed reliability principles throughout the application lifecycle., * Monitor, maintain, and optimize Azure infrastructure, ensuring the health, performance, and availability of IaaS, PaaS, and SaaS components.
- Enhance observability by defining, measuring, and refining Service Level Indicators (SLIs) and Objectives (SLOs) using Azure Monitor, Application Insights, and Log Analytics (KQL).
- Develop automation and tooling using scripting languages (PowerShell, Bash, Python) and potentially C#/.NET to eliminate manual tasks and improve efficiency.
- Participate in a 24/7 on-call rotation, contributing to incident triage, mitigation, root cause analysis (RCA), and the implementation of preventive actions.
- Collaborate with software development, QA, and other technology teams to ensure reliability, scalability, and performance requirements are met.
- Contribute to capacity planning, load testing, and performance tuning initiatives across a .NET and React micro-service architecture.
- Create and maintain clear documentation for systems, processes, and runbooks.
- Troubleshoot and support integrations with third-party systems via APIs and SSO implementations., This is a hybrid position requiring on-site work in South Austin, TX, from Monday to Thursday, with an option for remote work on Fridays. The role includes participation in a 24/7 on-call rotation to support critical production releases and incidents outside of standard business hours.
Requirements
Education: A Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience is required.
Experience: This position requires 2-5 years of experience in Site Reliability Engineering, DevOps, Cloud Operations/Engineering, Systems Administration, or Software Engineering with a strong operational focus. Hands-on experience with platform development, automation strategies, and monitoring strategies within Azure is necessary. Familiarity with SQL and the .NET stack is also required for effective application troubleshooting.
Technical Skills:
- Hands-on experience managing and troubleshooting production workloads on Microsoft Azure (IaaS & PaaS).
- Proficiency with Azure monitoring tools (Azure Monitor, Application Insights, Log Analytics) and KQL query language.
- Solid scripting skills for automation (e.g., PowerShell, Bash, Python).
- Experience with CI/CD concepts and tools, particularly Azure DevOps pipelines.
- Proficiency with Git workflows and platforms like Azure Repos or GitHub.
- Solid understanding of networking concepts (TCP/IP, DNS, HTTP/HTTPS, TLS, Load Balancing, Firewalls).
Preferred Qualifications
- A basic understanding or development experience with C#/.NET applications.
- Experience with Infrastructure as Code (IaC) tools like ARM templates, Bicep, or Terraform.
- Familiarity with containerization technologies (Docker) and orchestration (Kubernetes, Azure Container Apps).
- Experience supporting relational (e.g., Azure SQL) and NoSQL (e.g., Cosmos DB) databases.
- Basic familiarity with modern JavaScript front-end technologies like React/Typescript.
- Experience working in Agile development environments or within the financial services industry.
- Microsoft Certified: Azure Administrator Associate (AZ-104) or DevOps Engineer Expert (AZ-400).