Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Site Reliability Engineer (Cloud Ops), you will help operate and continuously improve Featurespace's world-leading product, ARIC Risk Hub, delivered as a robust cloud-based SaaS solution. You will work as part of the Cloud Operations / SRE team to ensure our platform is reliable, scalable, measurable, repeatable, secure, and cost-effective.
You will participate in designing, developing, deploying, monitoring, supporting, documenting, and troubleshooting our SaaS platform, collaborating closely with engineering, data science, internal stakeholders, external vendors, and customers to deliver excellent service outcomes.
Responsibilities
We hire people with a willingness to adapt to a variable role. Along with the responsibilities below, we ask for ownership of any other duties as required.
- Operate and support production deployments of ARIC Risk Hub SaaS, including deploying, maintaining, monitoring, upgrading, and troubleshooting platform and application components.
- Build software and systems to manage platform infrastructure and applications.
- Continuously evaluate and improve technology and operational processes to increase quality, reduce costs, and improve time-to-market.
- Participate in service resilience and failure testing, including predictable and unpredictable failure scenarios.
- Provide second-line operational support for SaaS customers, ensuring timely and high-quality issue resolution.
- Gather service performance data and generate reports and insights to guide reliability and scalability improvements.
- Develop, maintain, and document internal processes and operational runbooks.
- Collaborate with engineering and data science teams to drive new and improved ARIC Risk Hub capabilities.
- Participate in an on-call roster, including out-of-hours support as required.
This is a hybrid position. Expectation of days in office will be confirmed by your Hiring Manager.
Requirements
- Experience administering cloud infrastructure or supporting cloud applications (preferably AWS).
- Working knowledge of Linux, shell scripting, and command-line tools.
- Ability to write or maintain code in at least one high-level programming language (e.g., Python).
- Understanding of networking fundamentals (e.g., DNS, routing, firewalls).
- Familiarity with source control systems (e.g., Git).
- Exposure to CI/CD concepts and pipelines.
- Familiarity with monitoring, metrics, and alerting systems.
- Experience operating and supporting production-grade services.
- Ability to write clear technical documentation and follow defined operational processes.
Preferred
- Infrastructure as Code and configuration management experience (e.g., Terraform, SaltStack, Ansible).
- Experience with containerization (Docker) and Kubernetes (deploying or operating services).
- Exposure to service mesh technologies (e.g., Istio).
- Experience building or operating cloud-native or serverless applications.
- Familiarity with observability and data platforms such as Prometheus, Grafana, MongoDB, Elasticsearch, Kafka, and HashiCorp Vault.
- Understanding of application and data security fundamentals (authentication, authorization, encryption, TLS).
- Awareness of regulated standards (e.g., PCI-DSS, SOC2, ISO27001)., * Relevant industry experience supporting cloud-based SaaS platforms in production environments.
- Excellent interpersonal and communication skills, with the ability to collaborate across teams and organizations.
- Strong attention to detail and a proactive, best-practice-driven approach to work.
- Passion for learning new skills and technologies and staying current with industry developments.
- Curiosity, innovation, and enthusiasm for solving complex problems.
- Strong time-management skills and the ability to prioritize effectively.