Site Reliability Engineer - SRE
Quinnox Inc
New York, United States of America
3 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
Senior Compensation
$ 100KJob location
New York, United States of America
Tech stack
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Application Release Automation
Bash
Cloud Engineering
Cluster Analysis
Continuous Integration
DevOps
Disaster Recovery
Monitoring of Systems
Identity and Access Management
Java Virtual Machine (JVM)
Python
Key Management
Networking Basics
Oracle Applications
Performance Tuning
Powershell
Release Management
Reliability Engineering
Site Reliability Engineering Practices
Prometheus
Calypso Programming Language
SQL Databases
Data Logging
Scripting (Bash/Python/Go/Ruby)
Enterprise Software Applications
System Availability
Grafana
Reliability of Systems
Database Performance
Infrastructure as Code (IaC)
Amazon Web Services (AWS)
Gitlab
Cloudformation
Low Latency
Deployment Automation
Terraform
Splunk
Virtual Private Clouds
Dynatrace
Job description
- Own reliability, availability, and performance of Calypso across production and non-production environments
- Design, implement, and operate end-to-end SRE practices, including monitoring, alerting, incident management, and capacity planning
- Build and manage CI/CD pipelines using GitLab, enabling automated build, deployment, and release of Calypso components
- Automate deployment and environment provisioning on Amazon Web Services (AWS) using Infrastructure as Code (IaC) principles
- Develop and maintain automation scripts using PowerShell, Shell (Bash), and Python for operational tasks, deployments, and monitoring
- Ensure high availability and resiliency of Calypso services through failover strategies, clustering, and disaster recovery planning
- Implement observability frameworks, including logging, metrics, and distributed tracing for proactive issue detection
- Define and monitor SLOs/SLIs/SLAs, ensuring system performance meets business expectations in a trading environment
- Lead incident management and root cause analysis (RCA), ensuring quick resolution of production issues and prevention of recurrence
- Optimize system performance, including JVM tuning, database performance, and application-level optimizations for high-volume trade processing
- Manage environment stability, including handling batch jobs, EOD processing, and trade lifecycle events in Calypso
- Collaborate with development, QA, and infrastructure teams to ensure smooth releases and production readiness
- Implement security best practices, including access controls, secrets management, and compliance with regulatory requirements
- Support release management and deployment strategies, including blue-green deployments, canary releases, and rollback mechanisms
- Drive continuous improvement and automation, reducing manual intervention and improving system reliability
- Maintain runbooks, playbooks, and operational documentation for support and incident handling
- Support production releases and provide hypercare support, ensuring system stability during critical business cycles
Pay: $90,000.00 - $100,000.00 per year
Requirements
Do you have experience in Virtual Private Clouds?, * 6-10+ years of experience in Site Reliability Engineering / DevOps / Production Support in capital markets platforms/enterprise applications
[experience with Calypso V17/V18 is added advantage]
- Strong hands-on experience with:
- Amazon Web Services (EC2, S3, networking, IAM, VPC)
- GitLab CI/CD pipelines
- Scripting: PowerShell, Bash/Shell [Python is added advantage]
- Experience with:
- Monitoring tools (e.g., ELK, Prometheus, Grafana, Splunk)
- CI/CD and release automation
- Infrastructure as Code (Terraform, CloudFormation - preferred)
- Strong understanding of:
- Linux/Unix systems
- Networking fundamentals and cloud architecture
- Basic Database concepts (Oracle/SQL)
- Experience supporting high-availability, low-latency enterprise systems