Lead Cloud Engineer (AWS)
ConglomerateIT LLC
Chamblee, United States of America
yesterday
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
Senior Compensation
$ 94KJob location
Chamblee, United States of America
Tech stack
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Cloud Computing
Cloud Engineering
Data Recovery
DevOps
Disaster Recovery
Monitoring of Systems
Identity and Access Management
Python
OpenID
Ansible
Load Balancing
Mttr
Amazon Web Services (AWS)
Terraform
Job description
Keep AWS environments and customer applications stable, secure, cost-efficient, and resilient at all times. Focus is on making deployments feel routine, keeping incidents manageable, and ensuring operations run in a predictable, controlled manner. Key Responsibilities
- Lead incident management end-to-end, including handling critical outages and ensuring long-term fixes are implemented.
- Ensure production environments across AWS accounts and applications remain stable and reliable.
- Continuously optimize cloud spend using tagging strategies, rightsizing, and lifecycle controls.
- Strengthen observability across systems by making logs, metrics, tracing, and alerts actionable and meaningful.
- Build and scale reusable automation, playbooks, and operational best practices to support the team.
- Enforce secure access through least-privilege principles, regular audits, and credential hygiene.
- Define and maintain robust backup and disaster recovery strategies with periodic validation and documentation.
Core Functional Areas
- Application Operations: Manage deployments, perform smoke validations, track performance baselines, and ensure reliable rollback mechanisms.
- Cloud Infrastructure Management: Oversee AWS services like EC2, EKS, RDS, networking (VPC, security groups, transit gateways), IAM/OIDC, and edge components such as CloudFront and load balancers.
- Incident Management: Run high-priority incident bridges, maintain clear stakeholder communication, and drive effective post-incident reviews.
- Monitoring & Observability: Develop dashboards, alerts, and synthetic monitoring while maintaining a strong signal-to-noise ratio.
- Operational Excellence: Standardize processes via runbooks and reduce manual effort through automation.
- Backup & Recovery: Manage backup strategies, retention policies, cross-region replication, and validate recovery through regular testing.
- Cost Optimization: Control cloud expenses using savings plans, reserved instances, tagging discipline, and cleanup of unused assets.
Daily Activities
- Keep monitoring systems sharp by reducing alert noise and fixing visibility gaps.
- Act on operational issues quickly-resolve or escalate without leaving anything unclear.
- Review incidents and alerts from the previous cycle, prioritize them, and assign ownership.
- Update runbooks and documentation with new fixes, learnings, and recurring patterns.
- Validate backup success and confirm that recovery points are usable.
- Assist in deployments by ensuring readiness and verifying post-release checks.
Weekly Focus
- Strengthen observability by refining alerts and filling in missing telemetry signals.
- Review patches, recent changes, and rollback scenarios to identify improvement areas.
- Conduct a consolidated operations review across incidents, deployments, cost trends, capacity, and backup health.
- Perform recovery drills or partial restore validations to ensure disaster readiness.
Monthly Deliverables
- Refresh and maintain critical runbooks while validating disaster recovery readiness through drills or actual restore tests.
- Publish key operational insights such as uptime/SLO adherence, MTTR, deployment reliability, monitoring coverage, backup compliance, and cost optimization metrics.
- Drive closure of recurring operational issues like unstable releases and excessive alert noise.
Success Indicators
- Efficient incident resolution with most issues handled via well-defined runbooks.
- Controlled and optimized cloud spending that aligns with system growth, supported by strong tagging discipline.
- Reliable backup systems with consistent restore validation and full compliance.
- Clean, dependable dashboards with accurate alerting, minimal noise, and proper escalation flows.
- Smooth and predictable release cycles with very few failures.
Preferred Qualifications
- Certification as an AWS Solutions Architect.
- Relevant certifications in networking.
- ITIL or similar service management certification.
Monthly Deliverables
- Publish operational metrics covering uptime/SLO performance, MTTR, deployment stability, monitoring coverage, backup adherence, and cost efficiency.
- Keep critical runbooks up to date and validate disaster recovery readiness through drills or real restore exercises.
- Eliminate repeat operational issues such as unreliable releases and alert fatigue., Founded in 2014, is a global leader in delivering innovative IT solutions and services. Headquartered in the USA with a presence in the UK, Canada, and India, we specialize in offering industry-leading expertise and cutting-edge products that help our clients maximize their technological investments. Our focus on best-in-class solutions, a highly knowledgeable team, and proactive talent mapping ensure we remain at the forefront of the IT industry., As a Senior DevOps Engineer, you will play a crucial role in shaping the future of AI systems by designing and maintaining scalable infrastructure solutions. Your expertise will di…
- 2 days ago
- Apply easily, As a pivotal DevOps Engineer, you will architect, implement, and optimize scalable infrastructure solutions across dynamic cloud environments. This role is essential for driving wo…
- 8 days ago
- Apply easily, This early-career fullstack role is tailored for recent graduates eager to accelerate their growth by building impactful products. You will contribute to and gradually take ownersh…
- 9 days ago
- Apply easily
Requirements
- 8 10+ years of experience in cloud or application operations, with strong hands-on work in AWS environments.
- Proven ability to handle incidents, build effective monitoring systems, and automate operational workflows using tools like Terraform, Ansible, or Python.
- Hands-on exposure to AWS backup and disaster recovery setups, including real restore validations.
- Solid understanding of cloud networking concepts and architecture., ConglomerateIT is driven by our Center for Excellence and Innovation, an initiative dedicated to keeping us ahead in a rapidly evolving technology landscape. We understand that building strong relationships is key to our success, and this commitment has enabled us to partner with Fortune 500 companies and leading system integrators worldwide. Our ability to provide local talent on a global scale ensures that we can meet the contingent project requirements of our clients efficiently and effectively.
Benefits & conditions
- $45.00 per hour
About the company
© 2026 Careerjet All rights reserved