Site Reliability Engineer

NIGHTWING LLC
Sterling, United States of America
9 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Sterling, United States of America

Tech stack

Java
JavaScript
Amazon Web Services (AWS)
Data analysis
Azure
Unix
Command-Line Interface
Cloud Computing
Configuration Management
Computer Programming
Databases
Linux
DevOps
Monitoring of Systems
Systems Analysis
Information Technology Operations
Python
Network Security
Log Analysis
Network Connections
Network Protocols
Operational Data Store
Performance Tuning
Reliability Engineering
Ansible
DataOps
Data Logging
Google Cloud Platform
Containerization
Kubernetes
Patch Management
Cloud Optimization
Puppet
Docker
Go

Job description

The Site Reliability Engineer (SRE) collaboratively works closely with the contract leadership, Platform teams, and Sponsor to refine the operational and technical strategy to automate key portions of IT operations and enable the Product team (Platform) to bring new software or new features to production as quickly as possible. The SRE executes and analyzes manual IT operations/admin tasks (log analysis, performance tuning, patch management, testing, and incident response) and converts them to automated tasks. The SRE works with the Platform, Network and Data Operations teams to assist in deployment planning and onboard systems. They assist with monitoring, system analysis, and IT operations support. Daily tasks include, but are not limited to:

  • Work with Sponsor, Mission partners, and technical personnel to deliver robust scalable operations architecture that meets the customer goals for the enterprise.

  • Analyze, define, and document requirements for data, workflow, logical processes, hardware and operating system environment, and network connectivity, other system interfaces, internal and external checks and controls, and outputs.

  • Monitor and track metrics, logs and traces across all services in the system/network and provide context for identifying root causes in the event of an incident, performance degradation, or availability issue.

  • Perform Network/Cloud optimization and resilience planning

  • Develop capabilities to automate hardware/software provisioning, monitoring, patching, and troubleshooting.

  • Collaborate with and assist Platform team and leadership in network and security health, intrusions or inappropriate activities.

  • Optimize business processes, workflows, and service operations by building efficient on-call processes and streamlining alerting workflows.

  • Leverage operational data to automate systems administration, operations and incident response processes to improve enterprise reliability to manage IT environment complexity.

  • Works with LSA, Lab Manager, and CM to compose technical documents including Design, Deployment, System specifications and Host Nation baselines, updates, user's manuals, training materials, installation guides, proposals, and reports.

  • Work with the OM to implement ITSM best practices for ICA/Service discrepancy and reporting, issue resolution and operations support to include Tier 2/3 escalation.

Requirements

  • Programming: Proficiency in at least one programming language (e.g., Python, Go, Java, or JavaScript) is essential for automating tasks and developing tools.

  • Linux/Unix Systems Administration: Strong knowledge of Linux/Unix operating systems, including command-line tools and system administration tasks.

  • Networking: Understanding of network protocols, infrastructure, and troubleshooting techniques.

  • Database Management: Familiarity with database technologies and principles.

  • Automation: Experience with automation tools and techniques, such as configuration management (e.g., Ansible, Puppet, Chef) and orchestration (e.g., Kubernetes).

  • Monitoring and Logging: Experience with monitoring tools and logging systems.

  • Problem-Solving: Strong analytical and problem-solving skills to diagnose and resolve system issues.

  • Communication: Ability to communicate technical information clearly and concisely to both technical and non-technical audiences.

  • Collaboration: Ability to work effectively with cross-functional teams, including software developers and operations personnel.

Desired Skills:

  • Cloud Technologies: Experience with cloud platforms (e.g., AWS, Google Cloud, Azure).

  • Containerization: Knowledge of containerization technologies (e.g., Docker, Kubernetes).

  • DevOps Principles: Understanding DevOps principles and practices.

  • Service Level Objectives (SLOs) and Service Level Agreements (SLAs): Experience with defining, tracking, and managing SLOs and SLAs.

  • Data Analysis: Experience with data analysis and visualization tools.

Desired Certs:

  • Global Skill Development Council (GSDC) Site Reliability Engineering (SRE) Foundation Certification (CSREF).

About the company

_At Nightwing, we value collaboration and teamwork. You'll have the opportunity to work alongside talented individuals who are passionate about what they do. Together, we'll leverage our collective expertise to drive innovation, solve complex problems, and deliver exceptional results for our clients._ _Thank you for considering joining us as we embark on this new journey and shape the future of cybersecurity and intelligence together as part of the Nightwing team._ _Nightwing is An Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or veteran status, age or any other federally protected class._

Apply for this position