Senior Systems Engineer (Linux / HPC Environment)
Role details
Job location
Tech stack
Job description
-
Extensive hands-on experience in Linux system administration, including scripting with Bash or other shells.
-
Proficiency with automation frameworks such as Ansible (including Ansible Automation Platform) for configuration management and deployment.
-
Background supporting high-performance computing environments, such as systems utilizing SLURM workload manager or Open OnDemand portals.
-
Understanding of analytical and statistical software tools such as Python, R, MATLAB, SAS, or similar platforms., As a Senior Systems Engineer, you will play a critical role in supporting, administering, and maintaining a Linux-based high-performance computing (HPC) environment that underpins advanced analytics, statistical modeling, and research activities. Your primary goal will be to ensure system reliability, security, and top-tier performance while working with cross-functional teams to deliver scalable technical solutions for evolving business needs. Key areas of responsibility include:
-
System Administration:
-
Administer and maintain Linux-based HPC systems.
-
Perform regular system updates, patch management, and robust security hardening.
-
Monitor, tune, and optimize system performance to ensure high availability and efficiency.
-
Platform Support:
-
Provide advanced (Tier 3) technical support for complex HPC platform issues.
-
Troubleshoot and resolve system outages or performance issues with minimal downtime.
-
Interpret business and analytical requirements into workable technical solutions.
-
Collaboration & Communication:
-
Partner closely with data engineers, data scientists, analysts, and various stakeholders to understand and address their technology needs.
-
Document system configurations, processes, troubleshooting steps, and incident resolutions.
-
Drive knowledge sharing and support continuous process improvement activities.
-
Security & Compliance:
-
Implement and maintain security best practices, protocols, and regular audits.
-
Conduct vulnerability assessments to mitigate risks and protect sensitive data.
-
Ensure all systems adhere to organizational and regulatory compliance standards.
-
Project & Engineering Support:
-
Engage in system enhancements, upgrades, and performance initiatives to keep pace with technology advances.
-
Support system architecture and design decisions for both new and existing platforms.
-
Assist with the implementation and integration of new tools, features, and capabilities.
-
On-Call Support:
-
Participate in an on-call rotation to support critical systems and ensure maximum uptime.
Requirements
- Extensive hands-on experience in Linux system administration, including scripting with Bash or other shells.
- Proficiency with automation frameworks such as Ansible (including Ansible Automation Platform) for configuration management and deployment.
- Background supporting high-performance computing environments, such as systems utilizing SLURM workload manager or Open OnDemand portals.
- Understanding of analytical and statistical software tools such as Python, R, MATLAB, SAS, or similar platforms.
- Exceptional troubleshooting and root-cause analysis skills, with the ability to resolve complex technical issues under pressure.
- Highly effective communication skills, enabling collaboration with both technical specialists and business teams.
- Commitment to security best practices and experience with vulnerability assessments in enterprise environments.
- Strong documentation skills with a focus on process consistency and incident management., * Bachelor's degree in Computer Science, Information Technology, Engineering, or a relevant technical field (or equivalent experience).
- Prior experience in system engineering roles within computational, research, or analytics-driven organizations is strongly preferred.
- U.S. Citizenship is required due tp ongoing project needs.
- Ability to work onsite as required; onsite engagement is full-time unless otherwise specified.
Willingness to participate in on-call rotations to ensure system uptime and reliability