Manager, Cloud & Research Computing Platforms
Role details
Job location
Tech stack
Job description
As Manager, Cloud & Research Computing Platforms, you will report directly to the Principal Investigator of the MANIAC Lab and lead a technical team of systems administrators and research software engineers. In this role, you will develop high-level programmatic plans across multiple workstreams and translate them into detailed technical roadmaps for the Lab's systems engineering efforts. You will collaborate extensively with the U.S. ATLAS Computing operations program, the international ATLAS software and computing community, IRIS-HEP partners, and IT teams within the Physical Sciences Division.
Success in this position requires advanced technical depth, strong communication skills, and disciplined organizational capabilities to address complex cyberinfrastructure challenges and ensure reliable operations. You will guide the Lab's research and development agenda for computing facilities, advancing the transition from traditional HTC architectures to modern cloud-native systems, federated operational models, and AI-assisted monitoring, diagnostics and facility operations. Your leadership will be instrumental in shaping a forward-looking R&D program designed to meet the evolving demands of the HL-LHC.
Responsibilities
- Leads the MANIAC Lab's distributed computing and IT systems team, which is comprised of systems administrators and software engineers, overseeing Linux systems, cloud-native services, storage, networking, and cybersecurity.
- Supports team development through training, mentorship, and continuous learning opportunities.
- Develops clear technical plans, team goals, and operational milestones across all Lab-supported computing platforms.
- Partners with the Principal Investigator to implement strategic upgrades and ensure reliable, efficient operation of the Lab's cyberinfrastructure.
- Guides modernization efforts, including automation, cloud-native adoption, and improved data-delivery workflows.
- Collaborates with U.S. ATLAS, IRIS-HEP, and University partners to support shared operations and expand research capabilities.
- Monitors system performance and applies proactive measures to improve reliability and scalability.
- Engages with researchers to understand computing needs and deliver solutions that support data-intensive science.
- Ensures adherence to best practices for network operations and cybersecurity.
- Manages a single team's progress by maintaining accurate and up-to-date logs, ensures that all projects have the necessary management oversight and approvals for successful completion.
- Ensures the implementation of approved best practices and information technology policies that result in the highest quality systems administration.
- Manages the creation of standards and procedures to maintain production servers that run the operating system. Manages the installation, configuration, and maintenance of operating systems and utility software.
- Performs other related work as needed.
Requirements
Minimum requirements include a college or university degree in related field.
Work Experience:
Minimum requirements include knowledge and skills developed through 7+ years of work experience in a related job discipline., * Bachelor's degree in computer science or related field in the physical sciences., * Experience managing large-scale computing systems in academic, research, or enterprise environments.
- Demonstrated leadership of technical staff and successful delivery of complex cyberinfrastructure projects.
- Strong background in scientific or high-performance computing, distributed systems, and emerging cloud-native technologies.
- Experience implementing modern operational practices such as container orchestration, automation, and advanced data-delivery services.
- Familiarity with secure, policy-compliant operations, including network security and identity management.
- Experience supporting large CPU/GPU clusters, multi-petabyte storage systems, and data-intensive workflows.
- Proven ability to evaluate and integrate new technologies to enhance performance and efficiency.
- Record of effective collaboration with external partners and participation in professional technical communities.
Preferred Competencies
- Strong leadership, communication, and collaboration skills, with the ability to work effectively with researchers, technical staff, and institutional partners.
- Ability to operate in a dynamic research environment and stay current with advances in scientific and cloud-native computing.
- Proficiency in managing Unix/Linux systems, distributed storage platforms (e.g., Ceph), and high-performance networking.
- Familiarity with container orchestration and cloud-native technologies, including Kubernetes, CI/CD pipelines, and GitOps methodologies.
- Strong analytical and problem-solving abilities, with experience diagnosing and resolving complex infrastructure challenges.
- Experience applying automation, monitoring, and modern operational practices to improve system reliability and efficiency.
- Demonstrated ability to guide teams, build consensus, and drive process innovation in multi-stakeholder technical environments.
Working Conditions
- Presence on campus full time at the Hyde Park campus of the University of Chicago is required.
- Additionally, you should be capable of physically setting up server and networking equipment within professional data center environments.
Benefits & conditions
$66,500.00 - $83,100.00 per year