Manager, Cloud & Research Computing Platforms

The University of Chicago

Chicago, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 135K

Job location

Chicago, United States of America

Tech stack

Artificial Intelligence

Computing Platforms

Systems Engineering

Cloud Computing

Data Centers

Linux

Distributed Data Store

Distributed Systems

Identity and Access Management

Networking Hardware

Network Security

Utility Software

Ceph

Computer Network Operations

Reliability of Systems

Kubernetes

Information Technology

Operational Systems

Job description

As Manager, Cloud & Research Computing Platforms, you will report directly to the Principal Investigator of the MANIAC Lab and lead a technical team of systems administrators and research software engineers. In this role, you will develop high-level programmatic plans across multiple workstreams and translate them into detailed technical roadmaps for the Lab's systems engineering efforts. You will collaborate extensively with the U.S. ATLAS Computing operations program, the international ATLAS software and computing community, IRIS-HEP partners, and IT teams within the Physical Sciences Division.

Success in this position requires advanced technical depth, strong communication skills, and disciplined organizational capabilities to address complex cyberinfrastructure challenges and ensure reliable operations. You will guide the Lab's research and development agenda for computing facilities, advancing the transition from traditional HTC architectures to modern cloud-native systems, federated operational models, and AI-assisted monitoring, diagnostics and facility operations. Your leadership will be instrumental in shaping a forward-looking R&D program designed to meet the evolving demands of the HL-LHC.

Responsibilities

Leads the MANIAC Lab's distributed computing and IT systems team, which is comprised of systems administrators and software engineers, overseeing Linux systems, cloud-native services, storage, networking, and cybersecurity.
Supports team development through training, mentorship, and continuous learning opportunities.
Develops clear technical plans, team goals, and operational milestones across all Lab-supported computing platforms.
Partners with the Principal Investigator to implement strategic upgrades and ensure reliable, efficient operation of the Lab's cyberinfrastructure.
Guides modernization efforts, including automation, cloud-native adoption, and improved data-delivery workflows.
Collaborates with U.S. ATLAS, IRIS-HEP, and University partners to support shared operations and expand research capabilities.
Monitors system performance and applies proactive measures to improve reliability and scalability.
Engages with researchers to understand computing needs and deliver solutions that support data-intensive science.
Ensures adherence to best practices for network operations and cybersecurity.
Manages a single team's progress by maintaining accurate and up-to-date logs, ensures that all projects have the necessary management oversight and approvals for successful completion.
Ensures the implementation of approved best practices and information technology policies that result in the highest quality systems administration.
Manages the creation of standards and procedures to maintain production servers that run the operating system. Manages the installation, configuration, and maintenance of operating systems and utility software.
Performs other related work as needed.

Requirements

Minimum requirements include a college or university degree in related field.

Work Experience:

Minimum requirements include knowledge and skills developed through 7+ years of work experience in a related job discipline., * Bachelor's degree in computer science or related field in the physical sciences., * Experience managing large-scale computing systems in academic, research, or enterprise environments.

Demonstrated leadership of technical staff and successful delivery of complex cyberinfrastructure projects.
Strong background in scientific or high-performance computing, distributed systems, and emerging cloud-native technologies.
Experience implementing modern operational practices such as container orchestration, automation, and advanced data-delivery services.
Familiarity with secure, policy-compliant operations, including network security and identity management.
Experience supporting large CPU/GPU clusters, multi-petabyte storage systems, and data-intensive workflows.
Proven ability to evaluate and integrate new technologies to enhance performance and efficiency.
Record of effective collaboration with external partners and participation in professional technical communities.

Preferred Competencies

Strong leadership, communication, and collaboration skills, with the ability to work effectively with researchers, technical staff, and institutional partners.
Ability to operate in a dynamic research environment and stay current with advances in scientific and cloud-native computing.
Proficiency in managing Unix/Linux systems, distributed storage platforms (e.g., Ceph), and high-performance networking.
Familiarity with container orchestration and cloud-native technologies, including Kubernetes, CI/CD pipelines, and GitOps methodologies.
Strong analytical and problem-solving abilities, with experience diagnosing and resolving complex infrastructure challenges.
Experience applying automation, monitoring, and modern operational practices to improve system reliability and efficiency.
Demonstrated ability to guide teams, build consensus, and drive process innovation in multi-stakeholder technical environments.

Working Conditions

Presence on campus full time at the Hyde Park campus of the University of Chicago is required.
Additionally, you should be capable of physically setting up server and networking equipment within professional data center environments.

Benefits & conditions

$66,500.00 - $83,100.00 per year

About the company

The MANIAC Lab, within the Enrico Fermi Institute of the Physical Sciences Division at the University of Chicago, designs, deploys, and operates advanced cyberinfrastructure in support of forefront particle physics research. The Lab operates one of the three sites of the Midwest Tier-2 (MWT2) federation, a data-intensive, high-throughput computing center that appears to ATLAS as a unified logical facility through harmonized services, federated operations, and shared configuration management. A shared Tier-3 Analysis Facility complements these resources through a cloud-native Kubernetes environment integrating large-scale CPU and GPU resources, Ceph object storage, BinderHub, Coffea-Casa, Dask, and ServiceX. This platform supports more than 500 ATLAS physicists and serves as a national testbed for high-bandwidth analysis workflows and emerging AI-augmented research methodologies. The Lab integrates advanced data-delivery systems with modern parallel scheduling frameworks and operates the Scalable Systems Laboratory, a cloud-native software testing platform for the NSF Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP). IRIS-HEP leads the development of next-generation software and computing technologies for the HL-LHC era, and the Lab contributes to these efforts through research on federated analysis environments, token-based data-access infrastructures, and next-generation HTTP/S caching technologies. The Lab also maintains the ATLAS distributed analytics and AI-assisted observability and operations platform, a large-scale Elasticsearch-based system indexing more than eight years of workflow, data-transfer, and network-telemetry metadata. This infrastructure underpins AI-driven anomaly detection, operational intelligence, and natural-language interfaces that support distributed facility operations and improve reliability across U.S. ATLAS sites. In addition, the Lab provides comprehensive computation and data-management support for HEP and astrophysics experiments within the Enrico Fermi Institute. It played a key role in building the online computing infrastructure for the South Pole Telescope (SPT-3G) and maintains its associated analysis systems. The Lab operates distributed data-management services for the XENON dark-matter experiment at Gran Sasso, supports simulation and analysis activities for the KOTO experiment at J-PARC-advancing frontier CP-violation studies via ultra-rare kaon decays-and contributes to simulation R&D for future collider initiatives.