Senior HPC Systems Administrator
Hyde Park Campus
86755 Research Computing Center
About the Unit
The University of Chicago Research Computing Center (RCC), a unit in the Office of Research and National Laboratories (RNL), provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research and National Laboratories oversee the conduct of sponsored research, research program development, multi-institutional research institutes, national laboratory board, and contract management functions. RNL supports the development and coordination of research-related communications and educational programs at The University of Chicago. RNL oversees the management of two Department of Energy contracts for Argonne National Laboratory and Fermi National Accelerator Laboratory. When combined with the Lab R&D budgets, the office oversees approximately $1.4 billion in sponsored research. RNL works closely with individual scholars, departments, and divisions to encourage, seed, and coalesce research across the University, Argonne, and Fermilab campuses.
The University of Chicago is seeking a highly qualified Senior HPC System Administrator to join the system and operation team that builds and manages RCC HPC systems and facility operations. The individual in this position will be involved in the procurement and management of HPC hardware and software.
- Installing, configuring, and maintaining large computer clusters/servers and software.
- Day-to-day operations of the systems including systems administration, monitoring and storage performance up to and including network components. Management of the system's network switch, parallel file system and HPC software stack and tools.
- Configuration of the scheduling and queuing system.
- Diagnosing and resolving system operational problems quickly and effectively. Coordinating with vendors to resolve hardware and software problems. Assist users with access and other help desk ticket requests or issues.
- Use scripting/programming skills to enable system-level automation, problem detection, security maintenance and patch management.
- Building and deploying open source software and software from vendors/partners.
- Providing reliable and efficient backups/restores for all managed systems.
- Documenting system administration procedures for routine and complex tasks.
- Maintaining and monitoring the security of the HPC systems and servers. Other duties as assigned.
- Ability to understand and translate researchers' scientific goals into computational requirements.
- Ability to work well with faculty and researchers.
- Ability to identify and gain expertise in appropriate new technologies and/or software tools.
- Ability to function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
- Strong analytical skills and problem solving ability.
Education, Experience or Certifications:
- Bachelor's degree in Computer Science or closely related field or at least five years experience in HPC system administration or managing large HPC clusters required.
- A minimum of five years of full time Linux system administration experience in a large distributed computing environment required.
- At least two years experience in providing support for Linux HPC cluster used for scientific research preferred.
Technical Knowledge or Skills:
- Experience with installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.) required.
- Experience configuring, installing and troubleshooting MPI and OpenMP preferred.
- Experience with operating system deployment tools (e.g. XCAT, ROCKS) preferred.
- Experience configuring, administering, and supporting network storage subsystems (e.g. IBM, NetAppl DataDirect Network, LSI, etc.) preferred.
- Hands-on experience of at least one distributed file system (Spectrum Scale-GPFS, Lustre, BeeGFS, Gluster, IMRIX, PVFS, etc.) required.
- Direct experience working with Infiniband (must at least be able to demonstrate a working knowledge of Infiniband concepts, OFED layers, sub-net managers) required.
- Experience configuring, installing, tuning and maintaining scientific application software on large-scale systems preferred.
- Experience supporting HPC compilers and libraries preferred.
- Experience with systems automation tools such as Ansible or Puppet preferred.
- Experience configuring, installing, maintaining and/or using performance monitoring and optimization tools preferred.
- Cover letter
NOTE: When applying, all required documents MUST be uploaded under the Resume/CV section of the application
Depends on Qualifications
Scheduled Weekly Hours
Job is Exempt?
Drug Test Required?
Does this position require incumbent to operate a vehicle on the job?
Health Screen Required?
Remove from Posting On or Before
The University of Chicago is an Affirmative Action/Equal Opportunity/Disabled/Veterans Employer and does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national or ethnic origin, age, status as an individual with a disability, protected veteran status, genetic information, or other protected classes under the law. For additional information please see the University's Notice of Nondiscrimination.
Staff Job seekers in need of a reasonable accommodation to complete the application process should call 773-702-5800 or submit a request via Applicant Inquiry Form.
The University of Chicago's Annual Security & Fire Safety Report (Report) provides information about University offices and programs that provide safety support, crime and fire statistics, emergency response and communications plans, and other policies and information. The Report can be accessed online at: http://securityreport.uchicago.edu. Paper copies of the Report are available, upon request, from the University of Chicago Police Department, 850 E. 61st Street, Chicago, IL 60637.
The University of Chicago is an urban research university that has driven new ways of thinking since 1890. Our commitment to free and open inquiry draws inspired scholars to our global campuses, where ideas are born that challenge and change the world.
We empower individuals to challenge conventional thinking in pursuit of original ideas. Students in the College develop critical, analytic, and writing skills in our rigorous, interdisciplinary core curriculum. Through graduate programs, students test their ideas with UChicago scholars, and become the next generation of leaders in academia, industry, nonprofits, and government.
To learn more about the university click here http://www.uchicago.edu/