HPC System Administrator

Apr 05, 2017
Institution Type
Four-Year Institution
About The Unit: The University of Chicago Research Computing Center (RCC), a unit in the Office of the Vice President for Research and for National Laboratories (OVPRNL), provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of the Vice President for Research and for National Laboratories oversees the conduct of sponsored research, technology transfer, research program development, multi-institutional research institutes, national laboratory board, and contract management functions. OVPRNL supports the development and coordination of research-related communications and educational programs at The University of Chicago. OVPRNL oversees the management of two Department of Energy contracts for Argonne National Laboratory and Fermi National Accelerator Laboratory. When combined with the Lab R&D budgets, the office oversees approximately $1.4 billion in sponsored research. OVPRNL works closely with individual scholars, departments, and divisions to encourage, seed, and coalesce research across the University, Argonne, and Fermilab campuses.

Unit Job Summary: HPC System Administrator: The University of Chicago is seeking a highly qualified HPC system administrator to join its system and operation team that builds and manages RCC HPC systems and facility operations. The individual in this position will be involved in the procurement and management of HPC hardware and software. The responsibilities include but are not limited to: - Installing, configuring, and maintaining large computer clusters/servers and software. - Day-to-day operations of the systems including systems administration, monitoring and storage performance up to and including network components. - Management of the system's network switch, parallel file system and HPC software stack and tools. - Configuration of the scheduling and queuing system. - Diagnosing and resolving system operational problems quickly and effectively. - Coordinating with vendors to resolve hardware and software problems. - Assist users with access and other help desk ticket requests or issues. - Building and deploying open source software and software from vendors/partners. - Providing reliable and efficient backups/restores for all managed systems. - Documenting system administration procedures for routine and complex tasks. - Maintaining and monitoring the security of the HPC systems and servers. - Other duties as assigned.

Unit Education: Bachelor's degree in Computer Science or closely related field OR at least five years experience in HPC system administration or managing large HPC clusters required.

Unit Experience: - A minimum of three years of full time Linux system administration experience in a large distributed computing environment required. - Experience with installing, configuring, and maintaining job management support tools (such as Moab, TORQUE, SLURM, PBS, etc) required. - Experience installing MPI libraries and OpenMP required. - Experience with operating system deployment tools (e.g.: XCAT, ROCKS) preferred. - Experience configuring, administering, and supporting storage subsystems (e.g.: IBM, NetApp, DataDirect Network, LSI, etc.) preferred. - Experience with one or more distributed file systems (Lustre, Gluster, GPFS, GFS, IBRIX, PVFS, etc.) required. - Direct experience working with Infiniband (must at least be able to demonstrate a working knowledge of Infiniband concepts, OFED layers, sub-net managers) required. - Experience configuring, installing, tuning and maintaining scientific application software preferred. - At least two years experience in providing support for Linux HPC cluster used for scientific research preferred. - Experience supporting HPC compilers and libraries preferred. - Scientific programming experience preferred. - Experience configuring, installing, maintaining and/or using performance monitoring and optimization tools preferred.

Unit Job Function Competencies: -Excellent interpersonal and communication skills required. -Ability to plan, organize, prioritize tasks, and complete assigned projects with minimal supervision required. -Ability to develop and maintain programs and scripts that aid in the operation and automation of administrative tasks using various shell and scripting languages (Bash, Perl, Python) required.