HPC System Administrator

Jun 08, 2017
Institution Type
Four-Year Institution

Stanford University

HPC System Administrator

Job Number:

The Stanford Research Computing Center (SRCC) is seeking outstanding applicants for the position of Research Computing System Administrator/Engineer. Working with research units across the campus, the successful candidate for this position will join a dynamic and growing team of technology specialists to support Stanford's research portfolio. This position will specifically focus on management and support of an HPC cluster and multipetabyte storage platform that provides essential infrastructure for Stanford's growing bioinformatics and genomics communities.

The ideal candidate will have demonstrated expertise in managing and supporting a wide variety of infrastructure platforms and end-users in academic or lab environments. Outstanding written and verbal communication skills are essential in this position, as are patience and creativity. You will be expected to do independent analysis, troubleshooting and problem-solving, but you also must work collaboratively within teams and across organizational boundaries.

The SRCC is a joint effort of the Dean of Research and IT Services. The SRCC offers high performance computing platforms, consulting, tools, system engineering, and system administration in support of computational and data-intensive research across the campus.

Core Duties*:
  • Support and administration of research computing clusters, servers and storage systems, including installation, network and security configuration, monitoring, maintenance, application software build/configuration, upgrading, patching, and complex user problem solving. Those systems may be in Stanford data centers or in Stanford research labs and units.
  • Provision computing platforms and associated storage and networking for research environments, incorporating novel technical solutions as needed to meet research requirements. Install, test and configure software tools, libraries and compilers to meet researchers' needs.
  • Customize environments as requested by research teams, with specific focus on the optimization of end-users' experiences
  • Provide advanced cyberinfrastructure training and consultation for faculty, postdocs and graduate students across a wide-array of university research units and departments.
  • Ensure systems are configured and managed in accordance with Stanford policies and any regulatory requirements specific to data sources and classifications.
  • Conceive, design, develop, optimize, integrate, and maintain information technology at a complex level.
  • Troubleshoot highly complex problems for which the analysis and resolution require extensive knowledge of many diverse system components
  • Develop long range technology plans.
  • Provide leadership and IT solutions for complex problems

    *Other duties may be assigned.

    Minimum Requirements:
    Education and Experience
    Bachelor's degree and eight years of related increasingly technical work experience or a combination of education and relevant experience. Strong, demonstrated knowledge of Linux and demonstrated experience managing complex multiuser server and storage environments are required as well.

    Knowledge, Skills and Abilities
    Advanced knowledge of Linux is required; experience managing, using, supporting and consulting on research computing cyberinfrastructure in an academic or research environment is strongly preferred. Proven ability to deliver outstanding system and service administration and end-user support in a thorough and timely manner is needed. This position requires that you be able to juggle multiple competing priorities, work quickly and accurately, and demonstrate initiative in conceptualizing and moving technical projects successfully to completion. The position must be able to do independent analysis, troubleshooting and problem resolution, but also must work collaboratively with other team members and across organizational group boundaries.

    This position requires hands-on experience building and supporting multi-tenant Linux servers/clusters and their associated networks, file systems and storage devices in production research environments. Specifically, this technical knowledge needed to be successful in this position includes:
  • Expert demonstrated knowledge of clustered Linux systems, including securing systems, and day-to-day troubleshooting, monitoring, support, software packaging, and working within industry-wide best practices
  • Experience administering, configuring, and supporting HPC clusters, including systems with accelerators, and high performance file systems and storage. This includes hardware installation, configuration, upgrades and repairs
  • Knowledge of and experience utilizing data and system security techniques, practices and standards as they relate to HPC systems, storage and networks
  • Knowledge and experience with high speed data transfer methods such as but not limited to GridFTP, Globus Online, Aspera, bbftp and/or similar large volume, high speed data transfer technology
  • Experience installing and supporting parallel computing environments (e.g. OpenMPI, MVAPICH, etc.)
  • Hands-on experience installing, configuring and supporting job schedulers and resource managers (e.g., SLURM, OGE, LSF, Torque, Maui, etc.)
  • Knowledge of and experience utilizing data and system security techniques, practices and standards as they relate to research systems, storage and networks
  • Exceptional written and verbal communication skills
  • Experience using and programming automated system management tools, both at a general level (e.g. Puppet) and at a cluster-level (e.g. Rocks)
  • Experience managing and supporting Infiniband-based networks is desired but not required
  • Experience installing, configuring, managing and supporting GPFS parallel file systems is desired but not required
  • Familiarity with TCP/IP, Internet Routing Protocols, private and public networks, VLANs, Firewalls, Load Balancers, addressing schemes, subnet creation and subnet masking. Proven ability to troubleshoot basic network issues and communicate and work with a team of network engineers to solve possible network design issues in HPC
  • Familiarity with the intersection of storage and networking disciplines: i.e. transport media, speeds of media, storage networks, IP based storage delivery, other storage delivery technologies
  • Experience with some the following applications: Git, Apache, TomCat, Subversion, Kerberos, LDAP
  • Software installation and maintenance experience supporting research codes and clients
  • Exceptional client service and communication, focusing on proactive system administrator actions and interactions to reduce or remove barriers to clients' efficient use of resources to advance research

    Physical Requirements:
    This position requires the ability to lift and manipulate storage and compute servers, rack and unrack equipment up to 40 pounds, and occasionally climb ladders.

    Working Conditions:
    This position requires the ability to lift and manipulate storage and compute servers up to 40 pounds, rack and unrack equipment, and occasionally climb ladders. The position will support equipment in several off-campus locations, so having a valid driver's license is necessary. The position is expected to respond to critical system problems off-hours and also must also be available for routine on-site system maintenance and patching, typically scheduled for evenings and weekends so to minimize the disruption of research work. The position is expected to rotate on-call duties during winter break and other closures.

    Working Standards:
  • Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
  • Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for safety; communicates safety concerns; uses and promotes safe behaviors based on training and lessons learned.
  • Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide, http://adminguide.stanford.edu/.

    Job: Information Technology Services

    Location: Business Affairs: Administrative Systems (IT)
    Schedule: Full-time
    Classification Level:

    To be considered for this position please visit our web site and apply on line at the following link: stanfordcareers.stanford.edu

    Stanford is an equal opportunity employer and all qualified applicants will receive consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other characteristic protected by law.


  • Similar jobs

    Similar jobs