Data Quality Engineer

Location
Chicago
Posted
Feb 22, 2017
Institution Type
Four-Year Institution
About The Unit: The Biological Sciences Division (BSD) is the largest operating unit of the University of Chicago. It includes the Pritzker School of Medicine, twenty-three academic departments, twelve interdisciplinary degree-granting committees, and more than a dozen research centers and institutes. Our mission is to discover and create new knowledge of living systems, to preserve and communicate knowledge through education, and to nurture and sustain a community of scholars. These scholars pursue this mission through research, the education of basic scientists, physicians, and others interested in living things, and through enlightened and compassionate care of patients in a humane, academic environment. Join us in transforming cancer research. The Center for Data Intensive Science is developing the emerging field of data science with a focus on applications to problems in biology, medicine, and health care. Our vision is a world in which researchers have ready access to the data and tools required to make discoveries that lead to deeper understanding and improved quality of life. We democratize access, speed discovery, create new knowledge and foster innovation through implementation using data at scale. Our scientific data clouds and commons include: The NCI Genomic Data Commons (GDC) is a next generation unified system that supports the hosting and standardization of genomic and clinical data from cancer research programs. The GDC provides the cancer research community with a data service supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. This is the foundation of a genomic precision medicine platform and will enable the development of a knowledge system for cancer. The GDC is an open-source, scalable, modern informatics framework that enables previously infeasible collaborative efforts between scientists. The Bionimbus Protected Data Cloud is an analysis plastform for bioinformatics at scale. Bionimbus is the first open-source cloud-based computational platform that allows researchers authorized by NIH to compute over human genomic data in a secure and compliant fashion. Bionimbus and related cloud-based infrastructure are used by researchers working on cancer, diabetes and neuropsychiatric disorders. The Open Science Data Cloud (OSDC) provides the scientific community with resources for storing, sharing, and analyzing terabyte- and petabyte-scale scientific datasets. The OSDC is a data science ecosystem in which researchers can house and share their own scientific data, access complementary public datasets, build and share customized virtual machines with whatever tools necessary to analyze their data, and perform the analysis to answer their research questions. The OSDC is a collaboration with the not-for-profit Open Cloud Consortium. The Biomedical Data Commons (BDC) is cloud-based infrastructure that we are developing for a consortium of medical research centers and commercial partners who provides secure, compliant cloud services for managing and analyzing genomic data, electronic medical records (EMR), medical images, and other PHI data. It provides resources to researchers so they can more easily make discoveries from large, complex controlled-access datasets. The BDC is a collaboration with the not-for-profit Open Cloud Consortium.

Unit Job Summary: We're looking for a problem solver with a background working in data integrity and testing to ensure high quality data and metadata is distributed to the cancer research community. Elevate your career with this opportunity to work with one of the world's largest collections of harmonized cancer genomic data. This role focuses on the Genomic Data Commons, which is at the forefront of both cutting edge research and production systems supporting cancer research. You will join a team of engineers developing innovative technologies who will keep you challenged in our dynamic environment as we work together to pursue discovery through data-driven cancer research. You will join the team as the lead engineer for data quality and integrity. You will focus on leading data quality efforts related to data integration, higher level data products, and distribution to the cancer research community. To accomplish this, you will work across multiple teams to build and automate frameworks such as anomaly detection, reporting, and alerting to ensure data quality. You shall gain expertise not only in the data itself, but the systems as well in order to interrogate the data and understand gaps in data quality. Data and metadata quality has a broad scope therefore you are expected work collaboratively across teams to determine priorities and best methods for achieving objectives. Key responsibilities include: Data Quality and Integrity - Drive the design of the data QA infrastructure and execution of testing protocols to validate pipelines, integrated datasets, and data products. Use a combination of exploratory, regression and automated testing to ensure data quality standards. Assess appropriate inclusion/exclusion of data based on defined data dictionary; assist in evaluation of data dictionaries and utilize data specification and code to validate data as it relates to quality. Data Quality Improvement - Proactively identify potential data issues and downstream impact. Identify existing data issues and perform research and root cause analyses to determine resolution. Work collaboratively with software engineers and bioinformaticians to achieve and verify resolution. Establish processes and standards to improve data quality assurance and implement efficiencies in data management. Define measurements and metrics to conduct and present routine data reports to the project team and stakeholders. Data Management - Participate in data acquisition and integration planning efforts including data modeling, data dictionary definitions, and data harmonization pipeline development. Develop a deep understanding of multiple genomic datasets and the technical data management software and processes of the underlying system. Define data quality and integrity criteria and develop a comprehensive data quality management plan to lead key data QC efforts through team collaboration for all phases of the data management life cycle. Technical Writing - Contribute written knowledge and expertise to system documentation, user documentation, scientific manuscripts, reporting, grant proposals and reports, and presentation materials. Stay abreast of broad knowledge of existing and emerging technologies and QC tools in the cancer genomics space. Other duties as assigned. This at-will position is wholly or partially funded by contractual grant funding which is renewed under provisions set by the grantor of the contract. Employment will be contingent upon the continued receipt of these grant funds and satisfactory job performance.

Unit Education: Bachelor's degree in Computer Science, Informatics, Bioinformatics, Biological Sciences or related field, or four (4) years of equivalent experience required. Masters or doctoral degree in Computer Science, Informatics, Bioinformatics, Biological Sciences or related field highly preferred.

Unit Experience: Minimum three (3) years of experience working in data quality and integrity engineering or testing required Experience with data modeling, analysis, design, development, testing, and documentation required Experience with data quality standards and practices required Experience writing and executing data-centric tests cases to validate data required Experience writing database queries, reading and understanding database queries, and utilizing other database artifacts required Experience with Python required Experience working with Linux/Unix systems and basic shell scripting required. Experience with biospecimen and clinical data curation preferred. Experience with advanced high-throughput genomic technologies preferred. Experience providing bioinformatics services or support preferred. Experience using NCI datasets (TCGA, TARGET, and CGCI) preferred. Experience with graph and NoSQL databases preferred.

Unit Job Function Competencies: Ability to lead across a collaborative team environment required. Ability and willingness to acquire new programming languages, statistical and computational methods, and background in research area required. Ability to prioritize and manage workload to meet critical project milestones and deadlines required. Confidentiality related to sensitive matters such as strategic initiatives, trade secrets, quiet periods, and scientific discoveries yet to be put in the public domain required. Ability to take a broad plan and break it into incremental tasks and oversee the completion of each task required. Ability to come into a team used to minimal supervision and oversight and ensure accountability for deliverables and outcomes required. Ability to persuade others to adapt new structures or systems in order to meet objectives required. Ability to gain the trust of management in order to gain the authority to successfully coordinate the team required.