Somatic Mosaicism across Human Tissues (SMaHT) Network Data Use Policy

Version 1.1.2 – Dated April 3, 2024
SMaHT Website: https://smaht.org/ SMaHT Data Portal: https://data.smaht.org/

I. Introduction

This policy encourages sharing openly the data generated by the Somatic Mosaicism across Human Tissues (SMaHT) network.

All SMaHT network grants are engaged in developing or refining technology to investigate how somatic mosaicism in human cells influences biology and disease. SMaHT will systematically document and catalog DNA sequence variants within genomes, identified using state-of-the-art sequencing technologies. SMaHT will spur technological development, enabling researchers to detect diverse somatic genetic variants.

Methodologies, data, and technologies generated by the SMaHT network will be shared with the broader research community to spur continued advancement.

II. Terminology

Data Producers: This refers to the different categories of SMaHT-funded projects that will generate data. These projects include the Genome Characterization Centers (GCCs), the Tool & Technology Development (TTD) projects, the Tissue Procurement Center (TPC), the SMaHT Organizational Center (OC), and the Data Analysis Center (DAC).

Inherited variants: Refers to any genetic variation in every cell in a donor’s body. These variants are typically inherited from the donor’s mother via the egg or their father via the sperm and are also referred to as “germline.” Inherited variants have the potential to result in donor identification.

Non-inherited variants: Refers to any genetic variation absent in every donor’s body cell. These variants arise after the fusion of a sperm and an egg and are also referred to as “somatic.” Non-inherited variants are unlikely to result in donor identification.

Functional genetic data refers to any data that reflects modification to DNA other than sequence variation and downstream gene expression data. Examples include chromatin accessibility measures, epigenetic DNA modification, and single-cell or bulk RNA count data. These data are unlikely to result in donor identification.

Protected SMaHT Data: Refers to data under restricted access as they can result in donor identification. These data will be stored in a protected database and only accessible to approved researchers. The protected data include but are not limited to:

  • Information from the donor’s medical record that could lead to donor identification, such as health and disease history
  • Inherited (aka germline) variant calls and/or DNA and RNA sequence data from an individual donor

Open SMaHT Data: Refers to data without restricted access as they are unlikely to result in donor identification. The open data include but are not limited to:

  • De-identified information from a donor’s medical record, such as sex, age group, and cause of death
  • Sequence data and variant calls that are aggregated across multiple donors
  • Non-inherited (aka somatic) variant data from individual donors

Approved Researcher: Refers to a scientific researcher who is approved to access protected SMaHT data. Scientific researchers can only access protected SMaHT data after approval is granted through formal access control. Examples of formal data access control include but are not limited to an internal approval process for SMaHT-affiliated researchers and their institutions or dbGaP.

General Researcher: Refers to a scientific researcher who does not have access to protected SMaHT data.

III. SMaHT Data Sharing, Permissions & Availability

SMaHT data covered under this data use policy are as follows:

  • DNA sequencing data
  • RNA sequencing data
  • Epigenetic profiles
  • Protocols (both experimental and computational) used to generate and process data
  • Data associated with DNA & RNA sequencing data. This includes (but is not limited to) the metadata files produced from sequencing and tissue and donor-specific information characterizing the DNA & RNA data.

As defined in Section II, SMaHT data will be classified into two categories: protected and open. Data availability is tabulated below:

Data TypePublic or General ResearcherApproved Researcher
Protocols used to generate dataAvailableAvailable
De-identified information from a donor’s medical record, such as sex, age group, and cause of deathAvailableAvailable
Non-inherited (aka somatic) variant data from individual donorsAvailableAvailable
Individual-level functional dataAvailableAvailable
Sequence data and variant calls that are aggregated across multiple donorsAvailableAvailable
Inherited (aka germline) variant calls and/or DNA and RNA sequence data from an individual donorUnavailableAvailable
Information from the donor’s medical record that could lead to donor identification, such as health and disease historyUnavailableAvailable

A small amount of SMaHT data has additional data use and sharing limitations. These data and their limitations are described in Appendix B.

IV. Data Submission Schedule

To facilitate regular data transfer, each data producer will submit data to the DAC as soon as possible on a rolling submission schedule of approximately every 6 months.

V. Data Release Schedule

The DAC will make data that passes QC criteria immediately available without any embargo for analysis for SMaHT researchers. Separately, the DAC will package all submitted data that pass QC criteria into a data freeze for regular release to all researchers, with an expected cadence of about 6-12 months between each data freeze. Protected SMaHT data depends on whether the researcher has approval to access it, as described in sections II and III. Data will be accessed at https://data.smaht.org.

VI. Using SMaHT Data

  1. Anyone may download and analyze open SMaHT data without restriction.
  2. Only approved researchers may download and analyze protected SMaHT data. Publications (including preprints) and presentations resulting from these data must not lead to the publication of potentially identifying information.
  3. Researchers using unpublished SMaHT data in a publication or presentation must contact the specific data producer to discuss possible coordinated publication. Unpublished data have yet to be described in a peer-reviewed publication. Coordinated publication may take the form of (but is not limited to) active participation, input, or review of the publication or presentation by the data producer. Additionally, publications and presentations should cite SMaHT appropriately (see section VII).
  4. Researchers using SMaHT data should follow the publication & attribution guidelines below (section VII). Efforts to release coordinated publications simultaneously (such as bundled publications in a single journal) will be encouraged.

VII. Publication & Attribution of SMaHT Data

  1. All publications using SMaHT data should follow the processes outlined in the SMaHT Publication Policy.
  2. Publications and/or presentations using SMaHT data must (a) cite the SMaHT Marker Paper (TBD: PMID/doi), (b) indicate which released SMaHT dataset was used, and (c) cite all relevant publications and preprints describing the data being used.
  3. SMaHT data producers that generated SMaHT data used in publications and/or presentations must be acknowledged. The individual grants comprising SMaHT and their grant numbers are available in Appendix A.
  4. An example of acknowledgments text that can be used is below:
    1. Some data used in this work is from the NIH-funded Somatic Mosaicism across Human Tissues (SMaHT) network and was provided by the SMaHT Data Analysis Center (DAC) [1UM1DA058230] on behalf of the SMaHT network. More information about the SMaHT network is available online @ https://smaht.org/.

VIII. Compliance

When a SMaHT data use policy violation is identified, the violation should be notified to SMaHT’s NIH Program Directors (Amy Lossie - amy.lossie@nih.gov; Geetha Senthil - senthilgs@mail.nih.gov; Jill Morris - jill.morris@nih.gov). NIH Program Directors will seek to identify the cause(s) of the violation, determine if any action is needed to remedy the breach by the data user, and, if remediating actions are recommended, ensure the remediating actions are taken.

This policy will be reviewed and, if necessary, updated semi-annually.

Contributors

SMaHT Policy Working Group
Co-Chairs: Jimmy Bennett (PI - UW-SCRI GCC), Lucinda Fulton (PI - OC), Heather Lawson (PI - OC)
Document Coordinator: Jeffrey Ou (Project Manager - UW-SCRI GCC)

SMaHT Data Analysis Center
Members: Peter Park (PI - Harvard University), Elizabeth Chun (Project Manager - Harvard University)

SMaHT Network Grantees - Appendix A

Revisions

  • Version 1.1.2 - April 3, 2024
    • Added a link to the Publication Policy
  • Version 1.1.1 - March 22, 2024
    • Added additional clarity that the restrictions in Appendix B.I are for Yale Fibroblasts, iPSC lines, and other samples originating from living subjects
    • Added a contact point for reporting violations of the SMaHT Data Use Policy
  • Version 1.1 - February 2024
    • Details regarding datasets with specific data use and sharing limitations have been added, as Appendix B describes.
  • Version 1.0 – January 11, 2024

Appendix A: Table of SMaHT Network Funded Projects

ComponentAwardeeInstitutionTitleAward Number
Organizational CenterTing Wang (Contact), Heather Lawson, Lucinda Antonacci-FultonWashington UniversityWashU Somatic Mosaicism across Human Tissues (SMaHT) Program Organizational Center1U24NS132103
Tissue Procurement CenterThomas BellNational Disease Research InterchangeTissue Procurement Center (TPC) Supporting the Somatic Mosaicism across Human Tissues (SMaHT) Network1U24MH133204
Data Analysis CenterPeter ParkHarvard UniversityData Analysis Center for Somatic Mosaicism Across Human Tissues Network1UM1DA058230
Genome Characterization CenterKirstin Ardlie (Contact), John Niall, and Pradeep NatarajanBroad Institute, Inc.Whole Individual Comprehensive KnowlEDge: Somatic Mosaicism across Human Tissues (WICKed SMaHT)UM1DA058235
Genome Characterization CenterJames Bennett (Contact), Evan Eichler, and Andrew StergachisUniversity of Washington & Seattle Children’s Research InstituteMosaicism in Human Tissues, from Telomere to Telomere to RFA-22-013: Somatic Mosaicism across Human Tissues Program: Genome Characterization CentersUM1DA058220
Genome Characterization CenterSoren Germer (Contact) and Samuel AparicioNew York Genome CenterNew York Genome Characterization Center: Somatic Mosaicism across Human TissuesUM1DA058236
Genome Characterization CenterRichard Gibbs, Rui Chen, and Harsha DoddapaneniBaylor College of MedicineComprehensive Somatic Variant Characterization at the HGSCUM1DA058229
Genome Characterization CenterTing Wang (Contact), Robert Fulton and Hui ShenWashington UniversityWashU-VAI Somatic Mosaicism across Human Tissues (SMaHT) Program Genome Characterization CenterUM1DA058219
Tool and Technology DevelopmentAlexej AbyzovMayo Clinic RochesterHybrid approach for comprehensive mutation detection in a cellUG3NS132128
Tool and Technology DevelopmentKathleen Burns (Contact), Bradley Bernstein, and Alice Eungung LeeDana-Farber Cancer InstituteSingle molecule detection of L1 insertions and intermediatesUG3NS132127
Tool and Technology DevelopmentFei Chen (Contact), Jason Buenrostro, Jason Daniel, and Gad GetzBroad Institute, Inc.A Platform for Scalable Spatial Somatic Variant ProfilingUG3NS132135
Tool and Technology DevelopmentSangita Choudhury (Contact), Alice Eungung Lee, and Christopher WalshBoston Children’s HospitalDetection and Characterization of Somatic Mutations in Human Tissue Utilizing Duplex-Consensus SequencingUG3NS132144
Tool and Technology DevelopmentGilad EvronyNew York University School of MedicineUltra-High Fidelity Single-Molecule Profiling of Mosaic Double- and Single-Strand DNA Mutations and DamageUG3NS132024
Tool and Technology DevelopmentThomas Fazzio (Contact) and Manuel GarberUniversity of Massachusetts Medical School WorcestervarCUT&Tag: A Method for Simultaneous Identification and Characterization of Sequence Variants in Regulatory Elements and GenesUG3NS132136
Tool and Technology DevelopmentFulai Jin (Contact) and Yan LiCase Western Reserve UniversitySimultaneous mapping of somatic mosaicism and kb-resolution 3D genome in single cellsUG3NS132061
Tool and Technology DevelopmentDan Landau (Contact) and Rahul SatijaWeill Medical College of Cornell UniversitySingle-Cell Multi-omics to Link Clonal Mosaicism (CM) Genotypes with Chromatin, Epigenomic, Transcriptomic and Protein PhenotypesUG3NS132139
Tool and Technology DevelopmentGabor Marth (Contact) and Hunter UnderhillUniversity of UtahA reference-free computational algorithm for comprehensive somatic mosaic mutation detectionUG3NS132134
Tool and Technology DevelopmentRyan Mills (Contact, Alan Boyle, and Michael McConnellUniversity of Michigan Ann ArborMolecular and Computational Tools for Identifying Somatic Mosaicism in Human TissuesUG3NS132084
Tool and Technology DevelopmentFritz Sedlazeck (Contact) and Tao WuBaylor College of MedicineIdentification of somatic/ mosaic SV and transposon activity and their crosstalk to DNA epigenetic ModificationsUG3NS132105
Tool and Technology DevelopmentAlexander UrbanStanford UniversityEstablishing and benchmarking advanced methods to comprehensively characterize somatic genome variation in single human cellsUG3NS132146
Tool and Technology DevelopmentChristopher Walsh (Contact) and Peter ParkBoston Children’s HospitalDevelopment of an Efficient High Throughput Technique for the Identification of High-Impact Non-Coding Somatic Variants Across Multiple Tissue TypesUG3NS132138
Tool and Technology DevelopmentChenghang ZongBaylor College of MedicineDevelop accurate high-coverage and high-throughput single-cell Duplex-seq chemistry and multi-omics platforms for simultaneous profiling of somatic mutation and the transcriptome in single human cellsUG3NS132132

Appendix B: Tables of Data Use Conditions for Samples with Special Data Sharing Restrictions

B.I Yale Fibroblasts, iPSC lines and other samples originating from living subjects

Sample set description: Yale Fibroblasts: iPSC lines and other samples originating from living subjects.

Data use table:

Data TypePublic or General ResearcherApproved Researcher
Protocols used to generate dataAvailableAvailable
De-identified information from a donor such as sex, age groupAvailableAvailable
Non-inherited (aka somatic) variant data from individual donorsUnavailableAvailable
Individual level functional dataUnavailableAvailable
Sequence data and variant calls that are aggregated across multiple donorsAvailableAvailable
Inherited (aka germline) variant calls and/or DNA and RNA sequence data from an individual donorUnavailableAvailable
Information from the donor’s medical record that could lead to donor identification, such as health and disease historyUnavailableAvailable