Somatic Mosaicism across Human Tissues (SMaHT) Network Data Use Policy
Version 1.1.2 – Dated April 3, 2024
SMaHT Website: https://smaht.org/ SMaHT Data Portal: https://data.smaht.org/
I. Introduction
This policy encourages sharing openly the data generated by the Somatic Mosaicism across Human Tissues (SMaHT) network.
All SMaHT network grants are engaged in developing or refining technology to investigate how somatic mosaicism in human cells influences biology and disease. SMaHT will systematically document and catalog DNA sequence variants within genomes, identified using state-of-the-art sequencing technologies. SMaHT will spur technological development, enabling researchers to detect diverse somatic genetic variants.
Methodologies, data, and technologies generated by the SMaHT network will be shared with the broader research community to spur continued advancement.
II. Terminology
Data Producers: This refers to the different categories of SMaHT-funded projects that will generate data. These projects include the Genome Characterization Centers (GCCs), the Tool & Technology Development (TTD) projects, the Tissue Procurement Center (TPC), the SMaHT Organizational Center (OC), and the Data Analysis Center (DAC).
Inherited variants: Refers to any genetic variation in every cell in a donor’s body. These variants are typically inherited from the donor’s mother via the egg or their father via the sperm and are also referred to as “germline.” Inherited variants have the potential to result in donor identification.
Non-inherited variants: Refers to any genetic variation absent in every donor’s body cell. These variants arise after the fusion of a sperm and an egg and are also referred to as “somatic.” Non-inherited variants are unlikely to result in donor identification.
Functional genetic data refers to any data that reflects modification to DNA other than sequence variation and downstream gene expression data. Examples include chromatin accessibility measures, epigenetic DNA modification, and single-cell or bulk RNA count data. These data are unlikely to result in donor identification.
Protected SMaHT Data: Refers to data under restricted access as they can result in donor identification. These data will be stored in a protected database and only accessible to approved researchers. The protected data include but are not limited to:
- Information from the donor’s medical record that could lead to donor identification, such as health and disease history
- Inherited (aka germline) variant calls and/or DNA and RNA sequence data from an individual donor
Open SMaHT Data: Refers to data without restricted access as they are unlikely to result in donor identification. The open data include but are not limited to:
- De-identified information from a donor’s medical record, such as sex, age group, and cause of death
- Sequence data and variant calls that are aggregated across multiple donors
- Non-inherited (aka somatic) variant data from individual donors
Approved Researcher: Refers to a scientific researcher who is approved to access protected SMaHT data. Scientific researchers can only access protected SMaHT data after approval is granted through formal access control. Examples of formal data access control include but are not limited to an internal approval process for SMaHT-affiliated researchers and their institutions or dbGaP.
General Researcher: Refers to a scientific researcher who does not have access to protected SMaHT data.
III. SMaHT Data Sharing, Permissions & Availability
SMaHT data covered under this data use policy are as follows:
- DNA sequencing data
- RNA sequencing data
- Epigenetic profiles
- Protocols (both experimental and computational) used to generate and process data
- Data associated with DNA & RNA sequencing data. This includes (but is not limited to) the metadata files produced from sequencing and tissue and donor-specific information characterizing the DNA & RNA data.
As defined in Section II, SMaHT data will be classified into two categories: protected and open. Data availability is tabulated below:
Data Type | Public or General Researcher | Approved Researcher |
---|---|---|
Protocols used to generate data | Available | Available |
De-identified information from a donor’s medical record, such as sex, age group, and cause of death | Available | Available |
Non-inherited (aka somatic) variant data from individual donors | Available | Available |
Individual-level functional data | Available | Available |
Sequence data and variant calls that are aggregated across multiple donors | Available | Available |
Inherited (aka germline) variant calls and/or DNA and RNA sequence data from an individual donor | Unavailable | Available |
Information from the donor’s medical record that could lead to donor identification, such as health and disease history | Unavailable | Available |
A small amount of SMaHT data has additional data use and sharing limitations. These data and their limitations are described in Appendix B.
IV. Data Submission Schedule
To facilitate regular data transfer, each data producer will submit data to the DAC as soon as possible on a rolling submission schedule of approximately every 6 months.
V. Data Release Schedule
The DAC will make data that passes QC criteria immediately available without any embargo for analysis for SMaHT researchers. Separately, the DAC will package all submitted data that pass QC criteria into a data freeze for regular release to all researchers, with an expected cadence of about 6-12 months between each data freeze. Protected SMaHT data depends on whether the researcher has approval to access it, as described in sections II and III. Data will be accessed at https://data.smaht.org.
VI. Using SMaHT Data
- Anyone may download and analyze open SMaHT data without restriction.
- Only approved researchers may download and analyze protected SMaHT data. Publications (including preprints) and presentations resulting from these data must not lead to the publication of potentially identifying information.
- Researchers using unpublished SMaHT data in a publication or presentation must contact the specific data producer to discuss possible coordinated publication. Unpublished data have yet to be described in a peer-reviewed publication. Coordinated publication may take the form of (but is not limited to) active participation, input, or review of the publication or presentation by the data producer. Additionally, publications and presentations should cite SMaHT appropriately (see section VII).
- Researchers using SMaHT data should follow the publication & attribution guidelines below (section VII). Efforts to release coordinated publications simultaneously (such as bundled publications in a single journal) will be encouraged.
VII. Publication & Attribution of SMaHT Data
- All publications using SMaHT data should follow the processes outlined in the SMaHT Publication Policy.
- Publications and/or presentations using SMaHT data must (a) cite the SMaHT Marker Paper (TBD: PMID/doi), (b) indicate which released SMaHT dataset was used, and (c) cite all relevant publications and preprints describing the data being used.
- SMaHT data producers that generated SMaHT data used in publications and/or presentations must be acknowledged. The individual grants comprising SMaHT and their grant numbers are available in Appendix A.
- An example of acknowledgments text that can be used is below:
- Some data used in this work is from the NIH-funded Somatic Mosaicism across Human Tissues (SMaHT) network and was provided by the SMaHT Data Analysis Center (DAC) [1UM1DA058230] on behalf of the SMaHT network. More information about the SMaHT network is available online @ https://smaht.org/.
VIII. Compliance
When a SMaHT data use policy violation is identified, the violation should be notified to SMaHT’s NIH Program Directors (Amy Lossie - amy.lossie@nih.gov; Geetha Senthil - senthilgs@mail.nih.gov; Jill Morris - jill.morris@nih.gov). NIH Program Directors will seek to identify the cause(s) of the violation, determine if any action is needed to remedy the breach by the data user, and, if remediating actions are recommended, ensure the remediating actions are taken.
This policy will be reviewed and, if necessary, updated semi-annually.
Contributors
SMaHT Policy Working Group
Co-Chairs: Jimmy Bennett (PI - UW-SCRI GCC), Lucinda Fulton (PI - OC), Heather Lawson (PI - OC)
Document Coordinator: Jeffrey Ou (Project Manager - UW-SCRI GCC)
SMaHT Data Analysis Center
Members: Peter Park (PI - Harvard University), Elizabeth Chun (Project Manager - Harvard University)
SMaHT Network Grantees - Appendix A
Revisions
- Version 1.1.2 - April 3, 2024
- Added a link to the Publication Policy
- Version 1.1.1 - March 22, 2024
- Added additional clarity that the restrictions in Appendix B.I are for Yale Fibroblasts, iPSC lines, and other samples originating from living subjects
- Added a contact point for reporting violations of the SMaHT Data Use Policy
- Version 1.1 - February 2024
- Details regarding datasets with specific data use and sharing limitations have been added, as Appendix B describes.
- Version 1.0 – January 11, 2024
Appendix A: Table of SMaHT Network Funded Projects
Component | Awardee | Institution | Title | Award Number |
---|---|---|---|---|
Organizational Center | Ting Wang (Contact), Heather Lawson, Lucinda Antonacci-Fulton | Washington University | WashU Somatic Mosaicism across Human Tissues (SMaHT) Program Organizational Center | 1U24NS132103 |
Tissue Procurement Center | Thomas Bell | National Disease Research Interchange | Tissue Procurement Center (TPC) Supporting the Somatic Mosaicism across Human Tissues (SMaHT) Network | 1U24MH133204 |
Data Analysis Center | Peter Park | Harvard University | Data Analysis Center for Somatic Mosaicism Across Human Tissues Network | 1UM1DA058230 |
Genome Characterization Center | Kirstin Ardlie (Contact), John Niall, and Pradeep Natarajan | Broad Institute, Inc. | Whole Individual Comprehensive KnowlEDge: Somatic Mosaicism across Human Tissues (WICKed SMaHT) | UM1DA058235 |
Genome Characterization Center | James Bennett (Contact), Evan Eichler, and Andrew Stergachis | University of Washington & Seattle Children’s Research Institute | Mosaicism in Human Tissues, from Telomere to Telomere to RFA-22-013: Somatic Mosaicism across Human Tissues Program: Genome Characterization Centers | UM1DA058220 |
Genome Characterization Center | Soren Germer (Contact) and Samuel Aparicio | New York Genome Center | New York Genome Characterization Center: Somatic Mosaicism across Human Tissues | UM1DA058236 |
Genome Characterization Center | Richard Gibbs, Rui Chen, and Harsha Doddapaneni | Baylor College of Medicine | Comprehensive Somatic Variant Characterization at the HGSC | UM1DA058229 |
Genome Characterization Center | Ting Wang (Contact), Robert Fulton and Hui Shen | Washington University | WashU-VAI Somatic Mosaicism across Human Tissues (SMaHT) Program Genome Characterization Center | UM1DA058219 |
Tool and Technology Development | Alexej Abyzov | Mayo Clinic Rochester | Hybrid approach for comprehensive mutation detection in a cell | UG3NS132128 |
Tool and Technology Development | Kathleen Burns (Contact), Bradley Bernstein, and Alice Eungung Lee | Dana-Farber Cancer Institute | Single molecule detection of L1 insertions and intermediates | UG3NS132127 |
Tool and Technology Development | Fei Chen (Contact), Jason Buenrostro, Jason Daniel, and Gad Getz | Broad Institute, Inc. | A Platform for Scalable Spatial Somatic Variant Profiling | UG3NS132135 |
Tool and Technology Development | Sangita Choudhury (Contact), Alice Eungung Lee, and Christopher Walsh | Boston Children’s Hospital | Detection and Characterization of Somatic Mutations in Human Tissue Utilizing Duplex-Consensus Sequencing | UG3NS132144 |
Tool and Technology Development | Gilad Evrony | New York University School of Medicine | Ultra-High Fidelity Single-Molecule Profiling of Mosaic Double- and Single-Strand DNA Mutations and Damage | UG3NS132024 |
Tool and Technology Development | Thomas Fazzio (Contact) and Manuel Garber | University of Massachusetts Medical School Worcester | varCUT&Tag: A Method for Simultaneous Identification and Characterization of Sequence Variants in Regulatory Elements and Genes | UG3NS132136 |
Tool and Technology Development | Fulai Jin (Contact) and Yan Li | Case Western Reserve University | Simultaneous mapping of somatic mosaicism and kb-resolution 3D genome in single cells | UG3NS132061 |
Tool and Technology Development | Dan Landau (Contact) and Rahul Satija | Weill Medical College of Cornell University | Single-Cell Multi-omics to Link Clonal Mosaicism (CM) Genotypes with Chromatin, Epigenomic, Transcriptomic and Protein Phenotypes | UG3NS132139 |
Tool and Technology Development | Gabor Marth (Contact) and Hunter Underhill | University of Utah | A reference-free computational algorithm for comprehensive somatic mosaic mutation detection | UG3NS132134 |
Tool and Technology Development | Ryan Mills (Contact, Alan Boyle, and Michael McConnell | University of Michigan Ann Arbor | Molecular and Computational Tools for Identifying Somatic Mosaicism in Human Tissues | UG3NS132084 |
Tool and Technology Development | Fritz Sedlazeck (Contact) and Tao Wu | Baylor College of Medicine | Identification of somatic/ mosaic SV and transposon activity and their crosstalk to DNA epigenetic Modifications | UG3NS132105 |
Tool and Technology Development | Alexander Urban | Stanford University | Establishing and benchmarking advanced methods to comprehensively characterize somatic genome variation in single human cells | UG3NS132146 |
Tool and Technology Development | Christopher Walsh (Contact) and Peter Park | Boston Children’s Hospital | Development of an Efficient High Throughput Technique for the Identification of High-Impact Non-Coding Somatic Variants Across Multiple Tissue Types | UG3NS132138 |
Tool and Technology Development | Chenghang Zong | Baylor College of Medicine | Develop accurate high-coverage and high-throughput single-cell Duplex-seq chemistry and multi-omics platforms for simultaneous profiling of somatic mutation and the transcriptome in single human cells | UG3NS132132 |
Appendix B: Tables of Data Use Conditions for Samples with Special Data Sharing Restrictions
B.I Yale Fibroblasts, iPSC lines and other samples originating from living subjects
Sample set description: Yale Fibroblasts: iPSC lines and other samples originating from living subjects.
Data use table:
Data Type | Public or General Researcher | Approved Researcher |
---|---|---|
Protocols used to generate data | Available | Available |
De-identified information from a donor such as sex, age group | Available | Available |
Non-inherited (aka somatic) variant data from individual donors | Unavailable | Available |
Individual level functional data | Unavailable | Available |
Sequence data and variant calls that are aggregated across multiple donors | Available | Available |
Inherited (aka germline) variant calls and/or DNA and RNA sequence data from an individual donor | Unavailable | Available |
Information from the donor’s medical record that could lead to donor identification, such as health and disease history | Unavailable | Available |