4D Nucleome Hackathon

Project 1. Finding TF motifs implicated in Hi-C contacts through machine learning.

Leading lab: Noble Lab at University of Washington
Stakeholders: Anupama Jha (anupamaj@uw.edu), Xiao Wang (wang3702@uw.edu)
Desired deliverable: A benchmark of current Hi-C machine learning models for finding known and novel sequence elements relevant for tissue-specific Hi-C contacts.
Expected coding experience level: Intermediate-Advanced.
Motivation: Two major classes of machine learning models exist for predicting Hi-C contacts. The first model type predicts Hi-C contacts from DNA sequence alone, and it is evaluated on held-out chromosomes in the same tissue. Such models can study the impact of sequence variations on Hi-C contacts within a tissue. Some examples are Akita, Orca, and DeepC. The second model type predicts Hi-C contacts by combining DNA sequence with other epigenetic measurements like ATAC-seq, DNAse-seq and TF-ChIP-seq, and it can predict Hi-C contacts on held-out chromosomes within a tissue and in new tissues conditional on the availability of the epigenetic tracks. Some examples are epiphany and Origami. While the prediction performance of these models has been thoroughly studied, evaluations regarding their interpretation, especially in the context of capturing sequence elements relevant to tissue-specific gene regulation, are lacking.
Post-Hackathon Summary: Two main classes of machine learning models exist for predicting Hi-C contacts. The first model type predicts Hi-C contacts from DNA sequence alone, and it is evaluated on held-out chromosomes in the same tissue. Such models can study the impact of sequence variations on Hi-C contacts within a tissue. Some examples are Akita, Orca, and DeepC. The second model type predicts Hi-C contacts by combining DNA sequence with other epigenetic measurements like ATAC-seq, DNAse-seq and TF-ChIP-seq, and it can predict Hi-C contacts on held-out chromosomes within a tissue and in new tissues conditional on the availability of the epigenetic tracks. Some examples are Epiphany and C. Origami. While the prediction performance of these models has been thoroughly studied, evaluations regarding their interpretation, especially in the context of capturing sequence elements relevant to tissue-agnostic and tissue-specific gene regulation, are lacking. In this project, we systematically evaluated one model from both model classes on their ability to capture tissue-agnostic and tissue-specific gene expression signals, such as transcription factors. We ran the deep learning interpretation framework, Integrated Gradients, to compute attributions for one model from each class, Orca and Epiphany. We then leveraged RNA-seq and ATAC-seq data from GTEx and ENCODE in hundreds of tissues to find tissue-agnostic transcription factors. In parallel, we found tissue-specific transcription factors in three target tissues where we had trained Hi-C models, h1ESC, GM12878 and HFF. Subsequently, we used ATAC-seq data in h1ESC, GM12878 and HFF to find the accessible regions of the genome. Then, we searched for motif occurrences of transcription factors in the accessible regions of the held-out chromosomes using FIMO. Finally, we evaluated whether the models found some transcription factors significant for 3D genome structure using attribution scores of motif loci compared to background regions.
Back to top

Project 2. Inference of chromatin looping status from live cell imaging data.

Leading lab: Li lab at University of North Carolina, Chapel Hill (UNC-CH) and Hu lab at Cleveland Clinic
Stakeholders: Hongyu Yu (hongyuyu@email.unc.edu) and Shreya Mishra (mishras10@ccf.org)
Desired deliverable: (1) Develop a computational pipeline to estimate the frequency and duration of chromatin looping status from live cell imaging data. (2) Characterize the cell-to-cell variability of the kinetics of chromatin looping events. (3) Compare chromatin looping dynamics among different genomic loci.
Expected coding experience level: Basic familarity with Python and R.
Motivation: The recently developed live cell imaging technology (PMID: 31124784, 30038397, 33310227) provides a powerful tool to study the kinetics of chromatin spatial organization in live cells, facilitating a deep understanding of chromatin folding dynamics and gene regulation. In contrast to rapid development of experimental technologies, tailored computational methods for analyzing live cell imaging data are still lacking. In the project, the goal is to develop a stand-alone, user-friendly computational pipeline to infer the underlying chromatin looping status from live cell imaging data. The software to be developed can be applied to characterize the temporal dynamics of both CTCF-CTCF loops and enhancer-promoter interactions, and has the potential to shed novel insights on transcriptional bursting and gene regulation.
Post-Hackathon Summary: During the hackathon, we were able to achieve the following: 1) Applied Mach et al’s HMM to Gabriele et al’s data and showed that the loop fraction predicted by HMM differs from the one predicted by BILD. 2) Tried to replicate the analysis in Gabriele et al using the tracklib library developed by the authors. Successfully replicated the MSD fitting part but not the loop analysis part. 3) Systematically improved the current HMM in the following directions: (a) Tested various imputation methods to fill in missing data points, including: Fill with the preceding value, Apply Savitzky–Golay filter to smooth the trajectory, Impute by (weighted) moving average models, Interpolate by splines, and Apply Kalman filters. (b) Changed the emitting distribution from normal to multivariate normal. (c) Fitted the parameters of the distributions before fitting the HMM. For the un-looped state, estimate parameters using control strains. For the looped state, set the variance as 15% of the un-looped state. (d) Rederived the expectation-maximization algorithm used in HMM to handle trajectories with missing values. Modified hmmlearn library. 4) Generated simulated data to evaluate our new model: (a) Found the optimal transition matrix by comparing the MSD from real data and simulated data. (b) Simulated missing values in trajectories by estimating true missing patterns. In the end, we implemented all the above improvements with Python. The improved model has higher accuracy and precision on the simulated data. It also returns a more realistic loop fraction and loop lifetimes on real data.
Back to top

Project 3. In silico variant prioritization using sequence-based predictive models.

Leading lab: Pollard Lab at UCSF
Stakeholders: Katie Gjoni (katie.gjoni@gladstone.ucsf.edu), Shu Zhang (shu.zhang@gladstone.ucsf.edu) and Katie Pollard (katherine.pollard@gladstone.ucsf.edu)
Desired deliverable: Implementing various machine learning models to score variants for disruption to predicted results.
Expected coding experience level: Strong background in python, and experience working in bash.
Motivation: Predictive bioinformatics algorithms that take DNA sequences as input are frequently used to test the effects of genetic variants in high throughput in silico perturbation experiments. There previously lacked a standard procedure in formatting individual variants into a set of ready-to-use inputs. However, our lab recently developed SuPreMo, a computational tool to prepare individual genetic variants into model-ready DNA sequences. In addition, SuPreMo is directly integrated with Akita, a model that predicts genome folding. This project intends to extend this work to generate a set of SuPreMo-based tools that comprehensively scores variants for effects on various chromatin profiles and gene expression. This entails three coding days of 1) implementing machine learning models, 2) processing their outputs to allow for fair comparison of outputs, and 3) evaluating statistical methods for comparing outputs from reference and mutated sequences. These tools would be important for prioritizing putative pathogenic variants for experimental studies, decoding the grammar of noncoding DNA sequences, discovering new sequence motifs, designing tissue-specific enhancers, and uncovering novel roles of sequence elements.
Post-Hackathon Summary: SuPreMo is a pipeline that streamlines in silico mutagenesis (ISM) for sequence-based predictive models and SuPreMo-Akita generated scores for disruption to genome folding. Our team set out to 1) incorporate new models with SuPreMo in addition to Akita, and 2) add additional functionality to SuPreMo. Adding new models to the pipeline allows for faster, easier, and more flexible prediction of variant effects on various features. Our team developed two new versions: SuPreMo-Enformer, which performs ISM on various predicted genomic tracks including gene expression, accessibility, epigenetic modifications, etc. and SuPreMo-ExPecto, which performs ISM with predicted gene expression values for one gene at a time. Now users can take advantage of the flexible SuPreMo pipeline while also being able to predict variant effects on new genomic annotations across different conditions provided by the three models, such as window sizes, cell type, assay type, and variant distance to feature. We added functionality that affects both SuPreMo-generated sequences and scores. Users can now input a set of individual-specific SNVs for SuPreMo to generate a personalized reference genome. A use case for this option is when you have variants from both a disease sample and its matched normal. In addition, we created options for sequence mutagenesis for a variant and its surrounding regions. These options include mutating the GC content of the sequence, shuffling the sequence, and mutating transcription factor binding motifs in the sequence. In regards to scoring variants, users can now quantify the disruption of a variant and obtain a p-value when they input a set of control variants. In this case, the query variant score would be compared to the null distribution of scores generated by the control variants. Finally, disruption scores can be weighted based on additional features of the variant. For example, a variant can be prioritized if it disrupts nearby genes, histone marks of active enhancers or promoters, or regions of chromatin accessibility. All of these options increase the possibilities of ISM with SuPreMo.
Back to top

Project 4. Integrative analysis of single-cell Hi-C datasets.

Leading lab: Jian Ma Lab at CMU
Stakeholders: Akanksha Sachan (akanksha.11.05.07@gmail.com) and Wendy Yang (muyuy@andrew.cmu.edu)
Desired deliverable: In this project, we aim to build a workflow for scHi-C data analysis using the existing software (Higashi and Fast-Higashi) to achieve the following two goals: (1) Using Higashi/Fast-Higashi to embed single cells and impute scHi-C data; Then use the embedding to call compartments/TADs (2) If time permits, identifying the single-cell level 3D genome features (sub-compartments) by scGHOST based on the imputed scHi-C contact maps and study the cell-to-cell variability of those features.
Expected coding experience level: Python, bash/terminal experience.
Motivation: Cellular heterogeneity can be observed by embedding their genome architectural features. This also enables connecting genomic architecture to other epigenetic data modalities to assess its functional roles comprehensively. Analysing cell-to-cell variability of 3D genome features from sparse scHiC data
Post-Hackathon Summary: Standardizing embedding, imputation, and 3D genomic feature analysis for scHiC datasets is challenging but necessary for mapping the heterogeneity of chromatin architecture across different cell types. To achieve this aim, we created a comprehensive single-cell HiC sequencing data analysis pipeline during the hackathon. We applied the pipeline on two datasets from the human prefrontal cortex (5k cells) and mouse hippocampal formation regions (15k cells) sequenced using sn-m3c-seq. We benchmarked 5 computational methods for generating single-cell embeddings using HiC data, which cluster these cells into their different types/sub-types using quantitative metrics such as ARI, etc for a fair comparison across methods. We evaluated the performance of 2 methods (Higashi and scHiCluster) for imputing the sparsity in the single-cell HiC contact maps and visualizing the intra-chromosomal matrices. We also extended the benchmarking to assess the ability of the methods to perform cell-type clustering using 3D genomic features such as A/B compartment scores computed in single cells. Quantifying such features across different cell types would aid the discovery of cell-type-specific compartments, sub-compartments, and other architectural features playing possible functional roles in expressing genes and replicating DNA.
Back to top

Project 5. Evaluating the effects of Hi-C and TF on cis-regulation of gene expression.

Leading lab: Christina Leslie Lab at MSKCC
Stakeholders: Alireza Karbalayghareh (karbalayghareh@gmail.com) and Rui Yang (ruy4001@med.cornell.edu)
Desired deliverable: In this project, we aim to explore the impact of 3D genome structure on gene expression. GraphReg, a deep learning model, leverages 3D interactions along with 1D epigenomic data or genomic DNA sequences to predict gene expression. During the hackathon, participants will have the opportunity to engage in: a) Biological application of the model: Participants will apply the GraphReg model to germinal center B cell data. Using feature attributions, they will investigate potential enhancer-promoter interactions or distal regulatory elements influencing gene expression. b) Technical benchmarking of the data: Epiphany and ChromaFold are two deep learning models to predict Hi-C contact maps using 1D epigenomic tracks or scATAC-seq matrices. Participants will benchmark the effectiveness of GraphReg predictions by comparing experimental Hi-C data against model-predicted data, and analyze the results.
Expected coding experience level: Basic knowledge about Python, bash/terminal experience.
Motivation: In this project, we hope to explore two directions: a) a biological application of GraphReg on novel dataset; b) Benchmark the application of real experimental data vs. predicted Hi-C data from previously published models.
Post-Hackathon Summary: We have investigated the impact of Hi-C and transcription factors (TF) on cis-regulation of gene expression. We included 38 TF ChIP-seq datasets as inputs for GraphReg to examine the influence of TF binding on gene expression. Additionally, we assessed model performance using graphs constructed from real Hi-C, predicted Hi-C, and without a 3D structure. Our findings indicate that incorporating 3D information enhances the model's robustness in predicting in-silico perturbations, as confirmed by TF CRISPR knockout experiments. Notably, while the Epiphany model trained with mean squared error loss produces blurred predictions, it successfully identifies 90% of enhancer-promoter (E-P) interactions from real Hi-C data. Furthermore, GraphReg shows robustness against false-positive interactions, efficiently identifying relevant E-P interactions and accurately predicting gene expression. Moreover, we have calculated SHAP values using GraphReg to explore the role of TF binding in specific gene regulation.
Back to top

Project 6. Polymer model benchmarking.

Leading lab: Plewczynski Laboratory from University of Warsaw in Center of New Technologies CeNT, Warsaw, Poland
Stakeholders: Jedrzej Kubica (j.kubica@cent.uw.edu.pl), Dariusz Plewczynski (d.plewczynski@cent.uw.edu.pl)
Desired deliverable: Collection of 3D chromatin models for selected cell lines and experimental assays, protocol for critical assessment of the models, catalog of problems solvable by chromatin modeling, classification of modeling methods.
Expected coding experience level: Familiar with Python or C.
Motivation: The aim of the project is the identification of software to be suggested as the state-of-the-art practice in the modeling of spatial chromatin organization. In recent years, various approaches have been taken to predict the structure of chromatin, the majority of which leverages data about epigenomic modifications or imaging data. During the hackathon, we plan to analyze a meta-ensemble of chromatin structure models obtained from state-of-the art software. The rationale behind the project is that different approaches have their advantages, as well as drawbacks. In order for the community of researchers to benefit from the software, it is important to identify opportunities to improve it. The goal of our project is to provide the direction for reaching a high level of confidence in the modeling of chromatin structure.
Post-Hackathon Summary: The aim of the project was to create criteria for chromatin model comparison and subsequently to benchmark software for chromatin modeling based on those criteria. Description of the proposed deliverables, as well as the results 1) Collection of 3D chromatin models for selected cell lines and experimental assays Result: We obtained two models (based on Hi-C and ChIA-PET data) per software (5 software packages in total) for the Tier 1 cell line GM12878. 2) Protocol for critical assessment of the models. Result: we developed a prototype methodology for model visualization, model vs. model comparison, as well as model vs. experiment comparison. 3) Catalog of problems solvable by chromatin modeling. For the purposes of Hackathon we implemented different modeling methodologies, which aim to answer different problems. LoopSage is capable of reconstructing the thermodynamic ensemble of possible structures of a specific region based on the loop extrusion first principles and realistically reconstructing experimental heat maps by averaging over them. MultiEM, is a multiscale model which can model different scales of chromatin by applying a multi-scale nature potential. On the other hand, DIMES and PHi-C are two models that were implemented for a more direct (top-down) reconstruction of Hi-C matrices dependent on Hi-C contacts. 4) Classification of modeling methods. Result: we classified the output models based on the input data (Hi-C vs. ChIA-PET) as well as modeling scale. Furthermore, according to our bibliographical research, modeling methods can be distinguished into deterministic (i.e. MultiEM, DIMES, PHi-C) versus stochastic (i.e. LoopSage). We acknowledged the power of stochastic methods to create ensembles of structures, which is in contrast to the computational power that they require. Our prototype pipeline is modular and scalable, therefore we plan to collaborate to extend it to include other software packages, as well as to perform multi-scale model comparison. We would also like to explore the most uncorrelated models and the software which generated them to examine the characteristics that cause the differences.
Back to top

Project 7. Unveiling the role of epigenomic features on RNA splicing throughout neural development using machine learning.

Leading lab: YIN SHEN Lab from UCSF
Stakeholders: Jing Wang (Jing.Wang4@ucsf.edu), Ian Jones (Ian.Jones3@ucsf.edu)
Desired deliverable: The deliverable of this study will be a comprehensive analysis outlining correlations between chromatin accessibility profiles obtained from ATAC-seq and PLAC-seq data and the corresponding RNA splicing patterns extracted from RNA-seq data. This will potentially include visualizations, statistical models, and identified key regulatory elements impacting splicing events.
Expected coding experience level: Basic familarity with Python and R.
Motivation: This study aims to elucidate the relationship between chromatin accessibility and RNA splicing using multi-omics data obtained from human brain samples. Leveraging RNA-seq, ATAC-seq, and PLAC-seq datasets, our primary objective is to analyze the influence of chromatin accessibility patterns on the splicing landscape across multiple subtypes the human brain.
Post-Hackathon Summary: Firstly, we need to express our sincere thanks to the supporters and everyone who helped. I believe all members had a very positive and impressive experience during the meeting. Each member was highly active, and with solid support from the hackathon supporters, we had a reliable platform to explore and test everything we wanted. Thanks to the efforts of the hackathon committee staff, we were able to utilize AWS storage, GPU, and CPU resources. Our team applied machine learning to predict RNA splicing events based on epigenetic features. We thoroughly evaluated different models, confirming that random forest and deep learning show the most promise for future analysis. We plan to continue our analysis after the hackathon, and we believe our findings could contribute to impactful research in the near future. Lastly, but certainly not least, thank you to everyone who supported the hackathon.
Back to top

Project 8. SmellEnhancer: Integrating Machine Learning for Enhancer-based Olfactory Receptor Gene Regulation.

Leading lab: Lomvardas Lab at Columbia University
Stakeholders: Miao Wang (mw3777@columbia.edu), Isabella Pirozzolo (idp2121@cumc.columbia.edu).
Desired deliverable: Generate models to investigate the cooperative regulation of enhancers on gene expression in cell development.
Expected coding experience level: Basic knowledge about Python, bash/terminal experience.
Motivation: Enhancer-promoter interactions drive cell differentiation and cell fate determination. However, how cell type-specific enhancers cooperate during these processes remains unknown. The olfactory system detects and identifies odor signals with the “one receptor per neuron” rule where each mature olfactory sensory neuron (mOSN) expresses only one out of more than ~2000 olfactory gene (OR) alleles. OR gene activation is accomplished by forming chromatin interactions between OR genes and OR-specific enhancer elements named Greek Islands (GIs). At least 63 GIs were distributed in the intergenic regions in the mice. Therefore, olfactory receptor genes provide an ideal system to study how enhancers cooperate to regulate gene transcription transition in cell fate decisions.
Post-Hackathon Summary: Olfactory sensory neurons express one olfactory receptor (OR) out of a thousand choices. The chosen OR outcompetes multiple others that are expressed in developing cells. We aimed to develop a model to predict the chosen OR from a set of regulatory elements. Using a multiome dataset that combined RNAseq with ATACseq in the same single cells, we identified the chosen olfactory receptor in each cell, extracted the ATACseq peaks over regulatory sequences of interest, and used these cleaned data files to build several models. We built and optimized linear regression, random forest, and decision tree models. These were largely unable to predict OR choice or expression level. They were limited both by data sparsity and the lack of information on additional elements important for choice. Finally, we built a stacked model designed to predict one of the top five chosen ORs. It had a higher accuracy, suggesting that chromatin accessibility may be partially deterministic in OR choice. We will continue to build on this model and deepen datasets that could unpack the rules governing this process.
Back to top

Project 9. Building reproducible 3D genome simulations frameworks by populating the polymer model zoo.

Leading lab: Fudenberg lab at USC
Stakeholders: Sasha Galitsyna (galitsyn@mit.edu), Max Tortora (tortora@usc.edu), Geoff Fudenberg (fudenber@usc.edu)
Desired deliverable: Initial commits for the 3D genome polymer model zoo and a draft design document to unify polymer model specifications.
Expected coding experience level: Python background. Prior experience with: biophysics, file formats, and open source software development would be helpful.
Motivation: Despite progress in reproducible analysis of experimental data, software for the specification and analysis of biophysical simulations of 3D genomes has lagged. Existing simulation engines, often adapted from protein folding applications (e.g., HooMD, OpenMM, LAMMPS), lack modularity and tend toward complex, unsustainably coded frameworks. The Open Chromosome Collective (Open2C) aims to remedy this by developing an open-source polymer modeling framework for 3D genome simulation, polychrom. Exploration and implementation of published polymer models into a unified codebase are needed to develop a robust and flexible high-level API. Hackathon team members will build polymer models of 3D genome organization across cell states and organisms, populate a polymer model zoo, and understand the design principles for a robust polymer simulation framework.
Post-Hackathon Summary: 3D genome polymer simulations hold significant importance in understanding chromatin dynamics and Hi-C data interpretation. Simulation software typically presents a steep learning curve, requiring knowledge of physics, dealing with complex programming interfaces, and enduring extensive run times. To remedy this, the main aim of our project was to differentially assess — and potentially improve — the reliability, performance and API friendliness of various polymer simulations frameworks for coarse-grained chromatin simulations. By benchmarking various molecular dynamics schemes, we were able to show that the hydrodynamics-based dissipative-particle dynamics (DPD) method significantly outperformed the more commonly-used Langevin integrator for simulation of large (>100,000 particles) polymeric regions, without loss of accuracy. We further highlighted how this DPD backend may be tractably applied in the context of chromatin biology by devising a novel pipeline based on polychrom-hoomd (https://github.com/open2c/polychrom-hoomd) for quantitative simulations of single-cell HiC data. We finally deployed a new interactive API based on WandB (https://wandb.ai) for the live-tracking of simulation results and performance. The project also involved the development of streamlined tools to facilitate the integration of experimental data such as compartment profiles or single-cell HiC constraints into the simulation workflow, as well as the implementation of a robust analysis pipeline to be able to process both experimental and simulation data within a unified software ecosystem. The codes generated during the hackathon may be found here: https://github.com/4dn-hack-team9-2024.
Back to top