Exploring the current paradigm of gene regulation
How do cells know when to activate a certain gene? This information is encoded in the sequence of the DNA, but our understanding of this code is incomplete. Researchers now tested how much information can be extracted from sequence data to predict which gene is active in which tissue.
A good storyteller knows exactly which anecdotes will bring his stories’ characters to life. By telling the right story at the right time, our genome even manages to give rise to hundreds of different cell types with characteristic life stories breathing an individual identity into every cell.
DNA snippets scattered across the genome harbor the code that directs the script of a cell’s life, successively switching genes on and off. Sequences called enhancers play an outstanding role in this process. They attract transcription factor proteins that start the expression of genes, thereby “enhancing” their activity. In some cases, they are located far away from the gene they activate.
Researchers Philipp Benner and Martin Vingron from the Max Planck Institute for Molecular Genetics (MPIMG) set out to decipher the instructions of the activation patterns in distinct cell types and embryonic tissues of the mouse.
With a series of statistical and bioinformatic analyses, the scientists identified several hundreds of tissue-specific DNA subsequences or “codewords” in enhancers that guide transcription factors, not only confirming sequences already known from other studies, but also identifying many new ones. The results have been published in several articles in NAR Genomics and Bioinformatics and the Journal of Computational Biology.
“Today, researchers assume that all the information is in the DNA sequence, including information for specific cell types, tissues, and organs,” says Martin Vingron, Director at the MPIMG. According to the prevailing theory, transcription factor proteins recognize “codewords” in enhancers that are specific for a certain cell type, allowing the genome to tell a cell’s story by jumping to the right chapters. “We wanted to see how far this approach would take us and test its limits,” says Vingron.
The researchers developed a program that is able to identify DNA sequences that are recognized by the cell in order to activate genes in a tissue-specific way. They achieved this by training a statistical model with existing experimental data, telling it which enhancer is active in which tissue. Namely, they used sequencing data from eight tissues of the embryonic mouse like heart, lung, brain, or liver.
By comparing sequence data between the tissues, the program learned to recognize sequence patterns in enhancers that are characteristic for certain tissues.
This told the researchers how much cell type-specific regulatory information is actually contained in the DNA sequence of enhancers, explains Philipp Benner, who is a postdoctoral researcher in Vingron’s lab: “The better our algorithm can classify any given enhancer, the more information it contains about the tissue or cell types that it is responsible for.”
The statistical classifiers can also identify DNA subsequences that might underlie cell type-specific gene activation. In fact, Benner found several hundred new codewords in addition to patterns that have been identified in other studies.
“Overall, we established a strong and, most importantly, an interpretable model,” says Benner.
“With our advanced methods, the predictions are promising but far from perfect”, says Vingron. “Our results indicate that we might really have only a fragmentary understanding of the actual cell type-specific regulatory code.”
It might be possible that not all the required information is contained in the DNA sequence of enhancers but is distributed elsewhere in the genome. Some cross-references in the storybook of the genome might still hide in other regulatory sequences, like promoter regions that are in close proximity to the gene itself.
Parts of the project were funded by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) of the German Federal Ministry of Education and Research (BMBF).
Contact for scientific information:
Prof. Martin Vingron
Director, Head of the Department “Computational Molecular Biology”
Max Planck Institute for Molecular Genetics
+49 30 8413-1150
Dr. Philipp Benner
Guest Scientist at MPIMG
Federal Institute for Materials Research and Testing
+49 30 8104-3647
Benner, Philipp. Computing leapfrog regularization paths with applications to large-scale k-mer logistic regression. Journal of Computational Biology 28.6 (2021): 560-569. DOI: https://doi.org/10.1089/cmb.2020.0284
Benner, Philipp, and Martin Vingron. Quantifying the Tissue-Specific Regulatory Information within Enhancer DNA Sequences. NAR Genomics and Bioinformatics 3.4 (2021). DOI: https://doi.org/10.1093/nargab/lqab095
Benner, Philipp, and Martin Vingron. ModHMM: A modular supra-Bayesian genome segmentation method. Journal of Computational Biology 27.4 (2020): 442-457. DOI: https://doi.org/10.1007/978-3-030-17083-7_3