Novel mapping approach for DNA sequence binding motifs sharply expands library of genetic knowledge

In a study with wide-ranging impact, researchers effectively increased the DNA sequence binding motifs that are known for eukaryotic transcription factors over 10-fold, including doubling knowledge for human transcription factors. 

This new insight significantly improves predicting capacity for gene expression mechanisms for many disease-mechanism problems, and essentially all of eukaryotic biology.

The study, led by Matthew Weirauch, PhD, a computational biologist in the Center for Autoimmune Genomics and Etiology, was published Sept. 11, 2014 in the journal Cell. The findings have enabled researchers who study any organism to begin to understand how genes are regulated on a global scale. For human disease, the study increases researchers’ ability to understand the function of disease-associated genetic variants that fall in non-coding regions.

It is estimated that approximately 90 percent of disease-associated variants are non-coding. In genomics, noncoding DNA sequences are components of an organism's DNA that do not encode protein sequences. “Doubling our knowledge of human DNA sequence binding motifs essentially doubles our chance of figuring out which proteins these variants might affect the binding of,” Weirauch says.

The center’s primary focus is the genesis of lupus and other immunological diseases, and to explore the mechanisms of disease through the complex interactions of genetics, the immune system and environmental factors such as stress, exercise and diet.

Two findings of the study surprised researchers. “First, that our scheme for mapping DNA sequence binding motifs across organisms based on protein similarity works for most protein families,” says Weirauch. “Second, the fact that we increased knowledge of these motifs so substantially across all of eukaryotic life, from less than one percent to almost 40 percent of all proteins.”

In a pictorial overview of transcriptor factors (TF) choosing strategy and motif inferences, this first figure shows the network schematic depicting TFs (nodes), their related TFs (edges with nodes), and their motif status (node color). This figure depicts all 3,715 TFs across 246 species that contain a single bZIP domain.
Click on image to view caption.

Citation

Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey MG, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJ, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158(6):1431-1443.

Lead Researcher:

A photo of Matthew Weirauch, PhD.
Matthew Weirauch, PhD

This figure is a close-up of the boxed region in the first figure. Here, motifs are shown for characterized TFs. Researchers noticed that motifs from the left group strongly resemble one another, as do motifs within the right group (as predicted by their DBD AA %ID). However, the motifs from the left and right groups are not related, as predicted by the fact that the DBD %ID of their TF members fall below the inference threshold for bZIPs. That is, there are no links between the two groups. Motifs with blue outlines were determined using PBMs; red outlined motifs are from the Transfac database. The findings illustrated here improve the understanding of gene expression mechanisms connected to disease.
Click on image to view caption.