Genes aren’t the only organizational unit within an organism’s genetic sequence. Researchers have found other larger structures whose functions are less clear. Now, in the March Physical Review E, a team reports evidence for the largest unit yet seen–stretches of tens of millions of DNA “letters” that appear to make up a single unit of some sort. Furthermore, the individual genes within these “superstructures” are more related than chance would suggest, implying that the new structures have some biological purpose.
The human genome contains about 3 billion “letters,” the chemical nucleotides adenine, cytosine, guanine, and thymine (A, C, G, and T) that encode genes. Gene sequences are embedded within much longer stretches of DNA known as isochores. An isochore is defined by the percentage of G and C in its sequence–which could be below 40 percent or as high as 52 percent–and is relatively uniform over its length of typically several hundred thousand nucleotides.
Pedro Carpena of Harvard University and the University of Málaga in Spain and his colleagues wanted to look for regularity on scales larger than a single isochore. They searched for large-scale GC-rich and GC-poor regions in the sequences of human chromosomes using a conventional algorithm based on the concept of entropy–a measure of the number of different ways that a state can be generated. If ten coin flips come out with five heads followed by five tails, the whole sequence has high entropy, as there are many ways to achieve such a 50-50 result. But the two runs of five identical flips each have very low entropy, as there’s only one way for each to occur. Carpena and his colleagues split the chromosome’s nucleotide sequence into segments that would maximize the difference in entropy between individual segments and the sequence as a whole.
However, even in a random sequence of coin flips, there are likely to be stretches of the same value due solely to chance. So researchers check that the calculated change in entropy is larger than would be expected from a truly random series, to be sure that the substructures are real.
Traditionally, a random sequence of nucleotides has been used as the baseline for comparison. Real DNA, however, has long-range order. That is, the probability of a given nucleotide appearing is not independent of the nucleotides that came before it. To take this into account, Carpena and his colleagues developed a new baseline model with realistic long-range ordering and compared their segmented sequences with it. They found that each human chromosome segmented into a few huge segments tens of millions of nucleotides long–longer than any previously known organizational structure in the genome. These segments, which they’ve dubbed “superstructures,” were found to contain two hundred genes on average. Several simpler statistical crosschecks also found evidence for structures on the same scale.
To find out whether these structures are biologically meaningful or just statistical anomalies, the team looked at the underlying genes in the superstructures. They used a separate database that assigns to each gene a series of descriptive words characterizing its biological function, such as “membrane,” “metabolic,” or “signaling.” Two genes selected at random are likely to share six of these descriptive terms. Two genes on the same isochore share roughly 15. Any two genes in the same superstructure, the team found, are likely to share roughly 18 terms. They think this is a remarkable level of similarity, given that genes in a superstructure are on average a hundred times farther away from each other than genes in an isochore.
To Carpena and his colleagues, this functional similarity suggests that superstructures do have a biological function, even if they can’t yet identify it. They hope to study how these superstructures might be related to the three-dimensional structures of chromosomes, which are coiled up inside of a cell’s nucleus.
Wentian Li of the North Shore Long Island Jewish Health System says the new work is the first to explicitly call attention to patterns at such large scales, although there had been indications of such large-scale structures in the past. He is especially impressed with one of the correlation techniques the team used and with their use of the database of gene descriptors.