How to Compare Books or Genomes
You know when you’re reading Charles Dickens that you’re not reading the same English language as that of Stephen King. But how, exactly, do they differ? Language change is not easy to define or measure, and a team of researchers now shows why. They also identify a useful way to make the comparison, and from Google Books data they calculate the rate at which the English language has changed since 1850. The method can be applied to degrees of similarity between other symbol sequences too, such as the genetic sequences of organisms’ DNA—a comparison helpful for gauging the biochemical functions of different parts of the genome.
It’s been known since the 1930s that word usage in many languages has a so-called power-law distribution. The frequency with which a word is used is proportional to the position of that word in a list ranked by usage frequency raised to some negative power. For example, if the power were , the 20th most frequently used word would be 1/20 as common as the most frequently used word. With this distribution, a few words are used very often (“and” and “the” in English, say), but most are used very rarely. Such a distribution is “heavy-tailed” because the tail of the distribution (the low-usage end) decreases only slowly; most words appear in this tail. In this distribution, “there is no ‘typical’ frequency with which words occur,” says Martin Gerlach of the Max Planck Institute for the Physics of Complex Systems in Dresden, Germany.
There are two ways in which we could imagine comparing, say, Dickens’ Hard Times to Stephen King’s The Shining. We could count words that appear in one book but not the other (“snowmobile” doesn’t appear in Hard Times), or we could compare word usage frequencies (Dickens uses “gentleman” more often). Which is more revealing? Gerlach and his collaborators have devised a mathematical function that embraces both measures and allows a more nuanced and consistent comparison.
They define a “distance” between two word-use distributions (two different texts, say) as a function that involves summing up the word probabilities (that is, frequencies of use) in each distribution, with each word probability being raised to an exponent . When , is equal to a quantity already familiar in information theory, called the Jensen-Shannon divergence [1]. It draws on the notion that a sequence of symbols can be assigned an information content that depends on the frequency of each symbol.
Gerlach and colleagues then show that varying is equivalent to changing the attribute by which you compare two word distributions. For example, the Jensen-Shannon divergence corresponds to a comparison of word usage frequencies, while the case of corresponds to simply counting the number of words that appear in both distributions.
Choosing the best measure for comparison is difficult, Gerlach and colleagues say, because the heavy tails of the distributions, where much of the important information resides, are saddled with large amounts of noise. That is, there’s a high degree of randomness in the statistics of rarely used words, which can affect the apparent distance between distributions. The team finds that with is the optimal choice for bringing out a true measure of distance above the noise. Some previous studies didn’t take this noise problem into account [2], so their conclusions, regarding the rate of change of language for example, should be re-evaluated, says Gerlach.
Using data from Google Books, Gerlach and colleagues say that English language usage has in fact been diverging steadily since the mid-nineteenth century as the square of the separation in time. What’s more, this increasing distance is due more to changes in low-frequency words than high-frequency ones, information that was not available with previous techniques. The team says that their results should also hold for non-English languages with similar usage distributions, such as Russian and Spanish [3].
“These authors have systematically addressed an almost always overlooked and serious issue, and one that has worried me greatly,” says Peter Sheridan Dodds of the University of Vermont, who has also studied word usage patterns. Dodds adds that he and his colleagues had previously, “with much less rigor,” arrived at a conclusion equivalent to Gerlach’s result that gives the most meaningful measure of distance [4].
The Jensen-Shannon divergence has also been used to compare DNA sequences, for example to distinguish genome regions that code for proteins from those that do not [1]. The DNA symbol distributions can also be heavy-tailed [5], so Gerlach expects the same challenges in reliably measuring distances between distributions.
This research is published in Physical Review X.
–Philip Ball
Philip Ball is a freelance science writer in London. His latest book is How Life Works (Picador, 2024).
References
- I. Grosse, P. Bernaola-Galván, P. Carpena, R. Román-Roldán, J.Oliver, and H. E. Stanley, “Analysis of Symbolic Sequences Using the Jensen-Shannon Divergence,” Phys. Rev. E 65, 041905 (2002).
- J.-B. Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331, 176 (2010).
- M. Gerlach and E. G. Altmann, “Stochastic Model for the Vocabulary Growth in Natural Languages,” Phys. Rev. X 3, 021006 (2013).
- P. S. Dodds, K. D. Harris, I. M. Kloumann, C. A. Bliss, and C. M. Danforth, “Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter,” PLoS ONE 6, e26752 (2011).
- R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng, M. Simons, and H. E. Stanley, “Linguistic Features of Noncoding DNA Sequences,” Phys. Rev. Lett. 73, 3169 (1994).