Adamic and Zipf S Law

NEWS & VIEWS
COMP L EX SYSTEMS
Unzipping Zipfs law

One mathematical model can account for power-law distributions in a variety of systems. Eschewing system-specific assumptions, it utilizes a shared feature of the observed distributions: they all describe the division of items into groups.
WORDLE.NET
LADA ADAMIC
erhaps the only thing more abundant in both natural and man-made systems than power laws are the models that have been developed to explain them. Writing in the New Journal of Physics, Baek et al.1 argue that because such models depend on the specifics of each system, they fail to capture the shared cause of this regularity. The authors instead propose a general model that can be applied to any division of items into groups, and that can, for example, account for Zipf s law of word frequencies in text, the popularity of last names, and city and county populations. Scientists have been captivated by power laws with reason. Whereas other probability distributions invariably bend on loglog scales, the power law continues as a perfectly straight line over as many orders of magnitude as the system size allows. A power-law distribution, in special cases referred to as Zipf s law or a Pareto distribution, specifies that the probability of observing an item of size k is proportional to k, with being typically between 1 and 3. The implications of the distribution are even more striking than its heavy-tailed shape: there are a few megacities but many small towns; a small number of individuals hold a substantial fraction of the total wealth; and there are roughly 2.5 million Smiths in the United States, whereas most last names are uncommon. In fact, a heavy-tailed distribution of sizes tends to hold for a wide range of systems in which items are assigned to bins: species to genera, readers to books, visitors to websites, written words to vocabulary (Fig. 1), and even casualties to wars2. Various models have been proposed to explain one or several of the observed power laws. Two main criticisms are commonly aimed at such models. First, many distributions deviate, at least slightly, from a straight line on a loglog scale. Often the deviation is an exponential cut-off in the tail of the distribution and is not captured by the model. Second, models tend to contain system-specific elements that limit their generalizability, and early pursuits of more general models were undertaken by, among others, Herbert Simon, who wrote3: No one supposes that there is any connection between horse-kicks suffered by
1 6 4 | NAT U R E | VO L 4 7 4 | 9 J U N E 2 0 1 1
Figure 1 | Most frequently occurring words in a novel. Words are sized in proportion to their frequency of appearance in The Mayor of Casterbridge, a novel by Thomas Hardy that is part of a corpus analysed by Baek and colleagues1. The word frequencies follow Zipf s law; the single word the appears 6,775 times in the novel, whereas 4,959 words occur only once, among them grouping and mathematical.
soldiers in the German army and blood cells on a microscopic slide other than that the same urn scheme provides a satisfactory abstract model for both phenomena. The urn model proposed by Simon is related to other preferential-attachment growth models, also known as cumulative-advantage or rich-get-richer processes. Yule developed4 one of the oldest such models, proposing that genera grow in proportion to the number of species they contain, by assuming that each species has an equal likelihood of generating a speciation event. Whereas preferential-attachment models continue to be used to explain power-law distributions observed in various contexts5,6, some power laws prompt different explanations. For example, Zipf s distribution of word frequencies can result from a principle of least effort7,8 in the evolution of language, or even from random sequences of letters and spaces9. This leaves open the possibility that there is a more general, global explanation of power laws that is independent of system-specific details. Just such an explanation has been developed by Baek and colleagues1. Their random group formation (RGF) model is built on the
2011 Macmillan Publishers Limited. All rights reserved
only common feature among all the systems modelled: that M items are divided among N groups. Entropy is maximized when an item is equally likely to be found at any of M addresses across the groups. Next, one derives a distribution of group sizes that minimizes the amount of information needed to locate an item knowing only the size of the group to which it belongs. This objective, in addition to the constraints of total number of groups and items, and the maximum group size, is sufficient to derive the RGF function, a power-law distribution of group sizes with an exponential cut-off. There are several remarkable aspects to this finding. The RGF function closely fits observed group size distributions without incorporating any knowledge of systemspecific dynamics. In contrast to previous models, which would typically tune their parameters by fitting the empirically observed distribution, the RGF model requires no tuning. The power-law exponent in the RGF function is given directly once one specifies M, N and the maximum observed group size. Furthermore, the exponential cut-off observed in empirical data is an essential component
NEWS & VIEWS RESEARCH

of the RGF model, rather than a correction introduced to fit the data. The RGF model just as easily fits word-frequency distributions representing entire books as it fits random subsamples of the same texts, something that alternative models generally cannot do. Finally, the approach is flexible enough to incorporate system-specific constraints, as needed. The work of Baek and colleagues1 may be the first to provide a truly general explanation of the prevalence of power-law distributions in frequency counts. But it is not yet ready to replace other models entirely. For many, if not all, systems the intuition behind the assumption that one wishes to minimize the information cost of locating an item needs to be further developed. By contrast, growth
ST E M CELLS
models usually integrate intuition about a systems evolution. Furthermore, the powerlaw exponents produced by the RGF model in some cases differ from those estimated previously using maximum-likelihood fits to data2. Nevertheless, by deriving power-law distributions from very general system-independent principles, Baek et al. have raised the bar for other models. A model purporting to explain a power-law distribution should be as general as Baek and colleagues model, or it should be able to reproduce additional features of the system it models, beyond the familiar straight line on a loglog plot. Lada Adamic is in the School of Information and the Center for the Study of Complex
Systems, University of Michigan, Ann Arbor, Michigan 48109-1107, USA. e-mail: ladamic@umich.edu
1. Baek, S. K., Bernhardsson, S. & Minnhagen, P. New J. Phys. 13, 043004 (2011). 2. Newman, M. E. J. Contemp. Phys. 46, 323351 (2005). 3. Simon, H. A. Biometrika 42, 425440 (1955). 4. Yule, G. U. Phil. Trans. R. Soc. Lond. B 213, 2187 (1925). 5. Barabsi, A.-L. & Albert, R. Science 286, 509512 (1999). 6. Huberman, B. A. & Adamic, L. A. Nature 401, 131 (1999). 7. Zipf, G. K. Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology 1st edn (Hafner, 1949). 8. Ferrer i Cancho, R. & Sol, R. V. Proc. Natl Acad. Sci. USA 100, 788791 (2003). 9. Miller, G. A. Am. J. Psychol. 70, 311314 (1957).
iPS cells under attack

Induced pluripotent stem cells offer promise for patient-specific regenerative therapy. But a study now cautions that, even when immunologically matched, these cells can be rejected after transplantation. See Letter p .212
EFFIE APOSTOLOU & KO N R A D H O C H E D L I N G E R
n 2006, Takahashi and Yamanaka1 made a groundbreaking discovery. When they introduced four specific genes associated with embryonic development into adult mouse cells, the cells were reprogrammed to resemble embryonic stem cells (ES cells). They named these cells induced pluripotent stem cells (iPS cells). This approach does not require the destruction of embryos, and so assuaged the ethical concerns surrounding research on ES cells2. Whats more, researchers subsequently noted that the use of custom-made adult cells derived from human iPS cells might ultimately allow the treatment of patients with debilitating degenerative disorders. Given that such cells DNA is identical to that of the patient, it has been assumed although never rigorously tested that they wouldnt be attacked by the immune system3. On page 212 of this issue, however, Zhao etal.4 show, in a mouse transplantation model, that some iPS cells are immunogenic, raising concerns about their therapeutic use. To examine the immunogenicity of mouse iPS cells, the authors used a simple teratomaformation assay. Briefly, they injected iPS cells into mice that were either immune-compromised or genetically matched with the donor cells. This normally results in the formation of benign tumours called teratomas, which consist of many types of differentiated cells. Zhao etal. validated their approach by showing that a line of genetically matched (autologous) ES
cells gives rise to teratomas, whereas a line of unmatched ES cells is rejected by the immune system of the recipient animals before it can produce teratomas (Fig.1a). Surprisingly, transplantation of autologous iPS cells derived from fetal fibroblasts into matched mice also resulted in the rejection of teratomas (Fig.1b). To rule out the possibility that the viral vectors used to introduce the reprogramming genes (which integrate into the host-cell genome) were responsible for the
immune rejection, the authors used a different method the episomal approach to generate iPS cells. This produced similar results, albeit with a weakened immune response. Together, these data indicate that, in this assay, matched iPS cells are more immunogenic than matched ES cells. Zhao etal.4 further identified the antigens that probably caused immune rejection of the iPS cells. By analysing the gene-expression profiles of iPS-cell-derived teratomas, they discovered a group of just nine genes that were expressed at abnormally high levels. Indeed, inducing expression of three of these genes (Hormad1, Zg16 and Cyp3a11) in the non-immunogenic ES cells significantly impaired these cells ability to form teratomas on transplantation into genetically matched mice. Whether activation of the same genes is also necessary for immunogenicity of iPS cells remains to be tested. Teratoma regression may be due to a
Genetic reprogramming
Blastocyst
Matched ES cells
Fibroblasts
Matched iPS cells
Normal tumour growth Immune tolerance
Overexpression of Hormad1 and Zg16 Speci c T-cell response Tumour regression Immune rejection
Figure 1 | Immunogenicity of induced pluripotent stem cells. a,Zhao etal.4 find that embryonic stem cells (ES cells) derived from blastocyst embryos from a given genetic background grow into teratomas on transplantation into mice of the same genetic background. The immune system therefore tolerates autologous ES cells. b,By contrast, autologous induced pluripotent stem cells (iPS cells), reprogrammed from fetal fibroblasts by viral or non-viral genetic approaches, elicit an unexpected immune reaction in genetically identical mice, resulting in their rejection.
2011 Macmillan Publishers Limited. All rights reserved
9 J U N E 2 0 1 1 | VO L 4 7 4 | NAT U R E | 1 6 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Adamic and Zipf S Law

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Adamic and Zipf S Law

Diunggah oleh

Hak Cipta:

Format Tersedia

NEWS & VIEWS

Unzipping Zipfs law

NEWS & VIEWS RESEARCH

iPS cells under attack

Matched iPS cells

Normal tumour growth Immune tolerance

Anda mungkin juga menyukai