DOI 10.1007/s11192-010-0252-2
Abstract We report here a simple method to identify the ‘emerging topics’ in life sci-
ences. First, the keywords selected from MeSH terms on PubMed by filtering the terms
based on their increment rate of the appearance, and, then, were sorted into groups dealing
with the same topics by ‘co-word’ analysis. These topics were defined as ‘emerging
topics’. The survey of the emerging keywords with high increment rates of appearance
between 1972 to 2006 showed that emerging topics changed dramatically year by year, and
that the major shift of the topics occurred in the late 90s; the topics that cover technical and
conceptual aspects in molecular biology to the more systematic ‘-omics’-related
and nanoscience-related aspects. We further investigated trends in emerging topics within
various sub-fields in the life sciences.
Introduction
R. L. Ohniwa (&)
Institute of Basic Medical Sciences, Graduate School of Comprehensive Human Sciences,
University of Tsukuba, Tennoh-dai, Tsukuba 305-8575, Japan
e-mail: ohniwa@md.tsukuba.ac.jp
A. Hibino
Japan Society for the Promotion of Science, Interdisciplinary Cultural Studies, Graduate School of Arts
and Science, University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan
K. Takeyasu
Laboratory of Plasma Membrane and Nuclear Signaling, Graduate School of Biostudies,
Kyoto University, Yoshidahonmachi, Sakyo-ku, Kyoto 606-8501, Japan
123
112 R. L. Ohniwa et al.
converging and fading out every day. In this constantly changing research environment, it
is valuable to survey research trends in order to identify new issues as they emerge. This
critical information can help scientists to decide the direction of their future research and
allows governments and agencies to implement policies quickly and distribute research
funds more efficiently. In particular, it is important to identify research topics with a high
potential to expand and develop in the near future; such research topics shall be called as
‘emerging topics’ when they are identified at an early stage of development.
Two serious obstacles preventing the identification of emerging topics are the vast size
of scientific communities and the huge amount of research produced by each community—
over 800,000 life science articles were published in 2008. Thus, it is difficult to follow all
the information about research trends, and it is not practical to manually identify important
and emerging topics and judge whether each topic is emergent. To fill this gap, a simple
method to identify emerging topics would be beneficial.
Metrical approaches, first developed in the field of scientometrics, have provided useful
scoring systems that permit the identification of trends in given research areas—e.g.,
number of researchers, amount of grants, number of articles, number of citations. Among
the various metrical methods, content analysis of original articles can link quantitative
studies (e.g., number of papers) to qualitative studies (e.g., content of papers) (Callon et al.
1983, 1986; Leydesdorff 1995), and the keywords attached to the articles are often used for
content analysis. The number of articles containing certain keywords is a measure of the
research activity related to those keywords, and transitions in the frequency of particular
keywords can be regarded as indicators of research trends (Callon et al. 1986).
Since a single keyword represent only one aspect of research topics, a set of the
keywords are required to point out the precise view of the research topics (Fig. 1).
Research topic can be regarded as ‘‘an issue of a coherent set of subject-related research
problems, concepts and methods upon which attention is focused by scientific research-
ers,’’ irrespective the social and intellectual background of the researchers involved. This
concept is based on the notion of scientific specialty by Braam et al. (1991). Thus, the
‘topics’ can be regarded as the assembly of individual keywords.
Phenomena
and
Processes
Overall trends Problems
Diseases
Methods Techniques
Devices
123
Trends in life sciences 113
123
114 R. L. Ohniwa et al.
PubMed (MEDLINE) database and MeSH (Medical Subject heading) terms are suitable
sources for the extraction of emerging keywords. PubMed, a literature database run by the
National Library of Medicine (NLM), contains about 10 million articles (Fig. 2) and is
large enough to cover almost all fields of life sciences. MeSH is a popular keyword
thesaurus developed by NLM and is typically used on PubMed to support literature
searches (Schulman 2000). Because over 10 terms representing each article’s content are
assigned and controlled by professionals educated by NLM, we can assume that the use of
terms remains consistent among authors.
There are two types of terms not appropriate to be used as emerging keywords. First,
MeSH involves not a few keywords which represent the style or the background of articles
rather than its research contents. For instance, many articles contain ‘English Abstracts’
and/or ‘Research Support, U.S. Govn’t P. H. S.’. These words represent the style of the
articles and foundation to support the activities of the authors. Second, quite general terms,
such as ‘Animals’ and ‘Proteins’, are involved in MeSH. MeSH has hierarchical structure
of the term construction, and such general terms are located in the top of the hierarchy. To
avoid the contamination of these terms in the investigation of emerging keywords, we
excluded all the MeSH terms classified in four trees of hierarchy irrelevant to research
topics. In addition, we excluded the terms involved in the top three levels of the MeSH
hierarchical tree (see details in ‘‘Materials and methods’’ section). Through this operation,
193,234 terms were selected from total 230,100 MeSH terms from 1971 to 2008. The
residual terms were categorized into various trees like Materials, Phenomenon, Diseases,
Techniques, devices etc. suitable for the investigation of research topics (Fig. 1 and
‘‘Materials and methods’’ section). We used these selected terms for the following
analyses.
We defined ‘emerging keywords in a certain year’ as MeSH terms whose increment of
appearance frequency is included in top 5% out of the all terms in a given year. This
increment is calculated as the ratio of mesh term frequency in years 3 and 4 divided by its
frequency in years 1, 2, 3 and 4, in which year 2 is the target year and years 1, 3, 4 are four
consecutive years before and after year 2 (see ‘‘Materials and methods’’ section). This
operational definition enabled us to neglect terms that already had a high frequency of
appearance and those with an extremely low frequency, which have little development
potential and are likely to simply fade away. Thus, the emerging keywords extracted were
600,000
500,000 8
400,000 6
300,000
4
200,000 Articles
100,000 2
MeSH terms per an article
0 0
1971
1976
1981
1986
1991
1996
2001
2006
year
123
Trends in life sciences 115
MeSH terms with relatively low frequencies of appearance and high increment rates; i.e., a
high potential to develop in the future. We finally selected totally 63,798 kinds of emerging
keywords (N = 4,808,377) from 1972 to 2006 (Supplemental Table C).
The characteristics of the 20 most frequently appeared emerging keywords every
5 years are summarized in Table 1. We found the transition in keywords, rather than
consistency. While 19 keywords were listed in two consecutive periods, the other 102
keywords appeared in only one period, suggesting that ‘emerging topics’ implicated by
these emerging keywords have been dynamically tuned over in the past 30 years.
123
116 R. L. Ohniwa et al.
1977–1981 Num
123
Trends in life sciences 117
Table 1 continued
1982–1986 Num
1987–1991 Num
123
118 R. L. Ohniwa et al.
Table 1 continued
1992–1996 Num
1997–2001 Num
123
Trends in life sciences 119
Table 1 continued
2002–2006 Num
research target of the regulation of translation and the techniques for gene silencing, was
listed up in the top 20 emerging keywords in 2002–2006, and was valued in 2002. Thus, it
is possible to pick up the topics later highly valued as innovative topics by the analysis of
emerging keywords themselves.
It might be reasonable to assume that novel keywords are easily identified as emerging
keywords. To investigate the novelty of the emerging topics, the year when each emerging
keyword first appeared on PubMed was investigated using the NLM MeSH browser (see
‘‘Materials and methods’’ section). A comparison between the year of the first appearance
123
120 R. L. Ohniwa et al.
Fig. 3 The networks of the top 20 emerging keywords. Only keywords which obtained links with other
keywords were shown. The labels on the clusters represent the names of topics we added. Technical terms in
the label represent as following; ‘‘Lymphocyte’’: a type of white blood cell in the vertebrate immune system,
‘Oncogene’: genes related to cancer, ‘Ontogenesis’: the biological process that causes an organism to
develop its shape
123
Trends in life sciences 121
100000
80000
60000
s
40000
term
2000
20000
g
1993
rgin
0 1986
eme
1979
upto1970
1973
1976
1979
1982
1985
1988
1972
1991
1994
1997
the
2000
2003
2006
r as
yea
Initial year of the appearance
Fig. 4 Initial year of emerging topics on PubMed. The year when a particular MeSH term first appeared on
PubMed (initial year) was compared with the year when a MeSH term was identified as an emerging
keyword (emerging year). The X, Y, and Z axes represent the initial year, emerging year and frequency of
emerging keywords, respectively. The solid ellipse indicates emerging keywords for which the initial and
emerging years are identical. The dashed ellipse indicates emerging keywords with initial years before 1970
(initial year) and the year logged as the emerging keywords (emerging year) revealed that
the initial year of many keywords was indeed identical to the emerging year (Fig. 4, solid
ellipse). These types of keywords increased in the early-80s and gradually decreased from
the mid-90s. The increasing period is well consistent with the period when identification
and manipulation techniques for genes and proteins were beginning to emerge. Thus, the
development of molecular biology seems to have yielded plenty of novel results.
Another interesting finding is that the majority of keywords initially appeared before the
70s and was identified as emerging keywords later (Fig. 4, dashed ellipse). For instance,
‘Histones’, which are essential proteins that pack genomic DNA in eukaryotic cells, first
appeared as a MeSH term in 1963 and later re-emerged in 2001–2003. Additional
examples include ‘Phylogeny’ (appeared in 1967), ‘Microscopy, Fluorescence’ (1963) and
‘Models, Molecular’ (1967), all of which are listed among the top 50 emerging keywords
in the late-90s. The number of such emerging keywords has dramatically increased since
the mid-90s. This trend is likely a result of the sufficient development of techniques to
solve various old but fundamental issues that could not be addressed before the 90s. In
recent years, the rediscovery and refocusing of old research topics seem to promote
important breakthroughs for the development of life sciences.
The historical dynamics in sub-fields of life sciences like molecular biology, microbiology,
physiology etc. is an interesting issue to elucidate the trends of life science researches in
more detail. The number of emerging keywords in a particular category is a good
benchmark for the evaluation of the activity of the fields. Here, we focused on the journal
communities, because the journal communities can be regarded as the unit in which the
boundaries of professional discipline are established through the knowledge production
(Fujigaki 1998). Therefore, the analysis of journal communities will illustrate the trends in
the sub-fields. Here, we introduced the index, named PF (Perspective Factor), which we
123
122 R. L. Ohniwa et al.
A B
C D
E F
G H
Fig. 5 PF of journals in specific fields. a Biochemistry, molecular biology and cell biology, b genetics,
c microbiology, d development, e immunology, f bioinformatics, g neuroscience and h physiology
have previously proposed as an index to evaluate journal quality (Ohniwa et al. 2004,
2007), to the measurement of the abundance of emerging keywords in particular scientific
communities. The PF value of a given community c in year y, PFc in y, is defined as
follows:
PFc in y ¼ Ac in y =Bc in y
123
Trends in life sciences 123
2006
Small (Weinheim an der Bergstrasse, Germany) 3.31
Journal of nanoscience and nanotechnology 3.05
Nano letters 1.94
Nanomedicine (London, England) 1.92
Conservation biology: the journal of the Society for Conservation Biology 1.91
2005
Journal of nanoscience and nanotechnology 3.90
Journal of microencapsulation 2.89
Nano letters 2.24
Journal of contaminant hydrology 2.15
Water research 2.05
2004
IEEE transactions on pattern analysis and machine intelligence 5.06
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society 3.57
Journal of nanoscience and nanotechnology 3.29
IEEE computer graphics and applications 2.86
IEEE transactions on medical imaging 2.67
PF values of all journals on PubMed were calculated between 2004 and 2006, and the 5 journals with the
best PF values were selected
0.5
0
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
year
123
124 R. L. Ohniwa et al.
values of journals related to development and immunology decreased in the late-90s, while
those of journals related to the field of bioinformatics reached a peak in 2003. PF values of
journals related to physiology and neuroscience remained relatively low throughout the
period studied. The present data thus illustrate sub-field-specific fluxes and refluxes in
various research topics over the last 30 years.
Now, we focus on the two interesting issues; (i) the identification of the fields that
currently contain many emerging topics and (ii) the PF trends in the leading journals of
entire life science. For the issue (i), we selected journals with PF values in the top 5 from
2004 to 2006 (Table 3), and found that journals handling nanoscience constantly ranked
high. This is well consistent with the results of emerging topic in 2002–2006 (Fig. 3). For
the issue (ii), we focused on the journals, Nature, Science and Proceedings of National
Academy of Science in USA (PNAS). They cover almost all sub-categories of life sciences
and are believed to lead worldwide life science research, and the impact factors of these
journals are quite high. The annual transition of PF values for these journals between 1973
and 2006 demonstrated that the absolute PF values reached a peak until 1997 and
decreased thereafter (Fig. 6).
Concluding remarks
In this study, we proposed a simple method for the identification of emerging keywords,
and reported the successful application to investigate the historical dynamics of life science
researches. The novelty of the method proposed here is that the emerging keywords should
be selected based on the increment rates of the keywords after calculating the rates of ‘‘all’’
the keywords handled in a literature database. The keywords selected in this manner should
be free from the bias of the analysts.
Our historical survey of the emerging topics suggests two stages in the development of
life science research over the past 30 years: a ‘progressive stage’ from the 80s to the 90s
and a ‘re-evaluation stage’ from the late-90s to the present (Fig. 7). During the progressive
stage, various novel topics implicated by emerging keywords were constantly and con-
tinuously discovered (Fig. 4). The analysis showed that ‘progressive stage’ was likely to be
123
Trends in life sciences 125
Datasets
We collected the MeSH terms attached to articles published between 1971 and 2008 from
PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez?db=PubMed) on April 2009. A total of
13,954,189 articles were included in the analysis. To identify the set of the MeSH terms
attached to each article, we extracted the terms tagged as \NameOfSubstance[ and
\DescriptorName[ from the XML style data, and then eliminated overlaps in the terms
for each article. Then, to eliminate the terms not concerned to research topics, we excluded
the all MeSH terms under the hierarchies of ‘‘[M] Named Groups’’, ‘‘[N] Health Groups’’,
‘‘[V] Publication Characteristics’’, ‘‘[Z] Geographicals’’ and ‘‘[L01.453] Information
Services’’, ‘‘[L01.178] Communications Media’’, ‘‘[L01.143] Communication’’, ‘‘[L01.346]
Information Centers’’ and ‘‘[L01.737] Publishing’’. We further excluded the terms occu-
pying the top 1st, 2nd and 3rd hierarchies of MeSH tree construction to avoid the con-
tamination of quite general terms not suitable to clarify the topics at research level.
Among MeSH terms, we defined emerging keywords as follows. First, the increment rate
(I) of MeSH term a in year b was calculated as:
Ia in b ¼ Xa in b =Ya in b
where Xa in b is the total number of appearances of MeSH term a on PubMed in years
b ? 1 and b ? 2, and Ya in b is the total number of appearances of MeSH term a in years
b - 1, b, b ? 1 and b ? 2. The terms ranked in the top 5% of Ia in b in year b were
defined as emerging keywords.
As an example, the distribution of Ia in 2000 is shown in Supplemental Figure A. In 2000,
521,392 articles were published including a total of 6,118,069 MeSH terms (56,234 types).
For each MeSH term, we calculated Xa in 2000/Ya in 2000, where Xa in 2000 is the number of
articles published in 2001 and 2002 containing the MeSH term a, and Ya in 2000 is the
number of articles published in 1999, 2000, 2001 and 2002 containing the MeSH term a.
The MeSH terms meeting the criterion ‘Xa in 2000/Ya in 2000 [ 0.577’ were ranked in the top
123
126 R. L. Ohniwa et al.
5%. The averages of increasing rates of the extracted emerging keywords have remained
almost constant for the past 30 years (Supplemental Figure B).
The definition of emerging MeSH terms proposed here is similar to the Thomson
Reuters’ ‘‘fast breaking papers (http://sciencewatch.com/dr/fbp/)’’ and ‘‘fast moving fronts
(http://sciencewatch.com/dr/fmf/)’’, in which they calculate the percentage increase in
citations and the percentage increase in number of core papers respectively. These methods
were discussed in Small (2006) and Upham and Small (2010). As mentioned in the
Introduction section, our approach focused on the keyword rather than citation so that
investigate the contents of research topic deeply.
The top 20 most frequently appeared emerging keywords were collected in each period,
and were examined whether they were co-appeared on the same articles. The co-appear-
ance of the keywords was visualized by Pajek package (Batageji and Mrvar 2002). To
eliminate the weak relation among keywords, the threshold for making edges was set as 5%
of the number of keywords linked by the edges.
When the topics represented by MeSH terms appear or are rearranged on PubMed, the
NLM MeSH browser saves this information in a History Note for MeSH terms, tagged
as \NameOfSubstance[, or in the Date of Entry for Supplementary Concept Records,
tagged as \DescriptorName[. The earliest year listed in either the History Note or Date of
Entry for a particular topic was considered to be that topic’s initial year.
References
Batageji, V., & Mrvar, A. (2002). Pajek—analysis and visualization of large networks. Lecture Notes in
Computer Science, 2265, 477–478.
Braam, R. R., Moed, H. F., & van Raan, A. F. J. (1991). Mapping of science by combined co-citation and
word analysis. I. Structural aspects. Journal of the American Society for Information Science, 42(4),
233–251.
Callon, M., Courtial, J. P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks—
an introduction to co-word analysis. Social Science Information Sur Les Sciences Sociales, 22,
191–235.
Callon, M., Law, J., & Rip, A. (1986). Mapping the dynamics of science and technology-sociology of
science in the real world. London: The MacMillian Press.
Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific
literature. Journal of the American Society for Information Science and Technology, 57(3), 359–377.
Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research
by using co-word analysis. Information Processing and Management, 37, 817–842.
Fujigaki, Y. (1998). Filling the gap between discussions on science and scientists’ everyday activities:
Applying the autopoiesis system theory to scientific knowledge. Social Science Information, 37(1),
5–22.
Lee, W. H. (2008). How to identify emerging research fields using scientometrics: An example in the field of
Information Security. Scientometrics, 76, 503–525.
Leydesdorff, L. (1995). The challenge of scientometrics. Leiden, The Netherlands: DSWO Press, Leiden
University.
Noyons, E., Moed, H., & van Raan, A. F. J. (1999). Integrating research performance analysis and science
mapping. Scientometrics, 46(3), 591–604.
123
Trends in life sciences 127
Ohniwa, R. L., Denawa, M., Kudo, M., Nakamura, K., & Takeyasu, K. (2004). Perspective factor a novel
indicator for the assessment of journal quality. Research Evaluation, 13, 175–180.
Ohniwa, R. L., Hibino, A., & Takeyasu, K. (2007). Perspective factor; past, present and future of life sciences.
Proceedings of international society for scientometrics and informatics 2007, II, pp. 908–909.
Price, D. J. D. (1963). Little science, big science. New York: Columbia University Press.
Schulman, J. (2000). Using medical subject headings (MeSH) to examine patterns in American medicine-
preliminary consideration of vocabulary change as a metric. http://www.nlm.nih.gov/mesh/
patterns.html.
Small, H. (2006). Tracking and predicting growth areas in science. Scientometrics, 68(3), 595–610.
Tseng, Y. H., Lin, Y. I., Lee, Y. Y., Hung, W. C., & Lee, C. H. (2009). A comparison of methods for
detecting hot topics. Scientometrics, 81(1), 73–90.
Upham, S. P., & Small, H. (2010). Emerging research fronts in science and technology: patterns of new
knowledge development. Scientometrics, 83, 15–38.
123