The bibliometric database at the Swedish Research
Council – contents, methods and indicators
Ulf Kronman, Magnus Gunnarsson and Staffan Karlsson
2010‐05‐11. Version 1.0
Contents
1 Introduction ... 4
2 Data source and database ... 4
3 Data properties ... 4 3.1 Journals ... 4 3.2 Document types ... 5 3.3 Subject fields ... 5 3.4 Addresses ... 6 3.5 Author names ... 6 3.6 References ... 7 4 Data preparation ... 7
4.1 Re‐classification of publication type Letter ... 7
4.2 Subject field classification and fractionalisation ... 8
4.3 Address deduplication ... 9
4.4 Address matching ... 9
4.5 Address counting and fractionalisation ... 11
4.6 Reference adjustments ... 12
4.7 Citation windows ... 12
4.8 Self citations ... 13
4.9 Citation reference values for Thomson subject fields (μf) ... 13
4.10 Citation reference values for journals (μj) ... 14
4.11 Citation percentile threshold values for fields (τf) ... 15
5 Methods for analyses ... 16
5.1 Publication address/field fractions ... 16
6 Indicators ... 16
6.1 Publication counts (P) ... 16
6.2 Share of un‐cited publications (puc) ... 17
6.3 Share of self‐citations (csc) ... 17
6.4 Field normalised citation rate (cf) ... 17
6.5 Share of publications above the 90th, 95th and 99th citation percentile ... 18
6.6 Journal normalised citation rate (cj) ... 19
6.7 Journal to field normalised citation rate (jf) ... 19
7 References ... 20
8 Appendices ... 20
8.1 SRC denotation of parameters and indicators ... 20
8.3 Thomson subject fields ... 24
1 Introduction
The Department of Research Policy Analysis at the Swedish Research Council (SRC) maintains and
develops a database for bibliometric analyses.
This document describes the properties of the SRC bibliometric database, data preparation, analysis
tools and methods, and the resulting bibliometric indicators used at the SRC in detail.
2 Data source and database
The bibliometric analyses made at the SRC are based on scientific publication data records licensed
from the US‐based company Thomson Reuters. The database corresponds approximately to the data
that can be retrieved in the Thomson Reuters web service Web of Science. The licensed products are the
Thomson Reuters indices Science Citation Index Expanded, Social Science Citation Index and Arts and
Humanities Citation Index1.
The SRC database contains publication records of all serial titles and publication types covered by
Thomson Reuters2 together with their reference lists. SRC licenses records for articles published from
1982 and onwards; approximately 30 million records and 560 million references (citations). The
database is updated once a year in March‐April.
Data is delivered from Thomson Reuters in the form of tagged text files containing about 1.3 million
new publication records per year. As Thomson also complements and corrects old data, the delivery
of new records is accompanied by batches of “gaps” and “corrections” that are supposed to fill gaps
and replace old amended records. The SRC receives updates, gaps and corrections by the end of
January each year, a shipment that includes all changes made to the licensed products during the
previous calendar year. Since there is a certain time lag between a scientific document’s publication
and its entering into the Thomson indexes, the January shipment does not include all publications that
were published during the preceding year. In order to make the SRC database complete regarding the
preceding year the January shipment is supplemented in April with updates, gaps and corrections
made to the Thomson indexes during the first quarter of the year that is loaded into the SRC database
in April each year. By this addition the publication numbers for the previous year is close to complete.
The publication data is parsed from the delivered text files and loaded into a relational database with
the means of an in‐house developed loader program. Citation counts and reference values for field
and journal normalisation of citation counts are then calculated and stored in the database.
3 Data
properties
This section describes the general properties of the bibliometric data as it is delivered from Thomson
Reuters.
3.1 Journals
The Thomson data is organised around journal issues, rather than publications or journal titles. This
has implications on the later classification of publications into subject fields, since it is the journal
issues that are being subject classified by Thomson, not journals, nor individual publications.
1 Certain data included herein are derived from the Science Citation Index Expanded, prepared by
Thomson Reuters®, Philadelphia, Pennsylvania, USA. © Copyright Thomson Scientific® 2006. All
rights reserved.
For each issue Thomson registers a full serial title, ISSN, publisher address, etc. As both the serial title
and the ISSN may change over the years for something generally being considered as one and the
same journal, there is no consistent way of identifying journals in the Thomson data. Thomson does its
own judgement of what is to be considered as a journal and assigns each journal a unique identifier
called a sequence number.
For each issue the sequence number, the full title of the journal and four different normalised shorter
title variants are given; title_11, title_20, title_29, where the number indicates the maximum number of
characters in the title variant. There is also an ISO title and an ISSN number for each issue, but about
100 000 issues lack ISO title, and around 6000 issues lack ISSN number.
1982‐2008 2003‐2008 2008
Number of unique sequence numbers 23 954 15 307 10 812
Number of unique values in title_11 15 953 11 515 10 608
Number of unique values in title_20 16 020 11 623 10 619
Number of unique values in title_29 16 007 11 545 10 609
Number of unique values in full title 23 366 14 946 10 815
Table 1. The number of unique values for various potential journal identifiers in the SCR Thomson database.
A list of all journal titles indexed in the Thomson database can be found at the Thomson website1. In
January 2009, the list covered 15 717 titles.
The lack of a consistent identifier for what on daily basis is called a “journal” indicates some need for
caution when doing journal‐based analyses. The tests that SRC has performed have indicated that
title_11 is the most stable and reliable identifier, and SRC presently uses this field for identifying
journals.
3.2 Document
types
Thomson classifies the publications as belonging to one (1) of a number of document types (around
30), the most important being articles, reviews, letters, and meeting abstracts. See appendix 8.2 for a list of
document types, together with the number of records and the share for each document type in the
SRC database. Since the SRC products licensed from Thomson does not include the ISI Conference
Proceedings index, the SRC database and analyses does not presently2 include the Thomson document
type Proceedings Papers.
Bibliometric analyses including citation level normalisations (i.e. comparisons to other publications of
equal type) take the document type into account, since different document types tend to have different
citation characteristics.
Of the document types listed in appendix 8.2 Chronology and Note are not used separately anymore,
but included in the type Article. Document type Discussion is included in the type Editorial Material as
from 1996.
See also 4.1 for details on how SRC treats document types.
3.3 Subject
fields
The subject classification in the SRC database is primarily based on the Thomson classification system
that assigns different scientific field categories to journal issues. The number of fields used in the
Thomson database has varied over the years; presently there are 255 different subject fields in the
1 http://scientific.thomson.com/mjl/
database, whereof 247 have been used after year 2000. The Thomson subject fields are listed in
appendix 8.3, together with a number of publications in each category in the SRC database.
Each journal issue is classified by Thomson as belonging to one or several (maximum 7) of these 255
scientific fields. For “regular” journals the fields used for the issues tend to stay the same over time,
but for monograph series or report series the classification may vary from issue to issue, leading up to
a serial title classified as belonging to more than 30 different fields.
3.4 Addresses
The SRC bibliometric database contains about 56 million author or reprint addresses. 10 million of the
addresses can be considered as duplicates, generated by Thomson double registration of author and
reprint addresses that has varied in methodology over the years, see “Reprint addresses” below.
The 660 000 Swedish addresses1 in the SRC database have been refined using a set of address‐to‐
organisation matching rules. The procedure for the address matching is described in section 4.3
Address Matching under Data preparation.
The addresses for the publications in the Thomson database are of two types: author addresses and
reprint addresses. A reprint address contains name, affiliation and address details for the corresponding
author of the publication. Before 1998 the address of the corresponding author was only registered in
the database as a reprint address, but beginning in 1998 the address of the corresponding author is
registered both as an author address and as a reprint address, resulting in a number of duplicate
addresses in the database. See section 4.3 for details on how SRC handles these duplicates.
3.5 Author
names
The SRC bibliometric database contains 88 million author name entries. Author names are not
database normalised (merged), which means that each occurrence of an author name is recorded as a
separate entry, even if the name is exactly the same as a previous entry. The reason not to merge
identical author names is that there is no identifier of persons in the input data, and therefore there is
no way to tell if two identical author names denote the same one or two different persons (a
homonym).
The author names are registered in the Thomson database using the form Lastname, Initial(s). No
diacritics are registered in the database, so for instance Swedish characters å, ä, ö will end up as a, a, o (in a few occasions older transliteration rules making for instance ä to ae and ö to oe are used).
Varying transliteration, non‐consistent use of initials and mix‐up of middle names and given names
often cause problems when trying to locate the publications of a specific researcher, together with the
homonym problem mentioned above.
3.5.1 Reprint/corresponding author
Each author record is given a number of attributes, of which reprint author is one. The reprint author
name is also stored as a part of the reprint address. The reprint/corresponding author of a publication
may therefore be retrieved both by fetching the author record marked as reprint author or by
extracting the author name from the reprint address information.
3.5.2 Corporate authors
The Thomson data files also contain an author type marked as corporate authors, denoting the name of
a group of authors belonging to a large researcher community or an organisational body. In the SRC
database corporate author names are being stored and used in the same way as ordinary author
names.
3.5.3 Authors and affiliations
In the source files delivered by Thomson Reuters each publication record contains a list of author
names and a list of author addresses. However, for records entered in the Thomson system before
2008 there is no indication of which addresses that relates to which authors, except for the
corresponding author. As author names and author addresses are processed in sequence according to
the information from the journal issue, a relation between the first author name and the first author
address can be presumed, but there are no guaranties from Thomson on that.
As from second half of 2008 the records delivered contains an indication of which author names and
addresses belong together.
The reprint address also contains an author name, and therefore this information can be used as an
identifier of an author‐affiliation relation.
Details about author‐affiliation relations are not presently used in the calculation of any of the regular
indicators that SRC produces.
3.6 References
The reference lists of the publications in the SRC bibliometric database contain about 560 million
references (outbound citations) to other publications.
Each of the publication records in the Thomson database contains a list of the publication’s references
to other publications. These reference records contain, among other things, a generated reference ID
number (RefNumberID), built by a special Thomson algorithm from selected key data of the referred
publication. The same algorithm is also used to generate a unique identifier for all registered
publications in the Thomson database, but here it is called the ItemNumberID.
By matching the RefNumberIDs from the reference lists of newly registered publications with the
ItemNumberIDs of previously registered publications, the number of references to each publication
(i.e. the number of inbound citations) can be calculated. Since ItemNumberIDs are generated for all
publications entered in the database, there is no problem if the chronological order of the referring
publication and the cited publication is reversed (i.e. the referring publication is entered in the
database before the cited publication). When the matching algorithm is run, a match will occur
anyhow.
Almost half (47%) of the references in the SRC database are pointing to sources external to the SRC
database, i.e. their generated RefNumberIDs does not match any item in the database. These non‐
matched references usually refer to conference proceedings, reports, books, web‐published material or
journals not covered by the Thomson database.
See also section 4.6 for details on how the SRC deals with references.
4 Data
preparation
After the tagged text data for the Thomson publication records has been loaded into the database, a
number of SRC‐specific refinements and calculations are performed to generate the underlying data
needed for indicator calculations.
4.1 Re-classification
of
publication type Letter
At the SRC bibliometric analyses are mostly based only on the document types article and review. Since
several fields, the Thomson document types are re‐classified, so that types article and letter (and the
previously used document types note and chronology) are joined into a new SRC‐specific document
type; article, covering 70% of the publications in the database.
The reason for not treating letters as a separate document type is that the groupings of letters into
subject categories tend to generate too small groups for stable statistics and that the average citation
level for letters tend to be rather low (even though there are well‐cited letters). So at the SRC, letters
are grouped together with original articles and compared to the citation rates of those while doing
citation normalisations.
4.2 Subject field classification and fractionalisation
The SRC methodology uses the Thomson fields of the journal issue as a proxy to classify the
individual publications, so that each publication gets classified as belonging to 0 to 7 scientific fields
via the issue it is published in. During this mapping process, a SRC reclassification of publications in
journal issues classified as multidisciplinary is also applied, see 4.2.1 below.
The result of the subject field mapping leads to a number of links between Thomson subject fields and
publications and a count of the number of fields used for classification that is stored in each
publication record. The information about the fields and number of fields is used for a subject
fractionalisation in later calculations of citation reference values and citation indicators.
4.2.1 Re-classifying publications in Thomson subject field Multidisciplinary Sciences
Thomson Reuters uses the subject tag Multidisciplinary Sciences for journals that contain papers from
many different scientific fields, such as PNAS, Science and Nature. This poses a well‐known problem
for bibliometric analyses involving field relative indexes, since this means that papers in these journals
are not compared to other papers of their ‘true’ field, but rather to other papers in journals classified
as Multidisciplinary Sciences. For example, an article titled The origins of Afroasiatic was published in 2004 in the journal Science, which is tagged as Multidisciplinary Sciences, and by the end of 2008, this
article had accumulated 5 citations. Since the average citation count for articles in that field that year
was 2.13, the field normalised citation rate for this article was 2.35. However, if this article had been
published in Journal of the Royal Asiatic Society, which is tagged as Asian Studies, it would have had a
field normalised citation rate of 14.3, since the average citation count for Asian Studies articles that year
was 0.35.
This problem is aggravated by the fact that the journals tagged with Multidisciplinary Sciences regularly
are prominent journals with high average citation levels. An article published in any of these journals
is thus likely to receive a lower field normalised citation rate than would have been the case if it had
been published in a journal with more specific subject tagging.
In an effort to find more article‐relevant scientific field classifications for publications in journals
classified as Multidisciplinary Sciences, the publications in these journals are being re‐classified by an
in‐house developed algorithm, using the publication’s references and the (inbound) citations to the
publication itself. The algorithm is based on the assumption that the Thomson subject fields of the
articles referred to in reference lists and the citing articles indicate witch subject areas the
referring/referred article belongs to. The classification algorithm is described in detail in a separate
article from the SRC (Gunnarsson et. al. 2009).
Using the SRC reference‐citation‐based reclassification method 50% of the publications in journals
classified as multidisciplinary can be classified as belonging to other subject areas. For articles
published after year 2001, 90% are moved from multidisciplinary to other subject areas using this
method. The re‐classification produced by this method is now included as a standard procedure in the
4.3 Address
deduplication
As described in section 3.4 above, the set of reprint addresses and the set of author addresses partly
overlap. This, in combination with the fact that the SRC often uses address fractionalisation, means
that there is need for a method to identify and remove address duplicates.
Originally, the production year could be used to identify the records for which the reprint address is a
duplicate (before or after 1998), but since records older than 1998 have been added or amended
according to the new method, this method does not work anymore. After discussions with Thomson
Reuters and after detailed tests, SRC settled on the following criteria for identifying address
duplicates:
• The reprint address of a publication record is a duplicate
o if there is an author address that has the same organisation, sub‐organisation, city,
state and country
o or if the organisation and sub‐organisation fields of the reprint address are NULL and
there is at least one author address where those fields are not NULL in the publication
record.
• An author address of a publication is a duplicate if there is another author address in the
publication record with the same organisation, sub‐organisation, city, state and country.
4.4 Address
matching
The addresses supplied by the journal are parsed into several fields by Thomson:
• Complete address as supplied by publisher
• Organisation
• Sub‐organisation (as department)
• Postal code
• Street address
• City
• State or region
• Country
The addresses supplied by the authors may be formatted in a number of different ways, not always
with the main organisation name given as the first instance of the address, which can make it difficult
for Thomson to identify the main organisation, especially for organisations in non‐English speaking
countries. The organisation name may also be given in various synonym forms, especially if the name
is translated ad hoc to English from a non‐English name form.
Example 1
Full address Tufts Univ, Dept Mech Engn, Medford, MA 02155 USA
Organisation Tufts Univ
Sub‐organisation Dept Mech Engn
City Medford State MA Country USA Street NULL Postal code 02155 AP Example 2
Full address Karolinska Inst, Huddinge Univ Hosp, Dept Immunol Microbiol Pathol &
Infect d is,Div Clin Bacteriol,S‐14186 Huddinge, Sweden
Organisation Karolinska Inst
City Huddinge
State NULL
Country Sweden
Street Huddinge Univ Hosp, Dept Immunol Microbiol Pathol & Infect Dis, Div
Clin Ba cteriol
Postal code S‐14186 BC
Example 3
Full address Univ Lund, Ctr Chem, Dept Biotechnol, S‐22100 Lund, Sweden
Organisation Univ Lund
Sub‐organisation Ctr Chem; Dept Biotechnol
City Lund State NULL Country Sweden Street NULL Postal code S‐22100 BC Example 4
Full address Inst Microbiol, Dept Clin Immunol, Goteborg, Sweden
Organisation Inst Microbiol
Sub‐organisation NULL
City Goteborg
State NULL
Country Sweden
Street Dept Clin Immunol
Postal code NULL
Fig 2. Four examples of Thomson address records. The second record is an example of two different organisations
written in the same address and the fourth record is an example of a Swedish address with a hard‐to‐identify
organisation field.
The variations in address formatting and naming of organisations lead to problems when trying to
identify the publications of an organisation while doing a bibliometric analysis. The SRC has therefore
developed a semi‐automatic address‐to‐organisation matching process to improve the identification of
publications from Swedish organisations.
The address matching is based on 17 000 rules containing unique strings that match text strings in the
various Thomson address fields and link the 660 000 addresses that are designated with country
SWEDEN to 600 Swedish organisation records. The organisation records also contain information of
the type of organisation; for instance university, university college, university hospital, hospital, company,
organisation, museum, governmental, etc.
The matching process is performed in a sequence of steps, beginning with the most straightforward
synonym matching algorithms, followed by more intricate string matching rules:
1. The Thomson organisation field is searched for unique variants of Swedish organisation
names. The 4 600 rules of this step match 610 000 Swedish addresses (around 90%).
2. The Thomson organisation and Thomson city fields are searched in conjunction for
combinations making up unique identifications of Swedish organisations. The 9 000 rules
of this step matches 25 000 addresses.
3. A set of 500 unique string identifiers for Swedish universities and university colleges is
matched against the full address field. This procedure also picks up addresses where
several Swedish university names have been written into the same address and therefore
produces some address duplicates (that later are handled as split addresses). This step
matches around 42 000 addresses.
4. The Thomson organisation field is searched for unique identifying strings and
procedure is also repeated without the inclusion of Thomson city field to handle
organisations located in several cities. These steps match approximately 13 000 addresses.
5. The full address field is searched for unique identifying strings and combination of strings
once more, this time trying to catch residuals from previous matches. This matches about
800 addresses.
6. A number of addresses that have been manually linked to organisations or considered as
unidentifiable are being processed using the whole address field as identifier. This is used
for 200 addresses.
7. Names of joint university organisations are being searched in Thomson organisation and
Thomson city fields in two steps and split addresses for those are produced. This
produces around 800 organisation matches.
The result of the address matching process is a relational database table with 690 000 links1 between
Swedish organisations and their publications. The links cover over 99%2 of the Swedish publications
in the SRC database and work to get 100% coverage is ongoing.
The matching process allows a single address to be matched to more than one organisation, which in
practice means that addresses are split. Because of this all statistics based on or including the number
of addresses per publication are updated.
For Denmark, Norway and Finland a first rudimentary normalisation of organisation names for
universities, university colleges and university hospitals has been performed and there are plans to
extend this normalisation to a rule matching system of the same type as for Swedish addresses in the
future.
4.4.1 Country name normalisation
For all 56 million addresses in the SRC database a normalisation of country names is performed by
using a system of 353 rules for matching name variants to 246 standardised3 country names. For
instance ”Scotland” and ”England” get normalised to ”United Kingdom”, ”Cambodia” and ”Khmer
Republic” map to ”Cambodia”; and ”Fed Rep Ger” and ”Ger Dem Rep” to ”Germany”.
No attempt to correct possible errors in country assignment (organisation names assigned to the
wrong country) is done, but there are plans for such a procedure in the future.
4.5 Address counting and fractionalisation
Since almost all indicators calculated by the SRC are being weighted based on the analysed unit’s
share of the addresses in the publication, the number of addresses in each publication needs to be
counted and stored as a part of the publication record.
After the address matching the total number of addresses is stored as a part of each publication record
and used as a denominator when calculating fractionalised publication counts and weighted citation
indicators.
One rationale for using address fractionalisation and using fractions for weighting of citation averages
is the correlation between the number of addresses and the number of citations. Publications with
many authors usually attract more citations, even disregarding self‐citations. If citation values for
publications would attributed to organisations without any form of fractionalisation and weighting, a
1 The number of links between organisations and publications are more than the number of addresses,
due to joint organisations and several organisations contained within the same address.
2 Measured September 2009.
sort of inflation in citations would be created, since highly‐cited publications would be attributed to
more organisations than lower‐cited publications.
The SRC presently do not use any kind of fractionalisation based on author names and/or the number
of authors to the publications, since there presently are no links between authors and affiliations, and
SRC rarely does analyses on individual researchers.
4.6 Reference
adjustments
As mentioned above Thomson uses an elaborate mechanism for creating links between referring and
cited documents, but the SRC does some adjustments to this citation matching mechanism while
counting citations for publications.
4.6.1 Split references
The Thomson algorithm that builds the RefNumberIDs and the ItemNumberIDs is not perfect;
sometimes different publications can get the same ItemNumberIDs1. Of the 560 million references in
the SRC database about 160 000 (0.2‰) are duplicates, i.e. the same reference ID number points to two
different publication records. When this situation occurs there is no way to automatically differentiate
between the two records that are being referred to and the citation to these publications is therefore
split between them. The SRC database thus contains a few publications with a non‐integer number of
citations.
4.6.2 Duplicate references
In some scientific areas there are citing traditions leading to a publication to occur several times in the
same reference list. The reference may for instance point to different pages of the referred publication.
In the SRC database, these duplicate references are only being counted once, i.e. the same
RefNumberID is only entitled to occur once in each reference list.
4.7 Citation
windows
Citations to publications can be measured using different citation time windows. The citation window
stipulates from which publication years publications are to be searched for references when gathering
citations to a publication published a certain publication year. For instance, a three year citation
window used for a publication published year 2000 will mean that references from publications
published the same year as the cited publication (2000) and two succeeding years (2001‐2002) are
counted as citations. A citation window of six years for the same publication will count references
from publications published years 2000‐2005. An open citation window will count references from all
publications published up to the time the analysis is performed. For a publication published year 2000
and analysed year 2009, references from publications published years 2000‐2009 would then be
counted as citations.
The fixed citation windows 3 and 6 years have the advantage that citation counts to all publications
older than 3 or 6 years can be compared on like grounds and therefore make a good choice for
preparing time series of citation counts. The citation count for a publication published before the end
of the used citation window will also stay the same, no matter when analysis is done; for instance if
the citations publication published year 2000 is measured using a 3‐year window in 2003, 2005, 2008 or
2009, the result will be the same, since only citations from publications from year 2000‐2002 are
counted.
1 Usually, this occurs when two publications have same the first author, the same title, the same
The open citation window has the advantage of gathering as much citation data as possible for
indicator calculation, and has the possibility of spotting so called “sleeping beauties” – publications
that not get started to be cited before a number of years from their publication. But the open window
has the drawback of non‐time‐neutrality, since the worldwide general rate of citations is increasing for
each year. Internal studies at the SRC have shown that there is a good correlation between normalised
citation rates calculated from 3‐year, 6‐year and an open citation window at aggregated levels.
If an analysis using a fixed 3‐year or 6‐year citation windows includes publications that are younger
than 3 or 6 years, respectively, the window will act as an open window for those publications, i.e. all
citations from the publication date up to the analysis date will be included in the citation count. Since
the same will be valid for the publications included in the citation reference value for the same year
(the denominator used for normalisation) the deviation from the fixed window will not cause any
major differences in the normalised citation rate.
The SRC pre‐calculates citation counts for all publications in the database using a 3‐year window
(Cw3), a 6‐year window (Cw6) and an open window (Cwo).
4.8 Self
citations
References between publications are usually considered to reflect some kind of scientific recognition
and the number of citations a publication receives can thus be said to be a measure of the amount of
scientific recognition it has gained. But if a researcher refers to his own previous work, this is not a
measure of recognition from rest of the research community. Therefore it is customary to try to
remove these self citations from the citation counts in bibliometric studies.
The SRC method for removing self citations is based on all author names in both the referring and the
cited publication. If any of the author names in Thomson format (lastname + initials) in the author list
of the referring publication is found in the author list of the cited publication, the citation is considered
to be a self‐citation. No attempt to differentiate homonyms (different researchers sharing the same
Thomson form of name) is done, and there is no separate rule for publications with long author lists.
Using this process to identify self‐citations, the SRC calculates values for the number of citations
including self‐citations (Csci), and the number of citations excluding self‐citations (Cscx) for all three
citation windows and stores these pre‐calculated values for each publication in the database. Each
publication will thus have attributes for six citation values: Csci,w3, Csci,w6, Csci,w0, Cscx,w3, Cscx,w6 and Cscx,wo.
4.9 Citation reference values for Thomson subject fields (µ
f)
After the various citation values have been calculated for all publications in the database, the basic
data are in place to start to calculate the field reference values used for field normalisation of citations.
The average number of citations for publications of a certain document type a certain year in a certain
field is called the Field Reference Value and is denoted μf or FRV1.
The field reference value is calculated for each combination of subject field, publication year and
document type according to the following formula:
∑
∑ 1 [1]
where:
μf = weighted average citation rate for field, year and document type
1 The Field Reference Value has previously been called the Field Citation Score, denoted as FCS
P = the number of publications of the studied document type the studied year classified as
belonging to the subject field in question
Ci = the number of citations to publication i
(according to separately specified citation window and self‐citation handling)
Si = the number of subject fields the publication i has been classified as belonging to
The use of the number of fields each publication is classified in (Si) in the denominator of both of the
sums in the formula means that the average citation values are being based on publication subject
fractions and the resulting field reference value will (μf) be a weighted average1 with regards to the
publications.
Publications without any subject field classifications are obviously excluded from the calculation, as
well as publications without any author addresses, since the latter can not be a part of any country or
organisation analysis, and should therefore not be a part of the reference value for the field.
When calculating the field reference values all references to the publications are counted as potential
incoming citations, regardless of publication year, document type or field, before being filtered by
conditions regarding citation window and self‐citations.
At the SRC the field reference values are calculated for all of the six variants of citation values for each
publication; with (sci) or without (scx) self‐citations, with citations windows 3 (w3) or 6 (w6) years, or
an open window (wo). The calculated values are denoted μf[sci,w3], μf[sci,w6], μf[sci,wo], μf[scx,w3], μf[scx,w6] and μf[scx,wo] respectively, and stored in a separate field reference value table in the database as a foundation
for the later calculation of the field normalisation of citation rates.
Since the field normalised citation rate is the most commonly used indicator, pre‐calculated weighted
average field normalised citation values for all the combinations of citation windows and self‐citation
filtering are stored as a part of each publication record in the database. The calculated values are
denoted cf[sci,w3], cf[sci,w6], cf[sci,wo], cf[scx,w3], cf[scx,w6] and cf[scx,wo] respectively. See section Indicators below for a
description of how the field normalised citation level is calculated at the SRC.
4.10 Citation reference values for journals (µ
j)
Sometimes it can be of interest to study how much a publication has been cited in relation to
publications of the same document type, the same publication year in the same journal, an indicator
called the journal normalised citation rate. To be able to do this, a mean citation value for each document
type, year and journal has to be calculated. This value is called the journal reference value and is denoted μj or JRV2.
For each journal, identified by the field title_11 in the journal issue records, the journal reference
values are calculated according to the following formula:
∑
[2]
where:
μj = the journal reference value
1 A publication classified in only one field will have a fraction weight of 1, whereas a publication
classified in 5 different fields only will have a fraction weight of 0.2 for the calculation of each of the
field reference values.
Ci = the number of citations to publication i
(according to separately specified citation window and self‐citation handling)
P = the number of publications of the studied document type the studied year published in the
journal in question
At the SRC the journal reference values are calculated for all of the six variants of citation values for
each publication; with (sci) or without (scx) self‐citations, with citations windows 3 (w3) or 6 (w6)
years, or an open window (wo). The calculated values are denoted μj[sci,w3], μj[sci,w6], μj[sci,wo], μj[scx,w3], μj[scx,w6] and μj[scx,wo] respectively, and stored in a separate table in the database for later calculation of
journal normalised citation rates.
Publications without any author addresses are excluded from the calculation of the journal reference
value, since these can’t be a part of any country or organisation analysis, and should therefore not be a
part of the reference value for the journal.
4.11 Citation percentile threshold values for fields (
τ
f)
As a complement to the study of publication citation rates in relation to mean values, it can be of
interest to study how many and what share of publications are cited more than a specified percentile
threshold of a subject field. Commonly used percentile thresholds are 90%, 95% and 99%, which
indicate that a publication is among the 10%, 5% or 1% most cited in a field if it has yielded more
citations than the corresponding percentile threshold value.
The percentile threshold values are defined and calculated using the same conditions as the field
average reference values, i.e. it is publications of the same document type, the same publication year
and the same Thomson subject field that is grouped to calculate the percentile values. Publications
without any subject field classifications are excluded from the calculation, as well as publications
without any author addresses.
At the SRC the percentile thresholds for the fields are being calculated on subject fractions of the
publications, which means that each publication gets a fraction weight in inverted proportion to the
number of fields it is classified in when the summing up of the various percentages of publications is
done.
The percentiles are calculated by sorting the publications in order of the number of citations to each
publication. Each publication is assigned a fraction weight that is the inverse of the number of fields it
is classified in. Then groups of publications containing 90%, 95% and 99% of the total number of
weighted publication fractions are extracted and the number of citations to the publication at the top
of the list is noted as the corresponding percentile threshold value. If the number of (weighted)
publications is such that the 90% (or 95% or 99%) limit goes through a publication (and not between
two publications), the average citation count for the two publications on both sides of the limit is used
for percentile threshold value1.
The percentile calculation is performed for all six combinations of citation windows and self‐citation
handling for each field and the results are stored in the field reference value table in the database as 18
different threshold values, one for each combination of percentile threshold (90, 95 and 99), citation
window (w3, w6, wo) and handling of self‐citations (sci, scx). The values are denoted τ90f[sci,w3],
τ95f[sci,w3], τ99f[sci,w3], etc.
1 The calculation is made in the statistics software SAS, using the SAS standard definition for
5 Methods for analyses
Data for bibliometric analyses are being extracted from the SRC database in a number of different
ways, the most common being to retrieve sets of publication data in the form of lists of publication
fractions. Each publication is split into fractions by author address and subject classification, so that a
publication with two addresses and three subject classifications is split into six fractions, one for each
combination of address and subject. The basis of the subject classification is the Thomson subject
classification of journal issues included in the tagged text files; see section 3.3 above.
The final analyses are either performed with the statistical program SAS or with in‐house developed
programs and scripts. Usually the analyses are post‐processed in Microsoft Excel for tabulation and
diagrams.
5.1 Publication
address/field fractions
The output from the SQL database is typically a list of relevant data for each publication to be
analysed, with one row for each combination of publication, address and subject field. Each row
usually contains the following data:
• Normalised organisation name of the Swedish address, or Thomson organisation name if non‐
Swedish address
• Normalised country name
• Full address
• Publication year
• Document type
• Subject field
• The number of subject fields in the classification of this publication
• The number of addresses for this publication
• The number of citations in 6 variants with different citation windows and self‐citations
included or removed
• Field reference values in corresponding 6 variants for the field the fraction is classified in
• Percentile thresholds for 90th, 95th, and 99th percentile with 3 different citation windows and
self‐citations excluded
• Journal reference value, open citation window, self‐citations excluded
An extraction of fractions for Swedish publications 1982‐2009 generates about 1 million rows.
Extraction of world publication data for the same period generates about 73 million rows, which
corresponds to about 27 GB of data.
6 Indicators
In this section the most commonly used bibliometric indicators used at the SRC are presented,
together with descriptions on how they are being calculated by the SRC.
6.1 Publication counts (P)
An analysed unit’s publications may be counted either in full counts or in fractional counts based on
the unit’s share of addresses in the publication, and, in relevant cases, the number of subject fields
assigned to the publication. The number of publications in full counts is denoted Ph and the number of
fractionalised publications is denoted Pr.
A commonly used bibliometric indicator is the number of publications produced per year, which is
6.2 Share of un-cited publications (p
uc)
It can also be of interest to study what share of an analysed unit’s publications that have not made an
impact in the scientific community, i.e. that have not yielded any citations besides self‐citations.
The indicator is calculated according to the following formula:
∑ 1
∑ 1 [3]
where:
puc = weighted average share of uncited publications attributed to the analysed unit
Ruc = the number of publication fractions to uncited publications (self‐citation removed) attributed to
the analysed unit
R = the total number of publication fractions attributed to the analysed unit
Ai = the total number of addresses in the publication of fraction i
6.3 Share of self-citations (c
sc)
The analysed unit’s share of self‐citations is easily calculated by dividing the number of self‐citations
by the total number of citations to the unit’s publications. Since the SRC calculates the weighted
average based on the unit’s share of addresses to the analysed publications the formula gets a bit more
complicated:
∑
∑
[4]
where:
csc = weighted average share of self‐citations for the analysed publications
Csci(i) = the number of citations to the publication of fraction i, self‐citations included
Cscx(i) = the number of citations to the publication of fraction i, self‐citations excluded
R = the number of publication fractions attributed to the analysed unit
Ai = the total number of addresses in the publication of fraction i
The share of self‐citations can be calculated using different citation windows, depending on what is
the most suitable for the analysis in question.
6.4 Field normalised citation rate (c
f)
The field normalised citation rate is one of what is called “state‐of‐the‐art” bibliometric indicators. The
general idea of the indicator is to relate the number of citations to a publication or a group of
publications to the average citation level of a group of comparable publications of the same document
type, publication year and Thomson scientific field.
The SRC calculates its field normalised citation rate (cf) indicator using a publication fraction oriented
method, which means that the number of citations of each subject‐address fraction of a publication is
normalised against an average citation rate for the same document type, publication year and subject
When the final average normalised citation rate for the analysed unit’s publications is calculated, each
publication fraction is weighted by its share of all subject‐address fractions for that publication, so that
the resulting average will be a weighted average.
The SRC average cf is calculated according to the following formula:
∑
∑ 1 [5]
where:
cf = the weighted average field normalised citation rate
Ci = the number of citations to the publication of fraction i
(according to separately specified citation window, self‐citations removed)
μf(i) = the field reference value for the field of fraction i
R = the number of publication fractions attributed to the analysed unit
Si = the number of subject fields the publication of fraction i has been classified as belonging to
Ai = the total number of addresses in the publication of fraction i
The field normalised citation rate can be calculated using different citation windows, depending on
the situation, but at the SRC it does always exclude self‐citations. Presently, SRC does not do any
adjustments while normalising against very low‐cited fields.
Please note that even though the SRC cf indicator resembles the CWTS “crown” CPP/FCSm indicator,
it is not the same indicator. The CTWS indicator groups publications and calculates average citation
levels for both the nominator and the denominator before citation normalisation is done (Moed et al
1995), whereas the SRC indicator does the normalisation on publication fraction level and that
averaging is done after that. This difference is described in detail by Lundberg (2007). Furthermore,
the CWTS crown does not seem to use address fraction weighting, which means that crown values
usually will be higher than cf values for the same set of publications, due to the correlation between
numbers of authors/addresses and citations discussed in the section about fractionalisation above.
6.5 Share of publications above the 90
th, 95
thand 99
thcitation percentile
If the average field normalised citation rate for an analysed unit says something about the average
impact of the unit’s publications, the share of publications cited above a certain citation percentile can
tell us something about the distribution of the impact of the unit’s publications. Is an average
normalised citation rate of 1.2 the result of a majority of well‐cited publications or a few highly‐cited
publications?
This indicator is calculated by looking at how many of a unit’s publication fractions that are cited
more than the citation level for the percentile in question for the subject fraction it is classified in. If the publication is cited more than the threshold for the field, the value is 1, otherwise it is 0. This value is called the Pprc#f ‐ cited over threshold for #thpercentile value in the formula below.
The SRC calculation of the indicator is weighted on the analysed unit’s share of address fractions to
the publications according to the following formula:
#
∑ #
∑ 1 [6]
pPRC#f(i) = the weighted average share of publications cited above the #th percentile
R = the number of publication fractions attributed to the analysed unit
PPRC#f(i) = the cited over threshold for #th percentile value of fraction i
(according to separately specified citation window, self‐citations removed)
Ai = the total number of addresses in the publication of fraction i
Si = the number of subject fields the publication of fraction i has been classified as belonging to
The indicator can be calculated using different citation windows, but it seldom includes self‐citations.
It is worth pointing out that the share of publications cited more than 99th percentile is usually less
than 1%, and correspondingly for the other percentile values.
6.6 Journal normalised citation rate (c
j)
This indicator shows how an analysed unit’s publications are cited in relation to the average citation
rate for publications of the same document type and publication year in the same journal.
Since no normalisation against subject fields is performed here, the data set for calculation is only
fractionalised on addresses (not subject‐address as customary).
The indicator is calculated according to the following formula:
∑
∑ 1 [7]
where:
cj = the weighted average journal normalised citation rate
R = the number of publication fractions attributed to the analysed unit
Ci = the number of citations to the publication of fraction i
(according to separately specified citation window, self‐citations removed)
Ai = the total number of addresses in the publication of fraction i
μj(i) = the journal reference value for the publication of fraction i
The indicator can be calculated using different citation windows, but it seldom includes self‐citations.
Only publications with at least one subject classification and one address are considered in this
calculation.
6.7 Journal to field normalised citation rate (j
f)
Sometimes, it can be of interest to study the average citation rate of the journals a unit publishes in, in relation to the average citation rate of the fields the journal is classified in.
This indicator is called the journal to field normalised citation rate and is calculated according to the
following formula:
∑
∑ 1 [8]
where:
R = the number of publication fractions attributed to the analysed unit μj(i) = the journal reference value for the publication of fraction i
(according to separately specified citation window, self‐citations removed)
μf(i) = the field reference value for the field where fraction i is classified
(according to separately specified citation window, self‐citations removed)
Si = the number of subject fields the journal of fraction i has been classified as belonging to
Ai = the total number of addresses in the publication of fraction i
The indicator can be calculated using different citation windows, but self‐citations are seldom
included.
7 References
Gunnarsson, M.; Fröberg, J.; Jacobsson, C. & Karlsson, S. (2008) Subject classification of publications in the
Thomson database based on references and citations. Vetenskapsrådet, Stockholm.
Karlsson, S. & Wadskog, D. (2006). Hur mycket citeras svenska publikationer? Bibliometrisk översikt över
Sveriges vetenskapliga publicering mellan 1982 och 2004. Vetenskapsrådets rapportserie 13:2006.
(www.vr.se/download/18.5b5b80b310e317e3c0680001207/Bibliometrirapport)
Lundberg, Jonas (2007). Lifting the crown – citation z‐score. Journal of Informetrics, 1 (2007), 145‐154.
Moed, H. F., Debruin, R. E., & Vanleeuwen, T. N. (1995). New bibliometric tools for the assessment of
National Research Performance – Database description, overview of indicators and first
applications. Scientometrics, 33(3), 381–422.
8 Appendices
8.1 SRC denotation of parameters and indicators
8.1.1 Denotation overview
General rules guiding a suggested extendable denotation scheme:
• Absolute numbers are denoted with upper case letters (P, C).
• Relative numbers (quotients) are denoted with lower case letters (p, c).
• Reference values for normalisations are denoted using Greek characters (μ, τ)
• Index letters are used to indicate special conditions regarding the indicator or the reference
value. In situations where it is not possible to use index letters, indices may be written using
brackets or underscores, i.e. cf may be denoted as c[f] or c_f.
• Methodological aspects as fractionalisation, weighting, averaging, normalisation level, length
of citation windows and removal of self‐citations are not indicated in suggested denotations
where those can be considered to be an integral part of the resulting indicator. However,
where the handling of self‐citations or fractionalisation is a vital part of the resulting indicator,
they are used. Please see the method index below denotation list.
• In situations where the same methods are used throughout a study and methods are clearly
stated, methodological indices may be omitted and raw denotations as P, C or c may be used
to make the presentation less cluttered.
8.1.2 General abbreviations
p publication
sc self‐citation h whole counts r fractionalised j journal t top %, prc percentile u un‐, non‐, zero a author y year f field g group w citation window 8.1.3 Specific abbreviations
Denotation English description Swedish description
P Number of publications (counted as
separately defined)
Antal publikationer
(beräknat enligt separat definition)
Ph Number of publications, whole counts Antal publikationer, utan fraktionering
Pr Number of publications, fractionalized
counts (as separately defined)
Antal publikationer, fraktionerat (enligt
separat definition)
Pf#% Number of publications cited more
than the #th percentile of the field, usually 90, 95 and 99
Antal publikationer citerade mer än #:e
percentilen i fältet, vanligen 90, 95 och 99
Puc Number of non‐cited publications Antal ej citerade publikationer
pf#% Share of publications cited more than
the #th percentile of the field, usually 90, 95 and 99
Andel publikationer citerade mer än #:e
percentilen i fältet, vanligen 90, 95 och 99
pf50% Share of publications cited more than
the median (50th percentile) of the field
Andel publikationer citerade mer än
medianen (50:e percentilen) i fältet
top#f World‐relative share of publications
cited more than the #th percentile in the
field
Relativ andel publikationer citerade fler
gånger än den #:e percentilen för fältet
puc Share of non‐cited publications Andel ej citerade publikationer
C Total number of citations Totalt antal citeringar
Cp Number of citations to a single
publication
Antal citeringar till en publikation
Cy Number of citations to publications
published year y
Antal citeringar till publikationer publicerade