The bibliometric database at the Swedish Research Council contents, methods and indicators

(1)

The bibliometric database at the Swedish Research

Council – contents, methods and indicators

Ulf Kronman, Magnus Gunnarsson and Staffan Karlsson

2010‐05‐11. Version 1.0

(2)

1 Introduction

The Department of Research Policy Analysis at the Swedish Research Council (SRC) maintains and

develops a database for bibliometric analyses.

This document describes the properties of the SRC bibliometric database, data preparation, analysis

tools and methods, and the resulting bibliometric indicators used at the SRC in detail.

2 Data source and database

The bibliometric analyses made at the SRC are based on scientific publication data records licensed

from the US‐based company Thomson Reuters. The database corresponds approximately to the data

that can be retrieved in the Thomson Reuters web service Web of Science. The licensed products are the

Thomson Reuters indices Science Citation Index Expanded, Social Science Citation Index and Arts and

Humanities Citation Index1_.

The SRC database contains publication records of all serial titles and publication types covered by

Thomson Reuters2_together_with_their_reference_lists._SRC_licenses_records_for_articles_published_from

1982 and onwards; approximately 30 million records and 560 million references (citations). The

database is updated once a year in March‐April.

Data is delivered from Thomson Reuters in the form of tagged text files containing about 1.3 million

new publication records per year. As Thomson also complements and corrects old data, the delivery

of new records is accompanied by batches of “gaps” and “corrections” that are supposed to fill gaps

and replace old amended records. The SRC receives updates, gaps and corrections by the end of

January each year, a shipment that includes all changes made to the licensed products during the

previous calendar year. Since there is a certain time lag between a scientific document’s publication

and its entering into the Thomson indexes, the January shipment does not include all publications that

were published during the preceding year. In order to make the SRC database complete regarding the

preceding year the January shipment is supplemented in April with updates, gaps and corrections

made to the Thomson indexes during the first quarter of the year that is loaded into the SRC database

in April each year. By this addition the publication numbers for the previous year is close to complete.

The publication data is parsed from the delivered text files and loaded into a relational database with

the means of an in‐house developed loader program. Citation counts and reference values for field

and journal normalisation of citation counts are then calculated and stored in the database.

3 Data

properties

This section describes the general properties of the bibliometric data as it is delivered from Thomson

Reuters.

3.1 Journals

The Thomson data is organised around journal issues, rather than publications or journal titles. This

has implications on the later classification of publications into subject fields, since it is the journal

issues that are being subject classified by Thomson, not journals, nor individual publications.

1_Certain_data_included_herein_are_derived_from_the_Science_Citation_Index_Expanded,_prepared_by

Thomson Reuters®, Philadelphia, Pennsylvania, USA. © Copyright Thomson Scientific® 2006. All

rights reserved.

(5)

For each issue Thomson registers a full serial title, ISSN, publisher address, etc. As both the serial title

and the ISSN may change over the years for something generally being considered as one and the

same journal, there is no consistent way of identifying journals in the Thomson data. Thomson does its

own judgement of what is to be considered as a journal and assigns each journal a unique identifier

called a sequence number.

For each issue the sequence number, the full title of the journal and four different normalised shorter

title variants are given; title_11, title_20, title_29, where the number indicates the maximum number of

characters in the title variant. There is also an ISO title and an ISSN number for each issue, but about

100 000 issues lack ISO title, and around 6000 issues lack ISSN number.

1982‐2008 2003‐2008 2008

Number of unique sequence numbers 23 954 15 307 10 812

Number of unique values in title_11 15 953 11 515 10 608

Number of unique values in full title 23 366 14 946 10 815

Table 1. The number of unique values for various potential journal identifiers in the SCR Thomson database.

A list of all journal titles indexed in the Thomson database can be found at the Thomson website1_._In

January 2009, the list covered 15 717 titles.

The lack of a consistent identifier for what on daily basis is called a “journal” indicates some need for

caution when doing journal‐based analyses. The tests that SRC has performed have indicated that

title_11 is the most stable and reliable identifier, and SRC presently uses this field for identifying

journals.

3.2 Document

types

Thomson classifies the publications as belonging to one (1) of a number of document types (around

30), the most important being articles, reviews, letters, and meeting abstracts. See appendix 8.2 for a list of

document types, together with the number of records and the share for each document type in the

SRC database. Since the SRC products licensed from Thomson does not include the ISI Conference

Proceedings index, the SRC database and analyses does not presently2_include_the_Thomson_document

type Proceedings Papers.

Bibliometric analyses including citation level normalisations (i.e. comparisons to other publications of

equal type) take the document type into account, since different document types tend to have different

citation characteristics.

Of the document types listed in appendix 8.2 Chronology and Note are not used separately anymore,

but included in the type Article. Document type Discussion is included in the type Editorial Material as

from 1996.

See also 4.1 for details on how SRC treats document types.

3.3 Subject

fields

The subject classification in the SRC database is primarily based on the Thomson classification system

that assigns different scientific field categories to journal issues. The number of fields used in the

Thomson database has varied over the years; presently there are 255 different subject fields in the

1_{http://scientific.thomson.com/mjl/}

(6)

database, whereof 247 have been used after year 2000. The Thomson subject fields are listed in

appendix 8.3, together with a number of publications in each category in the SRC database.

Each journal issue is classified by Thomson as belonging to one or several (maximum 7) of these 255

scientific fields. For “regular” journals the fields used for the issues tend to stay the same over time,

but for monograph series or report series the classification may vary from issue to issue, leading up to

a serial title classified as belonging to more than 30 different fields.

3.4 Addresses

The SRC bibliometric database contains about 56 million author or reprint addresses. 10 million of the

addresses can be considered as duplicates, generated by Thomson double registration of author and

reprint addresses that has varied in methodology over the years, see “Reprint addresses” below.

The 660 000 Swedish addresses1_in_the_SRC_database_have_been_refined_using_a_set_of_address_‐_to_‐

organisation matching rules. The procedure for the address matching is described in section 4.3

Address Matching under Data preparation.

The addresses for the publications in the Thomson database are of two types: author addresses and

reprint addresses. A reprint address contains name, affiliation and address details for the corresponding

author of the publication. Before 1998 the address of the corresponding author was only registered in

the database as a reprint address, but beginning in 1998 the address of the corresponding author is

registered both as an author address and as a reprint address, resulting in a number of duplicate

addresses in the database. See section 4.3 for details on how SRC handles these duplicates.

3.5 Author

names

The SRC bibliometric database contains 88 million author name entries. Author names are not

database normalised (merged), which means that each occurrence of an author name is recorded as a

separate entry, even if the name is exactly the same as a previous entry. The reason not to merge

identical author names is that there is no identifier of persons in the input data, and therefore there is

no way to tell if two identical author names denote the same one or two different persons (a

homonym).

The author names are registered in the Thomson database using the form Lastname, Initial(s). No

diacritics are registered in the database, so for instance Swedish characters å, ä, ö will end up as a, a, o (in a few occasions older transliteration rules making for instance ä to ae and ö to oe are used).

Varying transliteration, non‐consistent use of initials and mix‐up of middle names and given names

often cause problems when trying to locate the publications of a specific researcher, together with the

homonym problem mentioned above.

3.5.1 Reprint/corresponding author

Each author record is given a number of attributes, of which reprint author is one. The reprint author

name is also stored as a part of the reprint address. The reprint/corresponding author of a publication

may therefore be retrieved both by fetching the author record marked as reprint author or by

extracting the author name from the reprint address information.

3.5.2 Corporate authors

The Thomson data files also contain an author type marked as corporate authors, denoting the name of

a group of authors belonging to a large researcher community or an organisational body. In the SRC

(7)

database corporate author names are being stored and used in the same way as ordinary author

names.

3.5.3 Authors and affiliations

In the source files delivered by Thomson Reuters each publication record contains a list of author

names and a list of author addresses. However, for records entered in the Thomson system before

2008 there is no indication of which addresses that relates to which authors, except for the

corresponding author. As author names and author addresses are processed in sequence according to

the information from the journal issue, a relation between the first author name and the first author

address can be presumed, but there are no guaranties from Thomson on that.

As from second half of 2008 the records delivered contains an indication of which author names and

addresses belong together.

The reprint address also contains an author name, and therefore this information can be used as an

identifier of an author‐affiliation relation.

Details about author‐affiliation relations are not presently used in the calculation of any of the regular

indicators that SRC produces.

3.6 References

The reference lists of the publications in the SRC bibliometric database contain about 560 million

references (outbound citations) to other publications.

Each of the publication records in the Thomson database contains a list of the publication’s references

to other publications. These reference records contain, among other things, a generated reference ID

number (RefNumberID), built by a special Thomson algorithm from selected key data of the referred

publication. The same algorithm is also used to generate a unique identifier for all registered

publications in the Thomson database, but here it is called the ItemNumberID.

By matching the RefNumberIDs from the reference lists of newly registered publications with the

ItemNumberIDs of previously registered publications, the number of references to each publication

(i.e. the number of inbound citations) can be calculated. Since ItemNumberIDs are generated for all

publications entered in the database, there is no problem if the chronological order of the referring

publication and the cited publication is reversed (i.e. the referring publication is entered in the

database before the cited publication). When the matching algorithm is run, a match will occur

anyhow.

Almost half (47%) of the references in the SRC database are pointing to sources external to the SRC

database, i.e. their generated RefNumberIDs does not match any item in the database. These non‐

matched references usually refer to conference proceedings, reports, books, web‐published material or

journals not covered by the Thomson database.

See also section 4.6 for details on how the SRC deals with references.

4 Data

preparation

After the tagged text data for the Thomson publication records has been loaded into the database, a

number of SRC‐specific refinements and calculations are performed to generate the underlying data

needed for indicator calculations.

4.1 Re-classification

of

publication type Letter

At the SRC bibliometric analyses are mostly based only on the document types article and review. Since

(8)

several fields, the Thomson document types are re‐classified, so that types article and letter (and the

previously used document types note and chronology) are joined into a new SRC‐specific document

type; article, covering 70% of the publications in the database.

The reason for not treating letters as a separate document type is that the groupings of letters into

subject categories tend to generate too small groups for stable statistics and that the average citation

level for letters tend to be rather low (even though there are well‐cited letters). So at the SRC, letters

are grouped together with original articles and compared to the citation rates of those while doing

citation normalisations.

4.2 Subject field classification and fractionalisation

The SRC methodology uses the Thomson fields of the journal issue as a proxy to classify the

individual publications, so that each publication gets classified as belonging to 0 to 7 scientific fields

via the issue it is published in. During this mapping process, a SRC reclassification of publications in

journal issues classified as multidisciplinary is also applied, see 4.2.1 below.

The result of the subject field mapping leads to a number of links between Thomson subject fields and

publications and a count of the number of fields used for classification that is stored in each

publication record. The information about the fields and number of fields is used for a subject

fractionalisation in later calculations of citation reference values and citation indicators.

4.2.1 Re-classifying publications in Thomson subject field Multidisciplinary Sciences

Thomson Reuters uses the subject tag Multidisciplinary Sciences for journals that contain papers from

many different scientific fields, such as PNAS, Science and Nature. This poses a well‐known problem

for bibliometric analyses involving field relative indexes, since this means that papers in these journals

are not compared to other papers of their ‘true’ field, but rather to other papers in journals classified

as Multidisciplinary Sciences. For example, an article titled The origins of Afroasiatic was published in 2004 in the journal Science, which is tagged as Multidisciplinary Sciences, and by the end of 2008, this

article had accumulated 5 citations. Since the average citation count for articles in that field that year

was 2.13, the field normalised citation rate for this article was 2.35. However, if this article had been

published in Journal of the Royal Asiatic Society, which is tagged as Asian Studies, it would have had a

field normalised citation rate of 14.3, since the average citation count for Asian Studies articles that year

was 0.35.

This problem is aggravated by the fact that the journals tagged with Multidisciplinary Sciences regularly

are prominent journals with high average citation levels. An article published in any of these journals

is thus likely to receive a lower field normalised citation rate than would have been the case if it had

been published in a journal with more specific subject tagging.

In an effort to find more article‐relevant scientific field classifications for publications in journals

classified as Multidisciplinary Sciences, the publications in these journals are being re‐classified by an

in‐house developed algorithm, using the publication’s references and the (inbound) citations to the

publication itself. The algorithm is based on the assumption that the Thomson subject fields of the

articles referred to in reference lists and the citing articles indicate witch subject areas the

referring/referred article belongs to. The classification algorithm is described in detail in a separate

article from the SRC (Gunnarsson et. al. 2009).

Using the SRC reference‐citation‐based reclassification method 50% of the publications in journals

classified as multidisciplinary can be classified as belonging to other subject areas. For articles

published after year 2001, 90% are moved from multidisciplinary to other subject areas using this

method. The re‐classification produced by this method is now included as a standard procedure in the

(9)

4.3 Address

deduplication

As described in section 3.4 above, the set of reprint addresses and the set of author addresses partly

overlap. This, in combination with the fact that the SRC often uses address fractionalisation, means

that there is need for a method to identify and remove address duplicates.

Originally, the production year could be used to identify the records for which the reprint address is a

duplicate (before or after 1998), but since records older than 1998 have been added or amended

according to the new method, this method does not work anymore. After discussions with Thomson

Reuters and after detailed tests, SRC settled on the following criteria for identifying address

duplicates:

• The reprint address of a publication record is a duplicate

o if there is an author address that has the same organisation, sub‐organisation, city,

state and country

o or if the organisation and sub‐organisation fields of the reprint address are NULL and

there is at least one author address where those fields are not NULL in the publication

record.

• An author address of a publication is a duplicate if there is another author address in the

publication record with the same organisation, sub‐organisation, city, state and country.

4.4 Address

matching

The addresses supplied by the journal are parsed into several fields by Thomson:

• Complete address as supplied by publisher

• Organisation

• Sub‐organisation (as department)

• Postal code

• Street address

• City

• State or region

• Country

The addresses supplied by the authors may be formatted in a number of different ways, not always

with the main organisation name given as the first instance of the address, which can make it difficult

for Thomson to identify the main organisation, especially for organisations in non‐English speaking

countries. The organisation name may also be given in various synonym forms, especially if the name

is translated ad hoc to English from a non‐English name form.

Example 1

Full address Tufts Univ, Dept Mech Engn, Medford, MA 02155 USA

Organisation Tufts Univ

Sub‐organisation Dept Mech Engn

City Medford State MA Country USA Street NULL Postal code 02155 AP Example 2

Full address Karolinska Inst, Huddinge Univ Hosp, Dept Immunol Microbiol Pathol &

Infect d is,Div Clin Bacteriol,S‐14186 Huddinge, Sweden

Organisation Karolinska Inst

(10)

City Huddinge

State NULL

Country Sweden

Street Huddinge Univ Hosp, Dept Immunol Microbiol Pathol & Infect Dis, Div

Clin Ba cteriol

Postal code S‐14186 BC

Example 3

Full address Univ Lund, Ctr Chem, Dept Biotechnol, S‐22100 Lund, Sweden

Organisation Univ Lund

Sub‐organisation Ctr Chem; Dept Biotechnol

City Lund State NULL Country Sweden Street NULL Postal code S‐22100 BC Example 4

Full address Inst Microbiol, Dept Clin Immunol, Goteborg, Sweden

Organisation Inst Microbiol

Sub‐organisation NULL

City Goteborg

State NULL

Country Sweden

Street Dept Clin Immunol

Postal code NULL

Fig 2. Four examples of Thomson address records. The second record is an example of two different organisations

written in the same address and the fourth record is an example of a Swedish address with a hard‐to‐identify

organisation field.

The variations in address formatting and naming of organisations lead to problems when trying to

identify the publications of an organisation while doing a bibliometric analysis. The SRC has therefore

developed a semi‐automatic address‐to‐organisation matching process to improve the identification of

publications from Swedish organisations.

The address matching is based on 17 000 rules containing unique strings that match text strings in the

various Thomson address fields and link the 660 000 addresses that are designated with country

SWEDEN to 600 Swedish organisation records. The organisation records also contain information of

the type of organisation; for instance university, university college, university hospital, hospital, company,

organisation, museum, governmental, etc.

The matching process is performed in a sequence of steps, beginning with the most straightforward

synonym matching algorithms, followed by more intricate string matching rules:

1. The Thomson organisation field is searched for unique variants of Swedish organisation

names. The 4 600 rules of this step match 610 000 Swedish addresses (around 90%).

2. The Thomson organisation and Thomson city fields are searched in conjunction for

combinations making up unique identifications of Swedish organisations. The 9 000 rules

of this step matches 25 000 addresses.

3. A set of 500 unique string identifiers for Swedish universities and university colleges is

matched against the full address field. This procedure also picks up addresses where

several Swedish university names have been written into the same address and therefore

produces some address duplicates (that later are handled as split addresses). This step

matches around 42 000 addresses.

4. The Thomson organisation field is searched for unique identifying strings and

(11)

procedure is also repeated without the inclusion of Thomson city field to handle

organisations located in several cities. These steps match approximately 13 000 addresses.

5. The full address field is searched for unique identifying strings and combination of strings

once more, this time trying to catch residuals from previous matches. This matches about

800 addresses.

6. A number of addresses that have been manually linked to organisations or considered as

unidentifiable are being processed using the whole address field as identifier. This is used

for 200 addresses.

7. Names of joint university organisations are being searched in Thomson organisation and

Thomson city fields in two steps and split addresses for those are produced. This

produces around 800 organisation matches.

The result of the address matching process is a relational database table with 690 000 links1_between

Swedish organisations and their publications. The links cover over 99%2_of_the_Swedish_publications

in the SRC database and work to get 100% coverage is ongoing.

The matching process allows a single address to be matched to more than one organisation, which in

practice means that addresses are split. Because of this all statistics based on or including the number

of addresses per publication are updated.

For Denmark, Norway and Finland a first rudimentary normalisation of organisation names for

universities, university colleges and university hospitals has been performed and there are plans to

extend this normalisation to a rule matching system of the same type as for Swedish addresses in the

future.

4.4.1 Country name normalisation

For all 56 million addresses in the SRC database a normalisation of country names is performed by

using a system of 353 rules for matching name variants to 246 standardised3_country_names._For

instance ”Scotland” and ”England” get normalised to ”United Kingdom”, ”Cambodia” and ”Khmer

Republic” map to ”Cambodia”; and ”Fed Rep Ger” and ”Ger Dem Rep” to ”Germany”.

No attempt to correct possible errors in country assignment (organisation names assigned to the

wrong country) is done, but there are plans for such a procedure in the future.

4.5 Address counting and fractionalisation

Since almost all indicators calculated by the SRC are being weighted based on the analysed unit’s

share of the addresses in the publication, the number of addresses in each publication needs to be

counted and stored as a part of the publication record.

After the address matching the total number of addresses is stored as a part of each publication record

and used as a denominator when calculating fractionalised publication counts and weighted citation

indicators.

One rationale for using address fractionalisation and using fractions for weighting of citation averages

is the correlation between the number of addresses and the number of citations. Publications with

many authors usually attract more citations, even disregarding self‐citations. If citation values for

publications would attributed to organisations without any form of fractionalisation and weighting, a

1_The_number_of_links_between_{organisations}_and_publications_are_more_than_the_number_of_addresses,

due to joint organisations and several organisations contained within the same address.

2_Measured_September_2009.

(12)

sort of inflation in citations would be created, since highly‐cited publications would be attributed to

more organisations than lower‐cited publications.

The SRC presently do not use any kind of fractionalisation based on author names and/or the number

of authors to the publications, since there presently are no links between authors and affiliations, and

SRC rarely does analyses on individual researchers.

4.6 Reference

adjustments

As mentioned above Thomson uses an elaborate mechanism for creating links between referring and

cited documents, but the SRC does some adjustments to this citation matching mechanism while

counting citations for publications.

4.6.1 Split references

The Thomson algorithm that builds the RefNumberIDs and the ItemNumberIDs is not perfect;

sometimes different publications can get the same ItemNumberIDs1_._Of_the₅₆₀_million_references_in

the SRC database about 160 000 (0.2‰) are duplicates, i.e. the same reference ID number points to two

different publication records. When this situation occurs there is no way to automatically differentiate

between the two records that are being referred to and the citation to these publications is therefore

split between them. The SRC database thus contains a few publications with a non‐integer number of

citations.

4.6.2 Duplicate references

In some scientific areas there are citing traditions leading to a publication to occur several times in the

same reference list. The reference may for instance point to different pages of the referred publication.

In the SRC database, these duplicate references are only being counted once, i.e. the same

RefNumberID is only entitled to occur once in each reference list.

4.7 Citation

windows

Citations to publications can be measured using different citation time windows. The citation window

stipulates from which publication years publications are to be searched for references when gathering

citations to a publication published a certain publication year. For instance, a three year citation

window used for a publication published year 2000 will mean that references from publications

published the same year as the cited publication (2000) and two succeeding years (2001‐2002) are

counted as citations. A citation window of six years for the same publication will count references

from publications published years 2000‐2005. An open citation window will count references from all

publications published up to the time the analysis is performed. For a publication published year 2000

and analysed year 2009, references from publications published years 2000‐2009 would then be

counted as citations.

The fixed citation windows 3 and 6 years have the advantage that citation counts to all publications

older than 3 or 6 years can be compared on like grounds and therefore make a good choice for

preparing time series of citation counts. The citation count for a publication published before the end

of the used citation window will also stay the same, no matter when analysis is done; for instance if

the citations publication published year 2000 is measured using a 3‐year window in 2003, 2005, 2008 or

2009, the result will be the same, since only citations from publications from year 2000‐2002 are

counted.

1_Usually,_this_occurs_when_two_publications_have_same_the_first_author,_the_same_title,_the_same

(13)

The open citation window has the advantage of gathering as much citation data as possible for

indicator calculation, and has the possibility of spotting so called “sleeping beauties” – publications

that not get started to be cited before a number of years from their publication. But the open window

has the drawback of non‐time‐neutrality, since the worldwide general rate of citations is increasing for

each year. Internal studies at the SRC have shown that there is a good correlation between normalised

citation rates calculated from 3‐year, 6‐year and an open citation window at aggregated levels.

If an analysis using a fixed 3‐year or 6‐year citation windows includes publications that are younger

than 3 or 6 years, respectively, the window will act as an open window for those publications, i.e. all

citations from the publication date up to the analysis date will be included in the citation count. Since

the same will be valid for the publications included in the citation reference value for the same year

(the denominator used for normalisation) the deviation from the fixed window will not cause any

major differences in the normalised citation rate.

The SRC pre‐calculates citation counts for all publications in the database using a 3‐year window

(Cw3), a 6‐year window (Cw6) and an open window (Cwo).

4.8 Self

citations

References between publications are usually considered to reflect some kind of scientific recognition

and the number of citations a publication receives can thus be said to be a measure of the amount of

scientific recognition it has gained. But if a researcher refers to his own previous work, this is not a

measure of recognition from rest of the research community. Therefore it is customary to try to

remove these self citations from the citation counts in bibliometric studies.

The SRC method for removing self citations is based on all author names in both the referring and the

cited publication. If any of the author names in Thomson format (lastname + initials) in the author list

of the referring publication is found in the author list of the cited publication, the citation is considered

to be a self‐citation. No attempt to differentiate homonyms (different researchers sharing the same

Thomson form of name) is done, and there is no separate rule for publications with long author lists.

Using this process to identify self‐citations, the SRC calculates values for the number of citations

including self‐citations (Csci), and the number of citations excluding self‐citations (Cscx) for all three

citation windows and stores these pre‐calculated values for each publication in the database. Each

publication will thus have attributes for six citation values: Csci,w3, Csci,w6, Csci,w0, Cscx,w3, Cscx,w6 and Cscx,wo.

4.9 Citation reference values for Thomson subject fields (µ

f

)

After the various citation values have been calculated for all publications in the database, the basic

data are in place to start to calculate the field reference values used for field normalisation of citations.

The average number of citations for publications of a certain document type a certain year in a certain

field is called the Field Reference Value and is denoted μf or FRV1.

The field reference value is calculated for each combination of subject field, publication year and

document type according to the following formula:

∑

∑ 1 [1]

where:

μf = weighted average citation rate for field, year and document type

1_The_Field_Reference_Value_has_previously_been_called_the_Field_Citation_Score,_denoted_as_FCS

(14)

P = the number of publications of the studied document type the studied year classified as

belonging to the subject field in question

Ci = the number of citations to publication i

(according to separately specified citation window and self‐citation handling)

Si = the number of subject fields the publication i has been classified as belonging to

The use of the number of fields each publication is classified in (Si) in the denominator of both of the

sums in the formula means that the average citation values are being based on publication subject

fractions and the resulting field reference value will (μf) be a weighted average1 with regards to the

publications.

Publications without any subject field classifications are obviously excluded from the calculation, as

well as publications without any author addresses, since the latter can not be a part of any country or

organisation analysis, and should therefore not be a part of the reference value for the field.

When calculating the field reference values all references to the publications are counted as potential

incoming citations, regardless of publication year, document type or field, before being filtered by

conditions regarding citation window and self‐citations.

At the SRC the field reference values are calculated for all of the six variants of citation values for each

publication; with (sci) or without (scx) self‐citations, with citations windows 3 (w3) or 6 (w6) years, or

an open window (wo). The calculated values are denoted μf[sci,w3], μf[sci,w6], μf[sci,wo], μf[scx,w3], μf[scx,w6] and μf[scx,wo] respectively, and stored in a separate field reference value table in the database as a foundation

for the later calculation of the field normalisation of citation rates.

Since the field normalised citation rate is the most commonly used indicator, pre‐calculated weighted

average field normalised citation values for all the combinations of citation windows and self‐citation

filtering are stored as a part of each publication record in the database. The calculated values are

denoted cf[sci,w3], cf[sci,w6], cf[sci,wo], cf[scx,w3], cf[scx,w6] and cf[scx,wo] respectively. See section Indicators below for a

description of how the field normalised citation level is calculated at the SRC.

4.10 Citation reference values for journals (µ

j

)

Sometimes it can be of interest to study how much a publication has been cited in relation to

publications of the same document type, the same publication year in the same journal, an indicator

called the journal normalised citation rate. To be able to do this, a mean citation value for each document

type, year and journal has to be calculated. This value is called the journal reference value and is denoted μj or JRV2.

For each journal, identified by the field title_11 in the journal issue records, the journal reference

values are calculated according to the following formula:

∑

[2]

where:

μj = the journal reference value

1_A_publication_classified_in_only_one_field_will_have_a_fraction_weight_of_1,_whereas_a_publication

classified in 5 different fields only will have a fraction weight of 0.2 for the calculation of each of the

field reference values.

(15)

Ci = the number of citations to publication i

(according to separately specified citation window and self‐citation handling)

P = the number of publications of the studied document type the studied year published in the

journal in question

At the SRC the journal reference values are calculated for all of the six variants of citation values for

each publication; with (sci) or without (scx) self‐citations, with citations windows 3 (w3) or 6 (w6)

years, or an open window (wo). The calculated values are denoted μj[sci,w3], μj[sci,w6], μj[sci,wo], μj[scx,w3], μj[scx,w6] and μj[scx,wo] respectively, and stored in a separate table in the database for later calculation of

journal normalised citation rates.

Publications without any author addresses are excluded from the calculation of the journal reference

value, since these can’t be a part of any country or organisation analysis, and should therefore not be a

part of the reference value for the journal.

4.11 Citation percentile threshold values for fields (

τ

f

)

As a complement to the study of publication citation rates in relation to mean values, it can be of

interest to study how many and what share of publications are cited more than a specified percentile

threshold of a subject field. Commonly used percentile thresholds are 90%, 95% and 99%, which

indicate that a publication is among the 10%, 5% or 1% most cited in a field if it has yielded more

citations than the corresponding percentile threshold value.

The percentile threshold values are defined and calculated using the same conditions as the field

average reference values, i.e. it is publications of the same document type, the same publication year

and the same Thomson subject field that is grouped to calculate the percentile values. Publications

without any subject field classifications are excluded from the calculation, as well as publications

without any author addresses.

At the SRC the percentile thresholds for the fields are being calculated on subject fractions of the

publications, which means that each publication gets a fraction weight in inverted proportion to the

number of fields it is classified in when the summing up of the various percentages of publications is

done.

The percentiles are calculated by sorting the publications in order of the number of citations to each

publication. Each publication is assigned a fraction weight that is the inverse of the number of fields it

is classified in. Then groups of publications containing 90%, 95% and 99% of the total number of

weighted publication fractions are extracted and the number of citations to the publication at the top

of the list is noted as the corresponding percentile threshold value. If the number of (weighted)

publications is such that the 90% (or 95% or 99%) limit goes through a publication (and not between

two publications), the average citation count for the two publications on both sides of the limit is used

for percentile threshold value1_.

The percentile calculation is performed for all six combinations of citation windows and self‐citation

handling for each field and the results are stored in the field reference value table in the database as 18

different threshold values, one for each combination of percentile threshold (90, 95 and 99), citation

window (w3, w6, wo) and handling of self‐citations (sci, scx). The values are denoted τ90f[sci,w3],

τ95f[sci,w3], τ99f[sci,w3], etc.

1_The_calculation_is_made_in_the_statistics_software_SAS,_using_the_SAS_standard_definition_for

(16)

5 Methods for analyses

Data for bibliometric analyses are being extracted from the SRC database in a number of different

ways, the most common being to retrieve sets of publication data in the form of lists of publication

fractions. Each publication is split into fractions by author address and subject classification, so that a

publication with two addresses and three subject classifications is split into six fractions, one for each

combination of address and subject. The basis of the subject classification is the Thomson subject

classification of journal issues included in the tagged text files; see section 3.3 above.

The final analyses are either performed with the statistical program SAS or with in‐house developed

programs and scripts. Usually the analyses are post‐processed in Microsoft Excel for tabulation and

diagrams.

5.1 Publication

address/field fractions

The output from the SQL database is typically a list of relevant data for each publication to be

analysed, with one row for each combination of publication, address and subject field. Each row

usually contains the following data:

• Normalised organisation name of the Swedish address, or Thomson organisation name if non‐

Swedish address

• Normalised country name

• Full address

• Publication year

• Document type

• Subject field

• The number of subject fields in the classification of this publication

• The number of addresses for this publication

• The number of citations in 6 variants with different citation windows and self‐citations

included or removed

• Field reference values in corresponding 6 variants for the field the fraction is classified in

• Percentile thresholds for 90th_,₉₅th_,_and₉₉th_percentile_with₃_different_citation_windows_and

self‐citations excluded

• Journal reference value, open citation window, self‐citations excluded

An extraction of fractions for Swedish publications 1982‐2009 generates about 1 million rows.

Extraction of world publication data for the same period generates about 73 million rows, which

corresponds to about 27 GB of data.

6 Indicators

In this section the most commonly used bibliometric indicators used at the SRC are presented,

together with descriptions on how they are being calculated by the SRC.

6.1 Publication counts (P)

An analysed unit’s publications may be counted either in full counts or in fractional counts based on

the unit’s share of addresses in the publication, and, in relevant cases, the number of subject fields

assigned to the publication. The number of publications in full counts is denoted Ph and the number of

fractionalised publications is denoted Pr.

A commonly used bibliometric indicator is the number of publications produced per year, which is

(17)

6.2 Share of un-cited publications (p

uc

)

It can also be of interest to study what share of an analysed unit’s publications that have not made an

impact in the scientific community, i.e. that have not yielded any citations besides self‐citations.

The indicator is calculated according to the following formula:

∑ 1

∑ 1 [3]

where:

puc = weighted average share of uncited publications attributed to the analysed unit

Ruc = the number of publication fractions to uncited publications (self‐citation removed) attributed to

the analysed unit

R = the total number of publication fractions attributed to the analysed unit

Ai = the total number of addresses in the publication of fraction i

6.3 Share of self-citations (c

sc

)

The analysed unit’s share of self‐citations is easily calculated by dividing the number of self‐citations

by the total number of citations to the unit’s publications. Since the SRC calculates the weighted

average based on the unit’s share of addresses to the analysed publications the formula gets a bit more

complicated:

∑

[4]

where:

csc = weighted average share of self‐citations for the analysed publications

Csci(i) = the number of citations to the publication of fraction i, self‐citations included

Cscx(i) = the number of citations to the publication of fraction i, self‐citations excluded

R = the number of publication fractions attributed to the analysed unit

The share of self‐citations can be calculated using different citation windows, depending on what is

the most suitable for the analysis in question.

6.4 Field normalised citation rate (c

f

)

The field normalised citation rate is one of what is called “state‐of‐the‐art” bibliometric indicators. The

general idea of the indicator is to relate the number of citations to a publication or a group of

publications to the average citation level of a group of comparable publications of the same document

type, publication year and Thomson scientific field.

The SRC calculates its field normalised citation rate (cf) indicator using a publication fraction oriented

method, which means that the number of citations of each subject‐address fraction of a publication is

normalised against an average citation rate for the same document type, publication year and subject

(18)

When the final average normalised citation rate for the analysed unit’s publications is calculated, each

publication fraction is weighted by its share of all subject‐address fractions for that publication, so that

the resulting average will be a weighted average.

The SRC average cf is calculated according to the following formula:

∑

∑ 1 [5]

where:

cf = the weighted average field normalised citation rate

Ci = the number of citations to the publication of fraction i

(according to separately specified citation window, self‐citations removed)

μf(i) = the field reference value for the field of fraction i

Si = the number of subject fields the publication of fraction i has been classified as belonging to

The field normalised citation rate can be calculated using different citation windows, depending on

the situation, but at the SRC it does always exclude self‐citations. Presently, SRC does not do any

adjustments while normalising against very low‐cited fields.

Please note that even though the SRC cf indicator resembles the CWTS “crown” CPP/FCSm indicator,

it is not the same indicator. The CTWS indicator groups publications and calculates average citation

levels for both the nominator and the denominator before citation normalisation is done (Moed et al

1995), whereas the SRC indicator does the normalisation on publication fraction level and that

averaging is done after that. This difference is described in detail by Lundberg (2007). Furthermore,

the CWTS crown does not seem to use address fraction weighting, which means that crown values

usually will be higher than cf values for the same set of publications, due to the correlation between

numbers of authors/addresses and citations discussed in the section about fractionalisation above.

6.5 Share of publications above the 90

th

, 95

th

and 99

th

citation percentile

If the average field normalised citation rate for an analysed unit says something about the average

impact of the unit’s publications, the share of publications cited above a certain citation percentile can

tell us something about the distribution of the impact of the unit’s publications. Is an average

normalised citation rate of 1.2 the result of a majority of well‐cited publications or a few highly‐cited

publications?

This indicator is calculated by looking at how many of a unit’s publication fractions that are cited

more than the citation level for the percentile in question for the subject fraction it is classified in. If the publication is cited more than the threshold for the field, the value is 1, otherwise it is 0. This value is called the Pprc#f ‐ cited over threshold for #thpercentile value in the formula below.

The SRC calculation of the indicator is weighted on the analysed unit’s share of address fractions to

the publications according to the following formula:

#

∑ #

∑ 1 [6]

(19)

pPRC#f(i) = the weighted average share of publications cited above the #th percentile

PPRC#f(i) = the cited over threshold for #th percentile value of fraction i

Si = the number of subject fields the publication of fraction i has been classified as belonging to

The indicator can be calculated using different citation windows, but it seldom includes self‐citations.

It is worth pointing out that the share of publications cited more than 99th percentile is usually less

than 1%, and correspondingly for the other percentile values.

6.6 Journal normalised citation rate (c

j

)

This indicator shows how an analysed unit’s publications are cited in relation to the average citation

rate for publications of the same document type and publication year in the same journal.

Since no normalisation against subject fields is performed here, the data set for calculation is only

fractionalised on addresses (not subject‐address as customary).

The indicator is calculated according to the following formula:

∑

∑ 1 [7]

where:

cj = the weighted average journal normalised citation rate

Ci = the number of citations to the publication of fraction i

μj(i) = the journal reference value for the publication of fraction i

The indicator can be calculated using different citation windows, but it seldom includes self‐citations.

Only publications with at least one subject classification and one address are considered in this

calculation.

6.7 Journal to field normalised citation rate (j

f

)

Sometimes, it can be of interest to study the average citation rate of the journals a unit publishes in, in relation to the average citation rate of the fields the journal is classified in.

This indicator is called the journal to field normalised citation rate and is calculated according to the

following formula:

∑

∑ 1 [8]

where:

(20)

R = the number of publication fractions attributed to the analysed unit μj(i) = the journal reference value for the publication of fraction i

μf(i) = the field reference value for the field where fraction i is classified

Si = the number of subject fields the journal of fraction i has been classified as belonging to

The indicator can be calculated using different citation windows, but self‐citations are seldom

included.

7 References

Gunnarsson, M.; Fröberg, J.; Jacobsson, C. & Karlsson, S. (2008) Subject classification of publications in the

Thomson database based on references and citations. Vetenskapsrådet, Stockholm.

Karlsson, S. & Wadskog, D. (2006). Hur mycket citeras svenska publikationer? Bibliometrisk översikt över

Sveriges vetenskapliga publicering mellan 1982 och 2004. Vetenskapsrådets rapportserie 13:2006.

(www.vr.se/download/18.5b5b80b310e317e3c0680001207/Bibliometrirapport)

Lundberg, Jonas (2007). Lifting the crown – citation z‐score. Journal of Informetrics, 1 (2007), 145‐154.

Moed, H. F., Debruin, R. E., & Vanleeuwen, T. N. (1995). New bibliometric tools for the assessment of

National Research Performance – Database description, overview of indicators and first

applications. Scientometrics, 33(3), 381–422.

8 Appendices

8.1 SRC denotation of parameters and indicators

8.1.1 Denotation overview

General rules guiding a suggested extendable denotation scheme:

• Absolute numbers are denoted with upper case letters (P, C).

• Relative numbers (quotients) are denoted with lower case letters (p, c).

• Reference values for normalisations are denoted using Greek characters (μ, τ)

• Index letters are used to indicate special conditions regarding the indicator or the reference

value. In situations where it is not possible to use index letters, indices may be written using

brackets or underscores, i.e. cf may be denoted as c[f] or c_f.

• Methodological aspects as fractionalisation, weighting, averaging, normalisation level, length

of citation windows and removal of self‐citations are not indicated in suggested denotations

where those can be considered to be an integral part of the resulting indicator. However,

where the handling of self‐citations or fractionalisation is a vital part of the resulting indicator,

they are used. Please see the method index below denotation list.

• In situations where the same methods are used throughout a study and methods are clearly

stated, methodological indices may be omitted and raw denotations as P, C or c may be used

to make the presentation less cluttered.

8.1.2 General abbreviations

p publication

(21)

sc self‐citation h whole counts r fractionalised j journal t top %, prc percentile u un‐, non‐, zero a author y year f field g group w citation window 8.1.3 Specific abbreviations

Denotation English description Swedish description

P Number of publications (counted as

separately defined)

Antal publikationer

(beräknat enligt separat definition)

Ph Number of publications, whole counts Antal publikationer, utan fraktionering

Pr Number of publications, fractionalized

counts (as separately defined)

Antal publikationer, fraktionerat (enligt

separat definition)

Pf#% Number of publications cited more

than the #th_percentile_of_the_field, usually 90, 95 and 99

Antal publikationer citerade mer än #:e

percentilen i fältet, vanligen 90, 95 och 99

Puc Number of non‐cited publications Antal ej citerade publikationer

pf#% Share of publications cited more than

the #th_percentile_of_the_field,_usually 90, 95 and 99

Andel publikationer citerade mer än #:e

percentilen i fältet, vanligen 90, 95 och 99

pf50% Share of publications cited more than

the median (50th_percentile)_of_the_field

Andel publikationer citerade mer än

medianen (50:e percentilen) i fältet

top#f World‐relative share of publications

cited more than the #th_percentile_in_the

field

Relativ andel publikationer citerade fler

gånger än den #:e percentilen för fältet

puc Share of non‐cited publications Andel ej citerade publikationer

C Total number of citations Totalt antal citeringar

Cp Number of citations to a single

publication

Antal citeringar till en publikation

Cy Number of citations to publications

published year y

Antal citeringar till publikationer publicerade

The bibliometric database at the Swedish Research Council contents, methods and indicators