Bibliometrics and
Transaction Log Analysis
Bibliometrics
Citation Analysis
Bibliometrics
De
fi
nitions:
Quantitative study of literatures as re
fl
ected in bibliographies
Use of quantitative analysis and statistics to describe patterns of
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Generally speaking, bibliometrics helps explore questions
about bodies of literature and the authors that produce it:
How scholarly is the cited literature?
How current is the cited literature?
How research oriented is it?
How interdisciplinary is it?
Who writes that literature?
Where does the literature appear?
Bibliometrics
More speci
fi
cally, enables investigation of basic research
questions:
Provide macro perspective on scienti
fi
c communication
Determining in
fl
uence of a single author
Describing relationship between two or more authors or works
Demonstrating emergence of new subject
fi
elds
Describing growth of literature on a subject
Quantifying productivity of individual authors
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Findings can be applied to range of practical problems:
Collection development
Thesaurus development
Development of indexes, abstracts, taxonomies, metadata
Collection pruning
Bibliometrics
Two distinct bibliometric approaches have developed in
parallel
Analysis of distribution properties resulting in statistical laws or
mathematical models
Range of methods that enable speci
fi
c descriptions of the content,
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Bibliometric laws
Lotka
’
s Law of scienti
fi
c productivity
Describes the frequency of publication by authors in a given
fi
eld
Demonstrates that only a small percentage of authors in a
fi
eld are
Bibliometrics
Bibliometric laws
Bradford
’
s Law of Core and Scatter in Journals
Demonstrates that a small portion of journals in a
fi
eld contain a
substantial portion of relevant articles in the
fi
eld
Journals in a single
fi
eld can be divided into three parts, each containing
the same number of articles:
1. A core of journals, few in number, that produces one
-
third of all the articles
2. A second zone, containing same number of articles as
fi
rst, but a greater
number of journals
3. A third zone, containing the same number of articles as the second, but a
still greater number of journals
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Bibliometric laws
Zipf
’
s Law of Word Frequency
Predicts the frequency of words within a text
Bibliometrics
Citation analysis
Tool to identify core sets of articles, authors, or journals of particular
fi
elds of
study, and to describe relationships and trends within and between these entities
When one author cites another author, a relationship is established, between:
Authors
Journals, publishers
Disciplines,
fi
elds, subject areas
Keywords
Institutions, countries, languages
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Citation analysis
Three distinct approaches
Co
-
citation analysis
Bibliographic coupling
Bibliometrics
Co
-
citation analysis
Method used to establish a subject similarity between two documents
Number of times two documents are jointly cited in other documents
If papers A and B are both cited by paper C, they can be said to be
related to one another, even though they don
’
t directly cite each other
The more papers A and B are both cited by, the stronger their
relationship is
Can be used to map the topical relatedness of clusters of authors, journals
or articles
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Co
-
citation analysis
“
In
fl
uential Authors in LIS
2000
-
20002
-
A First Author
Co
-
citation Map
”
http://www.umu.se/inforsk/
Bibliometrics
Co
-
citation analysis
AuthorLink Co
-
citation
Map
http://faculty.cis.drexel.edu/
~
xlin/authorlink.html
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Bibliographic coupling
Assumes two documents that both cite the same document have
something in common
Links two papers that cite the same articles, so that if papers A and B
both cite paper C, they may be said to be related, even though they
do not directly cite each other
Bibliometrics
Co
-
word analysis
Based on analysis of co
-
occurence of keywords used to index
documents
Useful for:
Mapping the content of research in a
fi
eld
Creation of indexes or thesauri for a given subject domain
Supplement search terms in information retrieval systems
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Co
-
word analysis
ConceptLink
http://faculty.cis.drexel.edu/
~
xlin/conceptlink.html
!"##$%%"&''(""))*+,-" " . ! /011023" 45567" 89162305" ,6:0;" <5=>07=1105" =1" 0?616@=?ABC""" #30"=1A209D"4,EA"#=@61FA"19@0"7=23"45567FA"91D"G0105920A" 230" HI,JK" 6L" I=GM50" NC" " " ,620" 230" 9;@6A2" ?6@<;020" D=A9<<09591?0" 6L" <AO?36;6G=A2A" 91D" 4+" <06<;0C" " K30O" 950" 50<;9?0D" PO" 91" 9;@6A2" ?6@<;020;O" D=LL05012" ?507" =1" 73=?3" 0?616@=A2A"<50D6@=1920C"""K30"9?2M9;"?612012"6L"230A0";=1QAR0CGCS"P027001"#=@61"91D" T;A61"65"P027001"#=@61"91D"K:05AQOR?91"P0"0U<;650D"@650" LM;;OC" " #=@61FA" 19@0" =A" 9;509DO" <50A012" =1" 230" V9=1" 4M2365" P6U"92"26<"5=G32S"A=1?0"30"=A"230"L6?MA"6L"230"@9<<=1GAC""41O" 62305"19@0"L56@"230"@9<"?91"P0"D59GG0D"91D"D56<<0D"=126"230" P6U" ;9P0;0D" 4DD=2=619;" 4M2365A" 167" 0@<2O" 92" 5=G32C" K3=A" =@<;=?=2;O" 4,EA" 230" 9DD=2=619;" 19@08AB" 7=23" #=@61FA" L65" 9" ;=20592M50" A095?3C" " W301" 230" #095?3" PM2261" 92" ;6705" 5=G32" =A" ?;=?Q0DS" 6M5" AOA20@" 7=;;" 5025=0:0" P=P;=6G59<3=?" 50?65DA" L56@" 4XY+" (Z[[)Z\" L65" 9;;" D6?M@012A" 2392" ?6)?=20" 230" 4,E0D" 19@0AC""I65"0U9@<;0S"=L"91"952=?;0"=1"9"3M@91=2=0A"]6M519;"?6) ?=20A" #=@61FA" K30" #?=01?0A" 6L" 230" 452=L=?=9;" 91D" V95:=1" V=1AQOFA"K30"#6?=02O"6L"V=1DS"2392"952=?;0"76M;D"P0"5025=0:0DC"
41O"9M2365FA"?6)?=292=61"?6M12"7=23"230"L6?9;"9M2365"?91"P0" ;09510D"=1"276"79OAC""+2"9<<095A"9L205"230"?6)?=200FA"19@0"7301" 230" L=5A2" 591Q0D" ;=A2" 6L" ?6)?=200A" =A" <56DM?0DS" 91D" =2" <6<A" M<" 7301"230"?M5A65"<9AA0A"6:05"9"19@0"61"230"/636101"@9<"65" HI,JKC" " K30A0" ?6M12A" 950" =@<652912" P0?9MA0" 230O" A367" 230" A=>0"6L"230"50250=:9;"2392"@9O"P0"0U<0?20D"7301"9"G=:01"9M2365" =A"4,E0D"7=23"230"L6?9;"9M2365C"
4DD=2=619;;OS"=1"HI,JKAS"230"?6)?=292=61"?6M12A"L65"9;;"<9=5A" 6L";=1Q0D"9M2365A"7=;;"P0"D=A<;9O0D"7301"230"^#367",M@P05A^" PM2261"92";6705"5=G32"=A"?;=?Q0DC""K30"HI,JK"9;G65=23@"?50920A" 9" LM;;O" ?6110?20D" G59<3" =1" 73=?3" 61;O" 230" ^;09A2" ?6A2^" <923A" P027001" 9M2365A" 950" D5971C" " *09A2)?6A2" <923A" 950" D0205@=10D" PO" <923" 70=G32AS" 3050" 230" ?6)?=292=61" ?6M12A" 6L" 9M2365" <9=5AC""" KO<=?9;;OS"230"3=G30A2"A=1G;0"?6)?=292=61"?6M12"P027001"9M2365A" =A"?36A01"9A"230";09A2)?6A2"70=G32S"P0?9MA0"7301"230"70=G32A" 6L"9;;"62305"<923A"P027001"9M2365A"950"AM@@0DS"230O"0U?00D"=2C""" #6"230"9;G65=23@"6@=2A"236A0"<923A"91D"D597A"61;O"230"<923A" 50L;0?2=1G"230"3=G30A2"?6)?=292=61"?6M12A"L65"09?3"<9=5C"""
K30" AMPA2912=:0" =@<65291?0" 6L" A367=1G" 230" ?6M12A" =A" 2392" 230O"50:09;"230"A=>0A"6L"D6?M@012A"A02A"230";=1Q0D"<9=5A"76M;D" 5025=0:0C" " K3=A" =A" 9" 56MG3" @09AM50" 6L" =@<65291?0" 65" <56@=101?0" =1" 9" G=:01" D6@9=1C" " +1" I=GM50" _S" #=@61" 91D" ,070;;"9<<095"7=23"P623"/95;"H6<<05"91D"K36@9A"#C"/M31C""+L" 610" ?9;;A" M<" 230" ?6)?=292=61" ?6M12A" L65" 230A0"50A<0?2=:0"<9=5AS" #=@61),070;;FA" =A" ``S" 730509A" H6<<05)/M31FA" =A" &\\C" " K3=A" 50L;0?2A" 230" G509205" <56@=101?0" 6L" 230" ;92205" <9=5" =1" 9" 3M@91=2=0A" D929P9A0" =1" 205@A" 6L" 230" D6?M@012A" 230=5" 19@0A" 76M;D"5025=0:0C"
4A"70"?91"A00"L56@"230"#=@61"0U9@<;0S"9";62"?91"P0"D610" 7=23"9M2365"?6)?=292=61"@9<AC""K30"MA05"?91"A2MDO"230"9M2365"" ?;MA205A" L65@0DS" L6;;67" ;=1QA" 26" A00" 367" 276" 9M2365A" 950" 50;920DS" 65" 8@0129;;OB" D=:=D0" 230" 012=50" @9<" =126" D=LL05012" AMP]0?2"G56M<A"7=23"D=LL05012"<62012=9;A"L65"D6?M@012"5025=0:9;C"" W=23" 9" P02205" M1D05A291D=1G" 6L" 230" 9M2365A" 91D" 230=5" 50;92=61A3=<AS" 230" MA05" ?91" D59G" 50;0:912" 9M2365)19@0A" =126" 9" P6U" 91D" ?;=?Q" 61" 230" A095?3" PM2261" 26" L=1D" ?6)?=2=1G"
D6?M@012AC" K30" MA05" ?91" 9;A6" 0U<05=@012" 7=23" D=LL05012" 9M2365A"26"5025=0:0"D=LL05012"A02"6L"D6?M@012AC"""
!"# $%&'()*+,&-#.(/0'1#23/4)5(6#
X61?0<2*=1Q"?6110?2A"26"91"=1)36MA0"T59?;0"D929P9A0"7=23" aV*#" ?6)6??M5501?0" ?6M12A" 91D" HabVJE" 6:05" 230" W0PC"" K30" X61?0<2*=1Q" =1205L9?0" =A" :05O" A=@=;95" 26" 2392" 6L" 4M2365*=1QS" 0U?0<2" 2392" =2" 9DDA" 9" L092M50" D0:0;6<0D" =1" *=1" c(\dC""+2"A367A"230"1M@P05"6L"3=2A"2392"6??M5"09?3"2=@0"230"MA05" D59GA" 9" 205@" 26" 230" A095?3" P6UC" " K3=A" P0?6@0A" 9" :05O" MA0LM;" L092M50"=1"230"=12059?2=61"7=23"230"A095?3"01G=10C"
+@9G=10S" L65" 0U9@<;0S" 2392" 9" MA05" 39A" 9" @0D=?9;" <56P;0@" 7=23"3=A"P9?QC""Y0"25=0A"230"eM05O"^P9?Q"<9=1^"=1"X61?0<2*=1QC"" Y0" =@@0D=920;O" 162=?0A" 2392" 230" 1M@P05" 6L" D6?M@012A" @92?3=1G" 3=A"eM05O"=A"(.S'(\S"73=?3"=A"7392"30"76M;D"G02"=L" 30"A095?30D"D=50?2;O"2356MG3"HabVJEC"",67"30"A2MD=0A"230" @9<C" " Y0" 162=?0A" 230" L6M5" @0D=?9;" 26<=?A" 50;920D" 26" ^P9?Q" <9=1^f" " ^6??M<92=619;" D=A09A0AS^" ^A<=19;" D=A09A0AS^" ^<6A26<0592=:0" ?6@<;=?92=61S^" 91D" ^A<=19;" 910A230A=9C^" " Y0" LM52305"162=?0A"2392"^<3OA=?9;"23059<O^"91D"^0U05?=A0"23059<O^" 950" D=50?2;O" ;=1Q0D" 26" ^P9?Q" <9=1^S" =1D=?92=1G" 23050" 950" D6?M@012A" D=A?MAA=1G" 25092@012A" 6L" ^P9?Q" <9=1^" 7=23" 230A0" 23059<=0AC""""
K30A0"6PA05:92=61A"<56@<2"230"MA05"26"5023=1Q"3=A"eM05O"91D" 7392" 30" @=G32" G02" L56@" 23=A" D929P9A0C" " Y0" D0?=D0A" 26" 9DD" ^;=L2=1G^"91D"^<6A2M50^"26"230"A095?3"P6UC""K30"1M@P05"6L"3=2A" =@@0D=920;O" L9;;A" 26" (''C" " Y0" 9DDA" ^<3OA=?9;" 23059<O^" 26" 230" A095?3"P6UC"",67"230"1M@P05"6L"3=2A"=A"('C""#6"30"?;=?QA"61"230" ^$6" $02" +2g^" PM2261" 91D" P567A0A" 2356MG3" 230" ('" D6?M@012AC"" W392"30"G02A"=A"=1L65@92=61"50;920D"26"230"eM05O"^=1L65@92=61" 61" <3OA=?9;" 23059<O" L65" P9?Q" <9=1A" ?9MA0D" PO" ;=L2=1G" 91D" <6A2M50"<56P;0@AS^"9;236MG3"30"10:05"0U<;=?=2;O"236MG32"9P6M2" 2392"eM05O"65"367"26"?61:0O"=2"7=23"b66;091";6G=?C""Y0"A=@<;O" 39A"26"50?6G1=>0"205@A"0U<50AA=1G"3=A"=12050A2C""" " " " " " " " " " " " " " " " " " "
I=GM50" .C" " " 4" ?61?0<2" @9<" L65" 230" A095?3" ^P9?Q" <9=1^C" " I6M5" ?;MA205A" 950" ?;095;O" A001" 89A" =1D=?920D" PO" 230" ?=5?;0AS" 73=?3" 950" 162" G0105920D" PO" 230" AOA20@h"230O"950"9DD0D"3050"L65"=;;MA2592=61"61;OBC"
" "
Bibliometrics
Example of practical value of citation analysis
Collection development
Collection planning: determine information needs, make decisions about priorities
Collection implementation: organizing collection, creating useful indexing aids for
fi
nding resources
Tasks require knowledge about the structure of a subject
fi
eld, about information
resources used, about important themes and terminology upon which the
collection can be organized and indexed
Co
-
citation analysis, bibliographic coupling, co
-
word analysis can each be useful:
Mapping the structure and use of the relevant literature
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Measuring growth and obsolescence
Use of citation data to measure half
-
life of articles, journals,
fi
elds
Median citation age: based on publishing years of citing publications and
publishing years of citations
Price index: measure of how many citations in a publication are at most
fi
ve years
old at the time of publishing
Index value is a measure of the increase of publications in the subject
fi
eld
If the growth of a
fi
eld is 10
%
the literature is doubled in about 7 years, 39
%
of
the literature was published during the past
fi
ve years
Humanities have a low Price index; obsolescence is slow
Emerging sciences have high Price index; obsolescence is relatively quick
Can be calculated annually to demonstrate changes and trends
Bibliometrics
Impact Factor
Measure of the frequency with which the
“
average article
”
in a journal
has been cited in a particular year or period
A = total citations in a year
(
example: 2001
)
B = 2001 citations to journal
(
X
)
articles published in years 1999
-
2000
(
subset of A
)
C = number of articles published in journal
(
X
)
in years 1999
-
2000
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Impact Factor
Provides an approximation of the prestige of journals in which
individuals have been published
Gives library administrator information about journals in existing
collection and journals being considered for acquisition
Can be useful but many cavets about use
(
eliminate self
-
citiations,
Bibliometrics
Strengths of bibliometrics as a research approach
Methods are objective and repeatable
Results have a wide range of potential practical value
Does not require human subject interaction
High reliability in that data are collected unobtrusively, from the
published record, and can be easily replicated by others
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Limitations of bibliometrics as a research approach
Results are only valid to extent that citations are assumed to represent
sign
fi
cant link between citing and cited documents, a questionable
assumption:
Citations made for many reasons other than topic similarity or quality
Citations which should be made are often not
Technical issues related to data obtained from citation indexes and
bibliographies
Variations and misspelling of author names, authors with same name,
incomplete coverage of non
-
English publications
Bibliometrics
Bibliometric methods not widely used by librarians for
practical problems
In recent years, however:
Rapid emergence of new subject
fi
elds and interdisciplinary
publications
Explosive growth in number of available documents
Bibliometrics provides tools that can help librarians deal with
challenges posed:
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Bibliometrics
Bibliometric related resources
ISI Web of Knowledge
Simmons Libraries
-
> GSLIS
-
> Online databases pulldown menu
Userid: simm23 Password: educate
Try:
ISI Web of Science
-
citations to a given article or author
ISI Journal Citation Reports
-
Social Sciences, subject category;
Transaction Log Analysis
Number of digital documents and users of those documents
growing rapidly
Findings from the How Much Information? project
(
http://www.sims.berkeley.edu/research/projects/how
-
much
-
info
-
2003/
)
New stored information grew about 30
%
a year between 1999 and 2002
Almost 800 MB of recorded information is produced per person each year
The World Wide Web contains about 170 terabytes of information on its
surface; about seventeen times the size of the Library of Congress print
collections
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Basic concepts of bibliometrics can also be applied to patterns of
usage beyond citations
Transaction log analysis or webmetrics
Analyzing usage patterns in a digital environment
Allows range of other types of observations
Citations do not necessarily re
fl
ect usage
Transaction logs generally do re
fl
ect real usage
Web server log analysis
ILL records, circulation records
Browsing data
Transaction Log Analysis
Web log data
One or more log
fi
les on the Web server can record:
IP address of requesting computer
Date and time of request
Page
(fi
lename
)
requested
Referrer page
(
URL of page that brought user
)
Web browser/operating system of requesting computer
Search terms used from search engine
Can also create relatively easily customized logs for a given system to
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Types of possible analysis
Session level: complete sequence of requests/queries by a given user
Characterize actions of and information sought by user
What is the user trying to accomplish?
Transaction Log Analysis
Types of possible analysis
Page/object level: access to speci
fi
c pages or objects in the system
Which pages are most popular?
Which
fi
les, images, videos are most frequently looked at or downloaded?
Errors resulting from page or resource requests
Query level: how users navigate or attempt to
fi
nd information or
resources
Which query terms are used?
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Example 1: Analyzing user queries from Excite search engine logs
Jansen, Bernard J., & Amanda Spink. (2000). Methodological approach in discovering user search patterns through web log analysis: using the Excite search engine. Bulletin of the American Society for Information Science. 27, no1: 15-17.
http://www.asis.org/Bulletin/Oct-00/janses___spink.html
Log of 1 million queries each in 1997 and 1999:
Mean queries per user session: 4.8 in 1997, 2.0 in 1999
Mean terms per query: 2.4 in 1997, 2.35 in 1999
Users most often view at most 10 results
Only about 8
%
of users use Boolean queries
Transaction Log Analysis
Example 2: Analyzing user activity on
Open Video site
The open
-
video.org Web site
redesigned in September, 2003
How are users using the redesigned
site?
Which pages are most popular?
Which options in the search results
page do they use?
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Example 2: Analyzing user activity on
Open Video site
User activity in 4 months after redesign:
Total of 69,589
‘
unique
’
visitors
Total of 140,135 downloads
Page Views Video Details 348,974 Search Results 276,745 Main 150,622 Popular Video 61,429 Special Collection Details 12,227New Video 4,133 Project Information 4004 Detailed Search 3097 Special Collections 3013 Related Video 2842 Project News 2427 Random Video 1835
Contributing Video Info 1503 Help on Playing Video 1465 Project Publications 521 Browser Compatibility 390 Project Contacts 334
Transaction Log Analysis
Example 2: Analyzing user activity on
Open Video site
User activity in 4 months after redesign:
Finding video by popularity much more
common than by lists of new or random
video
Page Views Video Details 348,974 Search Results 276,745 Main 150,622 Popular Video 61,429 Special Collection Details 12,227New Video 4,133 Project Information 4004 Detailed Search 3097 Special Collections 3013 Related Video 2842 Project News 2427 Random Video 1835
Contributing Video Info 1503 Help on Playing Video 1465 Project Publications 521 Browser Compatibility 390
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Example 2: Analyzing user activity on Open Video site
Which options do users use to sift search results?
Visual layout of results
Ordering criteria
Transaction Log Analysis
Example 2: Analyzing user activity on Open Video site
Sifting options
-
User choice of visual layout of results options
Large thumbnails 221,540 * Text 13,223 Small thumbnails 16,029 Thumbnails only 12,730 * Default choice
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Example 2: Analyzing user activity on Open Video site
Sifting options
-
User choice of ordering criteria of results
Option # of Selections Relevance 258,386 * Title 3,700 Year 6,735 Duration 1,320 Popularity 6,604 * Default choice
Transaction Log Analysis
Example 2: Analyzing user activity on Open Video site
Sifting options
-
User choice of size of visible set of results
Option # of Selections 10 252,207 * 20 4,923 30 3,600 50 5,585 100 7,350 All 10,430
©Gary Geisler ♦ Simmons College ♦ LIS 403 ♦ Spring, 2004
Transaction Log Analysis
Limitations of transaction log analysis
Assumption that an IP address represents unique user often not true
Dynamic IP addresses
-
same user can have di
ff
erent IP addresses
Shared computers
-
di
ff
erent users can have same IP address
Web pages can be cached, both by the client machine and by the
Internet Service Provider
(
ISP
)
Do not know user motivation for page, query selection
Privacy concerns
-
user registration can obviate variable IP address