Bibliometric Mapping Tools v1.0 The Manual
Paul Meara January 2014
0: This manual describes how to use the Bibliometric Mapping Tools program located at http://www.lognostics.co.uk/tools/mappingtools/
The manual contains the following sections: 1: general introduction
2: preparing your data for analysis 3: running the program
4: advice and counsel
5: about Bibliometic Mapping Tools 6: cautions and disclaimers
7: References
8: Appendix – how to obtain GEPHI
The program (BMT for short) takes a text file containing an author list, and turns it into a .gml file which can be used with the mapping program GEPHI (Bastian, Heymann and Jacomy 2009). BMT is not itself a mapping program. Its main purpose is to simplify the job of preparing bibliometric data for display.
Instructions for how to obtain a copy of GEPHI are included in Section 8 of this manual.
BMT is not a general bibliometric data processor. It is specifically designed to prepare data for an author co-citation analysis.The general approach is described in work by Small (e.g. Small 1973, cf. also Garfield 1993). Small argued that clusters of authors who are cited together identify systematic trends in a research field. He suggested that changing patterns of citations can be used to identify “research fronts” and “invisible colleges” and “paradigm shifts” within a discipline. The basic approach is outlined in price (1965) and some specific examples can be found in Boyack and Klavans (2010) and Åström (2007). Author co-citation analysis is not widely used in the Humanities (cf. Hellqvist 2010) and there are some suggestions that the use of citations in humanities research is very different from what we expect to find in scientific research. Examples of how the method can be used with sources in Applied Linguistics can be found in Meara (2012) and Meara (2013).
2: preparing your data for analysis
The raw data for an author co-citation analysis consists of a set of papers that cover the area you are interested in. Assembling this list of papers is not as straightforward as it might seem at first glance. Your data set needs to be large enough to be interesting and representative of the area you have chosen to analyse, and yet small enough to be manageable. Most bibliometric work seems to rely on pre-defined data sets, where you have little room for manoeuvre over what work to include or omit from your selection. A typical example would be a study of all the papers that appear in a specific journal between 2000 and 2010, or a study of all the papers that appear in a specific set of journals in 2005. Data sets of this sort are fairly uncomplicated. A more complicated example would be, say, all the papers published in a single journal that deal with vocabulary acquisition. In this case, you have to make a judgment call as to whether a specific paper merits inclusion or not: some papers deal with vocabulary acquisition within the broader context of reading, for example, and you will need to decide whether the vocabulary content is sufficient to make this paper an appropriate source or not.
Once you have assembled your data set, you need to extract the author data from the bibliographical entries of each paper. Again, this is not straightforward, particularly if the litertature you are dealing with is fairly old. Modern citation practices are rarely followed in early journals. If you are particularly well-resourced, then you may be able to get hold of the necessary data by mining it from one of the very large citation databases such as the Web of Knowledge (http://wok.mimas.ac.uk). If you are interested in a narrow area, you may find that some citation data is available on-line. (The lognostics site (http://www.lognostics.co.uk) has an extensive collection of citation data for vocabulary research, for example.) However, this manual assumes that these options are not available to you, and that you are going to have to construct your author lists by hand.
Before you start to prepare your data for BMT, you need to make a decision as to whether you intend to do an all-inclusive author analysis, or a first author only analysis. Current wisdom (e.g. Zhao and Strotman 2007) seems to suggest that it doesn't make a lot of difference which of these approaches you adopt – though this conclusion comes from an area where it is common for papers to be authored by six or more authors, and it's not obvious that it holds for the more cautious humanities. The first author only approach is easier to implement, but it seems to me it loses a lot of data. Specifically, it tends to down-play the importance of younger authors, and magnify the importance of better established researchers.
The approach recommend here, then, is the all-inclusive approach. In this approach, you need to identify all the authors cited in each individual paper in your data set. For BMT, you need to assemble this data into a specific format explained below. This format allows BMT to prepare a data file which can be mapped using GEPHI.
The format that BMT requires is as follows:
For each paper, you need an alphabetical list of names. Each name is listed only once, irrespective of how many times it is cited in the paper. Each individual name should be listed on a single line of a .txt file. You can use any format you like for the names, as long as you are consistent, and as long as your text does not contain white space or punctuation. I prefer to use lower-case entries with underscores to an initial, like this:
meara_pm (not Meara, P. M.) or
Once you have your name list, you need to add two delimiter lines, which BMT uses to identify the start and finish of each individual paper. The first line of your list should look like this:
[Laufer&Nation_1996
where you replace Laufer&Nation_1996 with any label that serves for you to identify the paper your names list comes from. Avoid white space and punctuation in these labels.
The last line looks like this: ]
If you follow these instructions, then you should find that you end up with a file that looks something like this: [meara_1992 arnaud_p laufer_b nation_isp schmitt_n ] [Schmitt_2003 laufer_b meara_pm nation_isp schmitt_d schmitt_n ] [Laufer_2015 laufer_b meara_pm nation_isp schmitt_n talmy_d xian_pq ]
This file summarises a set of three papers: Meara_1992 cites arnaud_p, laufer_b, nation_isp and schmitt_n; Schmitt_2003 cites laufer_b, meara_pm, nation_isp, schmitt_d and schmitt_n; Laufer_2015 cites laufer_b, meara_pm, nation_isp, schmitt_d and schmitt_n.
Note that the names within each source are in alphabetical order, but the order that the sources appear in is aribtrary. The important point to note is that the citation data from each paper starts and ends with a square bracket.
Once you have prepared this list of names, you need to save it as a text file. The easiest way to do this is to prepare your data using a very simple word-processor like the Microsoft NOTEPAD program, or GEDIT in Linux. These programs automatically save your text in .txt format. If you want to, you can use a more complicated word processor, but if you do this you must remember to save your data using the .txt option. This saves your data in a very small file, with only minimal formatting. You will find some practical suggestions for preparing data files in section 4.
3: running BMT
You can access the BMT program at http://www.lognostics.co.uk/tools/
Click on the RUN button to start the program. The BMT interface looks like this:
The button in the top left hand corner gives you access to the BMT Manual.
Drag and Drop your data into the large square at the bottom of the screen. To do this, you will need to carry out the following steps:
1. Open your data file.
2. Open the edit menu in your word processor.
3: Click Select All. This will highlight all the text in file. 4: Left Click on your text, but do not release the button. 5: Drag your mouse to the box in the BMT window. 6: Release the mouse button.
This should insert a copy of your data into the BMT data box.
There are three boxes on the right hand side of the BMT screen. The role of these boxes is explained later. For the moment, leave the default values in place.
Click the SUBMIT button to begin processing your data. BMT will then generate two further files for you to use.
The first file, Names.txt, is a frequency list of all the names which aappear in your data. You can access this file by clicking on the Names.txt download button.
This opens a new window containing your frequency data – all the names in your data in alphabetical order, along with the number of times each name appears in your data.
Your browser should now allow you to save this file to your computer.
The main use of this names file is that it allows you to check that you have been consistent in the way that you have entered your data. Names that have variant spellings, names that have different initials, names that contain non-alphabetic characters, blank lines in your data file, and so on, can all be easily identified by a careful perusal of this file. Remember that any corrections that you identify need to be made in the original data file, and not in names.txt itself.
It is difficult to overstate how important it is to make sure that your data is extremely well proofed.
The second file generated by BMT is a .gml file that you can use as input to Gephi. You can download this file by clicking on the data.gml download button.
Files in .gml format have the following structure: a header line,
a set of lines that identify the nodes to appear in your graph,
a set of lines that identify the connections between the nodes in your graph, an end line.
A typical .gml file looks something like this: graph [CREATOR "demodata2014"
node [ id 1 label "xian_pq" ] node [ id 2 label "talmy_d" ] node [ id 3 label "meara_pm" ] node [ id 4 label "schmitt_n" ] node [ id 5 label "arnaud_p" ] node [ id 6 label "schmitt_d" ] node [ id 7 label "laufer_b" ] node [ id 8 label "nation_isp" ]
edge [ source 8 target 2 type undirected value 1] edge [ source 7 target 4 type undirected value 3] edge [ source 5 target 7 type undirected value 1] edge [ source 8 target 6 type undirected value 1] edge [ source 7 target 8 type undirected value 3] edge [ source 8 target 4 type undirected value 3] edge [ source 3 target 2 type undirected value 1] edge [ source 2 target 1 type undirected value 1] edge [ source 3 target 1 type undirected value 1]
edge [ source 7 target 3 type undirected value 2] edge [ source 6 target 4 type undirected value 1] edge [ source 5 target 4 type undirected value 1] edge [ source 8 target 1 type undirected value 1] edge [ source 5 target 8 type undirected value 1] edge [ source 3 target 8 type undirected value 2] edge [ source 7 target 2 type undirected value 1] edge [ source 4 target 1 type undirected value 1] edge [ source 3 target 6 type undirected value 1] edge [ source 3 target 4 type undirected value 2] edge [ source 7 target 1 type undirected value 1] edge [ source 4 target 2 type undirected value 1] edge [ source 7 target 6 type undirected value 1] ]
This file summarises a graph consisting of eight nodes and 22 edges linking these nodes.
The .gml files generated by BMT from even a small dataset are very large. The reason for this is that the number of edges increases rapidly as the number of names cited in a paper grows. A paper with only 10 cited authors will generate 10*9/2=45 edges; a paper with 50 cited authors will generate 50*49/2=1225 edges. For a data set consisting of 50 papers each with a large bibliography, the resulting .gml file might contain several thousand co-citations.
It is normal to generate smaller files by setting a threshold for inclusion in the final data set. BMT lets you set two thresholds.
The inclusion threshold eliminates any authors who are cited infrequently. For example, setting this threshold to 5 would exclude any author whose name is cited in four or fewer papers in the data set. Deciding what inclusion threshold to work with is something of a black art that requires a bit of experimentation. Typically research in small-scale bibliometrics likes to work with graphs which contain about 100 nodes, and it is usual to select an inclusion threshold that gives you a graph about that big. The default value, 2, excludes all names which appear in only a single paper. This will typically eliminate about half of the names in a raw data file.
The weighting threshold eliminates weak links in the graph. The default value, 1, includes all the co-citations identified among the included names. Raising this threshold to a higher value prunes the graph, leaving only the very strongest co-citations. This can be a useful simplification if your graph is particularly dense.
The parameter label for this data set is a simple text string that identifies the data described by the .gml file. This label appears as part of the first line of the .gml file generated by BMT.
You will find it useful to make these labels easy to interpret. You will probably generate a lot of .gml files from a single data set, so it advisable to include in the label the two threshold values.
A good label would look something like this: file27-7-3
where file27.txt is, say, the name of the data file being analysed, 7 indicates an inclusion threshold of seven citations, and 3 indicates a weighting threshold of three.
Your browser should let you save the data.gml file to your own computer, and from there you should be able to use it with GEPHI or another mapping program. It's advisable to change the name of this file
when you save it to something that reflects the source of the data. e.g. a file called file27-7-3.gml is easily identified as a .gml file based on data from file27.txt using inclusion threshold 7 and weighting threshold 3.
Instructions on how to use the GEPHI software are beyond the scope of this manual. See section 8 for how to download Gephi and where to find the Gephi Tutorials. It is a very good idea to practice using Gephi with very small data sets before you attempt any serious research of your own.
4. advice and counsel
This section provides some advice on best practice.
The worst part of doing a bibliometric analysis is the preparation of the data files. Apart from the sheer slog of inputting the data, you will need to spend ages proof-reading and re-proof-reading your data file to make sure that you have eliminated all the errors.
You can make things slightly simpler for yourself by adopting the following hints.
Do everything in lower case. This isn't very important, it just saves your fingers a bit of work. Replace all non-alphabetical characters by _. This eliminates some spelling variants.
Replace accented characters by their unaccented forms. This saves a lot of fingerwork. Only use one initial unless you need two to differentiate people with the same first initial. Exclude the names of Corporate Organisations.
You might find it helps to do your preliminary data entry using a spreadsheet, rather than a word processor. (Thanks to David Coulson for this suggestion.) Using a spreadsheet means that you can enter the names in the order they appear in your source, and sort them into alphabetical order when the list is complete. This is a lot quicker, and less error-prone than sorting your data into alphabetical order by hand.
Some spreadsheet programs remember what you type into them, and this means that common entries will automatically be completed for you. This is a huge saving on effort – particularly where long names are involved - and it ensures a high level of consistency.
You will not be able to use the spreadsheet data directly. You will need to copy and paste your spreadsheet data into a text file before you can submit it to BMT for analysis.
If you use word processor, you may find it useful to turn on the feature which displays non-printing characters. This will enable you to spot blank spaces at the end of lines, and other extraneous characters that might appear in your data file.
You might also like to try using voice input. This approach does work, but not as well as you might expect. Foreign names are a particular problem, as most systems expect to get their input in a single language.
Before you do any serious work with BMT, do experiment with some small data sets whose properties you understand. Make sure that BMT generates the outputs you are expecting. If it doesn't, then you need to examine your data file carefully for extraneous errors. Invisible blank spaces are a serious problem (see above).
There are some outstanding issues with line endings that I have not yet fully resolved. If the program is behaving in an unexpected manner with your data, then please submit a bug report to the author. We will try to deal with these issues expeditiously.
5: about Bibliometric Mapping Tools
If you need to cite BMT, our preferred format is: Meara, PM (2014)
Bibliometric Mapping Tools. v1.0 Swansea: Lognostics. http://www.lognostics.co.uk/tools/
6: cautions and disclaimers
BMT is freely available for use by bona fide researchers. You should note, however, that you use the programs at your own risk. Every effort has been made to ensure that the programs are fit for purpose, but the authors accept no liability for any errors or damage caused by their use. You need to ensure the integrity of your own data.
7: references Åström, F
Changes in the LIS Research Front: Time-Sliced Cocitation Analyses of LIS Journal Articles, 1990–2004. Journal of the American Society for Information Science and Technology 58,7(2007), . 947-957.
Bastian M, S Heymann and M Jacomy
Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. 2009.
Boyack, KW and R Klavans
Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and
Technology 61,12(2010), 2389-2404. Garfield, E
Co-Citation Analysis of the Scientific Literature: Henry Small on Mapping the Collective Mind of Science. Current Contents 19(1993), 3-13.
Hellqvist, B. (2010).
Referencing in the humanities and its implications for citation analysis. Journal of the American Society for Information Science and Technology, 61,2(2010), 310-318.
Meara, PM
The bibliometrics of vocabulary acquisition: An exploratory study. RELC Journal 43,1(2012), Meara, PM (2013).
Vocabulary research in The Modern Language Journal: A bibliometric analysis. Vocabulary Learning and Instruction 3(2013). doi:10.7820/vli.v03.1.meara
Price, DJ. Networks of scientific papers. Science 149 (1965), 510-515. Small, H
Co-citation in the scientific literature: a new measure of the relationship between two documents. Journal of the American Society for Information Science 24(1973), 265-269.
Zhao, D and A Strotman
All-author vs. first-author co-citation analysis of the Information Science field using Scopus . Proceedings of the American Society for Information Science and Technology 2007 Annual Meeting, October 19 - 24, 2007, Milwaukee. 2007.
8: How to obtain Gephi
Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.
Runs on Windows, Linux and Mac OS X. Gephi is open-source and free.
You can download a copy of Gephi from http://gephi.org. The current version is Gephi 0.8.2beta. The Gephi site also includes a comprehensive set of tutorials which will take you through Gephi's main features. You can access these tutorials at: