Faculty of Information and Natural Sciences
Degree programme of Computer Science and Engineering
Ilkka R¨am¨o
Case Study: Creating a Recognized and
Popular Web Site
Master’s Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Technology.
Espoo, April 30, 2010
Supervisor: Professor Lauri Malmi, Aalto University
Faculty of Information and Natural Sciences MASTER’S THESIS Degree Programme of Computer Science and Engineering
Author: Ilkka R¨am¨o
Name of the Thesis:
Case Study: Creating a Recognized and Popular Web Site
Date: April 30, 2010 Number of pages: 9 + ??
Professorship: Software Technology
Supervisor: Professor Lauri Malmi, Aalto University
Instructor: Antti-Jaakko Koskenniemi M.Sc., Inessiivi Media Oy
Short abstract over the Thesis.
Keywords: thesis, internet services, world wide web, search engine optimization, web analytics
Informaatio- ja luonnontieteiden tiedekunta TIIVISTELM ¨A Tietotekniikan koulutusohjelma
Tekij¨a: Ilkka R¨am¨o
Ty¨on nimi:
Tunnetun ja suositun Web-palvelun rakentaminen
P¨aiv¨ays: 30. huhtikuuta 2010 Sivum¨a¨ar¨a: 9 + ??
Professuuri: Ohjelmistotekniikka
Valvoja: Professori Lauri Malmi, Aalto-yliopisto
Ohjaaja: MA Antti-Jaakko Koskenniemi, Inessiivi Media Oy
Abstraktin teksti.
Avainsanat: diplomity¨o, web-palvelut, hakukoneoptimointi, web-analytiikka
This includes some acknowledgements about the people helping me to accom-plish this Thesis.
Espoo, April 1, 2010
Ilkka R¨am¨o
Abbreviations vii List of Figures x List of Tables xi 1 Introduction 1 1.1 Problem . . . 2 1.2 Methods . . . 2 1.3 Thesis Outline . . . 3 2 Background 4 2.1 World Wide Web . . . 4
2.2 Search Engines . . . 6
2.2.1 Motivation. . . 6
2.2.2 Structure And Functionality . . . 7
2.2.3 Service Providers . . . 13
2.3 Search Engine Optimization . . . 15
2.3.1 Reasoning For Good Ranking . . . 15
2.3.2 Core Practices. . . 16
2.3.3 Key Word Analysis . . . 18
2.3.4 Optimizing Pages . . . 19
2.3.5 Web Analytics. . . 23
2.3.6 Related Work . . . 25
2.4 Progressive Enhancement. . . 25
2.5.2 Social Networking (Facebook, MySpace) . . . 27 2.5.3 Flickr . . . 28 2.5.4 YouTube . . . 28 2.6 Summary . . . 28 3 Case Stadissa.fi 29 3.1 Solution . . . 29
3.2 Web Site Overview . . . 30
3.2.1 Site Description . . . 30
3.2.2 Technology . . . 33
3.2.3 Traffic, Year 2009 . . . 35
3.3 Improvements . . . 39
3.3.1 Selected Key Words. . . 39
3.3.2 Navigation And Crawlability. . . 40
3.3.3 URLs And Titles . . . 44
3.3.4 Content . . . 46
3.3.5 Leveraging Social Media . . . 51
3.4 Summary . . . 53
4 Analysis 54 4.1 Search Engine Visibility . . . 54
4.2 Site Visits . . . 57 4.3 User Engagement . . . 60 4.4 Summary . . . 61 5 Conclusions 62 5.1 Results . . . 62 5.2 Future Work . . . 63 vi
CSS Cascading Style Sheets DHTML Dynamic HTML FTP File Transfer Protocol GA Google Analytics
HTML HyperText Markup Language HTTP Hypertext Transfer Protocol IR Information Retrieval
JS JavaScript, a dynamic client-side programming language, which is executed in the web browser
LAMP Linux OS + Apache HTTP Server + MySQL database software + PHP, Python or Perl scripting language
OS Operating System
PE Progressive Enhancement PHP PHP: Hypertext Preprocessor SEF Search Engine Friendly
SEM Search Engine Marketing SEO Search Engine Optimization SERP Search Engine Results Page
UI User Interface
URL Uniform Resource Locator WWW World Wide Web, The Web, W3
1 Simple PageRank example . . . 10
2 Return of results for a specified search query . . . 12
3 Front page of the case site . . . 31
4 Example event page of the case site . . . 32
5 Example search results page of the case site . . . 33
6 Three tier architecture of the application . . . 35
7 Traffic figures, year 2009 . . . 36
8 Traffic trend, year 2009 . . . 36
9 Example event: Madonna’s concert . . . 37
10 Traffic source types, year 2009 . . . 37
11 Traffic from different search engines, year 2009 . . . 38
12 Top keywords, which resulted in page visits, year 2009 . . . . 38
13 User engagement, year 2009 . . . 39
14 Listing all the venues . . . 42
15 Title of the old event page . . . 44
16 Title of the new event page . . . 45
17 Title of the event venue page. . . 46
18 Old event page . . . 47
19 New event page . . . 48
20 Event page summary in Google search results . . . 48
22 Event venue page in Google search results . . . 49
23 Related events in the same category . . . 50
24 Related events on the same venue . . . 50
25 Recommendations . . . 51
26 Social bookmarking . . . 52
27 Example social bookmark in Facebook . . . 52
28 Example snapshot of Google rankings after optimizations . . . 55
29 Google results for ”nhl helsinki 2010 liput” . . . 55
30 Google results for ”helsingin j¨a¨ahalli osoite” . . . 56
31 Google results for ”madonna j¨atk¨asaari” . . . 56
32 Comparison of visit trend 2009-2010. . . 57
33 Comparison of visits 2009-2010 . . . 57
34 Comparison of visits to other similar sites . . . 58
35 Traffic source types, Q1 2010. . . 59
36 Traffic from different search engines, Q1 2010 . . . 59
37 Comparison of customer engagement 2009-2010 . . . 60
38 Relational proportions for number of pageviews, Q1 2010 . . . 60
1 Top Search Engines - Visits . . . 13
2 Top Search Engines - Volume . . . 14
3 Top 5 Ranking Factors . . . 17
4 Top Positive On-Page Ranking Factors . . . 17
5 Top 5 Negative Ranking Factors . . . 18
6 Selected Keywords For The Site . . . 40
Introduction
The Internet is full of web sites, which range from simple company profile pages to complex online services. Information is continuously being published online, and new business opportunities rise as users are getting more and more familiar with the electronic channels.
As the number of web sites is ever growing, it becomes easier for a single web site to disappear among all the other web sites. This introduces a severe and very vital problem for businesses who are marketing their products and services in the online channels. It is self evident, that in order to benefit from the huge potential of these channels, businesses must ensure good enough visibility for their web sites. If the users can’t find a web site, it doesn’t exist nor create any additional value for the business.
This thesis focuses on a particular web site,http://www.stadissa.fi, which has been developed by a small Finnish company, Inessiivi Media Oy. The main concept of the site is to offer an extensive event calendar, which is strictly scoped in the Helsinki area. The site was established in 2008, and has been developed further since.
1.1
Problem
Any content or service provider building a web site needs to identify and overcome the following question:
How to make a successful web site?
This problem can be investigated from multiple different viewpoints (technical perspective, marketing perspective, usability perspective, etc.), which is one of the reasons why web site development is an ongoing effort.
In this thesis, the focus is set to organic traffic (traffic, which originates from other web sites without an explicit agreement) acquisition and marketing per-spectives of the problem.
Being popular among the users, receiving positive attention within user groups and attracting users to visit the site over and over again are all vital aspects for a web site. A site without users doesn’t serve any purpose nor create any value for the business.
A reasonable method for studying this extensive and challenging problem is to tear it down to smaller subproblems. The aim of this thesis is to find solutions to the following subproblems:
• How to get more users to visit the site?
• How to keep existing users in the site?
• How to make users come back to the site?
1.2
Methods
The research method of this thesis is to find good development principles in the literature, and to implement some of these principles in the web site, which is used as a case study. As the focus of this thesis is set to improved traffic, both getting new users in the site and to preserving existing users,
the development areas consist of improved site visibility, good provisioning of relevant information and the accuracy of the provisioned information.
The effects of implementing some of these development principles are investi-gated by analyzing the usage patterns of the site. The site has been connected to a web analytics system, which gathers data about visits, clicks, trends and other relevant quantitative factors, which give information about the usage of the site. This information forms a basis for any development work, which targets at improving the popularity and effectiveness of the site.
1.3
Thesis Outline
Chapter 2 presents a theoretical background for the areas of this thesis by introducing concepts of World Wide Web, search engines, search engine opti-mization, web analytics and social media.
The rest of the chapters present the actual contribution of this thesis. Chap-ter 3 introduces a subset of solutions, which have been selected as primary improvements for the web site at hand. Also implementations of these im-provements are presented and explained.
The results and possible effects of the improvements are analyzed in Chapter 4.
Background
This chapter gives the theoretical background for the rest of the thesis. The reader is introduced with the concepts and tools, which have been used in the study.
2.1
World Wide Web
In March 1989, Tim Berners-Lee wrote a proposal [1] that referenced EN-QUIRE, a database and software project he had built in 1980, and described a more elaborate information management system. With help from Robert Cail-liau, he published a more formal proposal [2] to build a ”Hypertext project” called ”WorldWideWeb” (one word, also ”W3”) as a ”web” of ”hypertext documents” to be viewed by ”browsers”, using a client-server architecture [3]. This proposal started the successful era of World Wide Web (WWW, The Web, W3), which has generated totally new opportunities for companies to publish information about themselves and to promote the products and services they’re selling.
Before the Web’s arrival, the Internet was difficult for experienced computer people to use and nearly impossible for everyone else. If you did not know what you were looking for, precisely where it was to be found, and how to access
it, you were out of luck. The information was scattered around in multiple isolated computers, and the tools for finding and accessing this information were cumbersome and inefficient.
FTP (File Transfer Protocol) provided a method for transferring individual files from one computer to another. Remote access with Telnet was used to connect to other computers in the network and to use them like via a normal Unix terminal session. The most important publishing method and information service on the Internet before WWW was Gopher, which was based on menu-structured information. However, it lacked the hypertextual capabilities of the WWW, and it mainly worked well for text-only users with slow connections.
The Web turned this situation around in two critically important ways. First, it gave information providers the tools they needed to make their content findable and accessible. The Web enabled them to organize and manage their information resources to present their information in a simple, architecture-independent, non-proprietary format, to make their resources more readily known and available to others, and to control access to them from across the Internet.
Second, it enabled information providers to interconnect information in a new and powerful way through something called hyperlinks. A hyperlink is an electronic cross-reference that functions like a traditional cross-reference - a notation in one place that directs the reader to related, often more detailed information in another place. Through hyperlinks, information providers be-came able to interconnect their information with information and resources on other Internet sites, as well as to interconnect information between their own documents and other information objects on their own site. [4]
The Internet as we know it today, is rapidly growing in size and the number of reachable documents in WWW is exploding. Pingdom, a Swedish web monitoring company presents statistics for Internet usage for year 2009 in [5]. These calculations state that there were 234 million websites in the Web in December 2009, with 47 million new sites added during year 2009. Based on calculations from Google by investigating the size of their search index, the number of unique URLs (Uniform Resource Locator) in the Web increased
from 26 million pages to a thousand trillion (1,000,000,000,000) in ten years during 1998-2008 [6]. These statistics don’t tell the exact number of different documents, but rather give examples of the enormous size of The Web.
2.2
Search Engines
2.2.1 Motivation
Information overload generated a new problem for the users who were trying to find information. Despite the revolution in information storage and access ushered in by the Web, users initiating web searches found themselves flounder-ing. They were looking for the proverbial needle in an enormous, ever-growing information haystack. [7]
What would you do to find something in The Web? The most obvious method is to use some of the major search engines, type in your search phrase and crawl through the result list to find the information you were looking for. [8] Initially, the Internet was nothing like the web of interconnected sites that’s become one of the greatest business facilitators of our time. Instead, it was actually a collection of FTP sites that users could access to download or upload files. To find a specific file in this structure, users had to navigate through each file. Of course it was possible to point directly to a file - if you knew the exact address of the file you were looking for. Finding relevant information was difficult and really slow.
First search engines were actually programs, which downloaded directory list-ings for all of the files that were stored on anonymous FTP sites. These listlist-ings were stored into a searchable database of web sites. This functionality is simi-lar to the file indexing mechanisms of personal computers, which make it easier to find a particular file on a hard disk.
The first advanced robot, which was developed at the University of Washing-ton, was called WebCrawler. This actually indexed the full text of documents, allowing users to search through this text, and therefore delivering more
rele-vant search results. Providing users a way to search through the whole HTML document, WebCrawler system represented a huge change in how web robots worked.
WebCrawler was followed by Lycos and Infoseek, first of which launched on 20 July 1995 with 54 000 documents indexed, and by January 1995 had indexed 1,5 million documents. In 1995 AltaVista came onto the scene and was quickly recognized as the top search engine due to the speed with which it returned results. It was also the first search engine to use natural language queries (see 2.2.2).
Google was launched at the end of 1998. Google has grown to become the most popular search engine in existence, mainly owing to its ease of use, the number of pages it indexes, and the relevancy of its results. More information about Google is presented in Chapter2.2.3. [9]
Today, a large number of Web page accesses originate from search engines. In 1998, 10th WWW User Survey by Georgia Tech University found that Internet search engines are used by 85 percent of The Web users [10].
Scientific American magazine published an article in February 2005 about the rise of the search engines, stating that in less than a decade, Internet search engines had completely changed how people gathered information [11]. There was no need to go to a library to look up some information, but it was a matter of seconds and few clicks on a keyboard.
Additionally, Web searchers are not random visitors. When searchers enter a series of words into a search engine query, they are actively searching out a specific product or service. Thus, the traffic the web site receives from search engines is already targeted. [12]
2.2.2 Structure And Functionality
A search engine implements four basic mechanisms [13]: Discovery, Indexing,
Discovery
Discovery is a method which finds new web sites. This is accomplished us-ing specific software programs calledbots (webbots, robots, spiders, crawlers), which travel down web sites using web links, which are connecting pages to-gether. As these bots make their way around the Internet, they collect content, such as text and links from web sites and store those in a database.
When a crawler is first released on the Web, it’s usually seeded with a few web sites and it begins on one of those sites. The first thing it does on that first site is to take note of the links on the page. Then it “reads” the text and begins to follow the links that it collected previously. This network of links is called the crawl frontier; it’s the territory that the crawler is exploring in a very systematic way.
The links in a crawl frontier will sometimes take the crawler to other pages on the same web site, and sometimes they will take it away from the site completely. The crawler will follow the links until it hits a dead end and then backtracks and begins the process again until every link on a page has been followed.
When a crawler reads a web page, it sees only the textual content of the page, encoded in HTML. No graphics, styles or other types of media are displayed. The only way for the crawler to jump from page to another is to follow the hyperlinks included in the page. The crawler doesn’t understand any dynamic client-side features, such as DHTML (Dynamic HTML) effects, JS (JavaScript) functionalities etc.
Since the Web is dynamic, last month’s crawled page may contain different content this month. Therefore, crawling is a never-ending process. Spiders return exhausted, carrying several new and many updated pages, only to be immediately given another root URL and told to start over. However, some pages change more often than others, so a crawler must decide which pages to revisit and how often.
When a spider visits a webpage, it consumes resources, such as bandwidth and hits quotas, belonging to the page’s host and the Internet at large. Polite
spiders try to minimize their impact. Additionally, website administrators can use a robots.txt file to block spiders from accessing parts of their sites.
Eventually, crawlers return with URLs for new or refreshed pages that need to be added to or updated in the search engine’s indexes. [7]
Indexing
Indexing means storing the found web site links, the summaries of the found pages and other related information in special servers, which are calledindex
servers. These servers form the basis for search engine information storage and they are the most valuable part of the search engines.
The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index, and thus a web page may have been ’spidered’ but not yet ’indexed’. Until it is indexed - added to the index - it is not available to those searching with the search engine. [9]
Ranking
Ranking is performed using specific algorithms, which order the stored pages by how important they are. In its simplest form, ranking of a given web page is calculated by counting citations or backlinks (links that are pointing to this particular page from other pages) to the particular page, which gives an approximation of the page’s importance or quality.
All the search engine providers have their own ranking algorithms, and these are kept strictly company confidential, in order to prevent misuse of the search engines. Ranking is based on calculating backlinks, i.e. a page is considered relevant, if there are multiple other pages, which are linking to this page. Probably the most advanced ranking mechanism is the one provided by Google
formu-lated like this: a page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks. Figure1demonstrates the propagation of rank from one pair of pages to another.
Figure 1: Simple PageRank example. The number on each page element presents the calculated PageRank of the page. This rank is evenly split to successor pages and this presents the idea of how the backlink rankings affect the PageRank. For example, the top-left page affects the successor page ranking much more than the bottom-left page, as it has a very high PageRank. [14]
Another, more intuitive description for PageRank was described by Sergey Brin and Lawrence Page in [15]:
”PageRank can be thought of as a model of user behavior. We assume there is a
’random surfer’ who is given a web page at random and keeps clicking on links, never hitting ’back’ but eventually gets bored and starts on another random
page. The probability that the random surfer visits a page is its PageRank.”
Return Of Results
Return of results is the component of a search engine, which is used to respond to user search queries. When a user enters a query into a search engine, the engine examines its index and provides a listing of best-matching web
pages according to its criteria, usually with a short summary containing the document’s title and sometimes parts of the text [16].
Search queries can be divided to three broad categories [17]:
• Navigational queries. The purpose of such queries is to reach a particular
site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. This type of search is sometimes referred as ”known item” search in classical IR (Information Retrieval), but is mostly used in the evaluation of various systems. With respect to evaluation, navigational queries have usually only one ”right” result (up to syntactic/semantic aliases).
• Informational queries. The purpose of such queries is to find information
assumed to be available on the web in a static form. No further interac-tion is predicted, except reading. By static form we mean that the target document is not created in response to the user query.
• Transactional queries. The purpose of such queries is to reach a site
where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g. for gaming) etc.
The engine looks for the words or phrases exactly as entered. Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. These are for literal searches that allow the user to refine and extend the terms of the search.
There are also more advanced features available in today’s search engines. Proximity search is a method for finding documents where two or more sep-arately matching term occurrences are within a specified distance, where dis-tance is a number of intermediate words or characters. By limiting search results to only include matches where the words are within the specified max-imum proximity, or distance, the search results are assumed to be of higher relevance than the matches where the words are scattered [18].
Conceptual search is an automated IR method that is used to search electron-ically stored unstructured text (for example, digital archives, email, scientific literature, etc.) for information that is conceptually similar to the informa-tion provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query [19]. Natural language queries allow the user to type a question in the same form one would ask it to a human. Search engine providers (especially Google) have also employed the Web Search technology into other services, including search for images, videos, discussion groups, news, online products, location information including maps, etc. The base technology provides means to index any content and this additional ab-straction provides users with richer information.
The results of the search query are organized and displayed in order, based on relevance ranking. Figure 2 gives a simplified view on the different stages of returning search results.
2.2.3 Service Providers
Nowadays, there are quite a few players in the search engine market. All of them are trying to crawl through The Web, collect even the smallest pieces of information and present them to the information-craving users. Despite the continuous battle between the biggest players, Google has taken a considerable lead both in online search and advertising markets [21].
This fact was also admitted by Google’s nearest competitor Yahoo!, as their Chief Financial Officer Susan Decker stated in an interview in 2006: ”It’s not our goal to be No. 1 in Internet search. We would be very happy to maintain
our market share.” [22]
Experian Hitwise, one of the leading global information services companies gathers statistics on Internet usage, having a sample of over 25 million users worldwide. The tables (1) and (2) show snapshots of 4 weeks, ending January 2nd 2010, about the visits and volumes of the most used search engines [23].
Table 1: Top Search Engines - Visits
Rank Search Engine Visits
1. Google 64.05%
2. Yahoo! Search 10.96%
3. Bing 8.91%
4. Google Image Search 4.19%
5. Ask.com 2.26%
6. AOL Search 0.98%
7. Yahoo! Image Search 0.50%
8. Dogpile 0.49%
9. Sphere 0.47%
10. Yahoo! Video 0.41%
users want to initiate a search operation. Many web browser vendors, both in desktop and in mobile devices have selected Google search (http://www. google.com) as a home page of the browser, and therefore it has been even considered the starting page of the Web.
Table 2: Top Search Engines - Volume
Rank Search Engine Searches 1. www.google.com 72.25% 2. search.yahoo.com 14.83% 3. www.bing.com 8.91%
4. www.ask.com 2.53%
5. www.aolsearch.com 0.77%
Table 2 shows that users also rely more on Google search for many consecu-tive search operations, as almost three out of four times, search operation is performed using Google’s search engine.
Based on these statistics, it’s obvious that visibility in Google is one of the most critical factors for any Web site, which is aiming for being recognized and found by users. It’s also evident, that being able to act as a good landing page for requests, which originate from search engines is a vital aspect of a web site.
But it’s not enough that the pages of a web site get indexed by search engines. There are millions and millions of pages stored in the search engine indexes, and a single page can very easily disappear among millions of similar pages. Chapter 2.3 gives an overview on the methods, which make it possible to improve visibility and ranking in search engines.
2.3
Search Engine Optimization
”Search engine optimization (SEO) is the process of improving the volume or quality of traffic to a web site from search engines via ’natural’ or un-paid (’organic’ or ’algorithmic’) search results.”
- Wikipedia [24]
2.3.1 Reasoning For Good Ranking
To find a page on the Web, many Web users go to Google (or their favorite search engine), issue keyword queries, and look at the results. If the users cannot find relevant pages after several iterations of keyword queries, they are likely to give up and stop looking for further pages on the Web. Therefore, a page that is not indexed by Google (or is ranked only at the bottom) is unlikely to be viewed by many Web users. [25]
iProspect conducted a Search Engine User Behavior study in 2006 [26], which brought up stunning results for the importance of good ranking in search engines. Key findings were that 62% of search engine users click on a search result within the first page of results, and a full 90% of the users click on a result within first three pages of search results. On the other hand, 41% of the users reported to change their search term and/or search engine, if they do not find what they were looking for on the first page and a full 88% of the users decided to do so if they do not find what they seek in the first three pages. White and Dumais present search engine switching behavior using large-scale log-based analysis and survey data in [27]. Their findings demonstrate the relationship between search engine switching and other factors such as dissat-isfaction with the quality of the results, the desire for broader topic coverage and verification of encountered information. Additionally, they also reveal sufficient consistency in users’ search behavior prior to engine switching. The study shows that users quite rarely paginate (i.e. request the next page of search results for the current query), mainly only for the first few pages. More often users perform another query or click some link on the SERP (Search engine results page).
2.3.2 Core Practices
The core practices of good SEO are fairly simple [13]:
• Understand how your pages are viewed by search engine software
• Take common sense steps to make sure your pages are optimized from the viewpoint of these search engines
Fortunately, this essentially means practicing good design, which makes your sites easy to use for human visitors as well
• Avoid certain over-aggressive SEO practices (i.e. black hat SEO), which can get your sites blacklisted by the search engines
There are considered to be two main areas of SEO methods and tactics in
use: white hat and black hat. White hat SEO, also known as ethical SEO,
consists of optimization methods which are generally approved by the search engines. These methods tend to produce results that last a long time, and they try to follow all the guidelines given by the search engine vendors. Black hat SEO tactics such as spamdexing (search engine spamming) attempt to redirect search results to particular target pages in a fashion that is against the search engines’ terms of service. Black hat SEO consists of use of hidden links, hidden text, user agent cloaking and sneaky redirects, which try to bait people into linking to the site and hence increasing PageRank of the site. Using black hat SEO techniques is considered dangerous as sites optimized like this may eventually be banned either temporarily or permanently once the search engines discover these biased optimizations. Trust and credibility is a limited commodity on the Internet and is much easier to lose than gain. SEOmoz, the world’s most popular provider of SEO software surveys top SEO experts in the field worldwide on their opinions of the algorithmic elements that comprise search engine rankings. These surveys are conducted every two years. Year 2009 survey collected top ranking factors, both positive and negative, helping to provide transparency into what matters and what doesn’t for best practices in search engine optimization. [28]
The ranking factors were rated by a panel of 72 SEO experts. Their feedback is aggregated and averaged into the percentage scores below. For each, the degree to which the experts felt this factor was important for achieving high rankings as well as the degree of variance in opinion was calculated. Thus, factors that are high in importance are those where experts agree the most that the factor is critical to rankings.
Table3presents the results of the top 5 factors, which affect the positive rank-ing of a web page. It includes both On-Page (i.e. methods, that manipulate the actual page content in order to improve its ranking) and Off-Page (i.e. methods, that build linking to and from external pages and therefore improve PageRank of the page) ranking factors.
Table 3: Top 5 Ranking Factors
Rank Topic Importance
1. Keyword Focused Anchor Text from External Links 73% very high importance 2. External Link Popularity (quantity/quality of external links) 71% very high importance 3. Diversity of Link Sources (links from many unique root domains) 67% very high importance 4. Keyword Use Anywhere in the Title Tag 66% very high importance 5. Trustworthiness of the Domain Based on Link Distance from Trusted
Domains (e.g. TrustRank, Domain mozTrust, etc.)
66% very high importance
This thesis focuses mostly on On-Page optimization methods, as those can be performed by developing the site. Table 4 presents the results of the most important positive On-Page ranking factors:
Table 4: Top Positive On-Page Ranking Factors
Rank Topic Importance
1. Keyword Use Anywhere in the Title Tag 66% very high importance 2. Existence of Substantive, Unique Content on the Page 65% very high importance 3. Keyword Use as the First Word(s) of the Title Tag 63% very high importance 4. Keyword Use in the Root Domain Name (e.g. keyword.com) 60% high importance 5. Site Architecture of the Domain (whether intelligent, useful
hierar-chies are employed)
52% moderate importance 6. Recency (freshness) of Page Creation 50% moderate importance 7. Keyword Use Anywhere in the H1 Headline Tag 49% moderate importance 8. Keyword Use in Internal Link Anchor Text on the Page 47% moderate importance
Table 5: Top 5 Negative Ranking Factors
Rank Topic Importance
1. Cloaking with Malicious/Manipulative Intent 68% very high importance 2. Link Acquisition from Known Link Brokers/Sellers 56% high importance 3. Links from the Page to Web Spam Sites/Pages 51% moderate importance 4. Cloaking by User Agent 51% moderate importance 5. Frequent Server Downtime & Site Inaccessibility 51% moderate importance
2.3.3 Key Word Analysis
An often overlooked, but vitally important, factor of search engine optimiza-tion is keyword research. All the optimizaoptimiza-tion in the world will do no good if the site is optimized for the wrong keywords. Basically, keywords capture the essence of the web site. Keywords are what a potential visitor to the site puts into a search engine to find web sites related to a specific subject. The cho-sen keywords are used throughout the whole optimization process and using correct keywords in the web site content can mean the difference in whether the site is shown in the first search results pages or buried under hundreds of more relevant results.
To decide which keywords should be used on a web site, the most simple, but relevant question is: Who needs this service? It’s an elementary question, but one that will be most important in searching for the correct keywords and having the best search engine optimization. It’s also important to remember to use words that real people use when talking about the products. In addition to the terms that one can think of, people also will look for web sites using variations of words and phrases—including misspellings.
Another important aspect of correctly defined keywords is the fact, that search engines aim for returning relevant information to the searchers. An important part of good white hat SEO is that a site should only advertise keywords, which are relevant from site’s point of view, i.e. all the search results based on the selected keywords should contain only valuable and relevant information for the user.
2.3.4 Optimizing Pages
The foundation of a successful search engine optimization program consists of three components: the text component, the link component, and the popu-larity component. When all these components are utilized, the site gains far more search engine visibility than sites that do not use them.
For the target audience to find the site on the search engines, the pages must
contain keyword phrases that match the phrases the target audience is typing into search queries. The most important text for a search engine is the most important text for the target audience - the text the target audience is going to read when they arrive at the site.
The way the web pages are linked to each other also affects the site’s search engine visibility. If search engine spiders can find the pages quickly and easily, the site has a much better chance of appearing at the top of search results. The site that end users click the most will usually rank higher. Sometimes, a popular web site will consistently rank higher than sites that use plenty of keywords. Building an appealing and interesting web site is very important for getting good visibility in search engines. [12]
When performing page optimizations, it’s important to avoid black hat SEO methods, which might end up getting the site banned from the search indexes. Google publishes its own guidelines in the following URL:
h t t p : / /www. g o o g l e . com/ webmasters / g u i d e l i n e s . html
These include guidelines for design and content creation, both for technical and quality aspects of the implementation.
Crawlable Links
As stated in the Chapter2.2.2, the only way a crawler can navigate from one page to another is by following hyperlinks connecting these two pages together. This sets a requirement for a site developer to create a clear navigation
hi-erarchy, where all the pages on the site are accessible through at least one static text link. Any dynamic browser features, such as Javascript menus, DHTML effects or Flash components hinder crawler access, and these must be implemented in a way, which doesn’t affect the site crawlability.
All the crawlable links should also be statically formed without any query parameters, as some of those might be ripped off by a crawler, and therefore result in broken links in the search index. Additionally, including keywords in the URLs acts as a good ranking factor from SEO point of view, as mentioned in table4 in Chapter2.3.2
Page Titles
Page titles are one of the most important elements of site optimization. When a crawler examines a site, the first elements it looks at are the page titles. And when a site is ranked in search results, page titles are again one of the top elements considered [8]. In addition, all major crawlers will use the text of the title tag in search results listings.
Page Content
Web site content is one of the most highly debated elements in search engine optimization, mostly because many rather unethical SEO users have turned to black hat SEO techniques, such as keyword stuffing to try to artificially improve search engine ranking. Despite these less than honest approaches to search engine optimization, however, website content is still an important part of any web site optimization strategy.
The content on the site is the main draw for visitors. Whether the site sells products or simply provides information about services, what brings visitors to the site is the words on the page. Product descriptions, articles, blog entries, and even advertisements are all scanned by spiders and crawlers as they work to index the Web. [8]
top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning.
Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.
META Tags
META tags are information inserted into the ”head” area of the web pages. Other than the title tag, information in the head area of the web pages is not seen by those viewing the pages in browsers. Instead, meta information in this area is used to communicate information that a human visitor may not be concerned with.
META tags have never been a guaranteed way to gain a top ranking on crawler-based search engines. Today, the most valuable feature they offer the web site owner is the ability to control to some degree how their web pages are described by some search engines. They also offer the ability to prevent pages from being indexed at all.
META description
This tag allows to influence the description of a page in the crawlers that support the tag. However, all the search engines don’t use this tag, and e.g. Google automatically generates its own description for the page either with or without guidance from META description. But in review, it is worthwhile to use the meta description tag for the pages, because it gives some degree of control with various crawlers. An easy way to do this often is to take the first sentence or two of body copy from the web page and use that for the meta description content.
META keywords
search engines to index along with the body copy. However, for most major crawlers, these don’t create any additional value, as most crawlers now ignore the tag. The META keywords tag is sometimes useful as a way to reinforce the important terms on a page. Just adding important terms to the META keywords tag is extremely unlikely to help the page do well for the term, but
theymight work in conjunction with the text in the body copy.
META robots
This tag makes it possible to specify that a particular page should not be indexed by a search engine. To keep spiders out, the following META tag is added between head tags on each page that are not to be indexed:
<meta name=” r o b o t s ” content=” n o i n d e x ” />
Most major search engines support the meta robots tag. However, therobots.txt
convention of blocking indexing is more efficient, as it’s not needed to add tags to each and every page.
Canonical URLs
If the site has identical or vastly similar content that’s accessible through multiple URLs, this format provides with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to the preferred version of the page. [29]
By adding alink-tag to specify the preferred version of the page:
<l i n k r e l=” c a n o n i c a l ” href=” h t t p : / /www. example . com/ master−page ” />
on the ”head” section of these duplicate content URLs:
h t t p : / /www. example . com/ master−page
Google will understand, that the duplicates all refer to the canonical URL. By doing this manipulation, all the URL properties, such as PageRank and related signals are transferred as well, which concentrates the PageRank for this particular page.
2.3.5 Web Analytics
”If you can not measure it, you can not improve it.” - Lord Kelvin
That statement is ultimately the purpose of web analytics. By enabling a site developer to identify what works and what doesn’t from a visitor’s point of view, web analytics is the foundation for running a successful website. Web analytics is a thermometer for a website - constantly checking and monitor-ing online health of the site. Web analytics gives insights on effectiveness of search engine marketing and successful search engine optimization effort by presenting number of unique visits and page views. Returning users can be differentiated from new users by investigating the bounce rate percentages, which give information about user engagement. [30]
Google Analytics (GA) [31]
Google Analytics is an enterprise-class web analytics solution that gives rich insights into website traffic and marketing effectiveness. Google Analytics uses a first-party cookie and JavaScript code to collect information about visitors and to track advertising campaign data. Google Analytics anonymously tracks how visitors interact with a website, including where they came from, what they did on a site, and whether they completed any of the site’s conversion goals. GA also collects information about the search queries, which led the user to the site, when entering the site from search results listing.
A web site is attached to GA by inserting a following Javascript snippet on every page, which is to be tracked:
<s c r i p t type=” t e x t / j a v a s c r i p t ”> v a r g a J s H o s t = ( ( ” h t t p s : ” == document . l o c a t i o n . p r o t o c o l ) ? ” h t t p s : / / s s l . ” : ” h t t p : / /www. ” ) ; document . w r i t e ( u n e s c a p e ( ”%3 C s c r i p t s r c = ’ ” + g a J s H o s t + ” g o o g l e−a n a l y t i c s . com/ ga . j s ’ ” + ” t y p e = ’ t e x t / j a v a s c r i p t ’%3E%3C/ s c r i p t %3E” ) ) ; </s c r i p t> <s c r i p t type=” t e x t / j a v a s c r i p t ”> t r y{ v a r p a g e T r a c k e r = g a t . g e t T r a c k e r ( ”UA−xxxxxx−x ” ) ; p a g e T r a c k e r . t r a c k P a g e v i e w ( ) ; } c a t c h ( e r r ) {} </s c r i p t>
When a user visits the site, this script sends information about the user to the Google data collection server. GA generates an hourly report about the site activity and presents this in an online service, where the site administrator can investigate user behavior in the site.
Google Webmaster Central (GWT) [32]
Google Webmaster Central is a free service, which provides information for webmasters. It allows webmasters to check indexing status of a site and pro-vides help in optimizing visibility of the site. The tool can be used to submit and check a sitemap for the site, which describes the structure of the site. It provides methods for fine-tuning search bot crawling privileges by setting crawl rates. Also statistics about how Google bot has accessed the site is available through this tool. Chapter2.2.2 mentionedrobots.txt file, which can be used to define restricted access for a Google bot. GWT provides methods for generating and testing these restrictions.
The most useful part of the tool is the statistics reporting functionality, which provides in-depth information about search engine attributes of the site. These include internal and external links, keywords that led users to the site, Google indexing statistics, possible errors which occurred while the bot was indexing the site, etc. These reports form a firm basis for SEO efforts, as the results of the optimizations can be checked really quickly.
2.3.6 Related Work
Xing and Lin studied the conditions under which SEO exist and its impact on the advertisement market in [33].
Viljakainen, B¨ack and Lindqvist wrote a research study about media and mar-keting landscape, investigating the most important trends in the advertisement markets in a five year time span until 2013. The study touches areas of Search Engine Marketing (SEM) in all its forms, including keyword marketing, visi-bility in online directories and search engine optimization. [34]
Practical research on search engine optimization has been previously studied in the Master’s Thesis of Juha Paananen, who presented measures for using SEO in search engine marketing. He also implemented optimizations for a Finnish marketing company Primasoft [35].
2.4
Progressive Enhancement
PE (Progressive Enhancement) is a modern approach for developing web doc-uments that are accessible across any browser or device that has access to the Internet. The idea of PE is to separate document’s content, presenta-tion and behavior, which embraces document accessibility, semantics, forward-compatibility and usability [36].
PE originates from an older idea of enabling a system to continue operating properly in the event of the failure of some of its components [37]. This prop-erty, Graceful degradation (a.k.a. Fault-tolerant system) was a carry-over from the engineering world, and since 1994 for the web development community it was at its core about giving the latest and greatest browsers the full-course meal experience while tossing a few scraps to the sad folk unfortunate enough to be using Netscape 4. About a decade later, this property began to be questioned and it was found lacking on many levels. In 2003 first blueprint of progressive enhancement was introduced. The biggest difference between graceful degradation and progressive enhancement is that the first one focuses on browser capabilities, whereas the latter focuses totally on the content, which
is being generated [38].
As web crawlers are not able to see anything else than textual content of the page, this approach is really beneficial for SEO. Because the basic content of the site is always accessible to search engine spiders, pages built with PE methods avoid problems that may hinder search engine indexing [39].
2.5
Social Media
With the emergence of Web 2.0 and multiple related social media applications like Flickr, Youtube, Facebook, Wikipedia etc., research interest has grown in multiple aspects of social media including data sharing, image tagging, mashups, ontologies, retrieval etc. While these contributions have significantly advanced the state of the art from the technology perspective, not much re-search attention has been given till now to the end-user or social aspect of social media research.
An important point to consider in all these applications is that the user con-tribution is totally voluntary. Further the decision making is completely dis-tributed and there are no means for central coordination or explicit communi-cation between the various participating users. This brings us to the important issue of user motivation and that the individual users will contribute to such social media networks only based on their personal utility decisions. [40] Social media services are dependent on user contributions to provide value to their products, and as a result designers of such systems build features targeted at increasing the amount of content a given user contributes. One such feature is a content feed, which publishes stories about a user or set of users and makes the stories available to others. Such feeds may cause users to increase their rate of content contribution, either by increasing user awareness of product features and the socially acceptable means of using them, encouraging users to contribute content to attract the attention of their peers, or a combination of these effects. [41]
2.5.1 Social Bookmarking
Social bookmarking is a method for Internet users to share, organize, search, and manage bookmarks of web resources. Unlike file sharing, the resources themselves aren’t shared, merely bookmarks that reference them. Like social networking, bookmarking sites are meant to engage users and provide inter-action. However, bookmarking sites also provide a way to more easily market articles and links.
2.5.2 Social Networking (Facebook, MySpace)
A basic social network is a collection of people you know and stay in contact with. You swap ideas and information, and recommend friends and services to each other. Sharing in a social network is based on trust; if the recommenda-tion is unbiased and is from a known source then you are more likely to trust it than not. The best recommendations are therefore those that come from unbiased and trustworthy sources [9].
In the past few years social networks have become some of the most popular sites around. The ability to utilize and benefit from today’s explosion of social media sites depends on providing tools that allow users to productively participate. In order to participate, users must be able to find resources (both people and information) that they find valuable.
Social networking sites like Facebook and MySpace have highlighted an emerg-ing form of social context composed of a digital record of an individual’s set of connections to other users of a given system. These forms of social network ser-vices have attracted attention to the idea of personal social networks, directed graphs connecting individuals to their friends, and how these networks have value for forging connections between users. Once generated, these personal social graphs can also serve as a novel form of reference set for collaborative filtering, replacing the association ”people who like this also like that” with the association ”people who like me also like this”. [42]
2.5.3 Flickr
Flickr is one of the crop of new ”social media” sites, along with blogs, wikis and their kin, that are transforming the Web to a participatory medium where the users are actively creating, evaluating and distributing information. Like many other social media sites, Flickr also allows users to designate others as ”friends” or ”contacts” and offers an interface to see in one place the latest images submitted by friends. The friends lists form the social network backbone of social media sites. [43]
2.5.4 YouTube
Video content is becoming a predominant part of users daily lives on the Web. By allowing users to generate and distribute their own content to large audiences, the Web has been transformed into a major channel for the delivery of multimedia. In fact, a number of services in current Web 2.0 are offering video-based functions as alternative to text-based ones, such as video reviews for products, video ads and video responses. Most part of this huge success of multimedia content is due to the change on the user perspective from consumer to creator. [44]
2.6
Summary
This Chapter introduced some relevant topics and concepts which should be considered, when developing a web site. However, leveraging these aspects on a real implementation is an iterative effort, which needs to be planned well. There are no guarantees, these methods would create success for a web site. After all, it’s the content of the site which attracts users the most.
In Chapter 3, some of these topics are implemented on a real web site, and their effects are measured and analyzed.
Case Stadissa.fi
3.1
Solution
”It is common sense to take a method and try it. If it fails, admit it frankly and try another. But above all, try something.”
- Franklin D. Roosevelt
The nature of this thesis is empirical, as it concentrates on improving a real web site. The site has been public already for a year and a half, quite a long time for a modern web site. However, this period of time has generated a good set of information about the use of the site, potential shortcomings of the site and improvement ideas for making the site better.
This information has been collected using Google Analytics, a web analytics service suite, which collects usage information of a web site and gives insights into traffic and marketing effectiveness. Another service, which gives a lot of information and guidance especially for search engine optimization effort, is Google Webmaster Central. (see Chapter 2.3.5)
Inessiivi Media Oy has also maintained an intranet site, where lots of the discussions, improvement ideas and innovations are documented. This work has been performed by the members of the company, and they reflect ideas that have emerged over time of running the site.
Combining analytical statistics and prioritized improvement ideas, the follow-ing set of solutions can be identified:
1. Getting new visitors to enter the site
• Improved discovery of the site in search engines
• Better positioning in search results
• User generated marketing by use of social media 2. Keeping the users in the site
• More related and recommended content, which attracts users to dive in the site
• Cross-site linking between different pages of the site 3. Making the users come back to the site
• Enough information for the user
• Accuracy and relevance of the information
• Better user experience
3.2
Web Site Overview
This Chapter introduces the case site by describing the main features on the site, briefly introducing the technical architecture of the implementation and by giving an insight on the traffic numbers during year 2009.
3.2.1 Site Description
As mentioned in the Chapter1, the web site offers an extensive event calendar, ranging from music events to theaters, exhibitions and sports events. The main page of the site (figure3) presents a calendar view, which lists the events for the current week, broken down into four main categories: music, sports, theaters and miscellaneous events.
Figure 3: Front page of the site presents all the events for all the categories for the current week.
The calendar view also gives information about the starting times and the venues of the events. User can drill down to a specific category and see the events in a more granular level, broken down into subcategories. Another dimension of the service is time. User can navigate to a certain week by using the calendar navigator in the top-left corner of the calendar. A calendar week is always presented such that the last row is always the last day of the selected week.
User can select an event by clicking a link on the calendar view. Each event is presented in an own page, which gives more information about the particular event, including time, price and location information. The event page offers an integration to public transportation service for finding out the optimal route and schedule to the event location. It’s also possible to send the event infor-mation to a friend using an e-mail form. An example event page is presented in figure 4.
The site includes an internal search functionality, which is initiated from the header part of the main page. This results in a search results list, which is also time sensitive. By default, events that are happening in the future are returned first. It’s possible to navigate back in time to show results from the past. An example view of search results page is presented in figure5
Figure 5: Example search results page of the case site
3.2.2 Technology
The site has been built using rapid Web application technologies. It runs on a standard LAMP architecture:
• Linux Operating System
• Apache HTTP Web Server
• MySQL Database Engine
• PHP scripting language
LAMP is a very popular Web application technology stack. Based on a survey provided by Netcraft [45], Apache HTTP server is a dominant in the Web
server market, with about 47% share. MySQL is fast approaching majority market among software developers. According to MySQL market share esti-mations [46], the database software is being downloaded over 65,000 times per day.
The application architecture follows a very often used three tier model, where application logic is split to three parts.
Client tier
This tier takes care of rendering the user interface in the client web browser. In addition to the back-end implementation, the site has some front-end enhance-ments, which are executed in the users’ web browsers. These are implemented using CSS (Cascading Style Sheets) styles and DHTML (Dynamic HTML) /Javascript effects. There are a bunch of free-of-charge Javascript frameworks available, and in this site, the ones used consist of jQuery, Prototype and Script.aculo.us.
Middle tier
This tier contains the core logic of the application. It consists of a logical data model and a controller framework. Logical data model contains enti-ties, which describe the information elements required in the site. These in-clude e.g. event, venue, address, category, etc. These are implemented using object-oriented features of PHP language. Controller logic has been imple-mented using PHP scripts, which handle HTTP requests initiated by users’ web browsers. Based on the requested URL resource, PHP script collects HTTP request parameters, retrieves required information from the database, builds a logical data model which is then aggregated to a suitable form. This data model is used to create a dynamic HTML page, which is then returned to the client tier for rendering.
Database tier
Database tier contains the physical data model of the application. It consists of a relational database, which runs in a MySQL database engine. Database schema contains tables for all the entities, and this data forms the basis for the information presented in the application.
Figure6 presents a simplified view on the application architecture.
Figure 6: Three tier architecture of the application [47]
3.2.3 Traffic, Year 2009
Visits
Year 2009, the site was visited approximately 150,000 unique visitors. Nat-urally, some error margin needs to be included here, as the unique visitor tracking is based on cookies. If a user clears all the cookies in his/her com-puter, next visit will be tracked as unique, even if the user is actually the same as before. Also, if a public computer is being used, all the visits are tracked as one user, even if there were multiple persons using the computer.
Figure 7 shows number of page visits, unique visitors and page views of the site during time span 1.1.-31.12.2009:
Figure 7: Traffic figures, year 2009
The number of visitors is pretty good, taking into account that there haven’t been any marketing activities for the site. In figure 8, these visits have been plotted on a trend line, which shows how user activity has evolved throughout the year.
Figure 8: Traffic trend, year 2009
Except for some high peeks here and there, the number of visits has remained basically on the same level through the whole year. The peeks are naturally interesting as by investigating them it’s possible to get information about users’ intentions and reasons for visiting the site.
Drilling down to the actual content, in addition to the front page of the site, the most visited page during 2009 was this event:
h t t p : / /www. s t a d i s s a . f i / e v e n t . php ? i d =10498 ( 1 2 , 2 4 4 u n i q u e v i e w s )
This particular event was Madonna’s concert, held in J¨atk¨asaari, Helsinki (figure9), and it was a really popular event in the Finnish music scene.
Figure 9: Example event: Madonna’s concert
Similar peeks occurred during other ”big” events, such as popular concerts and exhibitions. A notable detail was that also ticket selling events were really popular, with hundreds or even thousands of unique page views.
Traffic Sources
The figure10presents statistics of different source types which created traffic to the site. The share of the search engines is devastating with almost 80% of the whole traffic. This confirms the fact that search engines have a very big role in bringing new visitors to the site. Additionally, a small number of visits from referring sites gives good insight on the problem of not being actively linked from external sites and thus not getting good ranking figures, which would be based on Off-Site elements of SEO.
Figure 10: Traffic source types, year 2009
Figure 11: Traffic from different search engines, year 2009
Looking at these statistics, it’s evident that optimizing for Google is the key element of getting more visibility for the site. Getting a good ranking in Google would presumably be the greatest factor affecting the number of visitors.
Keywords
As the previous section described, traffic originating from search engines formed the majority of all the traffic, which reached the site. This supports theoretical assumptions about the importance of search engines as a method of finding information in the web, presented in Chapter2.2.1.
Top 5 keywords, which resulted in page visits are presented in figure 12.
User Engagement
Looking at traffic source type statistics, it’s evident that Google is the most important source for the site traffic. Combining user engagement statistics, presented in figure13, the usage pattern of the site seems straight-forward:
Users are looking for event information in Google, they find a link pointing to this site and click it. After having visited the site, they return back to Google, possibly to perform another search. This site acts as a simple landing page for Google searches, and is not used thoroughly. Navigation inside the site is very minimal, and the pages are acting as individual information elements.
Figure 13: User engagement, year 2009
Keeping the users in the site under a minute on average, creates challenges in promoting other parts of the site. High bounce rate speaks on behalf of the fact, that users are returning directly back to Google search, after seeing the particular search result, they were looking for.
3.3
Improvements
This Chapter introduces the implemented improvements, divided in subsec-tions. Every improvement is explained briefly and the relevant implementation part is presented in a simple manner.
3.3.1 Selected Key Words
As described in Chapter 2.3.3, choosing relevant keywords for the site is the first step of any search engine optimization improvements. For this particular
site, the message for the users is to be an extensive calendar, where one can find any events occurring in Helsinki area. Also event venues are well represented in the site, and therefore also optimizing for those is a good way of getting better rankings.
Keywords chosen for the site consist of event names, possibly followed by
Helsinki, which emphasizes location-specific nature of the service. Event names
are built of artist names, sports teams, exhibition names, etc. One of the most often requested information seems to be concert tickets, and therefore this information is also included in the event pages, which indicate the sales of tickets.
Table 6lists examples of keyword patterns, which are chosen to represent the message of the site, and for which the pages are optimized.
Table 6: Selected Keywords For The Site
Keyword Rationale Examples
<event name> Very often users tend to search for spe-cific event by its name, artist name, team name, etc.
”whitney houston konsertti”, ”hul-lut p¨aiv¨at”
<event name>+ ”helsinki” The site contains only events, which oc-cur in Helsinki area. This keyword com-bination serves those requests.
”madonna helsinki”, ”m¨akihyppy helsinki”
<event name>+<year> Testing different keyword combinations has revealed that it’s natural for the users to use time as a criteria, when searching for event information.
”fitness classic 2010”, ”sembalot 2010”
<event name>+ ”liput” As mentioned above, users are looking for information about event tickets, and this combination is a natural way of ask-ing that information
”billy idol liput”, ”green day liput”
<event venue> When users want to find information about a particular venue, the name of the venue is the most natural way of search-ing for it
”helsingin j¨a¨ahalli”, ”ravintola dubrovnik”
<event venue>+ ”osoite” Users might also be interested in finding a specific venue. This keyword combi-nation should return the address of the venue
”gloria osoite”, ”suvilahti voimala osoite”
3.3.2 Navigation And Crawlability
Chapters 2.2.2 and 2.3.4 explained, how search engine bots are crawling the site. The only way for a bot to access a specific page is to find a link, which is pointing to that page. Even if human users are getting more and more familiar
with accessing information by creating dynamic search queries, a bot requires that the site has a crawlable hierarchy, which makes it possible to reach all the pages on the site by following hyperlinks.
Categorized Calendar Views
The front page of the site is naturally a good starting point for the bot to start crawling event information. However, as mentioned above, the site also categorizes the events to four main categories.
To create a logical hierarchy for the categorized views on the calendar, a set of URLs need to be introduced:
h t t p : / /www. s t a d i s s a . f i / m u s i i k k i / h t t p : / /www. s t a d i s s a . f i / u r h e i l u / h t t p : / /www. s t a d i s s a . f i / t e a t t e r i / h t t p : / /www. s t a d i s s a . f i /muut/
Technically this is implemented using rewrite capability of the web server, which creates logical URLs, which are later interpreted correctly in order to execute the correct PHP script. Rewritten URLs (sometimes known as short, fancy URLs, or search engine friendly - SEF) are used to provide shorter and more relevant-looking links to web pages. The technique adds a degree of separation between the files used to generate a web page and the URL that is presented to the world. [48]
For these particular logical URLs, a simple set of rewrites are defined:
R e w r i t e R u l e ˆ m u s i i k k i / ( .∗) i n d e x . php ? g r o u p=1&d a t e=$1 R e w r i t e R u l e ˆ u r h e i l u / ( .∗) i n d e x . php ? g r o u p=2&d a t e=$1 R e w r i t e R u l e ˆ t e a t t e r i / ( .∗) i n d e x . php ? g r o u p=3&d a t e=$1 R e w r i t e R u l e ˆmuut / ( .∗) i n d e x . php ? g r o u p=4&d a t e=$1
These categorized views, in addition to the front page, offer a nice way for a Google bot to crawl all the event pages on the site.
Event Venues
Every event is bound to a specific venue. Event pages can naturally be reached through event pages, but in order for a search bot to reach an event venue page, there needs to be at least one event, which occurs in that specific venue. Also, as mentioned above, the site also aims for offering good amount of information about event venues, even if there weren’t any upcoming events in the venue currently.
In order to get the venue pages indexed efficiently by search engines, an indi-vidual page is generated, which lists all the venues:
h t t p : / /www. s t a d i s s a . f i / t a p a h t u m a p a i k a t
This URL results in a page, which lists all the venues in the site, categorized in the main categories (music, sports, theaters, miscellaneous), see figure14.
Figure 14: Listing all the venues
Having this page in place, it is an easy task for the Google bot to crawl all the venues in the site.