• No results found

Barriers to Information Access across Languages on the Internet: Network and Language Effects

N/A
N/A
Protected

Academic year: 2021

Share "Barriers to Information Access across Languages on the Internet: Network and Language Effects"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

Barriers to Information Access across Languages on the Internet:

Network and Language Effects

Anett Kralisch

Humboldt University Berlin

[email protected]

Thomas Mandl

University of Hildesheim

[email protected]

Abstract

This paper investigates the role of language in accessing information on the Internet. We combined data about website visitors through log-file analysis with data about web-hosts and links obtained from a crawler. Results suggest that language may represent a double barrier: first, the number of native speakers determines the number of web-hosts, and hence the amount of information and the interconnectedness of information sources. Second, to access information on a particular website the languages offered are an even more important factor than network effects: non-native speakers and links from websites in other languages are always underrepresented. Our results are in line with the Information Foraging Theory, the Revised Hierarchy Model, network and market theories, and emphasize the role of language on the Internet. Insight into these processes is helpful when website translation represents important investment decisions, or when aiming to diminish the digital divide.

1.

Introduction

The World Wide Web is a global network where people with very different cultural and linguistic backgrounds meet. Websites, as nodes of this network, offer services and information to their users. Easy worldwide information exchange is one of the core advantages of the Web. The question is raised: How is the flow of information between users and websites structured? In particular, the role of language as a potential barrier to information flow is the focus of our interest.

Two aspects of information flow are investigated: first we study how websites with information in different or same languages are linked to each other. Second, our analysis investigates how users from different linguistic backgrounds benefit from the information available on the Internet, depending on the language in which the information is offered. These kinds of analyses are anchored in general network and market theories.

The second aspect, the investigation of language-related link following behaviour, reveals important insight into the role of language when accessing information on the Web. Theories about the costs and values of surfing behaviour (Information Foraging Theory) and about costs of language use (Revised-Hierarchy-Model) provide a theoretical back-ground.

Knowledge about this matter is helpful for two reasons. The first is commercial: Since website translations and adaptations often represent important investment decisions, data and knowledge about language-determined information access is valuable for appropriate linguistic adaptation. The second reason is ethical: insight into the role of language helps realise the goal of increasing participation in Internet communication, reducing the “digital divide”. This perspective is gaining importance with the increasing desire to enable wide-ranging citizen participation on the Internet.

A language-related analysis of the Web’s hyperlink structure is a precondition for accurate evaluation of the role of language in information seeking, and especially for link following behaviour. In addition, insight into how language affects hyperlink structures yields valuable information towards developing appropriate information retrieval algorithms.

2.

Related Research

Investigations of hyperlink structures and link following behaviour represent two major areas of research in the field of Web Information Systems.

Investigations of hyperlink networks range from predominantly social analyses aiming to identify social networks, to primarily technology-oriented investigations of the web’s structure. One of the aims of the latter is to enhance information retrieval algorithms. [14, 5], for example, studied the global hyperlink structure of the Internet and developed formal concepts such as bow-tie, core, in and out components, tendrils or disconnected components. Authors focussing on social processes analysed these in terms of both the source and the result of hyperlink setting behaviour. McPherson et al. [17] argue

(2)

that “Confucianism” had a large impact on strengthening homogeneity among South-Korean political actors on the Web, which is mirrored in the hyperlink connectivity between their websites. Palmer et al. [18] have shown how the number of links affects the trustworthiness of a website. With specific regard to language and culture, authors such as Bharat et al. [4], Halavais [11], or Baeza-Yates & Poblete [1] studied the role of geographic borders, language affiliation, and culture on link setting behaviour. It was shown that the number of links within a country domain is generally much higher than to any other country domain [4]. Results revealed strong geographical connections (e.g. Norway, Sweden, Denmark, Estonia, Soviet Union, Finland), yet they are sometimes overridden by language affiliation (e.g. Brazil – Portugal) [11]. Finally, hyperlink structures can exhibit patterns that are specific to one region [1]. In all of these studies, data was aggregated on the national or regional level but not on a linguistic level.

One of the most important contributions to the analysis of link following behaviour was provided by Pirolli and Card [19, 20] and their Theory of Information Foraging. As an explanation of users’ strategies for information seeking, information gathering, and consumption on the Internet, the model predicts that a user follows a link if the expected informational benefit from following the links is not exceeded by the costs of accessing it. Based on the Theory of Information Foraging, other authors developed further predictive models in following years. The number of available links, the number of previously accessed pages, the search goal, and numerous other determinants were identified as having an impact on the perceived value of information gain [3].

So far, language-related aspects have only received little attention in studies of Information Foraging. In [13] we propose an approach for studying this process.

We therefore aim to investigate the impact of language on users’ link following behaviours. For the purpose of our investigation we furthermore take the language-related hyperlink distribution into account. None of the previous studies have adopted this type of combined approach. The role of language as a potential barrier to information access on the Internet is consequently investigated as a two-dimensional impact (see figure 1 below). The paper is structured as follows: First, we identify language-related aspects that are potential determinants of (1) the hyperlink structure of the Internet and/or (2) the users’ link following behaviours. Based on this argumentation, we develop hypotheses. Section 4 describes methods and data used for our analysis. Results are presented in Section 5. Next the results and their limitations are discussed. We conclude by giving an outlook for future research and on possible applications.

3.

Conceptual Framework and Hypotheses

3.1. Link Setting Behaviour, Hyperlink Structure and the Role of Language

Various determinants of link setting behaviour and hy-perlink structure can be identified. The role of language has a hierarchical two-step impact: the impact of the number of (native) speakers of a certain language on the number of web-hosts in that language, and the impact of the number of web-hosts in a certain language on the number of hyperlinks linking from/between websites of that language.

The relationship between the number of (potential) In-ternet users speaking a certain language and the number of web-hosts in that language can be founded on two argu-ments. The first reasoning is based on a market perspective where the number of potential customers/ visitors deter-mines the number of web-hosts, and hence the extent of available services or information. Consequently, the num-ber of web-hosts in each language increases with a growing number of users speaking that language as a native or non-native language. The second argument is the fact that a higher number of (native) speakers increases the number of people who are able to create a website in that particular language.

As a result, the number of hyperlinks linking one web-site to the next should be higher for webweb-sites offering in-formation in a language with many speakers than for those offering information in a language with few speakers.

However, predicting the relationship between the num-ber of speakers and the numnum-ber of web-hosts is not straight-forward. Despite a lack of empirical research, it can be ex-pected that the number of users and the number of websites are not directly proportional, due to scale, network, and threshold effects. In addition, different wealth and hetero-geneous education levels as well as discriminatory marketing goals1 [9] represent further influencing factors.

The connection between the number of web-hosts and the number of (potential) hyperlinks is derived from simple network effects: a larger number of nodes permits more arches between them. Each additional node increases the number of potential arches by n+1 (n= number of nodes).

A higher number of existing web-hosts in a certain language consequently leads to a higher number of potential links to and from websites in that language.

In order to analyse the impact of language independently from simple network effects, the number of links referring from websites in a certain language need to be evaluated with regard to the total number of websites.

We derive the following hypotheses:

1 Even if all speakers of a small language market are bilingual (i.e. they are

proficient in the language of a bigger language market), the use of their native tongue can represent an additional service that discriminates a product from others by enhancing its value/perception.

(3)

H1: The number of in-links from a website that offers information in language y, relative to the number of web-hosts in language y is higher than the number of in-links from a website that offers information in language x relative to the number of web-hosts in language x, if language y , in contrast to language x, is one of the target website’s languages.

The hypothesis is expressed by the following equation:

Y Y X X

h

il

h

il

<

il = number of in-links

h = number of web-hosts on the Internet

x = a language that is not offered on the target website y = a language that is offered on the target website

3.2. Link Following Behaviour and the Role of Language

The Theory of Information Foraging predicts link follo-wing behaviour based on a trade-off between the perceived costs and values of following that link. The resulting “net-value” represents the perceived “usefulness”. An examination of the impact of language on link following behaviour consequently requires an analysis of the potential additional costs and values of using that language.

There are two major cost aspects that are related to the user’s proficiency level in a certain language: cognitive effort and time invested towards understanding (and accessing) the website. Following the Revised-Hierarchy-Model [8] and research results from psycholinguistics [10], cognitive effort and time invested increase with lower language proficiency. As a result, users who are not native speakers of one of the website’s languages have higher costs for accessing and understanding the information or service offered on that website.

A discussion of whether or not the use of a user’s native tongue also increases the website’s value, or whether it ex-clusively diminishes the users’ costs is beyond the purpose of our paper. Nevertheless, since the use of a native language only rarely diminishes the value of a website (e.g. negative language-associated values – [7, 9]), it can be ar-gued that information / a service offered in the user’s mo-ther tongue always enhances the website’s net value: eimo-ther because it decreases the perceived cost and/or because it adds to the website’s value. Therefore, following the Information Foraging Theory, native speakers of the target website’s languages are more likely to access the website.

Again, an analysis of the number of website visitors per language relative to the total number of Internet users per language permits the identification of the impact of language independently from simple network effects.

The following hypothesis can be inferred:

H2: The number of website visitors of native language y relative to the number of in-links from web pages in language y, relative to the total number of Internet users with native language y, is higher than the number of website visitors with native language x relative to the number of in-links linking from websites in language x, relative to the total number of Internet users with native language x, if y is a language offered on the target website.

The hypothesis is expressed by the following equation:

Y Y Y X x X

tu

il

u

tu

il

u

*

*

<

u = number of users/website visitors il = number of in-links

tu = total number of Internet users

x = a language that is not offered on the target website y = a language that is offered on the target website

3.3. Reciprocity of Language-related Link Setting and Link Following Behaviour

Language-related link setting and link following behaviour are also characterized by their mutual interdependency. First, as mentioned above, link setting behaviour can be understood as an anticipation of link following behaviour. Links will therefore lead to websites in other languages less often. On the other hand we argued that link following behaviour is partly determined by the number of existing links. Due to this double and reciprocal impact, the effect of language is therefore expected be of an exponential rather than linear nature.

Second, as a result of a lower number of direct links lea-ding to websites in other languages, non-native speakers will more likely have to follow more links to access the tar-get website. In accordance with the Information Foraging Theory this again increases the costs for non-native speakers, affecting their likelihood to access that website.

Figure 1 illustrates the role of language as a barrier to information, with regard to its impact on the number of hyperlinks and on the number of website visitors.

Figure 1. The role of language as a barrier to information on the Internet

(4)

4.

Methods and Measures

4.1. Data

Our study is based on data collected primarily on a mul-tilingual E-health website by complementary means: non-reactive data is provided by a self-developed web-crawler based on Jobo (www.matuschek.net) and by the website’s log-files. In order to assure data validity and continuity, crawling and log-file analyses were carried out through se-parated data sets for three months. The crawler traverses all pages of the site by following the hyperlinks. In addition, it queries search engines to collect information about which other websites link to the website. We also cross-validated the links found by the crawler with the external referrers resulting from the analysis of the website’s log-file.

For the language identification, we integrated a language identifier into the crawler. We chose Ngramj (http://sourceforge.net/projects/ngramj/) which is based on an algorithm using n-grams of characters [6]. Obviously, some pages contain text in more than one language [16]. In those cases we assume that there is one main language and adopt the results from the system.

The web mining process recognizes all links to our health site registered at a search engine. For each of these links, the dataset contains the URL of both the source and target page and their language. For this analysis, all web pages are considered independent objects regardless of po-tential relationships. All language versions are presented with the same interface. More detailed information about the website’s structure and content can be found in [12].

Information about the website’s visitors is inferred from the website’s log-file. The log-file provides, among others, information about the user’s IP address, the requested page, the language in which the page was requested, and a session-id. We excluded robots from the data set by detec-ting them by their IP addresses or navigation patterns (e.g., regularity of access, number of page requests – see [22]). The majority of this work was done automatically using the sessionizing tool WUMPREP (www.hypknowsys.de). The use of session-ids assures better data quality in terms of data aggregation, from the level of page requests to the session level.

Geographic information is obtained from the IP address using specialized software (Geoselect –

www.geobytes.com) and contains, among others, the following items: country, city, and certainty of information. Due to the detailed geographic data, we can attribute information about the users’ native language to each session, even in the cases of multilingual countries with different official languages (e.g. Canada, Switzerland). It should be noted that the geographic information provided by this data is not specific enough to reveal information that might be in conflict with privacy laws or ethical issues.

Data about the number of hosts and Internet users per language are obtained from public statistics (Languages and Internet UNESCO Culture Sector: http://portal.unesco.org/ culture/en/ev.php and www.glreach.com). If not indicated otherwise these data are from 2005.

4.2. Limitations

The major drawbacks of our method of data collection are the lack of control of data about the users’ native languages and the web-hosts’ languages, and the potential inaccuracy of the web-crawler and the language recognition system. The users’ native languages are inferred from their IP addresses. Due to the nature of this kind of data processing, uncertainty cannot be avoided. However, there is no other efficient method available that would allow the processing of such a large amount of data.

In a similar manner, data about the languages of web-hosts are always approximate due to the often difficult assignment of a website to only one language, and the imperfect accuracy of automatic crawlers. Furthermore, when language assignment is based on the country-domain, non-English language websites using “neutral” domains such as “.org” or even “.com” are not assessed.

Finally, replacing session-ids by cookies would allow for a better data quality in terms of determining the number of website visitors. However, it might also bias the results due to the diverging acceptance of cookies among users.

4.3. Measures

In order to determine the impact of language as a barrier to information flow, we compare data about Internet users, website visitors, web-hosts, and in-links that regard Eng-lish, French, Spanish, German, or Portuguese (the “L1 group”) with data about other languages. We chose Japa-nese, ChiJapa-nese, and Russian as representative languages for the group of languages that are not offered on the investiga-ted website (“L2 group”). Visitors of the website and hence speakers of these languages come from around the world.

Website visitors are measured through sessions. This means that a visitor who visited the website several times during the investigated periods is measured as several dif-ferent website visitors. This approach is justified since repeated visits to the website can be interpreted as an indicator of a low (linguistic) barrier. The same interpre-tation would be applied if these sessions were assigned to different users.

(5)

5.

Results

5.1. The Number of Internet Users and Web-hosts

Absolute. Figures show that the number of web-hosts per language correlates with the number of Internet users. However, three exceptions were encountered in the analysis, which are described more in detail below.

Relative. Figure 2 depicts the ratio between the number of web-hosts and the number of Internet users per language as a function of the number of Internet users. Web-hosts and Internet users are measured as the percentage of the total number of web-hosts, and Internet users, respectively. A value of 1 therefore represents a balanced relationship between web-hosts and Internet users per language.

.

X-axis: Number of Internet users

Y-axis: Percentage of web-hosts/percentage of Internet users (Data for 2003; Source: www.glreach.com) Figure 2. Number of web-hosts/Internet users

as a function of the number of Internet users

The numbers show an exponentially decreasing ratio of web-hosts to Internet users with a decreasing number of In-ternet users. The percentage of English hosts is almost twice as high as the number of the large group of English native speaking Internet users, yet it is only a third for the much smaller Portuguese native speaker’s group. The three exceptions that can be found here again are: the Chinese and the Spanish group with an extremely low number of web-hosts compared to the number of Internet users, and the German group with a relatively high number of hosts. That means that Germany seems to have more hosts than we would expect from its number of Internet users whereas China has much less hosts than expected.

5.2. The Number of Web-hosts and In-links

Two datasets were used for this analysis: data from a web crawler and data about referrers from the server-log. The web crawler revealed 4,220 links pointing to pages within the website. The usage analysis revealed 6,370

distinct links, which were traversed and followed 35,348 times. The overlap between both sets is only about 20%. This means that the search engine queried by the crawler has not registered all pages linking to the site, and that many links known to the search engine were not used.

Due to the uncertainties associated with these methods we conducted a comparative analysis. The number of ses-sions from the usage analysis and the number of in-links from web mining is shown in table 1. The last column shows the global percentage of websites per language. Language was determined with the language identifier.

Table 1. Source page languages (Feb 05-Apr 05)

Source page language for referrers determined by page view and in-link analysis

Source page language sions

Ses-Percentage of sessions in-links Percentage of in-links % of web-sites English 19591 55.4% 2247 53.2% 68.40 French 744 2.1% 74 1.8% 3.00 German 2354 6.7% 1436 34.0% 5.80 Spanish 94 0.3% 14 0.3% 2.40 Portuguese 1796 5.1% 5 0.1% 1.40 Japanese 1657 4.7% 0 0.0% 5.90 Russian 0 0 0 0.0% 1.90 Chinese 442 1.3% 2 0.0% 3.90

The hypothesis H1 predicts more visits and links for L1 languages than expected from their global share and less for L2 languages (table 1). This is the case for all L2 languages, which are clearly underrepresented.

The hypothesis cannot be confirmed for all L1 languages. English is a special case due to its dominance on the web. It also dominates both data sets of in-links. For Spanish, the language recogniser seems to produce many errors by identifying Catalan. For French, the hypothesis needs to be rejected. However, French pages are the fourth most frequently viewed whereas it is only the fifth most popular for web sites globally. With respect to in-links, German has an unexpectedly high share. This is probably due to the fact that the site is hosted in Germany.

5.3. The Number of Internet Users and Website Visitors

Absolute. There is no simple direct relationship bet-ween the number of Internet users and website visitors. Yet, a systematic relationship can be found if the users are separated into the groups of non-native speakers and native speakers. Regardless of the number of Internet users per language, L2 website visitors are in every case less repre-sented than any L1 language. However, within the groups, there is again only a slight tendency of a direct, positive re-lationship between the number of Internet users and the number of website visitors.

1,90 0,30 0,67 0,29 0,88 0,60 0,35 0,43 0,53 0,00 0,20 0,40 0,60 0,80 1,00 1,20 1,40 1,60 1,80 2,00 288 103 70 66 53 40 30 26 24

English Japanese German Spanish Chinese Italian French Korean Portuguese

(6)

Relative. In Figure 3 we illustrate the percentage of website visitors per language (relative to the total number of Internet users) as a function of the number of Internet users per language.

0% 5% 10% 15% 20% 25% 30% 35% 40% 295,4 11 0 72 67,1 55,3 33,3 24,4 6,5 Apr 05 March 2005 February 2005

X-axis: Number of Internet users (Source: www.glreach.com) Y-axis: Website visitors/1000 Internet users Figure 3. Website visitors/1000 Internet users

as a function of the number of Internet users

Here again it can be noted that the number of L2 website visitors is constantly lower than the number of L1 website visitors – despite the fact that the number of Chinese and Japanese Internet users is higher than almost all L1 language groups.

If non-native speakers and native speakers are analysed in two separate groups, results for both groups suggest an exponentially increasing percentage of website users (relative to the number of Internet users!) with a decreasing number of Internet users. The lower the number of Internet users the higher the percentage of users who visit the website. Two major exceptions appear in the native speaker group: the English and the German group are represented by a higher number of website visitors than expected.

5.4. The Number of Web-hosts, In-links and Website Visitors

Characteristics of In-links. The links crawled from a web search engine were further analysed. We considered the language of the target and the source page. The uncer-tainty associated with this analysis is even higher because the language identification robot could err either way. For most links, the target and source language are the same. These links can be called monolingual. The following table 2 shows the frequency of a number of languages in the set of referrer pages for German and English target pages.

Table 2. Frequency of languages in referrer pages

Only external links All links

Target page language Target page language

Referrer page

language English German English German

Chinese 0,30% 0,12% 1,1% 0,3% Czech 0,30% 0,47% 1,1% 1,1% Danish 1,49% 0,95% 5,6% 2,1% Dutch 0,90% 0,12% 3,3% 0,3% English 75,82% 43,97% 72,2% 41,8% French 0,00% 1,30% 0,0% 2,9% German 19,70% 50,24% 11,1% 45,2% Italian 0,00% 0,35% 0,0% 0,8% Portuguese 0,30% 0,00% 1,1% 0,0% Spanish 0,00% 0,47% 0,0% 1,1% Swedish 0,90% 0,12% 3,3% 0,3% Vietnamese 0,00% 0,30% 3,3% 0,3%

The table shows that most existing links are monolingual. This trend is stronger for English than for German.

Use of In-links. We compared the links mined from a search engine with the links from the web usage analysis, i.e. existing links are compared to the ones actually used by web surfers. The exact overlap between the two sets is here again rather small. This is due to several reasons, also rooted in the uncertainties of the data acquisition methods.

In order to gain a larger data set, the exactness condition is relaxed to host equality, i.e. all pages on one host are considered as identical. The overlap between the two sets reaches then 20%. For this set we can calculate how often monolingual links were used for website visits. It should be noted that in each language group between 3% and 25% of the visitors were referred to the website by a search engine, with an average higher percentage among the L1 users.

Interestingly, users reach English target pages almost always (97%) via an English referrer page. That means that although 24% of all links pointing to pages in English in the health site are not in English (see table 2), these links are hardly ever used.

For German, there is a contrary trend. Most German pages seem to be reached via links from pages in English (74%), although there are more monolingual links. However, they do not seem to be used as much. This may be due to the global dominance of English.

In order to validate the results obtained from language recognition based web-crawling we carried out further analyses that were based on usage analysis without language recognition.

L2 users L1 users English Japanese German

Chinese Spanish French Portuguese

(7)

Table 3. Which language group uses the existing in-links (ordered by Top-Level-Domain)?

Data from April 2005

Language group with the highest percentage per in-link

Engl. Fren. Ger. Span. Port. Jap. Rus. Chin. Other

com 86% 3% 1% 2% 1% 0% 0% 2% 5% edu 96% 1% 0% 1% 0% 0% 0% 0% 2% org 82% 5% 3% 1% 2% 0% 0% 1% 8% .uk 87% 2% 0% 1% 2% 0% 0% 0% 5% .ca 80% 10% 0% 0% 5% 0% 0% 0% 5% .fr 10% 42% 29% 6% 0% 0% 0% 6% 6% .de 59% 3% 14% 6% 4% 0% 0% 1% 12% .mx 14% 0% 0% 86% 0% 0% 0% 0% 0% .es 14% 0% 0% 86% 0% 0% 0% 0% 0% .br 3% 0% 0% 3% 94% 0% 0% 0% 0% .jp 29% 0% 0% 0% 0% 71% 0% 0% 0% .ru 6% 1% 3% 0% 1% 0% 65% 0% 23% .cn 8% 0% 0% 0% 0% 0% 0% 77% 15% .nl 0% 1% 0% 3% 0% 0% 1% 0% 95% .be 5% 1% 0% 2% 1% 0% 0% 0% 90%

Table 4. Which in-links are used within the language groups?

Data from April 2005

In-links with the highest percentage per language group … non-mentioned Top-Level-Domains are used less than 2%

Engl. Fren. Ger. Span. Port. Jap. Rus. Chin. Other .com 7% 5% 0% 1% 2% 6% 1% 15% 2% .edu 12% 1% 0% 1% 1% 1% 0% 1% 2% ….. .de 41% 42% 51% 44% 40% 50% 36% 60% 48% .ru 0% 0% 0% 0% 0% 0% 47% 0% 0% total 100% 100% 100% 100% 100% 100% 100% 100% 100%

As depicted in the two tables above, referrers with country-specific links are always the most used by users with the corresponding mother tongue (e.g. .jp by Japa-nese). .de-in-links are the only exception since they were used more by English native speakers than by German native speakers.

However, most country-specific referrers have little impor-tance in an overall view: The .de in-links are used most by all language groups (exception: .ru for the Russian group). (grey cells in table 4) The outstanding role of German language in-links as already mentioned in section 5.1. is confirmed at this point.

Also, we analysed more in detail the usage behaviour of visitors referred to the website from a .com or a .de-in-link, as examples for English and German language referrers.

Table 5. To which language version of the website do .com in-links lead?

Data from April 2005

Language group with the highest %age per language version

user target page German English French Spanish Portuguese

English 0% 99% 0% 0% 0% French 6% 94% 0% 0% 0% German 56% 44% 0% 0% 0% Spanish 6% 42% 0% 48% 3% Portuguese 0% 29% 0% 12% 59% Japanese 11% 89% 0% 0% 0% Russian 0% 100% 0% 0% 0% Chinese 2% 98% 0% 0% 0% Other 3% 97% 0% 0% 0%

Table 6. To which language version of the website do .de in-links lead?

Data from April 2005

Language group with the highest %age per language version

user target page German English French Spanish Portuguese

English 3% 96% 0% 1% 0% French 10% 53% 37% 0% 0% German 84% 16% 0% 0% 0% Spanish 6% 41% 0% 51% 2% Portuguese 1% 37% 0% 5% 57% Japanese 3% 92% 1% 3% 1% Russian 5% 95% 1% 0% 0% Chinese 2% 96% 0% 1% 1% Other 7% 91% 1% 1% 0%

Again outcomes are in line with previous results. In addition, analysis of the usage behaviour revealed an interesting and important detail: regardless of the language of the referrer, the vast majority of the L2 users visit the English version of the website, whereas L1 users are roughly split in half between their native language version and the English version. The German native speakers alone barely use the .de in-link for visiting the English version. (It should however be noted that a .com website is not necessarily an English language website and even a .de website does not always offer information in German.)

Comparison with web-hosts. In addition to the analyses in section 5.1, we looked at this point in depth at the website use regarding L1 users and L2 users.

(8)

Absolute. Within the L1 user group a higher number of web-hosts leads to a higher number of website visitors, whereas it is the opposite case for the L2 user group.

Figure 4 illustrates these effects for February.

57956 5269 17362 5964 7489 0 10000 20000 30000 40000 50000 60000 70000 68,4 5,8 3 2,4 1,4

X-axis: Number of Internet users /total number of Internet users Y-axis: Number of website visitors per language group Figure 4. Number of website visitors as a function

of the percentage of Internet users per language (Feb 05)

Relative. Figure 5 shows that the increase of website visitors relative to the number of web-hosts is again not linear but exponential: native speakers of lesser-represented languages are seen to be relatively more represented on the website than native speakers of bigger language groups. Being over-represented within our data sample, the German native speakers again represent an exception.

0 1000 2000 3000 4000 5000 6000 68,4 5,9 5,8 3,9 3 2,4 1,9 1,4 Apr 05 March 2005 February 2005

X-Axis: percentage of web-hosts per language Y-axis: number of website visitors per language/percentage of

web-hosts per language

Figure 5. Number of website visitors/ web-hosts as a function of the number of web-hosts

Finally, the results concerning our second hypothesis (H2) reveal a lower share of website visitors per web-host* Internet users for the L2 group. Only the Russian user group has a higher share than the lowest L1 user group, the English group (2.13 vs. 0.93 (Chinese) and 0.77 (Japanese).

0 50 100 150 200 250 Engli sh Germ an Fren ch Span ish Portu

gueseRussianChines e Japan ese Apr 05 March 2005 February 2005

X-Axis: Language groups

Y-Axis: Number of website visitors/1000 Internet users * percentage of web-hosts

Figure 6. Number of website visitors per web-host and Internet user

6.

Discussion

Outcomes from our study confirm the vast majority of our argumentation and hypotheses. Thus, language has a major impact on who uses a website: (multilingual) websites reach primarily L1 users in the language(s) provided.

It is shown in our analysis – with the exception of three cases – that the number of web-hosts per language follows the number of Internet users in that language. Consequently there are more websites available for Internet users from language groups that are more represented on the Internet than for Internet users from underrepresented language groups. Furthermore, the relationship between Internet users and web-hosts is not of linear, but of exponential character, which increases additionally the benefit for largely represented language groups. This is a typical network effect occurring on the web.

The fact that Spanish and Chinese native speakers are underrepresented whereas the group of German native speakers is over-represented is likely due to the impact of e-commerce and the users’ purchase power (GDP per capita (K): Spanish 7.1 $; Chinese 7.2 $ -which is lower than every other investigated group except for the Russian group; source: www.glreach.com).

Our hypothesis H1 predicts that within the set of pages linking to a site, there will be proportionally more pages in the language of that site. For our health site, this hypothesis can be confirmed for all L2 languages not present on the

300

L2 users English German French Spanish Portuguese

Japanese Chinese Russian

English Japanese German Chinese French Spanish Russian Portuguese L1 users L2 users L1 users 400 500 302 456 519

(9)

website and for two of the five L1 languages. It should be noted that here the findings cannot be explained with economical power: basing explanations on the GDP/capita and the number of webhosts, Japanese (Italian, and Dutch) should exhibit a much higher number of in-links. Referring to the total GDP per language group, Japanese (Italian), Portuguese, French, and Spanish should be represented by higher percentages (www.gleach.com/globstats.).

When analysing the effect of the number of web-hosts on the number of website visitors, the impact of the web-site’s language offer became visible: L2 users are much less represented on the website than L1 users. This mere language effect appears to be much stronger than the net-work effect: none of the L2 user groups have more visitors than any of the L1 user groups – regardless of the number of in-links or web-hosts.

The network effect leads to two different consequences within the L1 user group and with the L1 user group. On the one hand, with regard to the native speakers a higher number of web-hosts and links leads to a higher number of visitors. The network effects, disadvantaging smaller language groups with less website offers, seem to be medi-ated through the users’ surfing behaviour: native speakers of these smaller groups tend to visit the same websites in their native language instead of visiting websites in other languages. These conclusions are inferred from the exponential increase of website visitors per web-host, with a decreasing number of web-hosts. The only exception in our case is the group of German native speakers over-represented on the website. This might be a website-specific effect that is due to the German origin of the website. Yet, the overall cultural impact of the website should be limited since the health-related topics presented (originally developed for physicians) are rather universal.

On the other hand, with regard to L2 users, a higher number of web-hosts leads to a lower number of website visitors. Again, the surfing behaviour can be interpreted as a preference for information presentation in the native tongue: the more alternative websites available in the users’ native tongue, the higher the probability that the user does not chose to visit this non-native language website.

The results providing evidence for the preference of websites in the user’s native language are perfectly in line with the Theory of Information Foraging [19, 20] and the Revised-Hierarchy-Model [8].

Due to the relationship between the number of Internet users and the number of web-hosts, the impact of the number of Internet users is similar to the one of web-hosts and links. In a parallel manner to the analysis of the impact of the number of web-hosts, smaller language groups are over-represented on the website if the percentage of website visitors is analysed relative to the number of Internet users.

Again, the group of German native speakers but also the group of English native speakers represent exceptions.

Our final examination of the number of website visitors in relation to the number of web-hosts and Internet users confirmed (with 1 exception) our second hypothesis.

To sum up, language may represent a double barrier for information access on the Internet. First, the size of the language group disadvantages smaller language groups due to the lower number of web-hosts and links, resulting in fewer native language information offers on the Internet. Second, with regard to accessing the information on a particular website, it was shown that the languages in which the information is presented influence who accesses the website. L2 users are strongly under-represented regardless of other network effects.

Consequently, the size of the language groups and the membership in either the L1 or the L2 user groups have an impact on the number of website users per language group. The impact of website-specific characteristics and e-commerce related issues are likely to be the reasons for the few exceptions that were encountered in our analysis.

Figure 7 summarizes our results.

Figure 7. The impact of language: network and language effects

7.

Implications

Our study provides evidence for the crucial role of language when information is accessed on the World Wide Web. Despite the role of English as Lingua Franca of the Internet, information presentation in the users’ native language seems to be the most decisive factor for attracting website visitors. The role of language becomes more important if the users have sufficient websites available in their native tongues as alternatives. Native speakers of languages that are less represented on the Internet seem to be more willing to visit L2 websites due to the lack of other information sources. Presenting website information in those languages might therefore be a useful differentiation strategy since a higher percentage of users within these language groups could be attracted.

The insight given by our study is helpful when website translation represents an important investment decision. They furthermore provide insight towards understanding the digital divide and aiming at diminishing it.

(10)

8.

Limits and Outlook

Our investigation was limited to eight languages, predominantly part of the indo-European language group. It is not clear how language similarity or cultural affiliation (due to the fact that these languages are mainly spoken within the Western culture) affected the results.

Furthermore, no data about the users (Internet/computer) literacy levels or domain knowledge was collected. These variables may determine the value of a website and hence the probability of accessing the website. Domain knowledge was also shown to enhance language proficiency [21]. It may therefore also affect the user’s perceived costs.

For future research, we want to investigate the role of multilingual pages with text in more than one language. In order to analyse the role of language in information seeking and especially link following behaviour, the language of the link label in the source page is also important. It may al-ready contain a hint as to whether the target page is in a dif-ferent language. In order to evaluate these aspects, a more powerful language identifier is necessary. Language identi-fication is also important due to the fact that correspon-dence between language and country-level-domain is not always given (e.g. [2]). Currently, we are developing a language identifier that will be able to detect very short passages in a different language, such as link labels.

References

[1] Baeza-Yates, Ricardo and Poblete, Barbara (2003). Evolution of the Web Structure. In: Poster Proceedings of the Twelfth International World Wide Web Conference (WWW 2003) Budapest, 20.-24. Mai, http://www2003.org/cdrom/papers/ poster/p103/ p103-baeza-yates/p103-baeza-yates.html

[2] Baeza-Yates, Ricardo, Castillo, Carlos, and López, Vicente. Characteristics of the Web of Spain. http://www.catedratelefonica. upf.es/webes/2005/ [last visit: 03 Sept 2005]

[3] Bernard, Michael Lewis (2000). Examining a Metric for Prediciting the Accessability of Information within Hypertext Structures. Unpublished Ph.D. Thesis at Wichita State University, Human Factors Psychology.

[4] Bharat, K., Chang, B.-W., Henzinger, M. & Ruhl, M. (2001). Who Links to Whom: Mining Linkage between Web Sites. In: Proceedings of IEEE International Conference on Data Mining (ICDM '01), San Jose, California, November 2001.

[5] Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Raymie Stata, Andrew Tomkins & Janet Wiener (2000). Graph Structure of the Web. In: Proceedings of the 9th International World Wide Web Conference.

[6] Cavnar, William B. and John M. Trenkle (1994): N-Gram-Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, 161-175.

[7] Dmoch, Thomas (1997). Interkulturelle Werbung. Verhaltens-wissenschaftliche Grundlage für die Standardisierung erlebnis-betonter Werbung, Aachen.

[8] Dufour, R. and Kroll, J. (1999). Matching Words to Concepts in Two Languages: A Test of The Concept Mediation Model of Bilingual Representation. Memory & Cognition, 23, 166-180. [9] Grin, François (1994). The Economics of Language. Match or Mismatch?. Revue Internationale de Science Politique, 15(1), 25-42.

[10] Hahne, Anja (2001). What’s Different in Second-Language-Processing? Evidence from Event-Related Brain Potentials, Journal of Psycholinguistic Research, 30(3), 251-265.

[11] Halavais, A. (2000). National borders on the World Wide Web. New Media & Society, 2(1), 7-28.

[12] Kralisch, A. , Eisend, M. and Berendt, B. (2005). The Impact of Culture on Website Navigation Behaviour, In: Proceedings of the 11th International Conference on Human-Computer Interaction (HCI), Las Vegas, NE. July 22-27 2005.

[13] Kralisch, A. and Köppen, V. (2005). The Impact of Language on Website Use and User Satisfaction, In: Proceedings of the 13th European Conference on Information Systems: Information Systems in a Rapidly Changing Economy

[14] Mandl, Thomas (2005a). Link-Analyse und alternative Verfahren zur Qualitätsbewertung im Web Information Retrieval. In: Datenbank-Spektrum: Zeitschrift für Datenbanktechnologie, 12, 16-25.

[15] Mandl, Thomas (2005b): The quest for the best pages on the web. To appear in: Information Service & Use.

[16] Martins, Bruno and Silva, Marió (2005). Language Identification in Web Pages. In Proceedings of the 2005 ACM SAC Symposium on Applied Computing (SAC). Santa Fe, New Mexico, USA. March 13.-17. 2005, 764-768.

[17] McPherson, M., Smith-Lovin, L. and Cook, J.M. (2001). Birds of feather: Homophily in Social Networks, Annual Review of Sociology, 27, 415-444.

[18] Palmer J. W., Bailey, J. P., & Faraj. S. (2000). The role of intermediaries in the development of trust on the WWW: The use and prominence of trusted third parties and privacy statements. Journal of Computer-Mediated Communication, 5(3).

[19] Pirolli, P. Card, S. (1995). Information Foraging in Information Access Environments. In: Proceedings of the Association for Computing Machinery’s Conference on Human Factors in Computing Systems, 51-66.

[20] Pirolli, P. Card, S. (1999). Information Foraging. Psychological Review, 106(4), 634-675.

[21] Steffenson, M.S., Joag-Dev, C. and Anderson, R.C. (1979). A Cross-cultural Perspective on Reading Comprehension. Reading Research Quarterly, 15(1), 10-29.

[22] Tan, P.-N. and Kumar, V. (2002). Discovery of Web Robot Sessions based on their Navigational Patterns. Date Mining and Knowledge Discover 6(1), 9-35.

Figure

Figure 1 illustrates the role of language as a barrier to  information, with regard to its impact on the number of  hyperlinks and on the number of website visitors
Table 1. Source page languages (Feb 05-Apr 05)           Source page language for referrers determined by page view
Figure 3. Website visitors/1000 Internet users   as a function of the number of Internet users  Here again it can be noted that the number of L2  website visitors is constantly lower than the number of L1  website visitors – despite the fact that the numbe
Figure 5. Number of website visitors/ web-hosts   as a function of the number of web-hosts
+2

References

Related documents

A common approach to processing and consuming IoT data is a centralized paradigm: sensor data is sent over the network to a comparatively powerful central server or a cloud

Patients currently receiving other oral morphine formulations may be transferred to MORPHINE MR APOTEX at the same total daily morphine dosage, equally divided into two

The mice inoculated with the mutants developed significantly less severe hepatic inflammation (P &lt; 0.05) and also produced significantly lower hepatic mRNA levels of

Secondly, which approach has better prediction performance in newbuilding ship price market between Long Short-Term Memory (LSTM) which is based on neural network, and Vector

Does the type of instruction (traditional classroom vs. web-hybrid) have an impact on the anxiety levels of counselor education graduate students taking a required research methods

COSE in family business Differentiation Customer well-being Customer experience Family influence Social skills Decision-making authority Motivation Technical skills RP1 RP2 RP3

For Aim 2, numeric data on clinical aspects were collected using several data sources: (1) a validated pain relief item from the Brief Pain Inventory instrument was

Ključni koncept kroz koji ću promatrati proces izgradnje države u Hrvatskoj i Monarhiji, odnosno ulogu Nikole III Erdődyja i Hrvatskog sabora u datom procesu, je odnos