Relationship Analysis between User s Contexts and Real Input Words through Twitter

(1)

Relationship Analysis between User’s Contexts and

Real Input Words through Twitter

Yutaka Arakawa, Shigeaki Tagashira and Akira Fukuda

Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan 744, Motooka, Nishi-ku, Fukuoka, Fukuoka, JAPAN, 819-0395

Email: arakawa, shigeaki, [email protected]

Abstract—In this paper, we propose a method to evaluate

effectiveness of our proposed context-aware text entry by using Twitter. We focus on ”geo-tagged” public tweets because they include user’s important contexts, real location and time. We also focus on TV program listing because 50% traffic of iPhone in Japan is generated from our home, in which I often tweets in watching a TV. Cyclical collecting system based on Streaming API and Search API of Twitter is proposed for gathering the target tweets efficiently. In order to find the relationship between user’s contexts and really used words, we compare really-tweeted words with words obtained from Local Search API of Yahoo! Japan that is used for our context-aware text entry and words obtained from TV program listing. We analyze 471274 tweets that have been collected from 15 December 2009 to 10 June 2010 for specifying the relationship to landmark information and TV program. As a result, we show that 5.1% of tweets include landmark words, and 9% of tweets include TV program words. Additionally, we bring out that there are location dependent words and time dependent words.

I. INTRODUCTION

In a recent research in Japan[1], it is turned out that over fifty percent of Internet users access the Internet from mobile devices. And among them, over eighty percent users access the information by using not hierarchical menu in official site but search engines such as Google. Moreover, current mobile devices can use not only text messaging but also web-mail such as Gmail. It indicates that one has an opportunity to input a long text. The increase of text input on mobile devices drives the demand for improving a text input method.

Recent mobile phones generally equip clever text entry which have a function of predictive transform. This function consists of dictionaries, syntactic parsing, and learning. When a user inputs “a”, it picks up the words which start with “a” from a dictionary, and recommends some candidate words which seems appropriate according to syntactic parsing. In addition, based on user’s input history, such as frequency or time stamp of latest use, it sorts the order of the candidates. In these days, iWnn[2], one of Japanese text entry adopted many kinds of mobile phones, suggests more appropriate words according to current seasons, time, body of received e-mail, re-lationships between superior and inferior. As another approach that differs from text entry, “Google Suggest” provides often-used keyword combination for optimizing search terms and reducing keystrokes.

Meanwhile we have proposed context-aware text entry[3] which can suggest useful words based on user’s context

such as location, presence and time. We mainly focus on how to make dictionary dynamically among above-mentioned three component of predictive transform in text entry. The reason is that the word not included in the dictionary cannot be recommended, and we are not specialists of philolog-ical syntactic parsing. In our proposed system, based on user’s current contexts, the dictionary in the mobile phone is updated periodically in cooperation with the dictionary creation server on the Internet, which generates user’s current dictionary dynamically by using several public Web APIs. In our current prototype, “location” and “time” are adopted as a user’s context, and landmark names surrounding user are added based on user’s location, and TV programs’ title and performers’ name are added based on “time”. The reason why we use TV program is that 50% iPhone traffic in Japan is transmitted through home WiFi networks. As a result, if you input “H” at neighbor venue of Globecom2010, the system may suggest “Hyatt Regency Miami” as one of candidates. If you input “J” or “K” in watching 24, the system may return “Jack Bauer” or “Kiefer Sutherland” respectively. We have already constructed the OpenWnn-based prototype system on Android terminal[4]. We have already tried several system architectures, and have confirmed that one of them can achieve enough response time[5]. Also, we are evaluating our system through demonstration and questionnaire with some persons.

Although our system looks effective, there is an important remaining issue. Since we started this research with the assumption that such kind of system will be convenient for us, there is no evidence or quantitative evaluation for representing the effectiveness. It is hard to gather large amount of results through questionnaire-based evaluation. In addition, if we log really inputted sentences in a mobile phone, we must consider privacy protection in relation to personal data.

In this paper, we propose a method for evaluating context-aware text entry by using Twitter. Twitter is a micro blogging service on the Internet, where a short message of up to 140 characters, called tweet, can be posted. And these tweets are generally open for the public. The reason why we focus on Twitter is 1) we can obtain huge amount of public strings of various users, 2) Geotagging API released at November 2009 enables users to add geo code to each tweet. It means that we can extensively and publicly collect real input sentences that include users’ real location. Therefore, we think that to analyze collected data clears up the relationship between user’s

(2)

Train Transit Application Departure Destination Route Search Roppongi Tokyo E-mail Application I took a Yamanote train from Shinbashi. Soon, I will arrive Shibuya. Meet at Statue of Hachiko.

Dictionary is dynamically generated by using public APIs on the Internet

Nearest station API Schedule/Calender API Landmark Info. API Shibuya, Shinjuku,

Roppongi, Tokyo, etc.

Train, Go, Take, Ride, Shinbashi, Hachiko

Bank of America, Hyatt, James L. Knight Center

Map Application

Hyatt Regency Miami

Fig. 1. Typical Effective Examples of context-aware text entry

contexts and real input words. As a result, it can show the effectiveness of our proposed text entry quantitatively.

First, we construct the tweet collecting system that obtains Japanese tweets with geocode, where we effectively combine two APIs of Twitter, Streaming API and Search API to gather huge amount of tweets. Our system has already gathered half-million tweets since 15 December 2009. Next, we analyze collected data by comparing with the data that obtained from other APIs. In this paper, we use “Yahoo! Local search API[6]” for obtaining landmark information, and use “TV program listing on the Internet” for obtaining TV programs’ title and performers’ name. These APIs are the same as APIs for making dictionary in our context-aware text entry. In our relationship analysis, both data are separated into some words by using “Yahoo! Japanese language morphological analysis API” and “Yahoo! Key phrase extraction API”.

As a result, we show that 5.1% of tweets include land-mark words, and 9% of tweets include TV program words. Additionally, we bring out that there are location dependent words and time dependent words. The rest of the paper is organized as follows. We present our context-aware input method editor proposed previously in Section 2. In section 3, we explain about Twitter. And following section explains relationship analysis. Finally, results are shown in Section 5.

II. CONTEXT-AWARETEXTENTRY FORMOBILEPHONE

Fig.1 shows a typical service examples in which our pro-posed context-aware text entry will work effective. It indicates the importance of words varies with a location (i.e., user’s context). For example, nearby station name is used at stations, landmark name is used at a new places, product name is used at bookstores and electronic retail stores. The most characteristic point is that dictionary is automatically and dynamically updated by mashing up public Web APIs in the Internet. Nearest station API can be used for obtaining the station name near here. Also, Landmark information API can be adopted to search landmark names surrounding the user.

In addition, we introduced learning process into this system. If a user selects the word in suggested candidates at the station, the system judges it may be used at the same place in the future. If the word is not used, it judges that the word is not useful in this place. By repeating these learning

Local device

Internet

Internal server External server

GPS sensor Acceleration

sensor Context updater Other sensors Japanese language morphological analysis (MeCab)

API access module

(XML parser) Context Estimation Engine Asynchronous Context-A ware

Select & Sort Engine

Input Hiragana or Roman character

IME ATOK

Output mixture of Chinese characters and

Japanese phonetic characters

kana-kanji conversion API "Personal Context

Dictionary"

For estimation

API access module

(XML

parser)

Local context Global context

For making dictionary

As a general dictionary Direct plugin Feedback Yahoo API Google API Schedule API Amazon API Tabelog API GuruNavi API Schedule API Presense API Location API

Fig. 2. The architecture of prototype system

processes cyclically, the words related to a certain place become suggested, and normal words will be suggested in other place.

A. System architecture

The architecture of prototype system is shown in Figure 2．

It is composed of three parts, local device, our server on the Internet, and general web services on the Internet. The local device has various sensors such as GPS and acceleration. In our prototype system, we use a PC as local device and adopt the Google Maps API as GPS sensor for setting user’s location visually.

The internal server in the center of Figure 2 is a main part of our proposed system. It collects information and estimates of user’s context, creates the dynamic dictionary, and suggests the words by utilizing user’s context. These functions are possible to construct on local device. However, we set it into the server over the global network because it is important not only accuracy of estimation algorithm but also processing speed. Besides, we architect it works asynchronously to collect sensor information by the system and to input text on local device. The dictionary is updated whenever location is varied. As a result, local device only searches pre-constructed database when text is input. This architecture enable the system to prevent the processing speed from slowing down when web external servers increase.

The external servers in the right of Figure 2 are not our servers but provided by several companies. In the case of this prototype system, it cooperates with the Yahoo Local Search API, Google Maps API, and Gurunavi API. Some words provided by these APIs are materials of personal context-aware dictionary.

We develop the two prototype systems. One is the extension of OpenWnn of Android, another is ATOK Direct Plug-in for

(3)

Filtering Japanese & Geotagged Collect realtime tweets

from Streaming API

（10∼15% of all tweets）

Collect past tweets from Search API

（Users who once geotagged）・xxxxxxxx ・xxxxxxxxxx ・xxxxxx ・xxxxxxxxxxx ・xxx ・xxxxxxx ・xxxxxx ・xxxxxxxxx ・xxxxxxxxxxx ・xxxxxxxxx ・xxxxx Twitter ID Tweets （less than 1%） Database ・xxxxxxxx ・xxxxxxxxxx ・xxxxxx ・xxxxxxxxxxx ・xxxxxxxxxxx ・xxxxxxxxx ・xxxxx Many tweets Filtering Japanese & Geotagged

Fig. 3. Cyclical collecting system based on Streaming API and Search API

Windows and Mac. “ATOK[7]” is one of the major text entry in Japan as well as Microsoft text entry.

B. Remaining Issue

Since we started this research with the assumption that such kind of system will be convenient for us, there is no evidence or quantitative evaluation for representing the effectiveness. It is hard to gather large amount of results through questionnaire-based evaluation. In addition, if we log really inputted sentences in a mobile phone, we must consider privacy protection in relation to personal data.

III. TWITTER

As you know, Twitter[8] is one of the major micro blogging and social networking service today, in which a short message of up to 140 characters, called tweet, can be posted. Tweets are generally open for the public as a “public timeline”. Since Twitter releases many kinds of API for general users, we can obtain other user’s tweets through these APIs. In this paper, we use Streaming API and Search API for obtaining tweets.

Streaming API that was officially released January 2010 allows near-realtime access to the user’s tweets timeline. Tweets created by a public account are candidates for inclusion in the Streaming API. However, Streaming API only provides randomly sampled tweets which is about 10% of all the tweets. Search API allows us to search Tweets with a query in which we can set some parameters such as target text, language, user id, geocode, time spam, etc. In this paper, we use this API for collecting the past tweets of user who have once posted with geocode. How to combine these two APIs is described in the following section.

IV. RELATIONSHIPANALYSIS

For analyzing the relationship between user’s contexts and really inputted sentences, we construct the collecting sys-tem of Twitter and compare collected tweets with landmark information gotten from Yahoo! Local Search API[6] and TV information obtained from online TV program listing[9].

Yahoo! Local Search API

Landmark name

Time Location

Language morphological analysis to parse sentences

Real inputted texts 2010-06-28T17:05:32, 33.59723, 130.217793, I am staying Hyatt Regency Miami． 2010-06-28T17:04:25, 34.54324, 131.234234, Honda and Matsui, Good job!! #worldcup

TV program listing Performer's name Jack Bauer Kiefer Sutherland Keisuke Honda Daisuke Matsui etc.

Hyatt Regency Miami Bank of America James L. Knight Center Miami Convention Center etc. Jack Bauer Kiefer Sutherland Keisuke Honda Daisuke Matsui Hyatt Regency Miami Bank America James Knight Center Convention Honda Matsui Good Worldcup Hyatt Regency Miami staying

Fig. 4. Flow of relationship analysis

These APIs are the same as APIs for making dictionary in our previously proposed context-aware text entry.

A. Cyclical collecting system for Twitter

We construct the tweet collecting system that obtains Japanese tweets tagged with geocode, where we effectively combine two APIs of Twitter, Streaming API and Search API to gather huge amount of tweets. Fig.3 shows the cyclical tweet collecting system based on Streaming API and Search API. The reason to use two APIs is as follows. Since tweets obtained through Streaming API consist of various languages, we need to filter and pick up target tweets which are written in Japanese and have geocode as shown in the left side of Fig.3. As a result, we obtain only less than 1% of tweets. If a user want to add geocode to own tweets, he must have a client that can tag user’s current location through Twitter API. In other words, a user once tagged is possible to post other tagged tweets. Therefore, we pick up user IDs who posted a geo-tagged tweet, and we collect their past tweets through Search API cyclically.

B. Matching Process

A tweet consists of thee data, time information, location information, and inputted text as shown in Fig.4. From location information and “Yahoo! Local Search API”, we pick up the surrounding landmarks’ name within one kilometer of user’s current location. Examples of typical landmarks are station, city hall, school, hospital, post office, and so on. From time information and TV program listing, we pick up performers’ name and TV programs’ title. As a target channel to be collected, we adopt 12 key stations in Tokyo area and Fukuoka area. Fukuoka is the one of major cities located at west side of Japan, where our university exists. Since it is hard to collect past TV program listing and obtainable data is extremely large, we only analyze data of about one-month (between 7 January 2010 to 2 February 2010).

All the data are separated into some words by using “Yahoo! Japanese language morphological analysis API” and “Yahoo!

(4)

Start: 15 Dec. 2009 End: 10 June 2010 Only Streaming API

Fig. 5. Distribution of collected tweets per day

Key phrase extraction API”. After that we compare these words with each other and evaluate the matching rate.

V. RESULTS

Fig.5 shows a distribution of collected 471274 tweets that have been collected from 15 December 2009 to 10 June 2010. Geographical scope of tweets is limited to Japan area, which is equal to the area from latitude 24 north and longitude 123 east to latitude 46 north and longitude 146. This limitation is due to the limitation of Yahoo! Local search API, which is only provided by Yahoo! Japan. Since we used only Streaming API at first, the number of tweets of the first one month is 10 or 100 times less than those of subsequent terms. It points out that our proposed cyclical tweet collecting system is very effective.

Average word count of collected tweets is 48.8 characters, and tweets of about 30 characters are majority. From these results, we think that an abbreviated notation is often used in tweets. Average and maximum number of landmarks obtained at a certain position from Yahoo! Local Search API is 22.9 and 71 respectively. 10.2% of position can’t obtain any landmark information from this API. Maximum number of landmarks per position is 66. Meanwhile, average and maximum number of words gotten from TV program listing is 149.1 words/hour and 790 words/hour respectively, which is about 10 times larger than landmark information.

The percentage of tweets including the words obtained according to the tweeted position is 5

Finally, we refer the dependency of time and location. Fig.6 shows a geographical distribution of tweets which include “noodle”. Since plots are widely distributed all over Japan, the

141.35 43.06 ᮐᖠ 33.58 130.42 ༤ከ

Fig. 6. Geographical distribution of a location independent word: Noodle

35.69 139.7 35.66 Nakano-ku Shibuya-ku Meguro-ku Minato-ku Shinjuku-ku Bunkyo-ku Chiyoda-ku Shinjuku station Shibuya station

Fig. 7. Geographical distribution of location dependent words: Shibuya, Shinjuku

word “noodle” can be determined as a location independent word. On the other hand, we notice that each plot (circle and plus) in Fig.7 is concentrated in certain areas respectively. In this figure, circle plots and plus plots show the geographical distribution of tweets which incude “Shinjuku” and “Shibuya” respectively. Centers of concentrated areas are Shinjuku station

(5)

9 May 16 May 23 May 30 May 6 June “Ryoma-den” is a TV drama broadcasting

in NHK at 20 o'clock on every Sunday now.

Sunday

Fig. 8. Distribution of a time dependent word: Ryoma-den (per day)

and Shibuya station of JR (Japan Railways). From this result, the word “Shinjuku” and “Shibuya” can be defined as location dependent words.

Fig.8 and Fig.9 show a distribution of tweets which include “Ryoma-den” per day and per hour respectively. “Ryoma-den” is a popular TV program broadcasting in Japan Broadcasting Corporation (NHK) at 20 o’clock on every Sunday now. As shown in Fig.8, the number of tweets on every Sunday is obviously larger than those on other day of the week. Also, we can notice that the number of tweets at 20 o’clock is remarkably larger than other time slots. As a result, the word “Ryoma-den” highly depends on time.

We are now picking up other typical words that highly depend on either time or location. We hope that by picking up such context-aware words, our context-aware text entry system will be improved.

VI. CONCLUSION AND FUTURE WORK

In this paper, we have proposed cyclical tweet collecting system and have collected over half-million geo-tagged tweets written in Japanese for analyzing the relationship between users’ context (location and time) and real inputted words. We have collected 471274 tweets from 15 December 2009. Statistical analysis shows that 5.1% of tweets include land-mark words, and 9% of tweets include TV program words. Addtionally, Geographical mapping indicates the evidence of location/time dependence of real inputted words. As a first step, we have focused on Japanese tweets, but this relationship must exist regardless of language. We are now trying to pick up the words with high location dependency by calculating the geographical distribution ratio.

“Ryoma-den” is a TV drama broadcasting in NHK at 20 o'clock on every Sunday now.

Fig. 9. Distribution of a time dependent word: Ryoma-den (per hour)

ACKNOWLEDGMENT

The work is carried out by joint research program of the NTT Service Integration Laboratories and the National Institute of Informatics. It is performed using the facilities provided by them.

REFERENCES

[1] rTYPE. (2009) Survey of mobile web site. (in Japanese).

[Online]. Available: http://release.center.jp/2008/11/0502.html (last access:2009/12/1)

[2] OMRON SOFTWARE. (2009) iwnn. (in Japanese). [Online]. Available: http://www.omronsoft.co.jp/SP/mobile/iwnn/ (last access:2009/12/1) [3] S. Suematsu, Y. Arakawa, S. Tagashira, and A. Fukuda, “Network-based

context-aware input method editor,” inThe Sixth International Conference on Networking and Services (ICNS 2010), 7 March 2010, pp. 1–6. [4] Y. Arakawa, S. Suematsu, S. Tagashira, Y. Yamaguchi, Y. Tanaka, and

A. Fukuda, “Implementation of network-based context-aware japanese input method editor,” inIEICE Technical Reports, ser. MoMuC2009-58, vol. 109, no. 380, 21 January 2010, pp. 31–34, (in Japanese).

[5] S. Suematsu, Y. Arakawa, S. Tagashira, and A. Fukuda, “On improvement of response time for network-based context-aware japanese input method editor,” inIEICE General Conference, no. B-15-18, 19 March 2010, (in Japanese).

[6] Yahoo Japan Corporation., “Yahoo! developer network- map,” http:/ /developer.yahoo.co.jp/webapi/map/(last access:2009/12/1), 2009, (in Japanese).

[7] JustSystems Corporation, “Atok.com,” http://www.atok.com/(last ac-cess:2009/12/1), 2009, (in Japanese).

[8] Twitter, “Twitter,” http://twitter.com/. [9] Toshiba, “Net de navi,” http://tvsurf.jp/tv/.