• No results found

In this section, we formulate the problem of TkLUS queries.

C.2.1

Modeling Social Media Data

In the context of social media, we identify three different but interrelated concepts. In particular, social media consists of many posts that are posted by users who form a social network through their interactions in the posts.

Definition 1. (Social Media Post) A social media post is a 4-tuple p= (uid, t, l, W),

where uid identifies the user who publishes the post, t is the timestamp when the post is published, l is the location where the user publishes the post, and W is a set of words(w1, w2, . . . , wn)that capture the textual content of the post.

The location field may be unavailable for many social media posts, de- pending on whether localization is supported (enabled) by a social media platform (a user). For example, Twitter users can choose to record their loca- tions or not in tweets if their devices are GPS-enabled. In this paper, we focus on social media posts that have non-empty location fields, and make use of them to find relevant results for user queries.

The size of p.W is always small since social media platforms impose length constraints on posts. For example, a tweet contains at most 140 char- acters that can only convey a small number of words. On the other hand, not all textual words in an original social media post are included in the ab- straction p. We assume the use of a vocabularyWthat excludes popular stop words (e.g., this and that).

Definition 2. (Social Network) A social network referred in social media platform

is a directed graph G= (U, Ereply, lreply, Eforward, lforward)where:

C.2. Problem Formulation

1. U is a set of vertices, each corresponding to a user.

2. Ereply is a set of directed edges that capture the “reply” relationships between

users in U. Particularly, an edge e = hu1, u2i ∈ Ereply means that user u1

replies to user u2in one or more posts.

3. lreply: lreplymaps a reply edge to the set of involved posts. Given two users u1

and u2, lreply(u1, u2)returns all the posts in which u1replies to u2.

4. Eforward is a set of directed edges that capture the “forward” relationships be-

tween users in U. Particularly, an edge e= hu1, u2i ∈Eforward means that

user u1forwards user u2’s post(s).

5. lforward : lforward maps a forward edge to the set of involved posts. Given two

users u1and u2, lforward(u1, u2)returns all u2’s posts forwarded by u1.

The location field may be unavailable for many social media posts, de- pending on whether localization is supported (enabled) by a social media platform (a user). For example, Twitter users can choose to record their loca- tions or not in tweets if their devices are GPS-enabled. In this paper, we focus on social media posts that have non-empty location fields, and make use of them to find relevant results for user queries.

The size of p.W is always small since social media platforms impose length constraints on posts. For example, a tweet contains at most 140 char- acters that can only convey a small number of words. On the other hand, not all textual words in an original social media post are included in the ab- straction p. We assume the use of a vocabularyWthat excludes popular stop words (e.g., this and that).

Definition 3. (Social Network) A social network referred in social media platform

is a directed graph G= (U, Ereply,

lreply, Eforward, lforward)where:

1. U is a set of vertices, each corresponding to a user.

2. Ereply is a set of directed edges that capture the “reply” relationships between

users in U. Particularly, an edge e = hu1, u2i ∈ Ereply means that user u1

replies to user u2in one or more posts.

3. lreply: lreplymaps a reply edge to the set of involved posts. Given two users u1

and u2, lreply(u1, u2)returns all the posts in which u1replies to u2.

4. Eforward is a set of directed edges that capture the “forward” relationships be-

tween users in U. Particularly, an edge e= hu1, u2i ∈Eforward means that

user u1forwards user u2’s post(s).

5. lforward : lforward maps a forward edge to the set of involved posts. Given two

Paper C.

The social network in a social media platform is complex as it involves both content-independent and content-dependent aspects. The definition above captures both aspects in different types of graph edges and auxiliary mappings. In particular, each reply edge in Ereply must involve at least one

post where one user replies to another, whereas each edge in Eforward must

involve at least one post which one user forwards.

We use P = {(uid, t, l, W)} to denote the set of all social media posts of interest, and D = (P, U, G) to denote the geo-tagged social media data. For convenience, we also use Puto denote all the posts by a user u∈U.

C.2.2

Problem Definition

In the context of geo-tagged social media data, people may pose questions like What are the events going on in New York City? and How can I find a babysit- ter in LA? Among all social media users in U, there can be some local users, who have relevant knowledge indicated in their posts, to answer such ques- tions. Without loss of generality, we formalize the questions as the following problem.

Problem Definition. (Top-k Local User Search in Social Media) In the con- text of geo-tagged social media dataD = (P, U, G), given a query q(l, r, W)

where q.l is a query location, q.r is a distance value, and q.W is a keyword set that captures user needs, a top-k local user search (TkLUS) finds a k-user set

Ek⊆ D.U that satisfies the following conditions:

1. ∀u∈ Ek,∃p∈Pu such thatkq.l, p.lk ≤q.r4and p.W∩q.W6=∅.

2. ∀u ∈ Ek and ∀u0 ∈ D.U\ Ek, either u0 does not satisfy condition 1 or score(u0, q) ≤score(u, q).

Here, score(.) is a score function to measure the relevance of a user to a search request. Our score functions, to be detailed in Section C.3, consider both keyword and distance relevances.

We give a simplified example. Figure C.1 shows a map where a Tk- LUS query is issued at the location indicated by the cross (43.6839128037, -79.37356590) with a single keyword “hotel” and a range of 10 km. The tweets containing “hotel” (and their users) are listed in Table C.1, and their locations are also indicated in the map. According to different user scoring functions, user u1 having more tweets or user u5 having more replies/forwards (not

shown in the table) will be returned.

To find the top-k users for a TkLUS query is a non-trivial problem. The user set U and the post set P are very large on many social media platforms like Twitter. It is definitely inefficient to check the sets iteratively. Also, it

4kl

1, l2kdenotes the Euclidean distance between locations l1and l2. The techniques proposed

C.3. Scoring Tweets and Users