3. Language Information Inference Using Social Profiles
3.2. Problem Description
This section first gives some background information on the inference problem and then introduces two main challenges in modelling the problem, as well as the corresponding proposed solutions. Finally, it gives the formal problem definition.
47
3.2.1.
Background
In general, a social profile consists of multiple fields, each of which details a particular aspect of information about the user, such as education background, hobbies etc. Different platforms may use different fields to construct user profiles. This work considers three commonly used fields in the profiles of popular SNSs such as LinkedIn, Facebook and Google Plus, for the language information inference problem:
• Summary: Unstructured text that gives a general introduction about the user. Because there is no structure restriction, the focus of this field varies from user to user.
• Education Background: Structured text that details each study experience of the user by subsections. Each subsection could include attributes like school name, study major, etc.
• Work Experience: Structured text that details the user’s work experience by subsections. Each subsection could include attributes like company name, role, work period, etc.
In practice, the language information of some users is already known through certain means. For example, it is explicitly stated in the user's profile; or it is predictable from the user's interactions with the system. Therefore, the problem is how to infer the language information of the remaining users based on the textual information of the given social profiles and the known language information of a subset of the users.
3.2.2.
Challenges and Solutions
As discussed in Section 3.1, the attributes in a social profile may implicitly suggest what languages a user comprehends, especially for the location-related attributes. For example, a user who lived in a place for a long period of time probably comprehends one of the languages spoken in that place. Based on this assumption, we can model the language inference using social profiles as a supervised classification problem. However, there are two main challenges in modelling this problem:
(1) Users' online social profiles are usually incomplete and some profiles even miss critical information as outlined in Section 3.1. Additionally, certain information is generally not asked for by SNS platforms when a user is populating their social profile, but which may provide important evidence for language information inference, e.g. the location of institutions in the Education Background field.
48
In order to alleviate this problem, this work proposes to correlate each experience of the user to the corresponding location (if missing) by exploiting external resources. For example, a University can be linked to the homepage of this University, from which the location can be harvested. Then, more language related attributes can be imported to help inference. The specific strategy adopted in this work is detailed in Section 3.4.1.
(2) Some attributes in the profile may be misleading in the inference process. As languages could be regionally related or mutually intelligible, they may share similar discriminative features. When many of these languages are taken into consideration in the target language set, only considering the textual attributes as features may not distinguish between these languages.
In this work, Chinese, French, German, Hindi and Spanish are selected as target inference languages. In the five languages, Spanish, French and German are used in combination as official languages in a number of multilingual countries. For example, both French and Spanish are official languages of Equatorial Guinea; French and German are used as official languages of Luxembourg18. Also, French has lexical similarities of 0.75 and 0.29 (where the maximum is 1.0: a total overlap between vocabularies) with Spanish and German respectively19. By contrast, Hindi and Chinese are spoken as an official language
only in India and China respectively; they have no overlap with each other or the above three languages in vocabulary. So lower prediction accuracy is expected on French, German and Spanish (when compared with Chinese and Hindi) as it is more difficult to identify their discriminative features. This intuition is validated in the experiments conducted as part of this research.
It is noted that, apart from Hindi, there are many other languages (which in the context of this research are considered to be related languages of Hindi) which are spoken in India, and that tertiary education within the country is conducted primarily though English. While English and other Indian languages are not considered in the target inference language in this PhD research, the location-related information about the user can increase the probability estimation of the user’s likelihood to have language expertise in Hindi. The introduction of other related Indian languages is expected to decrease the inference accuracy on both Hindi and these languages.
18 http://en.wikipedia.org/wiki/List_of_multilingual_countries_and_regions 19 http://en.wikipedia.org/wiki/Lexical_similarity
49
This work takes two types of relation into consideration when modelling the language inference problem in an attempt to address this challenge:
• Dependency relation between languages: The above example also hints that if a user knows French, she has a much higher probability of also knowing Spanish than Chinese. Thus, this dependency relation between languages could be helpful in inferring users' language information.
• Social relation between users: Although new users have no direct friendship/followship connections with other users, they can be related through available information of their social profiles. This PhD work focuses on the same- experience relation, i.e. two profiles share a study experience (studied the same major in the same institute) or work experience (worked in the same role in the same company), to help inference. For example, two users who shared the same work experience may imply that they know a common language because communication is needed between employees in the department of the company. Thus, it is reasonable to assume that users with the same-experience relation are likely to know a common language.
Therefore, this work proposes a model that collectively considers the three factors outlined above: enhanced textual attributes, language relation and social relation, to model the problem of language inference using social profiles.
Table 3-1: Definition of Notations Notation Notation Description
Dat
a
N Number of users/profiles
K Number of target languages
W User attribute matrix
v = (ui, lj) A correlation node consists of user ui and language lj
V Correlation node matrix
X Attribute matrix of V
y{+1, -1} Has or doesn’t have expertise
E Social and language relations
M
od
el α, β, γ Model coefficients
Z1, Z2 Normalization
50
3.2.3.
Problem Definition
The input of N social profiles can be represented as G = (U, L, W), where U is the set of |U| = N users and L is the set of |L| = K target inference languages; W is an attribute matrix associated with users in U in which each row corresponds to a user, each column an attribute of the profile, and an entry wij denotes the attribute value of the jth attribute in
the profile of user ui. The objective of the work is to learn a model that can effectively
infer what languages a user comprehends.
This work defines the correlation nodev = (ui, lj) that is associated with a label y (binary
value) to represent the output of the problem. It means user ui ∈U comprehends language
lj ∈ L if y = 1 or the opposite if y = -1. Each user in U is mapped to K correlation nodes
with the K languages in L, so a set V with K‧N correlation nodes and a corresponding label set Y are obtained. As part of users' language information is known, i.e. label values of the corresponding correlation nodes are given, they are denoted as set YL. The remaining labels are denoted as set YU, where Y = YL ∪ YU. In addition, the correlation nodes are
connected through the two types of relations defined above, which constitute an undirected edge set E. The definition of E is detailed in the next section. The definition of notations can be found in Table 3-1.
Therefore, given a partially labeled network G = (V, U, L, YL, E, W), the objective of the language inference problem is to learn a predictive function: f: G → YU.