2.6 Data
2.6.1 Usernames
The dataset contains partially encoded usernames of buyers. The format of buyers’ usernames is a String containing the first letter of the username followed by number of stars (*), which cover the middle part, and followed by the last letter, then the number of current wins in brackets. As an example user entry “a***s(19)” means that the first letter of the username was ”a”, the last ”s” and that the person has a total of 19 wins on eBay to date. In the case that the users do not buy any new product over the duration of data collection, this encoding would give in fact almost 100% certainty that each distinct entry related to a different person. The data collection took place over 44 days, so it is possible that additional purchases were made over that period. Moreover, some buyers can win more than one product over the dataset duration, or continue bidding in other auctions after winning a product, therefore reasonable increases in the number of won auctions are possible. This information is used in the algorithm to identify unique users. The choice of usernames on eBay allows using any letter, capital letters, numbers, as well as special characters, which include: full stops, asterisks, underscores, or dashes 1.
Usernames need to have a length of at least 6 characters. A username is a unique identifier of a person and one unique username is assigned automatically once a person registers on
1Additional restrictions include (citation from eBay website): “ User IDs can’t contain: Any characters
except letters, numbers, full stops, asterisks, underscores or dashes Elements that imply an email address or web address - including but not limited to .com, .net, .org, .edu or any variation (for example, com or -com). However, your user ID can contain an element of an email address or web address that identifies you or your brand. For example, if your web address is xyz.co.uk you can use xyz as an element of your user ID Consecutive underscores An underscore, hyphen or full stop at the beginning or end of a user ID (for example, -cardcollector) The word ’eBay’ The letter ’e’ followed by numbers Obscene or profane words that breach our profanity policy The same user ID as another member A user ID that is similar to the name of an eBay Shop A term that could be confused with someone else’s trademark or brand (for example, ’CocaColaSeller’) A term that may reasonably mislead another user into thinking that the account is held by a law enforcement agency or other regulatory authority (for example, Trading Standards UK)”
the website. It is later possible to change it to a preferred one. Given that most of buyers stay with their randomly assigned username, which typically includes a mixture of letters, special characters and numbers, the first and last letter of the usernames are likely to be a unique combination. The calculation of the upper bound on the number of combinations is with the assumption that the characters forming a username are assigned at random. No matter what the total length of the username, just by knowing the first and last letter gives the number of permutation with repetition: 662 = 4356 (66 is the total number of possible characters used), which means that the probability of randomly picking two identical pairs of characters is 1/662 = 2.296 ∗10−4. In the dataset there are multiple
observations for each person (multiple bids). Additionally, it is possible that a person buys as many objects as they wish on eBay. Taking into account that the probability that any other person’s username is the same as the previous one is very low, of magnitude 10−4 and the existence of multiple bids by the same user, the probability that any two entries with the same first and last letter of the username is the same person is high (upper bound being 1−2.296∗10−4). The first and last letters might not be random, but nevertheless 2.296∗10−4 shows the lower bound on the magnitude of finding two
identical usernames, which, even if in fact it is higher, is a very small number, close to zero. This shows that it is very unlikely that there will be two people with the same first and last character of the username in the dataset.
The additional information that is given is the number of total wins on eBay. This gives additional way to distinguish the users, in cases of more than one username with the same first and last character. Two extreme cases are: 1) treating the users as the same whenever the first and last letters are the same, or 2) only when both first and last letter as well as the number of wins is the same. Alternative approach is accepting the same user when first and last letters are the same, and with some restrictions on reasonable change in the number of total wins: for example if the total wins are decreasing with time, or increase too fast to not be possible.
Figure 2.1: Two methods of distinguishing increases in number of wins.
(containing only the username part) andnumbEbayWins (containing the number of total eBay wins). The total number of users identified through different methods are shown in table 2.1. In the table, the description of the method, the variable which corresponds to it, and the number of resulting distinct usernames can be found. user is the default variable containing both the encoded username and the number of total wins,user1 is the variable created by splitting theuser variable, containing only the username part of it. The other variables are created by imposing additional constraints on the increases/decreases of number of total wins in time. It was done by first sorting the users in a way to create an increasing ranking in the number of wins over time for each pair of first and last letter. There are two ways to approach this ranking, and the results are slightly different (Figure ??). The observations for each user can be first sorted by time and then by number of wins, or first by number of wins and then by time. I have decided on the second approach reflecting the belief that smaller number of win increases are more likely to refer to the same person. The difference between two methods is presented on an example in Figure ??.
I am using Method 2 , so sorting the users by the total number of wins first, and then by time. Then, variableuser1NonD is created by requiring that for each username, if the number of wins increase, but the time decreases, then the following entry starts a new user - that is applying Method 2 from figure ?? without any additional restrictions on the increases in number of wins. Variable userCoded1 is created such that the number of wins difference is at most 10 if time difference is below 24 hours and that the number
Table 2.1: Number of users when different rules of user identification applied
Total users
username
username username username username username
+number + number +n. of wins +n. of wins +≤1 win
of wins of wins not decreasing not decreasing in the first 1.2 min
match not decreasing +≤10 w/h +≤5 w/h +≤ 5 w/h
user1 user user1NonD userCoded1 user15 user1M
1156 6123 3994 4075 3995 3997
of wins is at most 10 wins per hour for time difference grater than 24 hours, in addition to what is required for user1NonD. Next example of an identifying variable is user15, which allows for an increase of no more than 10 wins for any time difference less than 5 hours, and and increase of no more than 5 per hour for a time difference larger than 5 hours. The restrictions on the increases used are arbitrary. Nevertheless it is more often the case that the wins for the same user might be close to each other (for example within 5 or 10 hour period), which is the reason for an initial period allowing for faster increases, set up to be a given number of hours. Otherwise, if just a simple ration per hour was used, an increase of 1 in one minute difference would not be allowed, and this can easily be the case if someone wins and then immediately places another bid which wins again. The last method, with variable user1M, is created by restricting that there is an increase no larger than one in the first 1.2 minute, and the other conditions remain the same as for user15. The difficulty is in choosing the best way for user identification, and as we can see the very upper bound on the number of users is 6123, while the lower bound is 1156, when allowing any increases in number of winning bids. Different ways of restricting these increases lead to a different number of resulting users in the dataset.
It is not clear which method should be used, although identification by user can be ruled out since some increases in the number of total wins should be allowed. The decreases in number of total wins cannot take place, so conditional on no mistakes in recording the data,user1NonD is the lower bound on the number of bidders in the dataset. Given that
typos, which could slightly influence the decreasing number of wins in time, but this is certainly a marginal problem. As can be seen, applying some restrictions on increases in bids, as in user15 or user1M does not influence the number of bidders by much (only +1 or +3 difference from user1NonD ), so using this variable seems to be reasonable and practically equivalent to the other two methods. The user1NonD method has been chosen, and the variable which distinguished between bidders has been named user. This is the one mentioned in table 4.1 under this name. This variable has been used for creating all other variables statistics and data analysis based on user identification.