Chapter 4 Using Transactions Data to Improve Consumer Re-
C.1 Data Set Construction
The original Ni et al. (2012) data set contains six transaction types, namely: product purchase, product return, service contract purchase, service contract return, sales discount, and miscellaneous transaction (Table 1A in Ni et al., 2012). As our proposed forecasting framework requires only purchase and return timestamps for each transaction as data input, we are able to use most of the transactions in Ni et al. (2012). We create the data set used in the current study through a few screening procedures. Our goal is to subset the products that are suitable for our analysis and make sure each transaction has the correct purchase and return timestamps. The amount of data reduction for each screening step is presented in Table C.1. We elaborate the screening process as follows.
Table C.1: Data Screening Process
Data Screening Step Data Reduction Sample Size
Ni et al. (2012) data set 173,262
Remove service, miscellaneous, and on-line transactions 20,035 153,227 Matching returns and discounts with purchases 27,800 125,427
Product categorization 27,009 98,418
Remove appliances and information goods 12,693 85,725
First, we remove the two service transaction types as well as the miscellaneous ones, since it is unclear what the return policy is for the service contracts. In addition, many service contracts were returned along with the product they were attached to. We exclude the “miscellaneous” transactions, because they do not have product descriptions. On-line transactions were a very small fraction (1.9%) of this retailer’s sales when the data was gathered. As a result, we are not in a position to investigate differences between on-line
and in-store return behavior, and hence the on-line transactions are eliminated as well. Second, the remaining three types of observations in Ni et al. (2012) are product pur- chase, return, and discounts. We match the 14,707 product returns with their corresponding purchase observations, which is necessary because a purchase contains information about whether it was returned but does not show when the return happened. In other words, if a consumer purchases an item and then later returns it, the purchase and return timestamps are entered in two separate observations, which need to be linked for our analysis. Similarly, if a purchased item has a discount, the amount of discount is entered as a separate obser- vation. To obtain the actual price paid for the discounted items, we matched the purchases with the discounts. Next, we describe how the matching is carried out. The purchase-return matching is not straightforward since the original data does not have an identifier variable which directly links a purchase observation with its return observation. We resolve this by sorting the data on certain existing variables that will make the purchase return link. Specifically, we sorted the data by three variables – original ticket number, return indicator, and transaction type, in descending priority. See Ni et al. (2012) for variable definitions. Table C.2 contains an example of the outcomes of this sorting method. Note that if a pur- chase is returned, both the purchase and the return observation have the return indicator set to “Y”. The rationale for this sorting is straightforward – if a purchase is returned, its return observation is now right below the purchase observation. We make two additional checks for each matched purchase-return pair: return comes after purchase, and they have the same price. This procedure is able to match 12,783 (87%) return observations with their purchase counterparts. We also matched the 1,447 observations of purchase discounts with their associated purchases. The matching strategy for the purchase-discount pairs is similar to the purchase-return case, using the original ticket number, product ID, and transaction type as the sorting variables. Note that there is no indicator variable to tell whether a purchase observation is associated with a discount observation. Therefore, we resort to product ID for matching. Again, an example is provided in Table C.2. Although the product ID variable is missing in 24% of the observations, this set of sorting variables matched 1,221 (84%) of the discount observations with their associated purchases. To sum-
Table C.2: Examples of Matched Purchase-Return and Purchase-Discount Pairs
Original Ticket Number Product ID Return Indicator Transaction Type Unit Price Date (purchase-return matching example)
Purchase 86702346281 538012 Y 1 39.99 Feb 16 2000 Return 86702346281 538012 Y 2 -39.99 Feb 17 2000
(purchase-discount matching example)
Purchase 373201712537 808789 N 1 179.99 Dec 15 2012 Discount 373201712537 808789 N 5 -4.99 Dec 15 2012
marize, the reduction of 27,800 observations in this screening step is due to three reasons: merging of returns and discounts into purchases, missing product IDs, and missing return indicators.
Third, we categorize products so that a separate analysis could be conducted for each category. Note that this data set is not appropriate for product-specific analysis, since a product on average has one return per 23 days. Our categorization is a fine-tuning on the original categorization in Ni et al. (2012), which blends the accessories for a product along with the product itself into the same category. For example, many products in the TV category are in fact TV stands and cables. Table C.3 contains details of the categorization process. Some transactions have either no or uninformative product descriptions. Along with the transactions that do not fit into any of the categories in Table C.3, we have left 27,009 transactions uncategorized, which are removed in this screening step.
Last, we ensure that each category has a fair number of returns. The “appliances” category has 58 returns in total, which makes it inappropriate for analysis. The “information goods” category has a very low return rate (4.5%) and a very short return lag (10 days). In addition, the common return policy for information goods is to not accept open-box items for refund, which is distinct from other products. Therefore, we exclude appliances and information goods from our final sample for the forecasting analysis.