IV. Data-Driven Dynamic Pricing
4.2 Model Formulation and Preliminaries
A firm sells a product over a planning horizon of T periods. At the beginning of each period t, the firm sets a selling price pt ∈ P = [pl, ph] and determines a replen-
ishment decision, the order-up-to level, yt ∈ Y = {yl, yl+ 1, . . . , yl}, t = 1, . . . , T .
During period t, a random number of customers Dt(pt, z) arrive, where z ∈ Z is a
parameter vector and Z is a compact and convex set. Suppose Dt(·, ·) takes integer
value from D that ranges from dl ≥ 0 to dh (≥ dl), which may be infinity, and the
average demand E[Dt(p, z)] at the true value z is positive at price p ∈ P. Realized
demands are satisfied as much as possible by on-hand inventory, and unsatisfied de- mands are lost. We consider the scenario with censored demand, i.e., the firm only observes sales data min{Dt(pt, z), yt} in period t, but not the actual demand. The
cost structure includes the unit holding cost h, unit shortage cost b, and the inventory ordering cost is normalized to zero. Suppose the inventory replenishment lead-time is zero. The objective of the firm is to dynamically determine its pricing and inventory replenishment decisions in each period to maximize the expected total profit.
The demand model described above is a parametric model, i.e., for a given p ∈ P, the firm knows the probability mass function for Dt(p, z), f (·; p, z), up to the unknown
parameter vector z. Assume f (·; p, z) is differentiable with respect to z. Clearly, if the firm knew the values of z, then this is a standard dynamic joint pricing and inventory control problem that has been extensively studied in the literature. However, in our setting, the firm does not know the parameter vector z, thus it has to learn about the demand information from past sales data, which is obtained through price and ordering experimentations. Furthermore, this chapter is concerned with the case that the firm is faced with the business constraint that prevents it from conducting extensive price experimentations. Thus, the firm is subject to the constraint on the number of times it can change its selling price.
information from sales data while satisfying the constraint on the number of price changes, and exploit the extracted information to maximize its expected total profit. Remark 1. In the subsequent analysis, we will focus on the case that the selling price is continuous and the demand and order quantities are discrete. However, we point out that the results, as well as all analyses, carry over to the case with continuous demands and ordering quantities, i.e., Dt(pt, z) is a continuous random variable and
Y = [yl, yh] ⊂ R+.
The Complete Information Problem. Let xtdenote the inventory level at the
beginning of period t before the replenishment decision, and suppose the initial inven- tory level is x1 = 0. Given a pricing and inventory policy φ = ((p1, y1), (p2, y2), . . . , (pT, yT)),
the total expected profit over the planning horizon is
Vφ(T ) (4.1)
=
T
X
t=1
ptE[Dt(pt, z)] − hE [max{yt, xt} − Dt(pt, z)]++ (b + pt)E [Dt(pt, z) − max{yt, xt}]+ .
If the firm knows the parameters z and thus also the distribution of Dt a priori,
then dynamic programming can be used to compute the optimal pricing and inventory replenishment decisions. In that case, and if in addition there is no constraint on the number of price changes, then it is known (see e.g. Sobel (1981)) that a myopic policy is optimal for problem (4.1). Let G(p, y, z) denote the single-period profit function, i.e.,
G(p, y, z) = pE[D(p, z)] − hE [y − D(p, z)]+− (b + p)E [D(p, z) − y]+, (4.2)
where D(p, z) is the generic random demand when the true parameter is z and the selling price is p, and suppose it has a unique maximizer (p∗, y∗) on P × Y. Then the optimal strategy φ∗ for the firm is to order up to y∗ and set the price at p∗ in each
Definition of Regret. In our setting, the firm does not know the parameter vector z a priori, so it needs to develop an adaptive policy φ which determines the selling price pt and replenishment level yt for each period t based on historical in-
formation, i.e., past selling prices, order-up-to levels, and sales data, subject to the constraint on the number of price changes. To measure the performance of a policy φ, we define the regret as the total profit loss of policy φ compared with that of the optimal policy φ∗ when complete information is available and there is no constraint on the number of price changes. That is,
Rφ(T ) = Vφ∗(T ) − Vφ(T ).
It is clear that Rφ(T ) ≥ 0, and the smaller the regret, the better policy φ performs. The Traditional Maximum Likelihood Estimation. To estimate the un- known parameters z of a distribution, a commonly used method is maximum likeli- hood estimation (MLE). For 1 ≤ t1 ≤ t2 < ∞, let {pt1, pt1+1, . . . , pt2} be a sequence of
given prices for periods {t1, t1+ 1, . . . , t2}, and if the corresponding realized demand
{dt1, dt1+1, . . . , dt2} can be observed and there is no censored data, then an estimate
of z can be computed using the standard MLE given by
ˆ z = arg max z∈Z t2 Y t=t1 f (dt; pt, z). (4.3)
In our setting, however, the traditional MLE will not work due to censored demand data. Indeed, the true demand dt is observed only when dt< yt. If dt ≥ yt, then the
firm observes the sales quantity yt with the implication that the demand dt is no less
than yt. Therefore, the likelihood function (4.3) cannot be applied under censored
demand data.
possesses the desired convergence rate under the mean-squared error measure. A Technical Result. We next develop an upper bound for regret from estimation error in a general setting, which will be used in our subsequent analysis. Suppose that a firm maximizes an objective function H(p, y, z) over decision variables p and y without knowing the values of underlying parameters z a priori, where z ∈ Z ⊂ Rr3
for some integer r3 ≥ 1. The objective function may be multimodal, and the decision
variables p ∈ P ⊂ Rr1 for some integer r
1 ≥ 1 and y ∈ Y ⊂ Zr2 for some integer
r2 ≥ 1. The firm learns the value of ˆz through some noisy observations during decision
process. We impose the following regularity conditions. Assumption A (Regularity Conditions).
i) There is a unique global maximizer on P × Y, denoted by (p∗(z), y∗(z)) for H(p, y, z), i.e.,
(p∗(z), y∗(z)) = arg max
p∈P,y∈YH(p, y, z),
and it falls into the interior of P × Y.
ii) For any y ∈ Y, H(p, y, z) is twice differentiable with respect to p ∈ P with bounded second order derivatives.
iii) H(p, y, z) satisfies the Lipschitz condition on P × Y, i.e., there exists some constant K1 > 0 such that kH(p1, y1, z) − H(p2, y2, z)k ≤ K1(kp1− p2k + ky1−
y2k) for any p1, p2 ∈ P and y1, y2 ∈ Y.
iv) p∗(z) is locally Lipschitz on P at the true underlying parameter z. That is, there exist constants δ > 0 and K2 > 0 such that when kz0− zk < δ, we have
kp∗(z0) − p∗(z)k ≤ K
2kz0− zk.
v) If z is the true underlying parameter, then there exists a constant δ > 0 such that when kz0 − zk < δ, we have y∗(z0) = y∗(z). (If y is continuous, then
there exist constants δ > 0 and K3 > 0 such that when kz0− zk < δ, we have
ky∗(z0) − y∗(z)k ≤ K
3kz0− zk.)
Under these assumptions, we have the following basic result.
Theorem IV.1. (Regret from Estimation Error). Suppose ˆz is an estimator of z using c data points, and for any > 0 it satisfies
Pkˆz − zk ≥ ≤ K4e−cK5
2
(4.4)
for some constants K4 > 0 and K5 > 0. Then, there exists a positive constant K6
such that
H(p∗(z), y∗(z), z) − E[H(p∗(ˆz), y∗(ˆz), z)] ≤ K6 c .
This theorem will play an important role in proving the main results in this chapter. Its proof is provided in Appendix A.