Binary Linear Codes with Optimal Scaling and Quasi-Linear Complexity

(1)

Binary Linear Codes with Optimal Scaling

and Quasi-Linear Complexity

Arman Fazeli

University of California San Diego 9500 Gilman Drive, La Jolla, CA 92093

[email protected] Hamed Hassani University of Pennsylvania 3101 Walnut St, Philadelphia, PA 19104 [email protected] Marco Mondelli Stanford University 350 Serra Mall, Stanford, CA 94305

[email protected]

Alexander Vardy

University of California San Diego 9500 Gilman Drive, La Jolla, CA 92093

[email protected]

November 3, 2017

Abstract

We present the first family of binary codes that attains optimal scaling and quasi-linear complexity, at least for the binary erasure channel (BEC). In other words, for any fixed δ>0, we provide codes that ensure re-liable communication at rates within ε > 0 of the Shannon capacity with block length n = O(1/ε2+δ), construction complexityΘ(n), and encoding/decoding complexityΘ(n log n). Furthermore, this scaling between the gap to capacity and the block length is optimal in an information-theoretic sense.

Our proof is based on the construction and analysis of binary polar codes obtained from large kernels. It was recently shown that, for all binary-input symmetric memoryless channels, conventional polar codes (based on a2×2 kernel) allow reliable communication at rates within ε>0 of the Shannon capacity with block length, construction, encoding and decoding complexity all bounded by a polynomial in1/ε. In par-ticular, this means that the block lengthn scales as O(1/εµ₎_{, where µ is referred to as the scaling exponent.} It is furthermore known that the optimal scaling exponent is µ = 2, and it is achieved by random linear codes. However, for general channels, the decoding complexity of random linear codes is exponential in the block length. As far as conventional polar codes, their scaling exponent depends on the channel, and for the BEC it is given by µ=3.63. This falls far short of the optimal scaling guaranteed by random codes.

(2)

1. Introduction

Shannon’s coding theorem implies that for every binary-input memoryless symmetric (BMS) channelW, there is a capacityI(W)such that the following holds: for all ε>0 and Pe>0, there exists a binary code of rate at

leastI(W)₋ε which enables communication over W with probability of error at most Pe. Ever since the

pub-lication of Shannon’s famous paper [48], the holy grail of coding theory was to find explicit codes that achieve Shannon capacity with polynomial-time complexity of construction and decoding. Today, several such families of codes are known, and the principal remaining challenge is to characterize how fast we can approach capacity as a function of the code block lengthn. Specifically, we now have explicit binary codes (which can be con-structed and decoded in polynomial time) of lengthn and rate R, such that the gap to capacity ε= I(W)₋R required to achieve any fixed error probabilityPe>0 vanishes as a function of n. The fundamental theoretical

problem is to characterize how fast this happens. Equivalently, we can fix ε = I(W)₋R and ask how large does the block lengthn need to be as a function of ε. That is, we are interested in the scaling between the block length and the gap to capacity, under the constraint of polynomial-time construction and decoding.

It is known that the optimal scaling is of the formn=O(1/εµ₎_{, where µ is referred to as the scaling}

expo-nent. It is furthermore known that the best possible scaling exponent is µ = 2, and it is achieved by random linear codes — although, of course, random codes do not admit efficient decoding. In this paper, we present the first family of binary codes that attains both optimal scaling and quasi-linear complexity on the the binary erasure channel (BEC). Specifically, for any fixed δ>0, we exhibit codes that ensure reliable communication on the BEC at rates within ε > 0 of the Shannon capacity, with block length n = O(1/ε2+δ₎_{, construction}

complexityΘ(n), and encoding/decoding complexityΘ(n log n).

To establish this result, we use polar codes, invented by Arıkan [1] in 2009. However, while Arıkan’s polar codes are based upon a specific2×2 kernel, we use`_{× `}binary polarization kernels, where`is a sufficiently large constant. The main technical challenge is to prove that this construction works. To this end, we choose the polarization kernel uniformly at random from the set of all`_{× `}nonsingular binary matrices, and show that with probability at least1−O(1/`), the resulting scaling exponent is at most2+δ. Since`is a constant

that depends only on δ, the choice of a polarization kernel can be, in principle, de-randomized in time O(2`), without affecting the complexity of construction or the encoding/decoding complexity of the resulting code.

1.1. Background and context

A sequence of papers starting with [4, 49] in 1960s and culminating with [15, 46] shows that for any discrete memoryless channelW and any code of length n and rate R that achieves error-probability PeonW, we have

I(W)₋R > const√(Pe, W) n − O log n n (1) where the constant (which is given explicitly in [46]) depends onW and Pe, but not onn. This immediately

implies that ifn = O(1/εµ₎_{, where ε} ₌ _I₍_W₎₋_{R is the gap to capacity, then µ} _> _{2. We further note that}

expressions similar to (1) were derived from the perspective of threshold phenomena in [52] and from the per-spective of statistical physics in [43]. The fact that µ>2 also follows from a simple heuristic argument. For simplicity, consider the special case of transmission over the BEC with erasure probabilityp. As n→ ∞, the number of erasures will tend to the normal distribution with meannp and standard deviation pnp(1−p). Thus channel randomness yields a variation in the fraction of erasures of order1/√n. This implies that, in order to achieve a fixed error probability, the gap to capacity ε has to scale at least as1/√n.

(3)

probabilityp is equivalent to removing each column of this generator matrix independently with probability p. The probability of error (under maximum-likelihood decoding) is equal to the probability that such residual matrix is not full-rank. This probability is easy to compute, and the desired scaling result immediately follows. Unfortunately, random linear codes cannot be decoded efficiently. On general BMS channels, this task is NP-hard [3]. On the BEC, decoding a general binary linear code takes timeO(nω₎_{, where ω is the exponent of}

matrix multiplication. This leads to the following natural question: what is the lowest possible scaling exponent for binary codes that can be constructed, encoded, and decoded efficiently? For the BEC, we take efficiently to mean linear or quasi-linear complexity. Here is a brief survey of the current state of knowledge on this question.

Forney’s concatenated codes [7] are a classical example of a capacity-achieving family of codes. However, their construction and decoding complexity are exponential in the inverse gap to capacity1/ε (see [9] for more details), so they are definitely not efficient. In recent years, three new families of achieve capacity-achieving codes have been discovered; let us review what is known regarding their scaling exponents.

Polar codes: Achieve the capacity of any BMS channel under a successive-cancellation decoding algorithm [1] that runs in timeO(n log n). It was shown in [9] that the block length, construction complexity, and de-coding complexity are all bounded by a polynomial in1/ε, which implies that the scaling exponent µ is finite. Later, a sequence of papers [8, 11, 21, 42] provided rigorous upper and lower bounds on µ. The specific value of µ depends on the channel W. It is known that µ = 3.63 on the BEC. The best-known bounds valid for any BMS channelW are given by 3.5796µ64.714.

Spatially-coupled LDPC codes: Achieve the capacity of any BMS channel under a belief-propagation decod-ing algorithm [29] that runs in linear time. A simple heuristic argument yields that the scaldecod-ing exponent of these codes is roughly3 (see [38, Section VI-D]). However, a rigorous proof of this statement remains elusive and appears to be technically challenging.

Reed-Muller codes: Achieve capacity of the BEC under maximum-likelihood decoding [27, 28] that runs in timeO(nω₎_{. While it has been observed empirically that the performance of Reed-Muller codes on the}

BEC is close to that of random codes [39], no bounds on the scaling exponent of these codes are known.

1.2. Our main result: Binary linear codes with optimal scaling and quasi-linear complexity

Our main result is the first family of binary codes for transmission over the BEC that achieve optimal scaling between the gap to capacity ε and the block length n, and that can be constructed, encoded, and decoded in quasi-linear time. In other words, the block length, construction, encoding, and decoding complexity are all bounded by a polynomial in1/ε and, moreover, the degree of this polynomial approaches the information-theoretic lower bound µ>2. Somewhat informally (cf. Theorem 2), this result can be stated as follows. Theorem 1. Consider transmission over a binary erasure channel W with capacity I(W). FixPe∈ (0, 1)and

arbitrary δ>0. Then for all R< I(W), there exists a sequence of binary linear codes of rateR that guaran-tee error probability at mostPeon the channelW, and whose block length n satisfies

n 6 β

I(W)₋Rµ with µ 6 2+δ, (2)

where β = 1+2 P_e−0.013

is a universal constant. Moreover, the codes in this sequence have construction complexityΘ(n)and encoding/decoding complexity Θ(n log n).

Remark. In the statement of the theorem, the error probability is upper-bounded by a fixed constantPe.

(4)

To prove Theorem 1, we will show that there exist`_{× `} binary kernels, such that polar codes constructed from these kernels achieve capacity with a scaling exponent µ(`) that tends to the optimal value of2 as `

grows. In Section 3, we will furthermore provide an upper bound on the required value of ` as a function of δ (and nothing else). The claim regarding the construction and encoding/decoding complexities immedi-ately follows from known results on polar codes [1, 47, 50]. This is so since polar codes constructed from`_{× `}

binary kernels maintain the recursive structure of conventional polar codes, and thereby inherit construction complexityΘ(n)and encoding/decoding complexityΘ(n log n).

1.3. A primer on polar codes

Like many fundamental discoveries, polar codes are rooted in a simple and beautiful basic idea. Polarization is induced via a simple linear transformation consisting of many Kronecker products of a small binary matrixK, called the polarization kernel, with itself. Following [23], we assume herein thatK is an`_{× `} non-singular binary matrix, which cannot be transformed into an upper triangular matrix under any column permutations. In this framework, the conventional polar codes of Arıkan [1] correspond to the special case where

K = 1 0

1 1

LetW:_{0, 1_{} →} Y be a BMS channel, characterized in terms of its transition probabilities W(y_|x), for ally∈Y and x∈ {0, 1}. Further, letU= (U1, U2, . . . , Un)be a block ofn = `m bits chosen uniformly at

random from_{0, 1_}n. We encodeU as X = UK⊗m _{and transmit}_{X through n independent copies of W, as}

shown below:

To appear in the IEEE TRANSACTIONS ONINFORMATIONTHEORY, 2011 3

. . . . . . U1 U2 Un Y1 Y2 Yn X1 X2 Xn W W W

K

⊗m (3)

This, in a nutshell, is the polarization transformation of Arıkan [1]. To understand what polarization means in this context, let us consider a number of channels associated with this transformation. First, there is the chan-nelWn:_{0, 1_}n _→Ynthat corresponds ton independent uses of W. Arıkan [1] also introduces the channel W∗:_{0, 1_}n_→_Yn_{with transition probabilities given by}_W∗₍_y_|_u_{) =}_Wn _y

u K⊗m. Finally, Arıkan [1] fur-ther defines, for eachi∈ [n], the channelWi :{0, 1} →Yn×{0, 1}i−1that is “seen” by the bitUi, as follows:

Wi y, v|ui) def= 1 2n−1

∑

u∈{0,1}n−i W∗y(v, u_i, u) = 1 2n−1

∑

u∈{0,1}n−i Wny(v, u_i, u)K⊗m , (4)

where(_·,·)stands for vector concatenation. It is easy to show thatWi y, v|ui)is indeed the probability of the

event that(Y1, Y2, . . . , Yn) =y and(U1, U2, . . . , Ui−1) =v given the event Ui =ui.

The key observation of [1] is that, as n grows, the n bit-channels Wi defined in (4) start polarizing: they

approach either a noiseless channel or a useless channel. Formally, given a BMS channelW, its capacity I(W)

and Bhattacharyya parameterZ(W)are given by

(5)

Given δ∈ (0, 1), let us say that a bit-channelWiis δ-bad if Z(Wi) >1−δ and δ-good if Z(Wi) 6 δ. Then

the polarization theorem of Arıkan [1, Theorem 1] can be informally stated as follows.

Theorem (Polarization Theorem). For every δ∈ (0, 1), almost all bit-channels become either δ-good or δ-bad asn → ∞. In fact, as n → ∞, the fraction of δ-good bit-channels approaches the capacity I(W)of the un-derlying channelW, while the fraction of δ-bad bit-channels approaches 1₋I(W).

This theorem naturally leads to the construction of capacity-achieving polar codes, as long as δ is o(1/n). Specifically, an(n, k)polar code is constructed by selecting a set_Aofk good bit-channels to carry the infor-mation bits, while the input to all the other bit-channels is frozen to zeros. In practice, the code parametersk and δ are usually selected according to the target rate of the code and/or the desired probability of error.

Let us now focus on binary erasure channelsW, where the erasure probability z is equal to the Bhattacharyya parameterZ(W)defined in (5). It is easy to see that when the underlying channel is BEC(z), then for all i, the i-th bit-channel is also a BEC. Specifically, the bit-channel Wi is BEC(zm(i)), wherezm(i)is a polynomial of

degree at mostn in z. The proof of the polarization theorem then follows by studying the evolution of the Bhat-tacharyya parameterszm(i)asm grows. For a fixed kernel K, these n = `m Bhattacharyya parameterszm(i)

can be viewed as values of a random variableZminduced by the uniform distribution on the`mbit-channels.

For any given kernelK, the recursive construction of K⊗m_allows_{_Z

m}m∈Nto form a supermartingale:

Z0= z and Zm+1= fBm,K(Zm), for Bm ∼Uniform[`] (6)

where f1,K(z), f1,K(z), . . . , f`,K(z)are the erasure probabilities of the` bit-channels obtained after a single

step of polarization. We shall refer to the set fi,K(z) : i ∈ [`] as the polarization behavior of K. We will

show later in this paper that f1,K(z), f1,K(z), . . . , f`,K(z)are all polynomials of degree at most`inz.

One can view (6) as a stochastic process on an infinite binary tree, where in each step we take one of the

` available branches with uniform probability. The polarization theorem is then reduced to the martingale convergence theorem for supermartingales, which in this case implies thatlimm→∞Zm(1−Zm) = 0. This

shows that the erasure probability of the bit-channels polarizes to0 or 1 as m _→ ∞. The speed with which this polarization phenomenon take place is the determining factor in the decay rate of the gap to the capacity as a function of the block length`m. We elaborate on this in the next subsection.

1.4. On the rate of polarization in various regimes

The performance of polar codes has been analyzed in several regimes. In the error exponent regime, the rate R < I(W)is fixed, and it is studied how the error probabilityPescales as a function of the block lengthn.

In [2] it is proved that the error probability under SC decoding behaves roughly as2−√n. An even more refined scaling is proved in [12].

In the error floor regime, the code is fixed, i.e., the rateR and the block length n are fixed, and it is studied how the error probabilityPescales as a function of the channel parameter. In [5] it is proved that the stopping

distance of polar codes scales as√n, which implies good error floor performance under BP decoding. The authors of [5] also provide simulation results that show no sign of error floor for transmission over the BEC and over the binary-input AWGN channel. This conjecture is settled in [42], where it is showed that polar codes do not exhibit error floors for the transmission over any BMS channel.

The focus of this paper is on the scaling exponent regime, where the error probabilityPe is fixed, and it is

studied how the gap to capacity I(W)₋R scales as a function of the block length n. As mentioned earlier, ifn is O(1/(I(W)₋R)µ₎_{, then we say that the family of codes has scaling exponent µ. For polar codes,}

(6)

µ ≈ 3.627. In [9], it is shown that the block length, construction, encoding and decoding complexity are all

bounded by a polynomial in the inverse of the gap to capacity for the transmission over any BMS channel. This implies that there exists a finite scaling exponent µ. Rigorous bounds on µ are provided in [8,11,42]. In [11], it is proved that3.5796 µ66, and it is conjectured that the lower bound can be increased up to 3.627, i.e.,

up to the value heuristically computed for the BEC. In [8], the upper bound is refined to5.702. The current best upper bounds on the scaling exponent are provided in [42]: for any BMS channel, µ64.714; and for the special case of the BEC, µ63.639, which approaches the value obtained heuristically in [21]. As a side note, let us point out that the heuristic method of [21] is based on a “Scaling Assumption” that requires the existence of a particular limit. The results of [8, 11, 42], as well as the result presented in this paper, do not rely on such an assumption.

In [42], it is also proved that, by allowing a less favorable scaling between the gap to capacity and the block length (i.e., a larger scaling exponent), the error probability goes to0 sub-exponentially fast in n. This inter-mediate regime is referred to as moderate deviations regime. Here, neither the rate nor the error probability are fixed, and it is studied how the gap to capacityI(W)₋R and the error probability Pejointly scale as functions

of the block lengthn (see [42, Theorem 3]).

In a nutshell, the scaling exponent of Arıkan’s polar codes is around4 (its exact value depends on the trans-mission channel and it can be bounded as3.579 6 µ 6 4.714). On the contrary, random codes achieve the

optimal scaling exponent of2. This means that, in order to obtain the same gap to capacity, the block length of polar codes needs to be roughly the square of the block length of random codes. Hence, one natural question is how to improve the scaling exponent of polar codes.

One possible approach consists in acting on the decoding algorithm. In particular, the successive cancel-lation list decoder proposed in [51] empirically provides a significant performance improvement. However, in [41], it is proved a negative result for list decoders: the introduction of any finite list cannot improve the scaling exponent under MAP decoding for the transmission over any BMS channel. Furthermore, for the spe-cial case of the BEC, it is also proved that the scaling exponent under SC decoding does not change even if one is given a finite number of helps from a genie.

Another approach is to consider the polarization of kernels larger than the original2_×2 matrix proposed by Arıkan. Indeed, such kernels have the potential to improve the scaling behavior of polar codes. For the error exponent, in [23] it is proved that, as`goes large, the error probability scales roughly as2−n. For the scaling exponent, in [6] it is observed that µ can be reduced when` >8. In the recent paper [45], it is shown that, for the transmission over the erasure channel, the optimal scaling exponent µ =2 is approached by using a large kernel and a large alphabet. Furthermore, in [10], the author gives evidence supporting the conjecture that, in order to obtain µ = 2, it suffices to consider a large kernel over a binary alphabet. In this paper, we finally settle such a conjecture: we show that the scaling exponent µ(`) obtained from the polarization of an`_{× `}

kernel tends to2, as`goes large.

2. Outline of the Proof

The proof consists of three main steps. In the following, we describe the main ideas behind each of them. Step 1. As mentioned in Section 1.3, whenm→∞, almost all the bit-channels polarize, i.e., the process Zm

almost surely takes its value inside the set_{0, 1_}. In order to study the finite-length behavior of polar codes, we need to understand how fast the processZmpolarizes. In other words, given a (small) number ε> 0, how

fast does the quantityP{Zm ∈ [ε, 1−ε]}vanish withm? To answer this question, we first relate the decay

speed ofZm with some simpler quantity that can be directly computed from the kernel matrixK.

(7)

Figure 1: The figure on the left shows that fi,K(z)exhibits a sharp transition of order roughlyO(`−1/2), when

the kernelK is random. The two figures on the right compare different choices of the kernel K: the red curve corresponds to Arıkan’s kernel; the black curve to the kernelK16from [6]; and the blue curve is obtained by

taking the average of the functions fi,K(z)for a random kernel.

behavior of the processgα(Zm) = (Zm(1−Zm))αfor α>0. By Markov inequality, we have

P{Zm ∈ [ε, 1−ε]} 6

_ε 2

−α

E[gα(Zm)]. (7)

Furthermore, in order to boundE[gα(Zm)], we can write:

gα(Zm) = (fBm,K(Zm−1)(1− fBm,K(Zm−1))) α _{= (} Zm−1(1−Zm−1))α fBm,K(Zm)(1− fBm,K(Zm)) Zm−1(1−Zm−1) α =gα(Zm−1) fBm,K(Zm)(1− fBm,K(Zm)) Zm−1(1−Zm−1) α . Hence, after some simple calculations, we conclude that

E[gα(Zm)] 6 (λ∗α,K)m, (8) where λ∗_α,K = sup z∈(0,1) 1 ` `

∑

i=1 (fi,K(z)(1− fi,K(z)))α (z(1−z))α . (9)

Step 2. We show that the quantity λ∗_α,K is bounded above by `−1/2+5α with probability1−O(1/`) over the random choice ofK. To do so, we prove that, as`grows, the functions fi,Kwill behave as a step function

for most of the non-singular kernels. Note that, for anyi and for any K, fi,Kis an increasing polynomial with

fi,K(0) = 0 and fi,K(1) = 1. As`increases, we prove that, with probability1−O(1/`)over the choice of

K, fi,K(z) has a sharp threshold around the pointz = i/`. More precisely, forz 6 i/`−5 log`/ √

`, we have that fi,K(z) 6 `−(2+log`); and forz>i/` +5 log`/

√

`we have that fi,K(z) >1− `−(2+log`)(see also

Figure 1). Let us now go back to (9) and use this “sharpness” property of fi,K in order to upper bound λ∗α,K.

In the right-hand-side of (9), let us only evaluate the term inside the supremum forz = 1/2. By using the “sharpness” property, it is not hard to see that this term will be of order

`−1/2log` + `−α(2+log`)

(8)

for sufficiently large`. With some more effort, we can show a bound which is valid for anyz_{∈ (}0, 1)(and not only forz=1/2).

Step 3. We derive a finite-length scaling law for polar codes by using the results of the previous steps. From (7), (8), and the upper bound on λ∗_α,K, we conclude thatP_{Zm ∈ [ε, 1−ε]} = O ε−α(`−1/2+5α)m. Denote

the desired error probability byPeand let ε= Pe`−m. Then,

P{Zm ∈ [Pe`−m, 1−Pe`−m]} =O(`m/(2+δ)), (11)

where δ can be made arbitrarily small by choosing a small enough α. As the blocklength n is equal to `m, (11) implies that the gap to capacity is of ordern1/(2+δ)_{. By bounding also}P_{_Z

m >1−Pe`−m}, the desired

scaling result follows.

3. Main Result

As mentioned in the introduction, the main result of this paper is to provide a family of binary codes that achieves optimal scaling between gap to capacity and block length, as well as quasi-linear complexity of con-struction, encoding and decoding. This is done by showing that binary polar codes obtained from large kernels possess those properties.

Theorem 2 (Binary Polar Codes with Optimal Scaling and Quasi-Linear Complexity). Consider the transmis-sion over a BECW with capacity I(W). LetK_∈ F`₂×`be a kernel that is selected uniformly at random among all`_{× `}non-singular binary matrices. FixPe ∈ (0, 1)and letC`(n, R, Pe)be the code obtained by polarizing

K with block length n = `m for somem∈ N and rate R < I(W)such that the error probability under suc-cessive cancellation decoding is at mostPe. Fix a small constant δ >0. Then, there exists`0(δ)such that for

any` > `₀(δ), with high probability over the choice ofK, there is a codeC`(n, R, Pe)that satisfies

n6 ₍ β

I(W)₋R)µ, with µ 6 2+δ, (12)

where β is a constant given by (1+2 P_e−0.01)3. This code has construction complexityΘ(n), and encod-ing/decoding complexityΘ(n log n).

It also possible to show that`needs to be of orderexp(1/δ1.01), and additional details about this fact are provided at the end of the section. The theorem above follows from the following result that characterizes the behavior of the polarization process.

Theorem 3 (Optimal Scaling of Polarization Process). LetK ∈ F`_×`

2 be a kernel that is selected uniformly

at random among all`_{× `}non-singular binary matrices. LetZm be the random process defined in(6) with

initial conditionZ0 =z. Fix Pe∈ (0, 1)and a small constant δ>0. Then, there exists`0(δ)such that, with

high probability over the choice ofK, for any` > `₀(δ)and for anym>1

P{Zm 6Pe`−m} >1−z−c0`−

m

µ, (13)

with

µ(K) 62+δ, (14)

and wherec0is a constant given by1+2 Pe−0.01.

For the sake of clarity, note that in (13) the kernelK is fixed and the probability space is defined with respect to the random processZm, while (14) holds with high probability over the choice of the kernelK. We are now

(9)

Proof of Theorem 2. Consider the transmission over the BEC(z) of a polar code with block length n= `mand rateR obtained by polarizing the`_{× `}kernelK, where` > `₀(δ). By Theorem 3, there is at least a fraction

1₋z₋c0`−

m

µ(K) _{of the synthetic channels whose erasure probability is at most}_P

e`−m, wherec0=1+2 Pe−0.01

and, with high probability over the choice ofK, µ(K) 62+δ. Then, if we take

R=1−z− (1+2 P_e−0.01)`−µ(mK)_, ₍₁₅₎

a simple union bound yields that the error probability under successive cancellation decoding is at mostPe. As

I(W) = 1−z, by re-arranging (15), formula (12) immediately follows with β = (1+2 P_e−0.01)2+δ_.

With-out loss of generality, we can assume that δ < 1, hence we can take β as prescribed by the statement of the theorem. The claim on the construction complexity follows from the fact that the erasure probabilities of the synthetic channels can be computed exactly according to the recursion (6). The claim on the encoding/decoding complexity follows from [23, 47].

The rest of the section is devoted to prove Theorem 3. The basic idea consists in bounding the number of un-polarized synthetic channel. To do so, let us define the polarization measure functiongα(z)as

gα(z) , z(1−z)

α

, (16)

where α ∈ (0, 1)is a fixed parameter. The first step is to show that an upper bound onE[gα(Zm)]yields an

lower bound onP_{Zm 6Pe`−m}. This is done in Lemma 1, whose statement and proof immediately follow.

Lemma 1. LetK ∈F`_×`

2 be an`× `non-singular binary kernel such that none of its column permutations is

upper triangular. Let Zm be the random process defined in(6) with initial condition Z0 = z. Fix α ∈ (0, 1)

and definegα(z)as in(16). Fix ρ, Pe∈ (0, 1)and assume that, for anym>1,

E[gα(Zm)] 6 `−mρ. (17)

Then, for anym>1,

P{Zm 6Pe`−m} >1−z−c1`−m(ρ−α), (18)

wherec1=2Pe−α+Pe.

Proof. First of all, we upper boundP_{Zm ∈ [Pe`−m, 1−Pe`−m]}as follows:

P Zm ∈ Pe`−m, 1−Pe`−m (a) =P gα(Zm) >gα(Pe`−m) (b) 6 E[gα(Zm)] gα(Pe`−m) (c) 6 `− mρ gα(Pe`−m) (d) 62 P−α e `−m(ρ−α), (19)

where the equality (a) uses the concavity of the functiongα(·); the inequality (b) follows from Markov

inequal-ity; the inequality (c) uses the hypothesisE[gα(Zm)] 6 `−mρ; and the inequality (d) uses that1−Pe`−m >

(10)

and let A0, B0, and C0 be the fraction of synthetic channels in A, B, and C, respectively, that will have a vanishing erasure probability asn _→∞. More formally,

A0 =lim inf m0_→_∞ P n Zm ∈0, Pe`−m , Zm+m0 6 `−m0 o , B0 =lim inf m0_→_∞ P n Zm ∈ Pe`−m, 1−Pe`−m , Zm+m06 `−m0 o , C0 =lim inf m0_→_∞ P n Zm ∈ 1−Pe`−m, 1 , Zm+m0 6 `−m0 o . (21)

Recall that any`_{× `}non-singular binary matrix none of whose column permutations is upper triangular po-larizes symmetric channels [23]. Hence, as K satisfies this condition by hypothesis, we immediately have that A0+B0+C0 =lim inf m0_→_∞ P n Zm+m0 6 `−m0 o =1₋z. (22)

In addition, from (19), we have that

B0 6B62 P−α

e `−m(ρ−α). (23)

In order to upper boundC0, we proceed as follows: C0 =lim inf m0_→_∞ P n Z_m+m0 6 `−m0 |Zm∈ 1−Pe`−m, 1 o ·P Zm ∈ 1−Pe`−m, 1 6lim inf m0_→_∞ P n Z_m+m06 `−m0 |Zm ∈ 1−Pe`−m, 1 o . (24)

By using again that the kernelK is polarizing, we obtain that the last term equals the capacity of a BEC with erasure probability at least1₋Pe`−m. Consequently,

C0 6Pe`−m. (25)

As a result, we conclude that

P Zm ∈0, Pe`−m = A> A0 (a)=1₋z₋B0₋C0(b)>1₋z₋2 P−α e `−m(ρ−α)−Pe`−m, (c) >1₋z₋ 2 P−α e +Pe `−m(ρ−α)_,

where the equality (a) uses (22); the inequality (b) uses (23) and (25); and the inequality (c) uses that α and

ρ∈ (0, 1). This chain of inequalities implies the desired result.

The second step consists in giving an upper bound onE[gα(Zm)]of the form(λ∗α,K)m, where λ∗α,K depends

on the particular kernelK. This is done in Lemma 2, whose statement and proof immediately follow.

Lemma 2. LetK _∈ F`₂×`be an`_{× `}binary kernel. LetZm be the random process defined in(6) with initial

conditionZ₀K =z. Fix α∈ (0, 1)and definegα(z)as in(16). Forz∈ (0, 1), define λα,K(z)as

λα,K(z) , 1 `∑ ` i=1gα(fi,K(z)) gα(z) , (26)

and let λ∗_α,Kbe its supremum, i.e.,

λ∗_α,K, sup

z∈(0,1)

λα,K(z). (27)

Then, for anym>0, we have that

(11)

Proof. We prove the claim by induction. The base stepm=0 follows immediately from the fact that Z0 =z.

To prove the inductive step, first we write

Egα(Zm+1)

=EE[gα(fBm,K(Zm))|Zm],

where the first expectation on the RHS is with respect toZmand the second expectation is with respect toBm.

Then, we have that

EE[gα(fBm,K(Zm))|Zm] =E gα(Zm) 1 `∑ ` i=1gα(fi,K(Zm)) gα(Zm) 6Egα(Zm) sup z∈{0,1} 1 `∑ ` i=1gα(fi,K(z)) gα(z) | {z } λ∗_α_,K ,

which concludes the proof.

The third and final step is to prove that λ∗_α,K concentrates around1/√`, whenK is selected uniformly at random among all `_{× `} non-singular binary matrices. This is done in Theorem 4 that is stated below and whose proof is presented in Appendix A.

Theorem 4 (Concentration of λ∗_α,K). Let K _∈ F`₂×` be a kernel that is selected uniformly at random among all`_{× `}non-singular binary matrices. Fix α∈ (0, 1/16)and define λ∗_α,Kas in(27). Then, there exists`₁(α)

such that, for any` > `₁(α),

P log`(λ∗α,K) 6− 1 2 +5α >1₋ 2 `, (29)

where the probability space is over the choice of the kernelK.

At this point, we are ready to put everything together and give the proof of Theorem 3. Proof of Theorem 3. Pick`sufficiently large and define

α=min δ 12(2+δ), 1 100 . (30)

Then, by Theorem 4, we have that, with high probability over the choice of the kernelK,

λ∗_α,K 6 `−(1/2−5α). (31)

Consequently, asgα(z) 61 for any z∈ (0, 1), by Lemma 2 we have that

E[gα(Zm)] 6 `−m

(1/2−5α)

. (32)

Note that, with high probability, the kernelK is such that none of its column permutations is upper triangular. Then, we can apply Lemma 1 and we deduce that

P{Zm6 Pe`−m} >1−z−c1`−m(1/2−6α), (33)

wherec1 =2Pe−α+Pe. Note that, as α61/100 and Pe61, we have that c1 61+2Pe−0.01. By plugging in

(33) the choice of α given by (30), the thesis immediately follows.

In order to study the scaling between`and δ, first note that α is O(δ)by (30). Furthermore, in order to apply

Theorem 4, we need that` > `₁(α), where`1(α)satisfies (35). Hence,`needs to be of orderexp(1/α1.01),

(12)

Appendix

A. Proof of Theorem 4: Concentration of λ

∗_α_,K

Let us recall that the goal is to show that for most non-singular binary kernelsK∈F`_×`

2 and for` > `1(α),

λα,K(z) 6 `−

1

2+5α ∀_z∈ (_{0, 1}₎_, ₍₃₄₎

where α∈ (0, 1)is fixed and`₁(α)is a large integer that depends only on α.

Our strategy is to split the interval(0, 1)into the three sub-intervals(0, 1/`2),[1/`2, 1₋1/`2], and(1₋ 1/`2_{, 1}₎_{. Then, we will show that (34) holds for each of these sub-intervals. In fact, as we will see, polarization}

is much faster at the tail intervals. Theorem 5 captures this approach.

Theorem 5. Let K _∈ F₂`×` be a kernel that is selected uniformly at random among all `_{× `} non-singular binary matrices. Fix α∈ (0, 1/16)and define λα,K(z)as in(26). Let`1(α)be the smallest integer such that

log`₁(α)

log log`₁(α) >

1

α. (35)

Then, for all_{` > `}₁(α), the following results hold.

1. Near optimal polarization in the middle:

P λα,K(z) < `− 1 2+5α_, ∀_z∈ h1 `2, 1− 1 `2 i >1− 1_`, (36)

2. Faster polarization in the tails:

P λα,K(z) < `− 1 2, ∀z∈ 0, 1 `2 ∪1₋ 1 `2, 1 >1₋1 `, (37)

where the probability spaces are over the choice of the kernelK. The proof of Theorem 4 immediately follows from the result above.

Proof of Theorem 4. By applying the union bound on (36) and (37), we obtain that

P λα,K(z) < `− 1 2+5α_, ∀_z ∈ (_{0, 1}₎ >1−2_`, (38)

which yields the desired result.

(13)

A.1. Successive cancellation decoding of binary erasure channels

Let us recall again that, when the transmission channel is a BEC(z), each polar bit-channel channels is also a BEC whose erasure probability is a polynomial inz. In this section, we give more insight into this fact by first describing the successive cancellation decoding method for BECs, and then establishing a connection between the decodability of theith_{bit-channel and the column spaces of some sub-matrices of}_K.

Let K _∈ F₂`×` be a non-singular binary kernel. Assume that the underlying channel is a BEC(z) and let s be the number of erasures occurred during the transmission. There are a total number of (`_s)distinct and equally-likely erasure patterns, and each of them occurs with probabilityzs(1−z)`−s. The erasure probability at thei-th bit-channel is given by

fi,K(z) = `

∑

s=0

zs(1₋z)`−s # of erasure patterns with s erasures that make uiundecidable.

Fix an erasure pattern withs erasures. For convenience, let the last s coordinates denote its erasure locations. Hence, the received symbols are given as

x`₁−s=uK[:,1:`₋s] , (39)

whereK[:,1:`₋s]denotes the sub-matrix ofK that is formed by removing its last s columns. Furthermore, note

that in the successive cancellation decoding ofui, we assume thatui1−1is known. We can now re-write (39) as

x₁`−s= uK[:,1:`₋s] =ui₁−1 u`_i K_[_:,1:_`₋_s_] =ui₁−1 0_`₋_i₊₁ K_[_:,1:_`₋_s_]+0_i−₁ u` i K[:,1:`₋s] | {z } =u` iK[i:`,1:`−s] , (40)

whereK[i:`,1:`₋s]denotes the sub-matrix ofK that is formed by removing its first i−1 rows as well as its last

s columns. It is now clear that, in order to decode ui, one needs to express the column vector(1, 0,· · · , 0)tas

a linear combination of columns inK[i:`,1:`₋s]. In other words,

ui = decodable ⇐⇒ (1, 0, 0, . . . , 0

| {z }

`₋i

)t_∈ column space ofK[i:`,1:`₋s], (41)

which is also equivalent to the following condition

∃ψ₁i−1 ∈Fi₂−1: (ψ1,· · · , ψi−1, 1, 0, 0,· · · , 0

| {z }

`₋i

)t_∈ column space ofK[:,1:`₋s]. (42)

Letejdenote thej-th element of the canonical basis and define the linear subspace Ej ⊂F`2as

Ej ,spanhe1, e2,· · · , eji.

Therefore, the decodability condition can be further simplified as

ui = decodable ⇐⇒ (Ei\Ei−1)∩ (column space ofK[:,1:`₋s])6=∅. (43)

In the following, we use the condition (43) and we give an explicit formula for the probability thatuiis

(14)

A.2. Average polarization behavior

Fori∈ [`], define the average erasure probabilityFi(z)as

Fi(z) ,EK[fi,K(z)] = ∑K

fi,K(z)

τ`

, (44)

where τ`denotes the number of non-singular`× `binary matrices, i.e.,

τ`= `₋1

∏

j=0

(2`₋2j). (45)

Then, it is easy to verify thatFi(z)represents the erasure probability of thei-th bit-channel given that (i) the

kernel is selected uniformly at random among the`_{× `}nonsingular binary matrices, and (ii) the transmission channel is a BEC(z).

In the following, we analyze the asymptotic behavior of Fi(z) and show that, as ` becomes large, Fi(z)

becomes close to a step function with jump atz_∼i_/`. We then prove some concentration results to show that, with high probability over the choice of the kernelK, fi,K(z) is also close to a sharp step function centered

aroundz∼i_/`.

We first recall thatFi(z)is the probability of observing an erasure at thei-th bit-channel, where there are two

sources of randomness: (i) the selection of the kernel, and (ii) the number and location of the erased received symbols. Let the random variableS denote the number of erased symbols at the receiver. As z is the erasure probability of the transmission channel, we have that

P{S=s} = ` s zs(1−z)`−s. (46)

Since we also average over all`_{× `}non-singular kernels, the location of theses erasures does not affect the average erasure probability of the bit-channels. Hence, without loss of generality, we can assume that the erasures happened at the last coordinates. LetR`₋s⊂ F2`denote the linear span of the first`−s columns of

the kernel. Since the kernel is selected uniformly at random, it is easy to see that_R`₋sis also chosen uniformly

at random from all subspaces of dimension`₋s inF`₂. Recalling the decodability condition in (43), we have that

P{ui =erasure|S =s} =P{R`₋s∩ (Ei\Ei−1) =∅}, (47)

where_R`₋sis a subspace of dimension`−s inF`2that is chosen uniformly at random. The probability space

on the LHS is defined with respect to (i) the location of thes erasures, and (ii) the selection of the random ker-nelK, while the probability space on the RHS is defined with respect to just the selection of random subspace

R`₋s. Now we can rewriteFi(z)as

Fi(z) = `

∑

s=0 P{S=s_}P_{ui =erasure|S=s} = `

∑

s=0 ` s zs(1₋z)`−sp_i_|_s, (48) where we define the average conditional erasure probabilityp_i_|_sas

p_i_|_s,P{R`₋s∩ (Ei\Ei−1) =∅}. (49)

(15)

Lemma 3 (Closed-Form for Average Conditional Erasure Probability). Let p_i_|_s be the average conditional erasure probability defined in(49). Then, for anyi and s,

p_i_|_s= ` `₋s −1 min{`−s,i−1}

∑

t=max{i−s,0} i−1 t `−s−t−1

∏

j=0 2`−2i+j 2`₋s₋₂t+j, (50) wherea b

is the binary Gaussian binomial coefficient.

Proof. Let∆`₋sbe the total number of subspaces inF`2with dimension`−s. Then,

∆`₋s= ` `₋s , `₋s−1

∏

j=0 2`−2j 2`₋s₋₂j. (51)

DefineΓ`_t−s,i as the number of subspaces A of dimension `₋s in F`

2 such that A∩ (Ei \Ei−1) = ∅ and

dim(A∩Ei−1) =t. Equivalently,Γt`−s,irepresents the number of subspacesA of dimension`−s inF`2such

thatdim(A_∩E_i₋₁) =dim(A_∩E_i) = t. Consequently, the integer t in the definition ofΓ`_t−s,i satisfies the following conditions:

max{i−s, 0} 6t6min{` −s, i−1}. (52)

A simple basis counting argument (see [53]) yields that Γ`₋s,i t = i−1 t | {z } number of subspaces inEi−1withdim= t × `₋s−t−1

∏

j=0 2`₋2i+j 2`₋s₋₂t+j | {z }

normalized number of basis extensions fromdim= t to dim= `₋s

. (53)

Then, the desired conditional erasure probability can be written as

p_i_|_s = min{`−s,i−1}

∑

t=max{i−s,0} Γ`₋s,i t ∆`₋s . (54)

The thesis immediately follows from (53) and (54).

Now, we use this closed-form expression to provide lower and upper bounds on the average conditional erasure probabilityp_i_|_sand on the average erasure probabilityFi(z).

Lemma 4 (Lower Bound for Average Conditional Erasure Probability). Let p_i_|_s be the average conditional erasure probability defined in(49). Then, for anyi and s,

p_i_|_s>1₋2−(s−i). (55)

(16)

The remainder of the proof is derived by simple algebra as follows `₋s−1

∏

j=0 2`−2i+j 2`₋₂j > `₋s−1

∏

j=0 2`−2i+j 2` = `₋s−1

∏

j=0 1−2−(`−i)+j>1− `₋s−1

∑

j=0 2−(`−i)+j >1−2−(s−i). (57)

Lemma 5 (Lower Bound for Average Erasure Probability). LetFi(z)be the average erasure probability of the

i-th bit-channel as defined in (44). Fix β, δ∈R+_{and assume that}

z> i `+ d δlog`e ` + βln` 2` 1/2 , (58)

wherelog and ln denote the logarithm in base 2 and e, respectively. Then, we have that

Fi(z) > (1− `−β)(1− `−δ). (59)

Proof. We begin by dropping the firsti+_dδlog`eterms in (48) and applying the lower bound from (55):

Fi(z) = `

∑

s=0 ` s zs(1₋z)`−sp_i_|_s > `

∑

s=i+_dδlog`e ` s zs(1₋z)`−s(1₋2−(s−i)) > (1− `−δ₎ `

∑

s=i+δdlog`e _` s zs(1−z)`−s. (60) Now, we point out that the sum on the RHS of (60) is the tail probability of a binomial distribution with`trials and a success rate ofz. More formally, let X∼ B(`, z). Then, from (60) we immediately obtain that

Fi(z) > (1− `−δ)P{X>i+dδlogè}. (61) Furthermore, P{X>i+_dδlogè} =1−P{X <i+dδlogè} (a) > 1−exp −2 z`− (i+dδlogè) 2 ` ! (b) >1_{− `}−β_, ₍₆₂₎

where in (a) we use Hoeffding’s inequality and in (b) we use (58). By combining (60) with (62), the claim readily follows.

First, we use the closed-form expression in order find a lower bound on the average conditional erasure probability and on the average erasure probability.

Lemma 6 (Upper Bound for Average Conditional Erasure Probability). Let p_i_|_s be the average conditional erasure probability defined in(49). Then, for anyi and s,

p_i_|_s62 2 3

i−s−1

(17)

Proof. Ifs>i−1, then the proof is trivial. Hence, let us assume that s< i−1. We start by proving that the term witht =i₋s is the dominant one in the expression (54) for p_i_|_s. For allt>i₋s, we have that

Γ`₋s,i t Γ`₋s,i t−1 = i₋1 t i−1 t−1 × `−s−t−1

∏

j=0 2`−2i+j 2`₋s₋₂t+j `−s−t

∏

j=0 2`−2i+j 2`₋s₋₂t+j−1 ,

which using simple algebra can be simplified as

(2i−1−2t−1)(2`−s−2t−1) 2t−1₍₂t₋₁₎₍₂`₋₂i+`₋s−t₎ 6 1 2t−1 × 2i−1×2`−s 2t−1_×₂`₋1 = 2i−s−t+1 2t−1 62− t+1 6 1 2. Therefore, we have that, for anyt>i₋s,

Γ`₋s,i t 62−

t−(i−s) _Γ`₋s,i i−s ,

which implies that

p_s_|_i6 Γ `₋s,i i−s ∆`₋s 1+2−1+2−2+_{· · ·} 6 2Γ `₋s,i i−s ∆`₋s . (64)

In a similar fashion, we fix`andi, and we show the exponential decay of the dominant term in p_i_|_s, denoted by ζs,Γ`_i₋−_ss,i/∆`₋s, ass decreases. We again use simple algebra to obtain

ζs ζs+1 = ∆`−s−1 ∆`₋s × i−1 i−s i−1 i−s−1 × `₋i−1

∏

j=0 2`₋2i+j 2`₋s₋₂i−s+j `₋i−1

∏

j=0 2`₋2i+j 2`₋s−1₋₂i−s−1+j = (2 i−1₋₂i−s−1₎₍₂`₋s₋₁₎ (2i−s₋₁₎₍₂`₋₂`₋s−1₎ = 2s₋1 2s+1₋₁ 1₋2−(`−s) 1−2−(i−s) 6 1 2× 1 1−2−(i−s) 6 1 2 × 1 1−1/4 = 2 3. (65)

As a result, we conclude that, for anys<i−1, p_i_|_s (64) 6 2Γ `₋s,i i−s ∆`₋s (65) 6 2ζi−1 2 3 i−s−1 62 2 3 i−s−1 , (66)

which implies the desired result.

Lemma 7 (Upper Bound for Average Erasure Probability). LetFi(z)be the average erasure probability of the

i-th bit-channel as defined in (44). Fix β, δ∈R+_{and assume that}

z< i ` − g(δ) ` − βln` 2` 1/2 , (67)

wherelog and ln denote the logarithm in base 2 and e, respectively, and g(δ) = δlog` +log 6

log 3−1 =O(δlog`). (68)

Then, we have that

(18)

Proof. Let us recall the formulation ofFi(z)from (48) and split the summation into two parts, where a trivial

upper bound is applied to each part: we drop (`_s)zs₍₁₋_z₎`₋s _{for all terms in the summation with}_s ₆ _i₋

g(δ)−1, and we drop p_i_|_sfrom the remaining terms that correspond tos>i−g(δ). More formally, we have

Fi(z) = i−g(δ)₋1

∑

s=0 ` s zs(1−z)`−sp_i_|_s + `

∑

s=i−g(δ) ` s zs(1−z)`−sp_s_|_i < i−g(δ)−1

∑

s=0 p_s_|_i + `

∑

s=i−g(δ) ` s zs(1−z)`−s. (70)

We apply the upper bound in (63) to the first summation, and obtain that

i−g(δ)−1

∑

s=0 p_s_|_i 6 i−g(δ)−1

∑

s=0 2 2 3 i−s−1 6 ∞

∑

s=g(δ) 2 2 3 s =6 2 3 g(δ) = `−δ_. ₍₇₁₎

The second summation is again upper bounded by applying Hoeffding’s inequality on the tail probability of the binomial distributionX∼ B(`, z)with`trials and a success rate ofz as

`

∑

s=i−g(δ) _` s zs(1−z)`−s=P_{X>i−g(δ)} 6P ( X>z` + β`ln` 2 1/2) 6 `−β_. ₍₇₂₎

A.3. Proof of Theorem 5

At this point, we have gathered all the required tools to prove Theorem 5. Our proof consists of two steps. First, we show that the polarization behavior of a random non-singular`_{× `}kernel is given, with high probability, by the functionFi(z)analyzed in the previous subsection. Then, we explain how to relate this fact to an upper

bound on λα,K(z).

As the theorem itself suggests, we split the proof into two parts: the first part takes care of the middle interval and proves (36), while the second one takes care of the tail intervals and proves (37).

Proof of (36). First, we combine the results from Lemma 5 and Lemma 7 to show thatFi(z)roughly behaves

as a step function. In fact, we have that          Fi(z) > (1− `−β)(1− `−δ), ifz > i`+ d δlog`e ` + βln` 2` 1/2 Fi(z) < `−β+ `−δ, ifz < i`− j δlog`+log 6 log 3−1 k ` − βln` 2` 1/2 . (73)

Our strategy is to show that, with high probability over the choice of the kernelK, fi,K(z)is sharp for a fixed

value of i. Then, we will use a union bound to show that fi,K(z)is sharp for all i ∈ [`]. To do so, we set

β = δ = 4.5+log`. Furthermore, we can assume that` > 32, as (35) holds and α < 1/16. It is easy to

verify that  



Fi(z) >1−2`−4.5−log` >1− (2`4+log`)−1, ifz> `i +c`−1/2log`

Fi(z) <2`−4.5−log`< (2`4+log`)−1, ifz6 `i −c`−1/2log`

(19)

where

c= (4.5+log`) + (log 6)(log`)− 1 log 3₋1 ` −1/2₊ 4.5(log`)−1+1 2 log e 1/2 65, _{∀` >}32. (75) Note that there are infinitely many values ofz for which we need fi,K(z)to behave similar to (74). Hence, a

simple union bound would not give us the proof. Fortunately, for alli _{∈ [`]}, fi,K(z)and consequently Fi(z)

are increasing functions ofz. Hence, it suffices to consider only two points in(0, 1)for eachi, one slightly larger thanz =i_/`and one slightly smaller.

Define

ai ,

i

` +c`

−1/2_log_`_.

From (74), we have that

E1− fi,K(ai)

=1−Fi(ai) < (2`4+log`)−1,

where the expectation is taken over all non-singular`_{× `}kernels. From Markov inequality, we deduce that

P fi,K(ai) 61− 1 `2+log` =P1₋ fi,K(ai) > 1 `2+log` 6 EK1− fi,K(ai) 1/`2+log` 6 1 2`2. (76) Define Ai ,K∈F2`×`

K is non-singular and f_i,K(a_i) >1− 1

`2+log` . (77)

Therefore, (76) can be re-written as

P{K_{∈ A}i} >1− 1 2`2. Similarly, set bi , i `−c`− 1/2_log_`_, and define Bi ,K∈F2`×`

K is non-singular and f_i,K(b_i) > 1

`2+log` . (78)

A very similar use of Markov inequality shows that

P{K _{∈ B}_i_{} >}1− 1 2`2. Then, define D = ∩`j=1Aj ∩ ∩`j=1Bj.

By union bound, we obtain that

P{K∈ D} >1− `

∑

i=1 P{K /∈ Ai} − `

∑

i=1 P{K /∈ Bi} >1− 2` 2`2 =1− 1 `. (79)

Assume thatK_{∈ D}throughout the remainder of proof. This implies that, fori_{∈ [`]}, (

fi,K(z) >1− _`2+1log`, forz= _`i +c`−1/2log`

fi,K(z) < _`2+1log`, forz= _`i −c`−1/2log`

(20)

As fi,K(z)is an increasing function ofz, (80) is equivalent to

(

fi,K(z) >1− _`2+1log`, forz> _`i +c`−1/2log`

fi,K(z) < _`2+1log`, forz6 _`i −c`−1/2log`

. (81)

Given these concentration results, we can proceed to the second step of the proof. Let us define T0(z,`) ,z`−c`1/2log`,

T1(z,`) ,z` +c`1/2log`.

(82) Note that, for any z ∈ (1/`2, 1−1/`2), the number of indices i such that fi,K(z) does not satisfy (81) is

upper bounded by T1(z,`)−T0(z,`) =2c`−1/2log`, as i `−c` −1/2_log_{` <} _z_< i `+c`

−1/2_log_`_⇐⇒_z_`₋_c_`1/2_log_{` <}_i_< _z_{` +}_c_`1/2_log_`_. ₍₈₃₎

We can re-write λα,K(z)which was defined earlier in (26) as

λα,K(z) = 1 `

∑

i∈(T0(z,`),T1(z,`)) gα(fi,K(z)) gα(z) + 1 `

∑

i /∈(T0(z,`),T1(z,`)) gα(fi,K(z)) gα(z) (84) By using (81), we have that, for anyi /∈ T0(z,`), T1(z,`),

gα(fi,K(z)) 6gα 1 `2+log` < `−2−log`α . (85)

By combining (85) with the trivial upper bound ofgα(fi,K(z)) 61 for the left summation, we obtain that

λα,K(z) 6 1 `2c`1/2log` gα(z) + ` −2−log`α gα(z) (86) Furthermore, note that, for anyz∈ (1/`2, 1−1/`2),

gα(z) > `−2(1− `−2)

α

. (87)

By combining (86) and (87), we have that

λα,K(z) 6 1 (1− `−2₎α 2c`−1/2+2αlog` + `−αlog` . (88)

As (35) holds with α61/16,`is large enough so that

(1− `−2₎α 62, log` 6 `α_, −αlog` 6−1/2+3α, 4c+26 `2α. (89)

By applying the inequalities in (89) to (88), we finally obtain that

λα,K(z) 64c`−1/2+2αlog` +2`−αlog`64c`−1/2+3α+ 2`−1/2+3α = (4c+2)`−

1

(21)

Proof of (37). The proof of the tail intervals also follows from analyzing the average erasure probabilities. We present the proof mainly for the lower tail, wherez ∈ (0, 1/`2). Similar arguments yield the proof for the upper tail.

We begin by recalling the previously derived upper bound on the average conditional erasure probability in (63): p_i_|_s=P_{R`₋s∩ (Ei\Ei−1) =∅} 63 2 3 i−s , (91)

where the probability space is defined with respect to the selection of a random subspace_R`₋s ⊂ F`2 of

di-mension = `₋s. Once again, let us point out that the above mentioned probability is equal to P{R0 `₋s∩ (Ei\Ei−1) =∅}, whereR0`−sis the linear span of some randomly chosen`−s columns of a random kernel

K_∈F`₂×`.

Let us define the conditional erasure probabilities of a fixed kernelK by

q_i_|_s(K) ,P{R`₋s∩ (Ei\Ei−1) =∅|K} =P{R`₋s(K)∩ (Ei\Ei−1) =∅}, (92)

where_R`₋s(K)is the linear span of`−s columns in K that are selected uniformly at random. Note that in

(92) the kernelK is fixed and the probability is with respect to the selection of the columns of the kernel (i.e., with respect to the location of thes channel erasures). Hence, it is clear that

EK[qi|s(K)] = pi|s63

2 3

i−s

. (93)

Similar to the proof for the middle interval, we provide some concentration results aboutq_i_|_s(K), whenK is selected uniformly at random among the non-singular`_{× `}matrices. Let us first fix the value ofi, and s. Then, by Markov inequality, we have

P ( q_i_|_s(K) >6`2(` +1) 2 3 i−s) 6 1 2`2_{(` +}₁₎, (94)

where the probability is defined with respect to the selection of the kernel. By union bound, for any i _∈

{1, . . . ,`_}ands∈ {0, . . . ,`_}, we deduce that

qi|s(K) 66`2(` +1)

2 3

i−s

, (95)

with probability of at least1₋1/2`.

(22)

Note that `

∑

s=1 ` s 3z 2 s (1−z)`−s=1+ z 2 ` − (1−z)` < (1+z)`_{− (}1₋z)` 62`z(1+z)`−1, (97) where the last inequality in (97) comes from the fact that∀x ∈ (0, 1), there exists ax0 ∈ (1−x, 1+x)such

that (1+x)`_{− (}1₋x)`= (1+x)_{− (}1₋x) ∂(1+x)` ∂x x=x0 =2`x(1+x0)`−162`x(1+x)`−1. (98)

Next, we point out that, for anyz< `−2and any` >2, we have

(1+z)`−1₆ 1+ 1 `2 `−1 6 1+ 1 `2 `2 <exp(1) <3. (99)

Now, we replace (97) and (99) in (96) to obtain that fi,K(z) <36`3(` +1) 2 3 i z638`4 2 3 i z, (100)

where the last inequality holds for_{` >}18.

Finally, we use (100) to derive the following upper bound on λα,K(z)for anyz ∈ (0, 1/`2):

λα,K(z) = 1 ` ∑ ` i=1gα(fi,K(z)) gα(z) = 1 ` `

∑

i fi,K(z) 1− fi,K(z) α z(1₋z)α < 1 ` `

∑

i=1 fi,K(z) α z−α₍₁₋_z₎−α _{< `}4α−1 38α `

∑

i=1 2 3 iα (1−z)−α_. ₍₁₀₁₎ As α61/16, 38 1−z α < 38 1− (1/18)2 1/16 < 3 2. (102) Furthermore, `

∑

i=1 2 3 iα < ∞

∑

i=1 2 3 iα = ∞

∑

j=0 d1/αe

∑

k=1 2 3 (jd1/αe+k)α < ∞

∑

j=0 d1/αe

∑

k=1 2 3 (jd1/αe)α = 1 α ∞

∑

j=0 2 3 (jd1/αe)α 6 1+ 1 α ∞

∑

j=0 2 3 j =3 1+ 1 α < 4 α. (103)

Moreover, from (35) we obtain that

6α−16 `1/4. (104)

(23)

which yields the desired bound on the lower tail.

The proof for the upper tail follows very similar arguments. First, we define hi,K(z) ,1− fi,K(z),

r_i_|_s(K) ,1−q_i_|_s(K). (106)

Next, we use the upper bound on the average conditional erasure probability from (55) to provide an upper bound onEr_i_|_s(K) that is very similar to (93), i.e.,

EK[ri|s(K)] 62−(s−i). (107)

By following steps similar to (94)-(105) and by using that α61/16 and 4α−1 6 `1/4, we show that, for any z_{∈ (}1₋1/`2, 1),

λα,K(z) 6 `−1/2, (108)

with probability at least1−1/(2`) over the choice of the kernel. By combining (105) and (108) and using one last union bound, we conclude that

(24)

References

[1] E. ARIKAN, Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels. IEEE Trans. Inform. Theory 55, 7 (July 2009), 3051–3073.

[2] E. ARIKANand I.E. TELATAR, On the rate of channel polarization. In Proc. of the IEEE Int. Symposium on Inform. Theory(Seoul, South Korea, July 2009), pp. 1493–1495.

[3] E.R. BERLEKAMP, R.J. MCELIECE, and H.C.A.VANTILBORG, On the inherent intractability of certain coding problems, IEEE Trans. Inform. Theory, 24 (May 1978), 384–386.

[4] R.L. DOBRUSHIN, Mathematical problems in the Shannon theory of optimal coding of information. In Proc. 4th Berkeley Symp. Mathematics, Statistics, and Probability(1961), vol. 1, pp. 211–252.

[5] A. ESLAMI and H Pishro-Nik, On finite-length performance of polar codes: Stopping sets, error floor, and concatenated design. IEEE Trans. Commun. 61, 3 (Mar. 2013), 919–929.

[6] A. FAZELIand A. VARDY, On the scaling exponent of binary polarization kernels. In Proc. of the Aller-ton Conf. on Commun., Control, and Computing(Monticello, IL, USA, October 2014), pp. 797–804. [7] G.D. FORNEY, JR., Concatenated Codes. PhD thesis, MIT, 1966.

[8] D. GOLDINand D. BURSHTEIN, Improved bounds on finite length scaling of polar codes, IEEE Trans.

Inform. Theory, vol. 60, no. 11, pp. 6966–6978, November 2014.

[9] V. GURUSWAMIand P. XIA, Polar codes: Speed of polarization and polynomial gap to capacity, Proc.

54-th Annual IEEE Symp. Foundations of Computer Science(FOCS), Berkeley, CA, October 2013; also arxiv.org/abs/1304.4321, preprint of April 16, 2013.

[10] S.H. HASSANI, Polarization and Spatial Coupling: Two Techniques to Boost Performance Ph.D. disser-tation, EPFL, Lausanne, Switzerland, March 2013.

[11] S.H. HASSANI, K. ALISHAHI, and R.L. URBANKE, Finite-length scaling for polar codes, IEEE Trans. Inform. Theory, vol. 60, no. 10, pp. 5875–5898, October 2014.

[12] S.H. HASSANI, R. MORI, T. TANAKA, and R.L. URBANKE, Rate-dependent analysis of the asymptotic behavior of channel polarization. IEEE Trans. Inform. Theory 59, 4 (Apr. 2013), 2267–2276.

[13] S.H. HASSANIand R.L. URBANKE, On the scaling of polar codes I: The behavior of polarized channels, arxiv.org/abs/1001.2766, preprint of January 28, 2010.

[14] S.H. HASSANIand R.L. URBANKE, On the scaling of polar codes II: The behavior of un-polarized chan-nels, arxiv.org/abs/1002.3187, preprint of February 18, 2010.

[15] M. HAYASHI, Information spectrum approach to second-order coding rate in channel coding. IEEE Trans. Inform. Theory 55, 11 (Nov. 2009), 4947–4966.

[16] E. HOF and S. SHAMAI, Secrecy-achieving polar-coding, Proc. IEEE Information Theory Workshop, Dublin, Ireland, August 2010.

(25)

[18] N. HUSSAMI, S.B. KORADA, and R.L. URBANKE, The compound capacity of polar codes, in Proc. 47-th Annual Allerton Conference on Communication, Control, and Computing, pp. 16–21, Monticello, IL, October 2009.

[19] N. HUSSAMI, R.L. URBANKE, and S.B. KORADA, Performance of polar codes for channel and source coding, in Proc. IEEE Intern. Symp. Information Theory, pp. 1488–1492, Seoul, South Korea, June 2009. [20] S.B. KORADA, Polar Codes for Channel and Source Coding, Ph.D. dissertation, EPFL, Lausanne,

Switzerland, May 2009.

[21] S.B. KORADA, A. MONTANARI, E. TELATAR, and R.L. URBANKE, An empirical scaling law for polar codes, in Proc. IEEE Intern. Symp. Information Theory, pp. 884–888, Austin, TX, June 2010.

[22] S.B. KORADAand E. S¸AS¸OGLU˘ , A class of transformations that polarize binary-input memoryless chan-nels, Proc. IEEE Intern. Symp. Information Theory, pp. 1478–1482, Seoul, Korea, June 2009.

[23] S.B. KORADA, E. S¸AS¸OGLU˘ , and R.L. URBANKE, Polar codes: Characterization of exponent, bounds, and constructions, IEEE Trans. Inform. Theory, vol. 56, no. 12, pp. 6253-6264, December 2010.

[24] S.B. KORADA and R.L. URBANKE, Polar codes for Slepian-Wolf, Wyner-Ziv, and Gelfand-Pinsker, Proc. IEEE Information Theory Workshop, Cairo, Egypt, January 2010.

[25] S.B. KORADAand R.L. URBANKE, Polar codes are optimal for lossy source coding, IEEE Trans. Inform.

Theory, vol. 56, no. 4, pp. 1751–1768, April 2010.

[26] O.O. KOYLUOGLUand H. ELGAMAL, Polar coding for secure transmission and key agreement, IEEE Trans. Inform. Forensics and Security, vol. 7, no. 5, pp. 1472–1483, July 2012.

[27] S. KUDEKAR, S. KUMAR, M. MONDELLI, H. D. PFISTER, E. S¸AS¸OGLU˘ , and R. URBANKE, R, Reed-Muller codes achieve capacity on erasure channels, In Proc. of the Annual ACM Symposium on Theory of Computing (STOC)(Boston, MA, USA, June 2016), pp. 658–669.

[28] S. KUDEKAR, S. KUMAR, M. MONDELLI, H. D. PFISTER, E. S¸AS¸OGLU˘ , and R. URBANKE, R,

Reed-Muller codes achieve capacity on erasure channels, IEEE Trans. Inform. Theory 63, 7 (July 2017), 4298– 4316.

[29] S. KUDEKAR, T. J. RICHARDSON, and R. L. URBANKE, Spatially coupled ensembles universally

achieve capacity under belief propagation, IEEE Trans. Inform. Theory 59, 12 (Dec. 2013), 7761–7813. [30] C. LEROUX, A.J. RAYMOND, G. SARKIS, I. TAL, A. VARDY, and W. J. GROSS,, Hardware

implemen-tation of successive-cancellation decoders for polar codes, Journal of Signal Processing Systems, vol. 69, no. 3, pp. 305–315, December 2012.

[31] C. LEROUX, A.J. RAYMOND, G. SARKIS, and W.J. GROSS, A semi-parallel successive cancellation de-coder for polar codes, IEEE Trans. Signal Processing, vol. 61, no. 2, pp. 289–299, January 2013.

[32] C. LEROUX, A.J. RAYMOND, G. SARKIS, I. TAL, A.VARDY, and W.J. GROSS, Hardware

implemen-tation of successive-cancellation decoders for polar codes, J. Signal Processing Systems, vol. 69, no. 3, pp. 305–315, December 2012.

(26)

[34] H. MAHDAVIFAR, M. EL-KHAMY, J. LEE, and I. KANG, Achieving the uniform rate region of general multiple access channels by polar coding, IEEE Trans. on Communications, vol. 64,

[35] H. MAHDAVIFAR, M. EL-KHAMY, J. LEE, and I. KANG, Polar coding for bit-interleaved coded

modu-lation, IEEE Trans. Vehicular Techn., vol. 65, no. 5, pp. 3115–3127, May 2016.

[36] H. MAHDAVIFARand A.VARDY, Achieving the secrecy capacity of wiretap channels using polar codes, IEEE Trans. Inform. Theory, vol. 57, no. 10, pp. 6428–6443, October 2011.

[37] M. MONDELLI, S. H. HASSANI, I. SASON, and R. L. URBANKE, Achieving Marton’s region for broad-cast channels using polar codes, IEEE Trans. Inform. Theory, vol. 61, no. 2, pp. 783–800, February 2015. [38] M. MONDELLI, S.H. HASSANI, and R.L. URBANKE, How to achieve the capacity of asymmetric

chan-nels, submitted to the IEEE Trans. Inform. Theory, September 2014; also arxiv.org/abs/1406.7373. [39] M. MONDELLI, S.H. HASSANI, and R.L. URBANKE, From polar to Reed-Muller codes: A technique to

improve the finite-length performance, IEEE Trans. Commun. 62, 9 (Sept. 2014), 3084–3091.

[40] M. MONDELLI, S.H. HASSANI, and R.L. URBANKE, Construction of polar codes with sublinear com-plexity, In Proc. of the IEEE Int. Symposium on Inform. Theory Aachen, Germany, June 2017, pp. 1853– 1857.

[41] M. MONDELLI, S.H. HASSANI, and R.L. URBANKE, Scaling exponent of list decoders with applica-tions to polar codes, IEEE Trans. Inform. Theory, vol. 61, no. 9, pp. 4838–4851, September 2015. [42] M. MONDELLI, S.H. HASSANI, and R.L. URBANKE, Unified scaling of polar codes: Error exponent,

scaling exponent, moderate deviations, and error floors, IEEE Trans. Inform. Theory, vol. 62, no. 12, pp. 6698–6712, December 2016

[43] A. MONTANARI, Finite size scaling and metastable states of good codes, In Proc. of the Allerton Conf. on Commun., Control, and Computing(Monticello, IL, USA, Oct. 2001).

[44] R. PEDARSANI, S. H. HASSANI, and E. TELATAR, On the construction of polar codes, In Proc. of the IEEE Int. Symposium on Inform. TheorySt. Petersberg, Russia, Aug. 2011, pp. 11–15.

[45] H. D. PFISTER, and R. URBANKE, Near-optimal finite-length scaling for polar codes over large

alpha-bets May 2016. [Online]. Available: http://arxiv.org/abs/1605.01997.

[46] Y. POLYANSKIY, H. V. POOR, and S. VERDU´, Channel coding rate in the finite block-length regime, IEEE Trans. Inform. Theory 56, 5 (May 2010), 2307–2359.

[47] E. S¸AS¸OGLU˘ , Polarization and polar codes, Foundations and Trends in Communications and Information Theory, vol. 8, no. 4, pp. 259–381, October 2012.

[48] C.E. SHANNON, A mathematical theory of communication, Bell Syst. Tech. J., 27, (April/October 1948), 379–423 and 623–656.

[49] V. STRASSEN, Asymptotische absch¨atzungen in Shannon’s informationstheorie, In Trans. 3rd Prague

Conf. Inf. Theory(1962), pp. 689–723.

(27)

[51] I. TALand A.VARDY, List decoding of polar codes, IEEE Trans. Inform. Theory, vol. 61, no. 5, pp. 2213– 2226, May 2015.

[52] J. P. TILLICH, and G. Z ´EMOR, Discrete isoperimetric inequalities and the probability of a decoding error. Combinatorics, Probability and Computing, 9 (September 2000), 465–479.

[53] A. VARDYand Y. BE’ERY, Maximum-likelihood soft decision decoding of BCH codes, IEEE