Anda di halaman 1dari 5

On the Entropy of Sums

Mokshay Madiman
Department of Statistics
Yale University
Email: mokshay.madiman@yale.edu

Abstract— It is shown that the entropy of a sum of independent II. P RELIMINARIES


random vectors is a submodular set function, and upper bounds
on the entropy of sums are obtained as a result in both discrete Let X1 , X2 , . . . , Xn be a collection of random variables
and continuous settings. These inequalities complement the lower taking values in some linear space, so that addition is a
bounds provided by the entropy power inequalities of Madiman well-defined operation. We assume that the joint distribu-
and Barron (2007). As applications, new inequalities for the tion has a density f with respect to some reference mea-
determinants of sums of positive-definite matrices are presented. sure. This allows the definition of the entropy of various
random variables depending on X1 , . . . , Xn . For instance,
I. I NTRODUCTION H(X1 , X2 , . . . , Xn ) = −E[log f (X1 , X2 , . . . , Xn )]. There
are the familiar two canonical cases: (a) the random variables
Entropies of sums are not as well understood as joint are real-valued and possess a probability density function, or
entropies. Indeed, for the joint entropy, there is an elaborate (b) they are discrete. In the former case, H represents the
history of entropy inequalities starting with the chain rule of differential entropy, and in the latter case, H represents the
Shannon, whose major developments include works of Han, discrete entropy. We simply call H the entropy in all cases,
Shearer, Fujishige, Yeung, Matúš, and others. The earlier part and where the distinction matters, the relevant assumption will
of this work, involving so-called Shannon-type inequalities be made explicit.
that use the submodularity of the joint entropy, was synthe- Since we wish to consider sums of various subsets of ran-
sized and generalized by Madiman and Tetali [1], who give dom variables, the following notational conventions and defi-
an array of lower as well as upper bounds for joint entropy nitions will be useful. Let [n] be the index set {1, 2, . . . , n}.
of a collection of random variables generalizing inequalities We are interested in a collection C of subsets of [n]. For any
of Han [2], Fujishige [3] and Shearer [4]. For a review of the set s ⊂ [n], Xs stands for the random variable (Xi : i ∈ s),
later history, involving so-called non-Shannon inequalities, one with the indices taken in their increasing order, while
may consult, for instance, Matúš [5]. X
When one considers the entropy of sums of independent Ys = Xi .
random variables instead of joint entropies, the most important i∈s
inequalities known are the entropy power inequalities, which For any index i in [n], define the degree of i in C as r(i) =
provide lower bounds on entropy of sums. In the setting |{t ∈ C : i ∈ t}|. A function α : C → R+ , isPcalled a fractional
of independent summands, the most general entropy power covering, if for each i ∈ [n], we have s∈C:i∈s αs ≥ 1.
inequalities known to date are described by Madiman and A function β : C → R+ , isPcalled a fractional packing, if
Barron [6]. for each i ∈ [n], we have s∈C:i∈s βs ≤ 1. If α is both
In this note, we develop a basic submodularity property of a fractional packing and a fractional covering, it is called a
the entropy of sums of independent random variables, and fractional partition.
formulate a “chain rule for sums”. Surprisingly, these results The mutual information between two jointly distributed
are entirely elementary and rely on the classical properties of random variables X and Y is
joint entropy. We also demonstrate as a consequence upper
bounds on entropy of sums that complement entropy power I(X; Y ) = H(X) + H(Y ) − H(X, Y ),
inequalities.
The paper is organized as follows. In Section II, we review and is a measure of the dependence between X and Y .
the notation and definitions we use. Section III presents the In particular, I(X; Y ) = 0 if and only if X and Y are
submodularity of the entropy of sums of independent random independent. The following three simple properties of mutual
vectors, which is a central result. Section IV develops general information can be found in elementary texts on information
upper bounds on entropies of sums, while Section V reviews theory such as Cover and Thomas [7]:
known lower bounds and conjectures some additional lower • When Y = X +Z, where X and Z are independent, then

bounds. In Section VI, we apply the entropy inequalities of I(X; Y ) = H(Y ) − H(Z). (1)
the previous sections to give information-theoretic proofs of
inequalities for the determinants of sums of positive-definite Indeed, I(X; X + Z) = H(X + Z) − H(X + Z|X) =
matrices. H(X + Z) − H(Z|X) = H(X + Z) − H(Z).
• The mutual information cannot increase when one looks to the author’s knowledge. Second, a consequence of the
at functions of the random variables (the “data processing submodularity of entropy of sums is an entropy inequality
inequality”): that obeys the partial order constructed using compressions, as
introduced by Bollobas and Leader [8]. Since this inequality
I(f (X); Y ) ≤ I(X; Y ).
requires the development of some involved notation, we defer
• When Z is independent of (X, Y ), then the details to the full version [9] of this paper. Third, an
equivalent statement of Theorem I is that for independent
I(X; Y ) = I(X, Z; Y ). (2) random vectors,
Indeed, I(X; X + Y ) ≥ I(X; X + Y + Z), (3)
I(X, Z; Y ) − I(X; Y ) as is clear from the proof of Theorem I.
= H(X, Z) − H(X, Z|Y ) − [H(X) − H(X|Y )]
= H(X, Z) − H(X) − [H(X, Z|Y ) − H(X|Y )] IV. U PPER B OUNDS
= H(Z|X) − H(Z|X, Y ) Remarkably, if we wish to use Theorem I to obtain upper
= 0. bounds for the entropy of a sum, there is a dramatic difference
between the discrete and continuous cases, even though this
distinction does not appear in Theorem I.
III. S UBMODULARITY First consider the case where all the random variables are
Although the inequality below is a simple consequence of discrete. Shannon’s chain rule for joint entropy says that
these well known facts, it does not seem to have been noticed H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) (4)
before.
where H(X2 |X1 ) is the conditional entropy of X2 given X1 .
Theorem I:[S UBMODULARITY FOR S UMS] If Xi are inde- This rule extends to the consideration of n variables; indeed,
pendent Rd -valued random vectors, then n
X
H(X1 , . . . , Xn ) = H(Xi |X<i ) (5)
H(X1 + X2 ) + H(X2 + X3 ) ≥ H(X1 + X2 + X3 ) + H(X2 ).
i=1

where X<i is used to denote (Xj : j < i). Similarly, one may
also talk about a “chain rule for sums” of independent random
Proof: First note that variables, and this is nothing but the well known identity (1)
H(X1 + X2 ) + H(X2 + X3 ) − H(X1 + X2 + X3 ) − H(X2 ) that is ubiquitous in the analysis of communication channels.
= H(X1 + X2 ) − H(X2 )− Indeed, one may write
 
H(X1 + X2 + X3 ) − H(X2 + X3 ) H(X1 + X2 ) = H(X1 ) + I(X1 + X2 ; X2 )
= I(X1 + X2 ; X1 ) − I(X1 + X2 + X3 ; X1 ), and iterate this to get
using (1). Thus we simply need to show that
X 
H Xi = H(Y[n−1] ) + I(Y[n] ; Yn )
I(X1 + X2 ; X1 ) ≥ I(X1 + X2 + X3 ; X1 ). i∈[n]

Now = H(Y[n−2] ) + I(Y[n−1] ; Yn−1 ) + I(Y[n] ; Yn )


(a) = ...
I(X1 + X2 + X3 ; X1 ) ≤ I(X1 + X2 , X3 ; X1 ) X
= I(Y[i] ; Yi ),
(b)
= I(X1 + X2 ; X1 ), i∈[n]

where (a) follows from the data processing inequality, and (b) which has the form of a “chain rule”. Here we have used Ys
follows from (2); so the proof is complete. to denote the sum of the components of Xs , and in particular
Yi = Y{i} = Xi . Below we also use ≤ i for the set of indices
Some remarks about this result are appropriate. First, this less than or equal to i, Y≤i for the corresponding subset sum
result implies that the “rate region” of vectors (R1 , . . . , Rn ) of Xi ’s etc.
satisfying
X Theorem II:[U PPER B OUND FOR D ISCRETE E NTROPY] If
Ri ≤ H(Ys ) Xi are independent discrete random variables, then
i∈s X 
X
for each s ⊂ [n] is polymatroidal P (in particular, it has a H(X1 + . . . + Xn ) ≤ αs H Xi ,
s∈C i∈s
non-empty dominant face on which i∈[n] Ri = H(Y[n] )).
It is not clear how this fact is to be interpreted since this is for any fractional covering α using any collection C of subsets
not the rate region of any real information theoretic problem of [n].
Theorem II can easily be derived by combining the chain Thus
rule for sums described above with Theorem I. Alternatively, X X X
αs H(X + Ys ) = αs [H(X) + I(Yi ; Ys∩≤i )]
it follows from Theorem I because of the general fact that
s∈C s∈C i∈s
submodularity implies “fractional subadditivity” (see, e.g., (a) X X X
[1]). ≥ αs H(X) + αs I(Yi ; Y≤i )
For any collection C of subsets, [1] introduced the degree s∈C s∈C i∈s
covering, given by (b) X X X
= αs H(X) + I(Yi ; Y≤i ) αs
1 s∈C i∈[n] s3i
αs = ,
r− (s) (c) X X
≥ αs H(X) + I(Yi ; Y≤i )
where r− (s) = mini∈s r(i). Specializing Theorem II to this s∈C i∈[n]
particular fractional covering, we obtain X 
X 1 X  = αs − 1 H(X) + H(Y[n] ),
H(X1 + . . . + Xn ) ≤ H Xi . s∈C
r− (s) i∈s
s∈C
where (a) follows from (3), (b) follows from an interchange
A simple example is the case of the collection Cm , consisting of sums, and (c) follows from the definition of a fractional
of all subsets of [n] with m elements. For this collection, we covering, and the last identity in the display again uses the
obtain chain rule (6) for sums.
X 
1 X
H(X1 + . . . + Xn ) ≤ n−1  H Xi . Note in particular that if H(X) ≥ 0, then Theorem III
m−1 s∈C i∈s implies
n−1
  X   
since the degree of each index with respect to Cm is m−1 . H X+ Xi
P P
≤ s∈C αs H X + i∈s Xi ,
Now consider the case where all the random variables
i∈[n]
are continuous. At first sight, everything discussed above for
the discrete case should also go through in this case, since which is identical to Theorem II except that the random
Theorem I holds in general. However this is not true; indeed, variable X is an additional summand in every sum.
the differential entropy of a sum does not even satisfy simple
subadditivity! To see why we cannot obtain subadditivity from V. L OWER BOUNDS
Theorem I, note that setting X2 = 0 makes Theorem I trivial
In this section, we only consider continuous random vectors.
because the differential entropy of a constant is −∞ (unlike
An entropy power inequality relates the entropy power of a
the discrete entropy of a constant which is 0). Similarly, the
sum to that of summands. We make the following conjecture
chain rule for sums described above is only partly true: while
that generalizes the known entropy power inequalities.
X  Xn
H Xi = H(X1 ) + I(Y[i] ; Yi ), (6) Conjecture I:[F RACTIONAL S UPERADDITIVITY OF E N -
i∈[n] i=2 TROPY P OWER ] Let X1 , . . . , Xn be independent Rd -valued
random vectors with densities and finite covariance matrices.
we cannot equate H(X1 ) = H(Y1 ) to I(Y1 ; Y1 ), since the
For any collection C of subsets of [n], let β be a fractional
latter is ∞.
packing. Then
Still one finds that (6) is nonetheless useful; it yields the P
2H(X1 +...+Xn ) 2H( j∈s Xj )
following upper bounds for differential entropy of a sum.
X
e d ≥ βs e d . (7)
s∈C
Theorem III:[U PPER B OUND FOR D IFFERENTIAL E N -
TROPY ] Let X and Xi , i ∈ [n] be independent Rd -valued Equality holds if and only if all the Xi are normal with pro-
random vectors with densities. Suppose α is a fractional portional covariance matrices, and β is a fractional partition.
covering for the collection C of subsets of [n]. Then
 X  X  X 
H X+ Xi ≤ αs H X + Xi Remark I: A canonical example of a fractional packing is given
i∈[n] s∈C i∈s
X  by the coefficients
− αs − 1 H(X). 1
βs = , (8)
s∈C r+ (s)
where r+ (s) = maxi∈s r(i) and r(i) is the degree of i. Thus
Proof: The modified chain rule (6) for sums implies that Conjecture I implies that
X 2H(X1 +...+Xn ) X 1 P
2H( j∈s Xj )
H(X + Ys ) = H(X) + I(Yi ; Ys∩≤i ). e d ≥ e d .
r+ (s)
i∈s s∈C
The following weaker statement is the main result of Madiman VI. A PPLICATION TO DETERMINANTS
and Barron [6]: if r+ is the maximum degree (over all i ∈ [n]), It has been long known that the probabilistic represen-
then tation of a matrix determinant using multivariate normal
1 X 2H( j∈s Xj ) probability density functions can be used to prove useful
P
2H(X1 +...+Xn )
e d ≥ e d . (9) matrix inequalities (see, for instance, Bellman’s classic book
r+
s∈C
[16]). A new variation on the theme of using properties
It is easy to see that this generalizes the entropy power of normal distributions to deduce inequalities for positive-
inequalities of Shannon-Stam [10], [11] and Artstein, Ball, definite matrices was begun by Cover and El Gamal [17], and
Barthe and Naor [12]. continued by Dembo, Cover and Thomas [18] and Madiman
and Tetali [1]. Their idea was to use the representation of the
Of course, one may further conjecture that the entropy determinant in terms of the entropy of normal distributions,
power is also supermodular for sums, which would be stronger rather than the integral representation of the determinant in
than Conjecture I. In general, it is an interesting question to terms of the normal density. They showed that the classical
characterize the class of all real numbers that can arise as Hadamard, Szasz and more general inequalities relating the
entropy powers of sums of a collection of underlying indepen- determinants of square submatrices followed from properties
dent random variables. Then the question of supermodularity of the joint entropy function. Our development below also
of entropy power is just the question of whether this class, uses the representation of the determinant in terms of entropy,
which one may call the Stam class to honour Stam’s role namely, the fact that the entropy of the the d-variate normal
in the study of entropy power, is contained in the class of distribution with covariance matrix K, written N (0, K), is
polymatroidal vectors. given by
There has been much progress in recent decades on char-  
1 d
acterizing the class of all entropy inequalities for the joint H(N (0, K)) = 2 log (2πe) |K| .
distributions of a collection of (dependent) random variables.
Major contributions were made by Han and Fujishige in the The ground is prepared to present an upper bound on the
1970’s, and by Yeung and collaborators in the 1990’s; we determinant of a sum of positive matrices.
refer to [13] for a history of such investigations; more recent
developments and applications are contained in [5] and [14]. Theorem IV:[U PPER B OUND ON D ETERMINANT OF S UM]
The study of the Stam class above is a direct analogue of the Let K and Ki be positive
P matrices of dimension d. Then the
study of the entropic vectors by these authors. Let us point function D(s) = log | i∈s Ki |, defined on the power set of
out some particularly interesting aspects of this analogy, using [n], is submodular. Furthermore, for any fractional covering α
Xs to denote (Xi : i ∈ s) and H to denote (with some abuse using any collection of subsets C of [n],
of notation) the discrete entropy, where (Xi : i ∈ [n]) is X αs

Y
a collection of dependent random variables taking values in |K + K1 + . . . + Kn | ≤ |K|−a K +
K j ,

some discrete space. Fujishige [3] proved that h(s) = H(Xs ) s∈C j∈s

is a submodular set function (analogous to the conjecture on


P
where a = s∈C αs − 1.
the supermodularity of entropy power), and Madiman and
Tetali [1] proved that Proof: Substituting normals in Theorems I and III gives the
X X result.
βs H(Xs |Xsc ) ≤ h([n]) ≤ αs h(s), (10)
s∈C s∈C As a corollary, one obtains for instance

Y X
the upper bound of which is analogous to the entropy power |K + K1 + . . . + Kn |n−1 |K| ≤ K + K j .

inequalities for sums suggested in Conjecture I. However, it

i∈[n] j6=i
does not seem to be easy to derive entropy lower bounds for
sums and inequalities for joint distributions using a common We now present lower bounds for the determinant of a sum.
framework; even the recent innovative treatment of entropy
Proposition I:[G ENERALIZED M INKOWSKI DETERMINANT
power inequalities by Rioul [15], which partially addresses
INEQUALITY ] Let K1 , . . . , Kn be d × d positive matrices.
this issue, requires some delicate analysis in the end to justify
Let C be a a collection of subsets of [n] maximum degree r+ .
the vanishing of higher-order terms in a Taylor expansion.
Then
On the other hand, inequalities for joint distributions tend
1
to rely entirely on elementary facts such as chain rules. 1 1 X X d
Nonetheless the formal similarity begs the question of how |K1 + . . . + Kn | d ≥ K j .
r+
j∈s
s∈C
far this parallelism extends; in particular, are there “non-
Shannon-type” inequalities for entropy powers of sums that Equality holds if and only if all the matrices Ki are propor-
impose restrictions beyond those imposed by the conjectured tional to each other, and each index appears in exactly r+
supermodularity? subsets.
In particular, when C is the collection of singleton sets, where LBk is the lower bound given by Proposition I using
r+ = 1, and one recovers the classical inequality. Specifically, Ck . Note that the direct Minkowski lower bound, namely LB1 ,
if K1 and K2 are d × d symmetric, positive-definite matrices, is the worst of the hierarchy.
then Note that Theorem V gives evidence towards Conjecture
1 1 1 I, since it is simply the special case of Conjecture I for
|K1 + K2 | d ≥ |K1 | d + |K2 | d .
multivariate normals! (Conversely, the inequality (9) may be
Many proofs of this inequality exist, see, e.g., [19]. Although used to give an information-theoretic proof of Proposition I,
the bound in Proposition I is better in general than the classical which is a special case of Theorem V.)
1 1
Minkowski bound of |K1 | d + . . . + |Kn | d , mathematically There are other interesting related entropy and matrix in-
the former can be seen as a consequence of the latter. Indeed, equalities that we do not have space to cover in this note;
below we prove a more general version of Proposition I. these can be found in the full paper [9].

Theorem V: Let K1 , . . . , Kn be d × d positive matrices. Then ACKNOWLEDGMENT


for any fractional packing β using the collection C of subsets The author would like to thank Andrew Barron and Prasad
of [n], Tetali, who have collaborated with him on related work.
1
R EFERENCES

1
X X d
|K1 + . . . + Kn | d ≥ βs
Kj .
[1] M. Madiman and P. Tetali, “Information inequalities for joint distribu-
s∈C j∈s
tions, with interpretations and applications,” Submitted, 2007.
P
Equality holds iff the matrices { j∈s Kj , s ∈ C} are propor- [2] T. S. Han, “Nonnegative entropy measures of multivariate symmetric
correlations,” Information and Control, vol. 36, no. 2, pp. 133–156,
tional. 1978.
[3] S. Fujishige, “Polymatroidal dependence structure of a set of random
Proof: First note that for any fractional partition, variables,” Information and Control, vol. 39, pp. 55–72, 1978.
X X [4] F. Chung, R. Graham, P. Frankl, and J. Shearer, “Some intersection
K1 + . . . + K n = βs Kj . theorems for ordered sets and graphs,” J. Combinatorial Theory, Ser. A,
vol. 43, pp. 23–37, 1986.
s∈C j∈s
[5] F. Matúš, “Two constructions on limits of entropy functions,” IEEE
Applying the usual Minkowski inequality gives Trans. Inform. Theory, vol. 53, no. 1, pp. 320–330, 2007.
[6] M. Madiman and A. Barron, “Generalized entropy power inequalities
1 1 and monotonicity properties of information,” IEEE Trans. Inform. The-
1
X X d X X d
|K1 + . . . + Kn | ≥ ory, vol. 53, no. 7, pp. 2317–2329, July 2007.
βs Kj = βs Kj .
d
[7] T. Cover and J. Thomas, Elements of Information Theory. New York:
s∈C j∈s s∈C j∈s
J. Wiley, 1991.
For equality, we clearly need the matrices [8] B. Bollobás and I. Leader, “Compressions and isoperimetric inequali-
X ties,” J. Combinatorial Theory Ser. A, vol. 56, no. 1, pp. 47–62, 1991.
βs Kj , s ∈ C [9] M. Madiman, “The entropy of sums of independent random elements,”
Preprint, 2008.
j∈s [10] C. Shannon, “A mathematical theory of communication,” Bell System
Tech. J., vol. 27, pp. 379–423, 623–656, 1948.
to be proportional. [11] A. Stam, “Some inequalities satisfied by the quantities of information
of Fisher and Shannon,” Information and Control, vol. 2, pp. 101–112,
It is not a priori clear that the bounds obtained using Propo- 1959.
sition I (or Theorem V) are better than the direct Minkowski [12] S. Artstein, K. M. Ball, F. Barthe, and A. Naor, “Solution of Shannon’s
problem on the monotonicity of entropy,” J. Amer. Math. Soc., vol. 17,
lower bound of no. 4, pp. 975–982 (electronic), 2004.
1 1 [13] J. Zhang and R. Yeung, “On characterization of entropy function via
|K1 | d + . . . + |Kn | d , information inequalities,” IEEE Trans. Inform. Theory, vol. 44, pp.
1440–1452, 1998.
but we observe that they are in general better. Consider the [14] R. Dougherty, C. Freiling, and K. Zeger, “Networks, matroids and non-
collection Cn−1 of leave-one-out sets, i.e., C = {s ⊂ [n] : Shannon information inequalities,” IEEE Trans. Inform. Theory, vol. 53,
|s| = n − 1}. The maximum degree here is r+ = n − 1, so no. 6, pp. 1949–1969, June 2007.
[15] O. Rioul, “A simple proof of the entropy-power inequality via properties
1
X X d of mutual information,” Proc. IEEE Intl. Symp. Inform. Theory, Nice,
1 1 France, pp. 46–50, 2007.
|K1 + . . . + Kn | d ≥ K j .
(11)
n−1
j∈s
[16] R. Bellman, Introduction to Matrix Analysis. McGraw-Hill, 1960.
s∈Cn−1 [17] T. Cover and A. El Gamal, “An information theoretic proof of
Lower bounding each term on the right by iterating this Hadamard’s inequality,” IEEE Trans. Inform. Theory, vol. 29, no. 6,
pp. 930–931, 1983.
inequality, and using Ck to denote the collection of all sets [18] A. Dembo, T. Cover, and J. Thomas, “Information-theoretic inequali-
of size k, one obtains ties,” IEEE Trans. Inform. Theory, vol. 37, no. 6, pp. 1501–1518, 1991.
1
[19] R. Bhatia, Matrix Analysis. Springer, 1996.
1 2 X X d
|K1 + . . . + Kn | ≥
d Kj .

(n − 1)(n − 2)
j∈si∈[n]

Repeating this procedure yields the hierarchy of inequalities


1
|K1 + . . . + Kn | d ≥ LBn−1 ≥ LBn−2 ≥ . . . ≥ LB1 ,

Anda mungkin juga menyukai