The Asymptotic Loss of Information for Grouped Data

Journal of Multivariate Analysis 67, 99127 (1998) Article No. MV981759 The Asymptotic Loss of Information for Grouped Data laus Felsenstein Technical University Vienna, Vienna, Austria and laus Po tzelberger University of Economics and Business Administration, Vienna, Austria Received December 2, 1996; revised March 12, 1998 We study the loss of information (measured in terms of the ullbacleibler distance) caused by observing ``grouped'' data (observing only a discretized version of a continuous random variable). We analyze the asymptotical behaviour of the loss of information as the partition becomes finer. In the case of a univariate observation, we compute the optimal rate of convergence and characterize asymptotically optimal partitions (into intervals). In the multivariate case we derive the asymptotically optimal regular sequences of partitions. Furthermore, we compute the asymptotically optimal transformation of the data, when a sequence of partitions is given. Examples demonstrate the efficiency of the suggested discretizing strategy even for few intervals. 1998 Academic Press ey words and phrases: Asymptotically optimal discretization; grouped data; ullbacleibler distance; optimal quantizer; optimal design. 1. INTRODUCTION In statistical applications various phenomena concerning discretization or grouping of data arise. It is common that instead of a continuous random variable, a discrete approximation of is observed. More precisely, assume that the sample space R n is partitioned into measurable subsets [H 1,..., H ]=: H. Only H() is observed, where H() denotes the unique H i # H with # H i. Due to the central importance of this approximation an extensive literature on various approaches exists. Many papers deal with the correction or adaption of procedures for grouped data, such as variations of Sheppard's correction (see Dempster and Rubin, 1983). Especially, effects of grouping on the maximum lielihood estimate or Bayes estimates are discussed in the statistical literature (see Dempster, Laird, and Rubin (1977), Lindley (195), or Heitjan (1989) for a review). 99 47-25998 25. Copyright 1998 by Academic Press All rights of reproduction in any form reserved.

1 FELSENSTEIN AND PO TZELBERGER The design problem of choosing at least asymptotically optimal partitions is discussed in the electrical engineers' literature (see Benitz and Buclew, 1989). We consider the loss of information due to this discretizing process, with emphasis on the asymptotic case ( ). We intend to give an analysis and characterizations of asymptotically optimal partitions. Analogous to the problem of optimal quantization for L p -distances, only in the univariate case does a complete characterization of asymptotically optimal partitions seem possible. Regular sequences of partitions are of central importance in the analysis. First, optimal regular sequences are optimal among all sequences for n=1. Second, they provide a simple mechanism for constructing good partitions in the multivariate case. In practice, the numerical calculation of optimal intervals turns out to be remarably difficult even for small values of. Therefore, a practicable and schematic procedure to gain at least asymptotically optimal intervals seems to be helpful and necessary. Furthermore, for regular sequences of partitions, the design problem of choosing optimal partitions may be inverted in the following sense. If the partition is given, we find an optimal transformation of the data in order to minimize asymptotically the loss of information. This situation is common and arises, for example, if the data are rounded. In contrast to the existing literature we do not study the effect of discretizing on a single distribution, but rather on a model, i.e., a set of distributions, indexed by a parameter. Our measure of the loss of information is the expected ullbacleibler distance of the two posterior densities of the parameter %, resulting if or H(), respectively, is observed. From the Bayesian point of view the ullbacleibler distance represents the ``natural'' method for measuring the difference of the posterior distributions. The ullbacleibler distance has a decision theoretical interpretation as loss of information. Note that from a frequentist point of view, our distance is a weighted ullbacleibler distance between the conditional and the marginal distribution of restricted to H(), with a weight on the parameter space and the partition. Beyond that, our results hold for a general class of loss functions. For example, an asymptotically optimal partition remains optimal if we consider the squared Hellinger distance or a similar I-divergence type of distance measure. First, we introduce some notations. Let (, B) be a measurable space with R n measurable, and B the Borel _-algebra. The model consists of a family of conditional distributions of the variable with densities f (x %) with respect to the n-dimensional Lebesgue measure and a prior density?(%) on an arbitrary parameter space 3. F (H %) denotes the conditional probability of a measurable H/R n. (For univariate observations, i.e., n=1, we consider partitions into intervals only.) The corresponding density and probability of H for the marginal distribution of are f (x)

ASYMPTOTIC LOSS OF INFORMATION 11 and F (H), respectively. For a fixed partition H =[H 1,..., H ] into disjoint measurable sets H i, the distribution of H() is multinomial with parameter (F (H 1 %),..., F (H %)). The posterior densities are?(% ) and?(% H()), where?(% H())=F (H() %)?(%)F (H()). In the sequel we consider the quadratic Hellinger distance and the expected ullbacleibler distance. The Hellinger distance D( f, g) of two densities f, g is D( f, g)= { 1 2 (- f(t)&- g(t))2 dt = 12. We denote the squared Hellinger distance of?(% x) and?(% H(x)) by D 2 (x) :=D(?(% x),?(% H(x)))2 and the expected squared Hellinger distance by D 2 := D2 (x) f (x) dx. Given the observation x, the ullbacleibler distance is I (x)= log?(%?(% x) d% \?(% H(x))+ and the expected ullbacleibler distance is I = I (x) f (x) dx. If necessary we indicate the dependence of I or D 2 on H by writing I (H )ord 2 (H ). For n=1, those partitions where the endpoints of the intervals are quantiles of a distribution play a special role in the analysis of asymptotical behaviour of partitions. Definition. Let G, G 1,..., G n denote cumulative distribution functions on R with quantile functions Q=G &1, Q 1 =G &1 1,..., Q n =G &1 n. A sequence of partitions (H ), with =1 H =[H,..., H 1 ] is called a regular sequence of partitions corresponding to G or G 1,..., G n resp., if (i) n=1, H i =]h, i&1 h], and i h i =Q(i), i=,...,, (ii) n>1, 1,..., n # N exist, with => n i=1 i, and H i =H 1, j1 _H 2, j2 _}}}_H n, jn for intervals H r, jr =]Q r (( j r &1) i ), Q r ( j r i ], for some j 1,..., j n # N with j r r.

12 FELSENSTEIN AND PO TZELBERGER It is essential that Q, Q 1,..., Q n do not depend on. For n>1, regular partitions are partitions into rectangles. Rectangles are only one possibility for generalizing intervals to higher dimensions. Partitions into rectangles cannot be expected to be optimal, for instance, among partitions into convex sets or partitions into polyhedra. For n=1, regular partitions enable the construction of an asymptotically optimal sequence of partitions. Let I *=inf H I (H ), where the infimum is taen over all partitions into intervals with size. A sequence of partitions (H ) =1 is called asymptotically optimal if lim sup I (H )I *=1. (1) We will show that (1) is equivalent to lim sup D 2 (H )D 2 *=1, where D 2 *=inf H D 2 (H ). In this paper we obtain the following results. First, we prove that for n=1, the optimal rate of convergence for is quadratic in 1, i.e. I *=O(1 2 ). For n>1, we show that the optimal rate among regular sequences is O(1 2n ). Furthermore, we characterize these optimal regular sequences of partitions. Sufficient and necessary conditions for asymptotic optimality of arbitrary sequences of partitions are given for n=1. We characterize the distribution G leading to the asymptotically optimal sequence of regular partitions and show that these partitions maintain asymptotically optimal among arbitrary partitions. This leads to the optimal transformation of the data if the partition is fixed. We demonstrate by examples the ``efficiency'' of our method even for small. To obtain the results mentioned above it is necessary to state regularity conditions. The conditions we use are fulfilled for most of the relevant statistical models. In (regular) cases where our conditions fail it seems possible to modify regularity conditions appropriately. For technical reasons we prove the results for the Hellinger distance first. Then we show a relation between the Hellinger distance and the ullbacleibler distance that reveals equal asymptotic behaviour of both distances. The paper is organized as follows. Regularity conditions are stated and discussed in Section 2. Section 3 gives the results. In Section 4 we present examples and show that asymptotically optimal partitions are good approximations to optimal partitions for finite. In Section 5 the proofs of the Results of Section 3 and of necessary technical lemmas are provided.

ASYMPTOTIC LOSS OF INFORMATION 13 2. REGULARITY CONDITIONS The following regularity conditions are necessary to provide a precise mathematical formulation of the results. Let 9 (x, %) := f $ (x %) f (x %) & f $ (x) f (x) with f $ (x %)=(x) f (x %) and f $ (x)=(x) f (x). The posterior variance of 9 (x, %) is denoted by _ 2 (x) := 9 2 (x, %)?(% x) d%. C1. f $ (x %) exists for all % # 3, f $ (x %) andf $ (x) are continuous in x and f $ (x %) in%, andf$(x %)?(%) d%= f $ (x). C2. The support of % is independent of %. C3. _ 2 (x) is continuous, 9 2 (x, %) f 2 (x)?(% x) d% dx<, and 9 2 (x, %) f 2 (x)?(% x) is bounded in x for any fixed %. The behavior of the marginal distribution near the boundary of the support (if it is not bounded) ought to be smooth in some sense. To state ``smoothness conditions'' we introduce the (left and right side) hazard rates, r R (x) := f (x) 1&F (x), r L(x) := f (x) F (x). Our assumptions concerning the marginal distribution are: C4. Constants c L, c R, c > and :> exist with c L c R such that (i) f ( } ) is increasing on ]&, c L [, decreasing on ]c R, [, and positive on [c L, c R ]; (ii) for <x< y with F &1 ( y)<c L it holds that F &1 (x) F } y : &1 ( y)<c x}

14 FELSENSTEIN AND PO TZELBERGER and for x< y<1 with F &1 (x)>c R it holds that F &1 F &1 ( y) (x)c } 1&y} 1&x : ; (iii) yx+c 1 x for any constant c 1 >, a c 2 > exists such that for c R x r R ( y) r R (x) c 2 holds and for x&c 1 x yxc L holds. r L ( y) r L (x) c 2 Condition (iii) is fulfilled if the hazard rates are monotone (especially r L decreasing and r R increasing). C5. For any compact interval ($, %) and ($) exist such that for x, x~ # and x&x~$ } f $ (x~ %)&f $ (x %) ($, %) f (x %) } with 2 ($, %)?(% x) d% 2 ($) and lim $ 2 ($)=. C6. The integral 9 4 (x, %)?(% x) d% exists and is continuous in x. C7. The integral f (x %) 9 (x, %) 4?(% x)) f 3 (x) d%dx is finite. C8. For any $> and any x out of the interior of the support let f $ (u %) 3 \ $ (x, %) := sup u, v # U $ (x) f (u %) 32 f (v %) 12, where U $ (x)=[yx& y<$]. Then for any compact, a$> exists with \ $(x, %)?(%) d% dx<.

ASYMPTOTIC LOSS OF INFORMATION 15 We close this section by commenting briefly on the regularity conditions C1C8. Conditions C1 and C2 guarantee that objects lie _ 2 (x), or derivatives of densities, exist and that order of integration and differentiation may be interchanged. In this sense, they are automatically necessary or at least not problematic. Condition C4 implies that few intervals in the tails of the distributions do not count asymptotically. Condition C4 can be weaened for specific models and is thus not necessary. However, a condition lie C4 will always be necessary. Note that C4 holds if the hazard rates r R or r L oscillate not too wildly near or &. Conditions C5 and C8 are conditions on the uniform smoothness of f (} %) on compact subsets of R. They are usually not problematic. Note that in any event we must assume that f (}%) is differentiable. Conditions C3, C6, and C7 are moment conditions that simplify the proofs of the lemmas. C7 is the only condition that reads differently for and =F (); see the proof of Lemma 4. 3. ASYMPTOTICALLY OPTIMAL PARTITIONS In this section we state the main results, concerning convergence rates and asymptotically optimal partitions for the expected squared Hellinger distance and the expected ullbacleibler distance between the two posterior densities. The results for the expected squared Hellinger distance are given for n=1 only. For the expected ullbacleibler distance, in addition to the case n=1, convergence rates and optimal regular partitions are considered for n>1. Let n=1. We show that the rate of convergence cannot exceed O(1 2 ), with the size of the partition. Define for a partition H =[H 1,..., H ] with H i =]h i&1, h i ], $ i =F (h i )&F (h i&1 ), and p*, (x)= I i=1 H i (x) $ 2 f i (x). i=1 $3 i The following theorem summarizes the main results on the Hellinger distance. Theorem 1. Assume that the regularity conditions C1C6 hold for the joint distribution of and %. Then (i) Any sequence of partitions (H ) =1 satisfies lim inf 2 D 2 1 96\ (_2 (x) f (x)) 13 dx + 3.

16 FELSENSTEIN AND PO TZELBERGER (ii) A sequence of partitions (H ) =1 is asymptotically optimal, if and only if i=1 $3 i =O(12 ) and if for any compact interval R with (_ 2 (x) f (x)) &23 dx> p*, (x) I (x) p*, (x) dx ww wea (_ 2 (x) f (x)) &23 I (x) (_ 2 (x) f (x)) &23 dx. (iii) For a regular sequence of partitions (H ) =1 generated by g=g$, where D 2 =O(12 ) and _ 2 (x) f (x)g 2 (x) dx<, we have D 2 = 1 96 _2 (x) f (x) dx+o(1 2 ). (2) 2 g 2 (x) (iv) The regular sequence of partitions (H ) =1 generated by g(x) B (_ 2 (x) f (x)) 13 (3) is asymptotically optimal. Remars. 1. Conditions C1C6 imply that (_ 2 (x) f (x)) 13 dx<, so that g defined by (3) is a density. 2. We prove (2) if _ 2 (x) f (x)g 2 (x) dx< and i=1 $3= i O(1 2 ). It can be shown that a regular sequence of partitions is not of order O(1 2 )if_ 2 (x) f (x)g 2 (x) dx= in the sense that no Riemann sum converges to a finite limit. 3. Part (iv) claims that the regular sequence corresponding to g(x) B (_ 2 (x) f (x)) 13 is asymptotically optimal among all sequences of partitions, not only among regular sequences. 4. Asymptotically, optimality is a question of the asymptotic behaviour concerning wea convergence of the empirical distributions of the endpoints h i if n=1. Any asymptotically negligible transformation of the endpoints maintains asymptotical optimality. Analogously to optimal quantization with L p -distances, we conjecture that for n>1, asymptotically not only does the empirical distribution of the partition matter, but also its geometric features. Whereas the Hellinger distance of two densities always exists the ullbac Leibler distance does not. Therefore we need additional assumptions, C7 and C8, for the distributions of (, %) and. Theorem 2. Let n=1 and assume the regularity conditions C1C8 hold for the joint distribution of and %. Then

ASYMPTOTIC LOSS OF INFORMATION 17 (i) For sequences of partitions [H ] with (F &1 (h i)&f &1 (h i&1)) 3 =O(1 2 ) and any compact interval contained in the interior of the support of f, lim provided that D (x) 2 f (x) dx>. I (x) f (x) dx D (x) 2 f (x) dx =4, (ii) If for a sequence of partitions (H ), I =1 =O(1 2 ) (for ) holds, then lim I D 2 =4. (iii) A sequence of partitions is asymptotically optimal for the ullbacleibler distance if and only if it is optimal for the squared Hellinger distance. Theorems 1 and 2 imply the following main theorem concerning the asymptotical behavior of the ullbacleibler distance. Theorem 3. Let n=1 and assume the regularity conditions C1C8 hold for the joint distribution of and %. Then (i) Any sequence of partitions (H ) =1 satisfies lim inf 2 I 2 24\ 1 (_2 (x) f (x)) 13 dx + 3. (ii) For a regular sequence of partitions (H ) =1 generated by g=g$, where I 2 =O(12 ) and _ 2 (x) f (x)g 2 (x) dx<, we have I 2 = 1 24 _2 (x) f (x) dx+o(1 2 ). 2 g 2 (x) (iii) The regular sequence of partitions (H ) =1 generated by g(x) B (_ 2 (x) f (x)) 13 is asymptotically optimal. Finally, we deal with the problem of an optimal transformation Y=T() of the data if a regular sequence of partitions is given. This is the most common situation in applications. Tae, for instance, rounding in ], 1]. This corresponds to the regular sequence of partitions (](i&1), i]) i=1. It is reasonable to choose the transformation T such that the partitions (](i&1), i]) i=1 are asymptotically optimal for Y=T(). Not unexpectedly, again G=T is optimal, where G is the c.d.f.

18 FELSENSTEIN AND PO TZELBERGER of g(x) B (_ 2 (x) f (x)) 13. Note that if (](i&1), i]) i=1 is asymptotically optimal for Y=T(), then (]T &1 ((i&1)), T &1 (i)]) is i=1 asymptotically optimal for. Since G &1 (i) are asymptotically optimal endpoints of the intervals for, the transformation T=G leads to the optimality of h i =i for Y. Therefore the optimal procedure is to transform the data according to G() if the values are rounded. In Proposition 1, I (T ) denotes the ullbacleibler distance between the posterior distribution of % given T () and the posterior distribution of % given H(T ()), where H corresponds to a regular sequence of partitions, defined by a c.d.f. R. Proposition 1. Let a regular sequence of partitions, corresponding to the c.d.f. R be given, with R invertible. Let G be the c.d.f. with G$(x)= g(x) B (_ 2 (x) f (x)) 13. Then T=R &1 b G is asymptotically optimal, i.e., lim sup I (T )I (T)1, for all monotonous transformations T from the support of F onto the support of R. For more than one observation (n>1) we need additional notation. Let n # N denote the dimension of the observation and # N the number of rectangles into which R n is partitioned by H. This partition induces a partition H i of each of the components into intervals. Let i denote the number of intervals in the ith components. Then => n i=1 i. We assume that, given the parameter %, the random variables are independent, with a p.d.f. f i, (}%) with respect to Lebesgue measure. For x=(x 1,..., x n )#R n, let H(x)=(H(x 1 ),..., H(x n )), where H(x i ) is the unique interval in H i with x i # H(x i ). Furthermore, let z i =(H(x 1 ),..., H(x i&1 ), x i+1,..., x n ) and x i =(x 1,..., x i&1, x i+1,..., x n ), with obvious modifications for i=1 or i=n. Now the ullbacleibler distance of the posterior densities corresponding to and H() reads I (n) with expectation (x)= log \?(% x 1,..., x n )?(% H(x 1 ),..., H(x n ))+?(% x 1,..., x n ) d% I (n) = I(n) (x) f (x) dx = log \ >n f i=1 i, (x i %) f (x 1,..., x n ) F (H(x)) > n i=1 F i, (H(x i )%)+ f (x) dx.

ASYMPTOTIC LOSS OF INFORMATION 19 Thus n I (n) = : i=1 I (x i ) f i, (x i ) dx i & log \ f (x) > n f i=1 i, (x i ) n = : i=1{ I (x i ) f i, (x i ) dx i & log \ f i, (x i z i ) f i, (x i ) > n F i=1 i, (H(x i )) F (H(x)) + f (x) dx F i, (H(x i )) F i, (H(x i )z i )+ f i, (z i, x i ) dz i dx i= n = : i=1{ I (x i ) f i, (x i ) dx i & I (x i ) f i, (x i ) dx i=, with I (x i )= log \ f i, (z i x i ) f i, (z i h(x i ))+ f i, (z i x i ) dz i, (4) since f (x 1,..., x n ) F (H(x 1 ),..., H(x n )) > n i=1 F i, (H(x i )) > n i=1 f i, (x i ) i=1{ f i, (x i z i ) F i, (H(x i )) f i, (x i ) F i, (H(x i )z i )=. n = ` Therefore the expected ullbacleibler distance for multivariate observations is the sum of the expected ullbacleibler distances for one dimensional observation minus the one dimensional expected ullbac Leibler distance, where z i replaces the parameter %. Let 9 i (x i, %)= x i log f i, (x i %), _ 2 % x (9(x i, } )) the variance of 9 i (x i, %) with respect to the conditional i distribution of % given x i, and let _ 2 * i, (x i )= _2 % x i (9(x i,})) f (x i x i ) dx i. Let g* i denote the p.d.f. proportional to (_ 2 i, * (x i ) f i, (x i )) 13. An application of Theorem 3 gives an asymptotic representation of I (n).

11 FELSENSTEIN AND PO TZELBERGER Theorem 4. Let n>1 be fixed and let (H ) =1 be a regular sequence of partitions, with (H ) corresponding to g =1 1,..., g n. If conditions C1C8 hold for the distributions of ( i, %) and the distributions of ( i, z i ) (1in), then (i) and (ii) hold. (i) If i =O(1 1n ) for all 1i, jn, then n I (n) = : i=1 1 24 _2 i, * (x i ) f i, (x i ) 2 i g 2(x dx i +o(1 2n ). (5) i i) (ii) (H ) =1 g i = g i * and is asymptotically optimal among regular sequences, if i 1n c12 i > n j=1 c12n j, where c i =( (_ 2 * i, (x i ) f i, (x i )) 13 dx i ) 3. In this case, I (n) j=1 c1n j =n>n +o(1 2n ). 24 2n 4. EAMPLES For purposes of illustration, we apply our discretizing strategy to some special univariate distributions. The required conditions C1 to C8 seem to be technical but they are fulfilled for most of the ``standard'' distributions. We start with normally distributed observations %tn(%, _ 2 ) with nown variance _ 2. If the prior distribution is a (conjugate) normal distribution, %tn(+, d 2 ), the marginal distribution is normal again, tn(+, d 2 +_ 2 ). Since 9 (x, %)= x&+ d 2 +_ 2&x&% _ 2 all posterior moments of 9 exist and _ 2 (x)= d 2 _ 2 (d 2 +_ 2 )

ASYMPTOTIC LOSS OF INFORMATION 111 is independent of x. Therefore C1C3 and C6 hold. Because of the inequality 2 t+- t 2 +4 1&8(t) 2,(t) t+- t 2 +8? for the density,(t) and the c.d.f. 8(t) of the standard normal distribution (see Abramowitz and Stegun, 1972) condition C4 is valid. Concerning condition C5 for any compact interval it is possible to choose ($, %)=$p 2 (%) e $p 1 (%), where p 2 is a polynomial of degree 2 and p 1 a polynomial of degree 1. Therefore the second moment of ($, %) with respect to the posterior distribution exists and vanishes if $. Similarly, condition C8 is fulfilled for the normal distribution since \ $ (x, %) has the form \ $ (x, %)=q x (%) e &%2 2+q~ x (%), where q x (%) and q~ x (%) are polynomials in % for fixed x and the degree of q~ x (%) is1. Application of Theorem 2 leads to an asymptotically optimal sequence consisting of the quantiles of g(x) B f (x) 13 since _ 2 (x) is constant. The optimal quantiles are coming from the normal distribution with mean + and variance 3(d 2 +_ 2 ). The following table shows the (numerically calculated) optimal points and the asymptotically optimal quantiles for d 2 =_ 2 =1 and +=. The minimal ullbacleibler distance is I * and the difference of the ullbacleibler distances using optimal points or quantiles is denoted by 2(). There is not much difference even for really small number of intervals,. Optimal points Quantiles I * 2() 3 &.636;.636 &1.55; 1.55 2.3E&2 6.7E&2 4 &1.36; ; 1.36 &1.65; ; 1.65 5.5E&2 3.6E&3 5 &1.72; &.53;.53; 1.72 &2.6; &.62;.62; 2.6 3.8E&2 2.6E&3 6 &2.8; &.93; ;.93; 2.8 &2.37; &1.6; ; 1.6; 2.37 2.8E&2 1.7E&3 7 &2.22; &1.22; &.39;.39; 1.22; 2.22 &2.62; &1.39; &.44;.44; 1.39; 2.62 2.1E&2 1.3E&3 The second example deals with a model where the parameter attains two values only, arising in problems of testing hypotheses. Numerical calculations are simplified and we demonstrate the efficiency of our method.

112 FELSENSTEIN AND PO TZELBERGER The densities on the interval x # [, 1] are f (x %)=3x 2 if %= and f (x %)=3(1&x) 2 if %=1. The prior probabilities are equal,?()=?(1)=12. Then f (x)=32(x 2 +(1&x) 2 ) results as marginal density of and 9 (x, )= 2 x & 2(2x&1) x 2 +(1&x) 2 9 (x, 1)=& 2 1&x & 2(2x&1) x 2 +(1&x) 2. The posterior variance of 9 (x, %) is _ 2 (x)= 8 4(2x&1)2 x 2 +(1&x) 2& (x 2 +(1&x) 2 ) 2. It is easy to show that all required regularity conditions (C1C8) hold. Therefore, the optimal density g has the form g(x)=c 1 (x 2 +(1&x) 2 ) 13 with normalizing constant c=.8645. The following table compares the optimal points to the regular asymptotically optimal quantiles. Again, I * is the optimal ullbacleibler distance and 2() is the difference. Optimal points Quantiles I * 2() 3.35;.65.345;.655 4.6E&2 5.52E&5 4.271;.5;.729.265;.5;.735 2.3E&2 3.88E&5 5.221;.41;.59;.779.215;.48;.592;.785 1.48E&2 2.65E&5 6.187;.348;.5;.652;.813.182;.345;.5;.655;.818 1.3E&2 1.86E&5 7.162;.33;.435;.565;.697;.838.157;.3;.434;.566;.7;.843 7.61E&3 1.35E&5 8.143;.269;.387;.5;.613;.731;.857.138;.265;.384;.5;.616;.735;.862 5.85E&3 1.E&5 5. PROOFS The proofs of Theorems 14 can be simplified if the random variable is transformed to the interval [, 1] by applying F.Let =F () and denote by f(x %) its conditionals p.d.f. and let 9(x, %)= f $(x %)f(x %). _ 2 (x) denotes the posterior variance of 9(x, %). The relations between the non-transformed and transformed objects are f(x %)= f (F &1 (x) %) f (F &1 (x)), 9(x, %)=9 (F &1 (x), %)f (F &1 (x)), and _ 2 (x)=9 2 (x, %)?(% x) d%=_ 2 &1 (F (x))f 2 &1 (F (x)).

ASYMPTOTIC LOSS OF INFORMATION 113 Recall that the proofs of Theorems 13 will be given for the transformed observations, with conditional p.d.f. f(x %). We assume that C1C6 hold for Lemmas 1 and 2 and that C1C8 hold for Lemmas 4 and 5. Lemma 1. Let n=1. A sequence of partitions exists such that D 2 = O(1 2 ), and such that for any => a compact interval ], 1[ exists, with lim inf D 2 (x) dxd2 1&=. (6) For x #]h i&1, h i ], let,(x):=h i +h i&1 &2x. In the following we consider two densities, p and p *. p (x)= : i=1 p *(x)= : i=1 I Hi (x) 3,(x) < 2 : I Hi (x) $ i< 2 : p (x) is a probability density satisfying 1 16,2 (x) 9 2 (x, %) f(x %)?(%) d% dx= i=1 $3 i 48 _ 2 (x) p (x) dx, i=1 for compact intervals ], 1[. Furthermore, i=1 $ 3 i. lim p (x) _ 2 (x) dx p*(x) _ 2 (x) dx =1. In the next lemma we show that D 2 (x) restricted to compact intervals is asymptotically proportional to p (x) _ 2 (x). and Lemma 2. Let n=1. For any compact interval ], 1[, lim $ 3 i, (7) ( i=1 $396) i _2 (x) p (x) dx =1 (8) D 2 (x) dx ( i=1 lim $396) i _2 (x) p *(x) dx D 2 (x) dx =1 (9)

114 FELSENSTEIN AND PO TZELBERGER p *(x) is used to construct a density with the points h i as quantiles. We define this density as g(x) B 1- p *(x) and get h i h i&1 g(x) dx=c hi h i&1 1 $ i dx=c with some constant c. Therefore h i h i&1 g(x) dx=1 and h i g(x) dx=i hold. We may restrict the analysis to sequences h i which become ``dense'' quicly enough, exactly to sequences with i=1 $3=c i 2 and a bounded sequence (c ). Then $ 3 i : j=1 $ 3 j = p *(h i ) $ i gives $ i =- p*(h i ) - c, which implies h i = ji - p*(h j ) c. The quantile function Q=G &1 of the distribution with the density g and G(x)= x g(t) dt leads to so that h i = : - p *(Q( j)) c, ji h i =- c i - p *(Q(x)) dx=- c G(h i ) - p *(Q(x)) dx. The substitution Q(x)=y then gives h i =- c h i - p*(x) g(x) dx =- c h i =- c h i 1 - p *(x)- p *(x) dx \ 1 1- p *(x) dx. Thus c =( 1 1- p *(x) dx) 2 and : i $ 3 i p *(x) _ 2 (x) dx= 1 2\ 1 1- p *(x) dx + &1 1- p *(x) dx + 2 p *(x) _ 2 (x) dx. (1)

ASYMPTOTIC LOSS OF INFORMATION 115 Combined with (9), we have for compact ]>, 1[, (196 2 )( 1 lim 1- p *(x) dx) 2 p *(x) _ 2 (x) dx D 2 (x) dx =1. (11) Lemma 3. The density function g(x) minimizing 1- q(x) dx \ + 2 q(x) _ 2 (x) dx (12) has the form q(x)=c[_ 2 (x)] &23 for x # with a normalizing constant c. p *(x)=q(x) (=c(_ 2 (x)) &23 ) is optimal in the following sense. For any sequence p~ * different from q and any compact interval $], 1[ a compact $$ exists such that ( 1 lim inf 1- p~ *(x) dx) 2 p~ *(x) _ 2 (x) dx ( 1 1- p *(x) dx) 2 p *(x) _ 2 (x) dx >1. Inserting the optimal p *(x)=q(x) in (1) gives, according to Lemma 2, : i $ 3 i p *(x) _ 2 (x) dx=(1 2 ) \ (_ 2 (x)) 13 dx + 3. If (H ) =1 is a regular partition with h i =Q(i)), Q=G &1, and g(x)=g$(x), then g(x) B 1- p *(x) B (_ 2 (x)) 13 ; i.e., regular partitions corresponding to densities proportional to (_ 2 (x)) 13 are asymptotically optimal. Both the expected squared Hellinger distance and the expected ullbacleibler distance are invariant with respect to the transformation =F (). We thus have to substitute dx by f (x) dx and p *(x) by p *(F (x)) f (x) in the case of a general marginal density f (x). We now have p *(F (x)) f (x)= 1 i=1 $3 i I Hi (x) $ 2 i f (x). Finally, we combine Lemmas 13 with the proof of Theorem 1.

116 FELSENSTEIN AND PO TZELBERGER Proof of Theorem 1. (i) This is a reformulation of Lemma 3. (ii) This is again Lemma 3, using the representation (1) and the result (9), because Lemma 1 implies that for at least one sequence (and therefore for the optimal sequence) D 2 =O(12 ). (iii) This is (9) and (1) under the assumption that _ 2 (x) f (x)g 2 (x) dx exists. (iv) This is a special case of (iii). As the lower bound in (i) is attained asymptotically, this optimal regular sequence is in fact asymptotically optimal among all sequences of partitions. The proof of Theorems 2 and 3 are split into the following two lemmas. Lemma 4. Let n=1. A sequence of partitions exists such that I =O(1 2 ) and such that for all => a compact interval ], 1[ exists with lim sup I (x) dxi 1&=. Lemma 5. For each compact interval ], 1[ and &$& :=sup 1i $ i I (x) dx=4 D 2 (x) dx+o \ D 2 (x) dx +. The asymptotic relation between the two distances D 2 and I stated in the preceding lemma leads to optimal sequences of partitions for the ullbacleibler distance. A sequence of partitions is asymptotically optimal for the squared Hellinger distance if and only if it is asymptotically optimal for the ullbacleibler distance. The proof of Theorem 2 is an application of the results on the squared Hellinger distance together with Lemmas 4 and 5. Proof of Theorem 4. Let 9 i(x i, z i )= x i log f i, (x i z i ) and let _~ 2 i, (x i) be the variance of 9 i(x i, z i ) with respect to the conditional distribution of z i x i. Then, according to Theorem 3(ii), 24 2 i (I(x i)&i (x i )) f i, (x i ) dx i _2 i, (x i)&_~ 2 i, (x i) g 2 i (x i) f i, (x i ) dx i.

ASYMPTOTIC LOSS OF INFORMATION 117 We have and f $ i, (x i z i ) f i, (x i z i ) f i, (z i x i ) dz i = f $ i, (x i z i ) f i, (z i ) f i, (x i ) dz i f $ i, (x i %) f i, (x i %) so that _ 2 i, (x i )&_~ 2 i, (x i ) = f $ i, (x i ) f i, (x i ) = f $ i, (x i %) f i, (x i %)?(% x i) d% 9 i(x i, %)?(% x i, z i ) d%= f $ i, (x i %) f i, (x i %)?(% x i, z i ) d% f i, (x i %) f i, (z i %)?(%) f i, (x i, z i ) d%= f $ i, (x i %) f i, (z i %)?(%) d% f i, (x i, z i ) = f $ i, (x i, z i ) f $ i, (x i, z i ) =f $ i, (x i z i ) f $ i, (x i z i ), = \ f $ 2 i, (x i?(% x f i, (x i %)+ i ) d%& \ f $ 2 i, (x i z i ) f f i, (x i z i )+ i, (z i x i ) dz i E % xi 9 2 i (x i, %)&E zi x i (E % zi, x i 9 i (x i, %)) 2 =E zi x i [E % zi, x i 9 2 i (x i, %)&(E % zi, x i 9 i (x i, %)) 2 ] =E zi x i Var % zi, x i (9 2 i (x i, %)) = Var(9 2 i (x i, %) H i (x 1 ),..., H i (x i&1 ), x i,..., x n ) f i, (x i x i ) dx i. For, Var(9 2 i (x i, %) H i (x 1 ),..., H i (x i&1 ), x i,..., x n ) converges to Var(9 2 i (x i, %) x i,..., x n ) for all (x 1,..., x n ). Furthermore, Var(9 2 i (x i, %) H i (x 1 ),..., H i (x i&1 ), x i,..., x n ) =_ 2 i, (x i )&_~ 2 i, (x i )_ 2 i, (x i ). The theorem on dominated convergence implies thus 24 2 i (I(x i)&i (x i )) f i, (x i ) dx i _2 * i, (x i ) g 2 i (x i) f i, (x i ) dx i,

118 FELSENSTEIN AND PO TZELBERGER and (i) follows immediately. Choosing g i = g i * and minimizing (5) over ( 1,..., n ), with > n i=1 i=, yield (ii). Finally, we provide the proofs of the technical lemmas. We will show that the special sequence of intervals where q i :=c H i x 3 (1&x) 3 f (F &1 (x)) dx, (13) = H 1 x 3 (1&x) 3 f (F &1 (x)) dx, c &1 gives a convergence rate D 2 =O(12 ) for the transformed model =F (). Lemma 6. For fixed => define M + =[i&1 q i&1 F (=)] and M & =[i2 q i F (&=)]. Conditions C1C4 imply lim sup &1 F sup (q i)&f &1 i # M + F &1 (q i&1) (q i&1) < (14) and lim sup F &1 sup (q i&1)&f &1 (q i) i # M F &1 (q & i) Furthermore, a constant c 3 > exists with lim sup sup 2i&1 sup u, v #[q i&1, q i ] <. (15) f (F &1 (u)) f (F &1 (v))c 3. (16) Proof. Let H i =]q i&1, q i ] and c R F &1 (12). If F &1 (H i )[c L, c R ] the assertion is implied by the uniform continuity of F &1 and f b F &1 on compact intervals. We assume F &1 (H i)]c R, [ and F &1 (i&1)>c R. For x, y with c R F &1 (x)f&1( y)f&1(x)(1+c 1) condition C4 gives f (F &1 (x)) &1 R(F (x)) &1 1&F (F (x)) 1 1&x f (F &1 ( y))=r R (F &1 ( y)) 1&F (F &1 ( y)) c 2 1&y. (17)

ASYMPTOTIC LOSS OF INFORMATION 119 Furthermore, we have F &1 F &1 \ i+12 + \ i&1 + c } i&1 1& 1& i+12 }: c 4 : < leading to f \ F&1 f \ F&1 Let (1&q i&1 )(1&q i )=1+A with \ i&1 \ i+12 ++ ++ 4 c 2. (18) Then A= i (i&1) x3 (1&x) 3 f (F &1 (x)) dx. 1 i x3 (1&x) 3 f (F &1 (x)) dx Since (i&1) x3 (1&x) 3 f (F &1 (x)) dx i x 3 (1&x) 3 f (F &1 (x)) dx i&1 \ 1&i&1 + i+12 \ 1&i+12 +&3 A i (i+12) _ 1 1 2 f \ F &1 f \ F&1 i&1 +1&i i+12 &12&i 4 \ i&1 ++ \ i+12 ++. it follows from (18) that 1512(4c 2 ) and (1&q i&1 )(1&q i )1+248c 2. Thus F &1 F &1 : 1&q i } (q i) (q i&1) } c 1&q i&1 : c } 1+248 c 2 }

12 FELSENSTEIN AND PO TZELBERGER and (14) is verified. Monotony of the density [c R, [ and (18) give sup u, v #[q i&1, q i ] f (F &1(u)) f(f&1 f (F &1 (v))= f (F &1 (q i)) (q i&1)) 1 c 2 1&q i&1 1&q i 1 c 2 \ 1+248 c 2 +. For F &1 (H i)]&, c L [ we find an upper bound in the same way and (16) and (15) are proved. Proof of Lemma 1. The endpoints of the intervals are chosen by h i =q i defined in (13). Let H i =]h i&1, h i ]for1i&1and x # H i. Then D 2 (x)=1 2 (- f(x %)&- F(H i %)($ i )) 2?(%) d%. Let x %, i # H i such that f(x %, i %)=F(H i %) $ i.forx # H i, we write for an x~ %, i # H i. Thus f $(x~ %, i %) - f(x %)&- F(H i %) $ i =(x&x %, i ) 2 - f(x~ %, i %) where D 2 =d 1+ 1 &1 8 : (x&x %, i ) f $(x~ 2 2 %, i %) i=1 Hi f(x~ %, i %) dx?(%) d%+d, d 1 = 1 2 h 1 (-?(% x)&-?(% H 1 )) 2 dx?(%) d% and d = 1 2 1 h &1 (-?(% x)&-?(% H )) 2 dx?(%) d%. The terms d 1 and d are o(1 4 ) since for sufficiently large d 1&h &1 =c H 1 1&1 c H 8 f (F &1 (1&1)) 1 x 3 (1&x) 3 f (F &1 (x)) dx 1&1 (1&x) 3 dx = c H 32 f (F &1 (1&1)) 1 4=o(14 )

ASYMPTOTIC LOSS OF INFORMATION 121 and analogously d 1 =o(1 4 ). Thus D 2 o(14 )+ 1 &1 8 : $ 2 i i=2 Hi f $(x~ %, i %) 2 dx?(%) d%. f(x~ %, i %) An x^ %, i #[(i&1)), i] exists with h i &h i&1 =c H f (F &1 (x^ %, i)). Therefore, we have D 2 o(14 )+ 1 1 8 c3 H : 3 &1 i=2 f (F &1 (x^ %, i)) f $(x~ 3 %, i %) 2?(%) d%. f(x~ %, i %) Let x* %, i #[(i&1), i] maximize f $(x %)- f(x %). For all except the first and the last interval Lemma 6 provides a constant c 3 with Then f (F &1 (x^ %, i)) 3 f $(x~ %, i %) 2 f(x~ %, i %) c3 3 f (F &1 (x* %, i)) 3 f $(x* %, i %) 2 f(x* %, i %). D 2 o(14 )+ 1 2 1 &1 8 c3 3 c3 H { : i=2 1 f $(x* %, i %) 2 f(x* %, i %) f (F &1 (x* %, i))?(%) d%. 3= (19) The integrand is a Riemann sum converging to 1 8 c3 3 c3 H 1 f $(x f (F &1 %)2 (x))3 f(x %) dx, which is finite by condition C3. If we consider only the subsequence with =2 m for m # N, then the Riemann sum is nonincreasing and therefore 1 lim sup 2 2m D 2 2 mc3 3 c3 H 8 1 f (F &1 f $(x %)2 (x))3 dx?(%) d%. f(x %) To complete the proof, for # N, let m be the largest integer with 2 m. Then 2 m 2. We have just proved that there exists a sequence (H ), =1 where H consists of 2 m intervals and D 2 2 =O(12 2m m ). Therefore there exists a sequence (H ) =1 of partitions into intervals with lim sup 2 D 2 2 C2 2m 4C, with C(c 3 Hc 3 38) 1 f (F &1 (x)) 3 ( f $(x %) 2 f(x %)) dx?(%) d%. Equation (6) is an immediate consequence of the construction of the sequence (H ). =1

122 FELSENSTEIN AND PO TZELBERGER Proof of Lemma 2. Let,(x)=h i +h i&1 &2x; then F(H i %) 9(x, %) =1+,(x) +R(x, %) $ i f(x %) 2 if x # H i, with R(x, %)= h i h i&1 ( y&x)( f $(x~ %)&f $(x %)) dy $ i f(x %) x~=x~(x, y), and x~&xy&x. If C5 holds a function ($) exists with R(x, %)$ i ($ i ). We use the representation with and (- f(x %)&- F(H i %)$ i ) 2 =,2 9 2 f(x %) +S(x, %)+T(x, %), (2) 16 R(x, %) f(x %)[,(x) 9(x, %) f(x %)+R(x, %) f(x %)] S(x, %)= (- f(x %)+- F(H i %)$ i ) 2 T(x, %)=,2 (x) 9 2 (x, %) f(x %) 4 { f(x %) (- f(x %)+- F(H i %)$ i ) 2&1 4=. Let &$&=sup $ i and c j (x)= 9 j?(% x) d%. The supremum of c j (x) in is denoted by c* j and c 4 = 9 4?(% x) d% dx. Then S(x, %)?(%) d% $ 2 i - c* 2 (&$&)+$ 2 i 2 (&$&). Since { f(x %) (- f(x %)+- F(H i %)$ i ) 4= 2&1 = \ - f(x %) (- f(x %)+- F(H i %)$ i ) &1 2+\ } - f(x %) (- f(x %)+- F(H i %)$ i ) &1 2} 3 2 - f(x %) (- f(x %)+- F(H i %)$ i ) +1 2+ and - f(x %) } (- f(x %)+- F(H i %)$ i ) &1 2} } 1 2\ 1& F(H i $ i f(x %)+},

ASYMPTOTIC LOSS OF INFORMATION 123 the inequality holds. We have T(x, %) 3 16 $2 i 9 2 (x, %) f(x %) } 1& F(H i %) $ i f(x %)} h i h i&1 9 4 (x, %) f(x %)?(%) d% dx$ i c* 4. (21) Application of Jensen's inequality leads to h i h i&1 \ 1& F(H 2 i f(x %)?(%) d% dx $ i f(x %)+ = h i h i 2 $ i + h i&1 \ - f(x %)& F(H i %) h i&1 1?(%) d% dx $ i hi (- f(x %)&- f( y %)) 2?(%) d% dy dx h i&1 $ 2 i sup x, y # H i D 2 (?( } x),?( } y)). The last term is o(1) for &$&. Schwarz' inequality gives Thus h i h i&1 T(x, %)?(%) d% dx 3 16 $2 i \ hi _ \ h i h i&1 9 4 (x, %) f(x %)?(%) d% dx + 12 2 $ i f(x %)+ h i&1 \ 1& F(H i %) 3 16 $2 i - c* 4 $ i - $ i o(1). f(x %)?(%) d% dx + 12 } (S(x, %)+T(x, %))?(%) d% dx } =o \ : i=1 $ 3 i +. (22) Equation (8) then follows from (2), (7), and (22).

124 FELSENSTEIN AND PO TZELBERGER Proof of Lemma 3. We define a density g(x)=q &12 (x) q &12 (x) dx. Then Jensen's inequality gives 1- q(x) dx \ + 2 _ q(x) _ 2 (x) dx= 2 (x) g 2 (x) dx = \ _23 (x) g(x) + \ _ 23 (x) g(x) 3 g(x) dx g(x) dx + 3 = \ _ 23 (x) dx + 3, with equality if and only if g B _ 23. Therefore the optimal q is proportional to (_ 2 ) &23. Proof of Lemma 4. The proof is similar to the proof of Lemma 6, so we give the deviating parts only. Let x %, i # H i with f(x %, i %)=F(H i %)$ i. Note that log(1+t)t and with and We have I = : I = : i=1 Hi log \ =2 : 2 : i=1 Hi log \ i=1 Hi \ =2I +2I i=1 Hi \ I = :, i=1 H i \ f(x %) f(x %)?(%) d% dx f(x %, i %)+ f(x %) f(x %) x?(%) d% dx f(x %, i %)+ f(x %) f(x %, i %) &1 f(x %)?(%) d% dx + f(x %) f(x %, i %) &1 + ( f(x %)&f(x %, i %))?(%) d% dx f(x %) f(x %, i %) &1 + f(x %, i %)?(%) d% dx. I = : i=1 Hi (- f(x %) - f(x %, i %)&1)?(%) d% dx=&d 2.

ASYMPTOTIC LOSS OF INFORMATION 125 Furthermore, I = : i=1 Hi \ f(x %) f(x %, i %) &1 + (- f(x %)&- f(x %, i %)) _(- f(x %)+- f(x %, i %))?(%) d% dx = : i=1 Hi (- f(x %)&- f(x - f(x %) %, i %)) \ 2 1+?(%) d% dx - f(x %, i %)+ =2D 2 + : 2D 2 + \ : i=1 H i (- f(x %)&- f(x %, i %)) 2 - f(x %) - f(x %, i %) i=1 Hi (- f(x %)&- f(x %, i %)) 4?(%) d% dx + 12 _ \ : i=1 Hi f(x %) =2D 2 + \ : 12 f(x %, i %) dx?(%) d% + i=1 H i (- f(x %)&- f(x %, i %)) 4?(%) d% dx + 12. Again, we choose the endpoints of the intervals as where h :=c H i x 3 (1&x) 3 f 32 c &1 H = 1 x 3 (1&x) 3 f 32 (F &1 &1 (F Analogously to Lemma 1, it can be seen that (x)) dx, (x)) dx.?(%) d% dx Hi (- f(x %)&- f(x %, i %)) 4?(%) d% dx=o(1 4 ), for i=1 and i=. For2i&1 an x~ %, i # H i exists such that f $(x~ %, i %) - f(x %)&- f(x %, i %)=(x&x %, i ) 2 - f(x~ %, i %) holds. The arguments concerning the convergence of the Riemann sum used in Lemma 2 apply here as well. Thus I 2D 2 +O(12 )

126 FELSENSTEIN AND PO TZELBERGER if f $(x %)4 f(x %)?(%) f (F &1 2 (x))7 d% dx<, which is condition C7 for the transformed model. Proof of Lemma 5. Let B=- F(H i %)$ i f(x %)=- f(x %, i %)f(x %). Then and D (x) 2 = 1 2 (B&1)2?(% x) d% log(b)=&(1&b)&(1&b 2 ) 1 v 1+v(B&1) dv =&(1&B)&(1&B) 2 2&(1&B) 3 1 so that I (x)=&2 log(b)?(% x) d%, and therefore with Now R(x)=2 (1&B)3 1 I (x)=4d (x) 2 +R(x) v 2 1+v(B&1) dv, v 2 dv f(x %)?(%) d%. 1+v(B&1) } 1 v 2 1+v(B&1) dv } 1 B ; hence it is sufficient to prove that 1&B 3 B?(% x) d%=o( i $ 3 i ) uniformly in x. Now 1 f(x %)(1&B) 3 B=(- f(x %)&- f(x %, i %)) 3 - f(x %, i %) =(x&x %, i ) 3 1 8 $ 3 i \ &$&(x, %) f $(x~ %, i %) 3 1 f(x~ %, i %) 32 - f(x %, i %)

ASYMPTOTIC LOSS OF INFORMATION 127 with x~ %, i defined as in Lemma 4. Condition C8 implies that for small &$& the integral \ &$& (x, %)?(%) d%< for any x #, so that for x # H i Hi 1&B 3 B?(% x) d%$ 3 i 1 8 \ &$& (x, %)?(%) d% dx=o($ 3 ) i Hi uniformly in x #. REFERENCES M. Abramowitz and I. Stegun, ``Handboo of Mathematical Functions,'' U.S. Department of Commerce, Washington, DC, 1972. G. R. Benitz and J. A. Buclew, Asymptotically optimal quantizers for detection of i.i.d. data, IEEE Trans. Theory 35 (1989), 316325. W. Cochran, Errors of measurements in statistics, Technometrics 1 (1968), 637666. D. R. Cox, Note on grouping, J. Amer. Statist. Assoc. 52 (1957), 543547. A. Dempster, N. Laird, and D. Rubin, Maximum lielihood from incomplete data via the EM algorithm (with discussion), J. Roy. Statist. Soc. Ser. B 39 (1977), 138. A. Dempster and D. Rubin, Rounding error in regression: The appropriateness of Sheppard's corrections, J. Roy. Statist. Soc. Ser. B 45 (1983), 5159. D. Heitjan, Inference from Grouped Continuous Data: A review (with discussion), Statist. Sci. 4 (1989), 164183. D. Heitjan and D. Rubin, Ignorability and coarse data, Ann. Statist. 19 (1991), 22442253. D. Lindley, Grouping corrections and maximum lielihood equations, Proc. Cambridge Philos. Soc. 46 (195), 1611. Printed in Belgium