Beruflich Dokumente
Kultur Dokumente
A Thesis
Submitted For the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
Ambedkar Dukkipati
March 2006
Abstract
Z
dP
Kullback-Leibler relative-entropy or KL-entropy of P with respect to R defined as ln
dP ,
X dR
where P and R are probability measures on a measurable space (X, M), plays a basic role in the
definitions of classical information measures. It overcomes a shortcoming of Shannon entropy
– discrete case definition of which cannot be extended to nondiscrete case naturally. Further,
entropy and other classical information measures can be expressed in terms of KL-entropy and
hence properties of their measure-theoretic analogs will follow from those of measure-theoretic
KL-entropy. An important theorem in this respect is the Gelfand-Yaglom-Perez (GYP) Theorem
which equips KL-entropy with a fundamental definition and can be stated as: measure-theoretic
KL-entropy equals the supremum of KL-entropies over all measurable partitions of X. In this
thesis we provide the measure-theoretic formulations for ‘generalized’ information measures, and
state and prove the corresponding GYP-theorem – the ‘generalizations’ being in the sense of R ényi
and nonextensive, both of which are explained below.
Kolmogorov-Nagumo average or quasilinear mean of a vector x = (x 1 , . . . , xn ) with respect
P
n
to a pmf p = (p1 , . . . , pn ) is defined as hxiψ = ψ −1 k=1 p k ψ(x k ) , where ψ is an arbitrary
continuous and strictly monotone function. Replacing linear averaging in Shannon entropy with
Kolmogorov-Nagumo averages (KN-averages) and further imposing the additivity constraint – a
characteristic property of underlying information associated with single event, which is logarith-
mic – leads to the definition of α-entropy or Rényi entropy. This is the first formal well-known
generalization of Shannon entropy. Using this recipe of Rényi’s generalization, one can prepare
only two information measures: Shannon and Rényi entropy. Indeed, using this formalism Rényi
characterized these additive entropies in terms of axioms of KN-averages. On the other hand, if
one generalizes the information of a single event in the definition of Shannon entropy, by replac-
x1−q −1
ing the logarithm with the so called q-logarithm, which is defined as ln q x = 1−q , one gets
what is known as Tsallis entropy. Tsallis entropy is also a generalization of Shannon entropy
but it does not satisfy the additivity property. Instead, it satisfies pseudo-additivity of the form
x ⊕q y = x + y + (1 − q)xy, and hence it is also known as nonextensive entropy. One can apply
Rényi’s recipe in the nonextensive case by replacing the linear averaging in Tsallis entropy with
KN-averages and thereby imposing the constraint of pseudo-additivity. A natural question that
arises is what are the various pseudo-additive information measures that can be prepared with this
recipe? We prove that Tsallis entropy is the only one. Here, we mention that one of the impor-
tant characteristics of this generalized entropy is that while canonical distributions resulting from
‘maximization’ of Shannon entropy are exponential in nature, in the Tsallis case they result in
power-law distributions.
i
The concept of maximum entropy (ME), originally from physics, has been promoted to a gen-
eral principle of inference primarily by the works of Jaynes and (later on) Kullback. This connects
information theory and statistical mechanics via the principle: the states of thermodynamic equi-
librium are states of maximum entropy, and further connects to statistical inference via select the
probability distribution that maximizes the entropy. The two fundamental principles related to
the concept of maximum entropy are Jaynes maximum entropy principle, which involves maxi-
mizing Shannon entropy and the Kullback minimum entropy principle that involves minimizing
relative-entropy, with respect to appropriate moment constraints.
Though relative-entropy is not a metric, in cases involving distributions resulting from relative-
entropy minimization, one can bring forth certain geometrical formulations. These are reminiscent
of squared Euclidean distance and satisfy an analogue of the Pythagoras’ theorem. This property
is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle equality and
plays a fundamental role in geometrical approaches to statistical estimation theory like informa-
tion geometry. In this thesis we state and prove the equivalent of Pythagoras’ theorem in the
nonextensive formalism. For this purpose we study relative-entropy minimization in detail and
present some results.
Finally, we demonstrate the use of power-law distributions, resulting from ME-prescriptions
of Tsallis entropy, in evolutionary algorithms. This work is motivated by the recently proposed
generalized simulated annealing algorithm based on Tsallis statistics.
To sum up, in light of their well-known axiomatic and operational justifications, this thesis
establishes some results pertaining to the mathematical significance of generalized measures of
information. We believe that these results represent an important contribution towards the ongoing
research on understanding the phenomina of information.
ii
To
Bhirava Swamy and Bharati who infected me with a disease called Life
and to
all my Mathematics teachers who taught me how to extract sweetness from it.
-------------
. . . lie down in a garden and extract from the disease,
especially if it’s not a real one, as much sweetness as
possible. There’s a lot of sweetness in it.
iii
Acknowledgements
No one deserves more thanks for the success of this work than my advisers Prof. M. Narasimha
Murty and Dr. Shalabh Bhatnagar. I wholeheartedly thank them for their guidance.
I thank Prof. Narasimha Murty for his continued support throughout my graduate student
years. I always looked upon him for advice – academic or non-academic. He has always been
a very patient critique of my research approach and results; without his trust and guidance this
thesis would not have been possible. I feel that I am more disciplined, simple and punctual after
working under his guidance.
The opportunity to watch Dr. Shalabh Bhatnagar in action (particularly during discussions)
has fashioned my way of thought in problem solving. He has been a valuable adviser, and I hope
my three and half years of working with him have left me with at least few of his qualities.
I am thankful to the Chairman, Department of CSA for all the support.
I am privileged to learn mathematics from the great teachers: Prof. Vittal Rao, Prof. Adi Murty
and Prof. A. V. Gopala Krishna. I thank them for imbibing in me the rigour of mathematics.
Special thanks are due to Prof. M. A. L. Thathachar for having taught me.
I thank Dr. Christophe Vignat for his criticisms and encouraging advice on my papers.
I wish to thank CSA staff Ms. Lalitha, Ms. Meenakshi and Mr. George for being of very great
help in administrative works. I am thankful to all my labmates: Dr. Vishwanath, Asharaf, Shahid,
Rahul, Dr. Vijaya, for their help. I also thank my institute friends Arjun, Raghav, Ranjna.
I will never forget the time I spent with Asit, Aneesh, Gunti, Ravi. Special thanks to my music
companions, Raghav, Hari, Kripa, Manas, Niki. Thanks to all IISc Hockey club members and my
running mates, Sai, Aneesh, Sunder. I thank Dr. Sai Jagan Mohan for correcting my drafts.
Special thanks are due to Vinita who corrected many of my drafts of papers, this thesis, all the
way from DC and WI. Thanks to Vinita, Moski and Madhulatha for their care.
I am forever indebted to my sister Kalyani for her prayers. My special thanks are due to my
sister Sasi and her husband and to my brother Karunakar and his wife. Thanks to my cousin
Chinni for her special care. The three great new women in my life: my nieces Sanjana (3 years),
Naomika (2 years), Bhavana (3 months) who will always be dear to me. I reserve my special love
for my nephew (new born).
I am indebted to my father for keeping his promise that he will continue to guide me even
though he had to go to unreachable places. I owe everything to my mother for taking care of every
need of mine. I dedicate this thesis to my parents and to my teachers.
iv
Contents
Abstract i
Acknowledgements iv
Notations viii
1 Prolegomenon 1
1.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 What is Entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Why to maximize entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 A reader’s guide to the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
3.2 Measure-Theoretic Definitions of Generalized Information Measures . . . . . . . . . . . 56
3.3 Maximum Entropy and Canonical Distributions . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 ME-prescription for Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.1 Tsallis Maximum Entropy Distribution . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.2 The Case of Normalized q-expectation values . . . . . . . . . . . . . . . . . . . 62
3.5 Measure-Theoretic Definitions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.1 On Measure-Theoretic Definitions of Generalized Relative-Entropies . . . . . . . 64
3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy . . . . . . . . . . . . 69
3.6 Gelfand-Yaglom-Perez Theorem in the General Case . . . . . . . . . . . . . . . . . . . . 70
6 Conclusions 106
6.1 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Concluding Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vi
Bibliography 111
vii
Notations
R+ [0, ∞)
#E Cardinality of a set E
viii
1 Prolegomenon
Abstract
This chapter serves as an introduction to the thesis. The purpose is to motivate the
discussion on generalized information measures and their maximum entropy pre-
scriptions by introducing in broad brush-strokes a picture of the information theory
and its relation with statistical mechanics and statistics. It also has road-map of the
thesis, which should serve as a reader’s guide.
1
(axiomatic approach) or, preferably, by both.”
The above passage is quoted from a critical survey on information measures by
Csiszár (1974), which summarizes the significance of information measures and scope
of generalizing them. Now we shall see the details.
2
p(xk ) = pk , k = 1, . . . n. Then, Shannon entropy can be written as expectation of a
function of X as follows. Define a function H which assigns each value x k that X
takes, the value − ln p(xk ) = − ln pk , for k = 1, . . . n. The quantity − ln pk is known
as the information associated with the single event x k with probability pk , also known
as Hartley information (Aczél & Daróczy, 1975). From this what one can infer is
that Shannon entropy expression is an average of Hartley information. Interpretation
of Shannon entropy, as an average of information associated with a single event, is
central to Rényi generalization.
Rényi entropies were introduced into mathematics by Alfred Rényi (1960). The
original motivation was strictly formal. The basic idea behind Rényi’s generalization
is that any putative candidate for an entropy should be a mean, and thereby he uses a
well known idea in mathematics that the linear mean, though most widely used, is not
the only possible way of averaging, however, one can define the mean with respect to
an arbitrary function. Here one should be aware that, to define a ‘meaningful’ gener-
alized mean, one has to restrict the choice of functions to continuous and monotone
functions (Hardy, Littlewood, & Pólya, 1934).
Following the above idea, once we replace the linear mean with generalized means,
we have a set of information measures each corresponding to a continuous and mono-
tone function. Can we call every such entity an information measure? Rényi (1960)
postulated that an information measure should satisfy additivity property which Shan-
non entropy itself does. The important consequence of this constraint is that it restricts
the choice of function in a generalized mean to linear and exponential functions: if we
choose a linear function, we get back the Shannon entropy, if we choose an exponential
function, we have well known and much studied generalization of Shannon entropy
X n
1
Sα (p) = ln pαk ,
1−α
k=1
3
entropy, Havrda and Charvát (1967) observed that for operational purposes, it seems
P
more natural to consider the simpler expression nk=1 pαk as an information measure
instead of Rényi entropy (up to a constant factor). Characteristics of this information
measure are studied by Daróczy (1970), Forte and Ng (1973), and it is shown that
this quantity permits simpler postulational characterizations (for the summary of the
discussion see (Csiszár, 1974)).
While generalized information measures, after Rényi’s work, continued to be of
interest to many mathematicians, it was in 1988 that they came to attention in Physics
when Tsallis reinvented the above mentioned Havrda and Charvát entropy (up to a
constant factor), and specified it in the form (Tsallis, 1988)
P
1 − k pqk
Sq (p) = .
q−1
Though this expression looks somewhat similar to the Rényi entropy and retrieves
Shannon entropy in the limit q → 1, Tsallis entropy has the remarkable, albeit not
yet understood, property that in the case of independent experiments, it is not addi-
tive. Hence, statistical formalism based on Tsallis entropy is also termed nonextensive
statistics.
Next, we discuss what information measures to do with statistics.
Probabilities are unobservable quantities in the sense that one cannot determine the val-
ues of these corresponding to a random experiment by simply an inspection of whether
the events do, in fact, occur or not. Assessing the probability of the occurrence of some
event or of the truth of some hypothesis is the important question one runs up against
in any application of probability theory to the problems of science or practical life.
Although the mathematical formalism of probability theory serves as a powerful tool
when analyzing such problems, it cannot, by itself, answer this question. Indeed, the
formalism is silent on this issue, since its goal is just to provide theorems valid for
all probability assignments allowed by its axioms. Hence, recourse is necessary to
an additional rule which tells us in which case one ought to assign which values to
probabilities.
In 1957, Jaynes proposed a rule to assign numerical values to probabilities in cir-
cumstances where certain partial information is available. Jaynes showed, in particu-
lar, how this rule, when applied to statistical mechanics, leads to the usual canonical
4
distributions in an extremely simple fashion. The concept he used was ‘maximum
entropy’.
With his maximum entropy principle, Jaynes re-derived Gibbs-Boltzmann statisti-
cal mechanics á la information theory in his two papers (Jaynes, 1957a, 1957b). This
principle states that the states of thermodynamic equilibrium are states of maximum
entropy. Formally, let p1 , . . . , pn be the probabilities that a particle in a system has
energies E1 , . . . , En respectively, then well known Gibbs-Boltzmann distribution
e−βEk
pk = k = 1, . . . , n,
Z
P
can be deduced from maximizing the Shannon entropy functional − nk=1 pk ln pk
P
with respect to the constraint of known expected energy nk=1 pk Ek = U along with
P
the normalizing constraint nk=1 pk = 1. Z is called the partition function and can be
specified as
n
X
Z= e−βEk .
k=1
Though use of maximum entropy has its historical roots in physics (e.g., Elsasser,
1937) and economics (e.g., Davis, 1941), later on, Jaynes showed that a general method
of statistical inference could be built upon this rule, which subsumes the techniques of
statistical mechanics as a mere special case. The principle of maximum entropy states
that, of all the distributions p that satisfy the constraints, one should choose the distri-
bution with largest entropy. In the above formulation of Gibbs-Boltzmann distribution
one can view the mean energy constraint and normalizing constraints as the only avail-
able information. Also, this principle is a natural extension of Laplace’s famous prin-
ciple of insufficient reason, which postulates that the uniform distribution is the most
satisfactory representation of our knowledge when we know nothing about the random
variate except that each probability is nonnegative and the sum of the probabilities is
unity; it is easy to show that Shannon entropy is maximum for uniform distribution.
The maximum entropy principle is used in many fields, ranging from physics (for
example, Bose-Einstein and Fermi-Dirac statistics can be made as though they are
derived from the maximum entropy principle) and chemistry to image reconstruction
and stock market analysis, recently in machine learning.
While Jayens was developing his maximum entropy principle for statistical infer-
ence problems, a more general principle was proposed by Kullback (1959, pp. 37)
which is known as the minimum entropy principle. This principle comes into picture
in problems where inductive inference is to update from a prior probability distribu-
tions to a posterior distribution when ever new information becomes available. This
5
principle states that, given a prior distribution r, of all the distributions p that sat-
isfy the constraints, one should choose the distribution with the least Kullback-Leibler
relative-entropy
n
X pk
I(pkr) = pk ln .
rk
k=1
6
Power-law Distributions
Despite the great success of the standard ME-principle, it is a well known fact that
there are many relevant probability distributions in nature which are not easily deriv-
able from Jaynes-Shannon prescription: Power-law distributions constitute an interest-
ing example. If one sticks to the standard logarithmic entropy, ‘awkward constraints’
are needed in order to obtain power-law type distributions (Tsallis et al., 1995). Does
Jaynes ME-principle suggest in a natural way the possibility of incorporating alter-
native entropy functionals to the variational principle? It seems that if one replaces
Shannon entropy with its generalization, ME-prescriptions ‘naturally’ result in power-
law distributions.
Power-law distributions can be obtained by optimizing Tsallis entropy under ap-
propriate constraints. The distribution thus obtained is termed the q-exponential distri-
1
bution. The associated q-exponential function of x is e q (x) = [1 + (1 − q)x]+1−q , with
the notation [a]+ = max{0, a}, and converges to the ordinary exponential function
in the limit q → 1. Hence formalism of Tsallis offers continuity between Boltzmann-
Gibbs distribution and power-law distribution, which is given by the nonextensive pa-
rameter q. Boltzmann-Gibbs distribution is a special case of the power-law distribution
of Tsallis prescription; as we set q → 0, we get exponential.
Here, we take up an important real-world example, where significance of power-
law distribution can be demonstrated.
The importance of power-law distributions in the domain of computer science was
first precipitated in 1999 in the study of connectedness of World Wide Web (WWW).
Using a Web crawler, Barabási and Albert (1999) mapped the connectedness of the
Web. To their surprise, the web did not have an even distribution of connectivity (so-
called “random connectivity”). Instead, a very few network nodes (called “hubs”)
were far more connected than other nodes. In general, they found that the probability
p(k) that a node in the network connects with k other nodes was, in a given network,
proportional to k −γ , where the degree exponent γ is not universal and depends on
the detail of network structure. Pictorial depiction of random networks and scale-free
networks is given in Figure 1.1.
Here we wish to point out that, using the q-exponential function, p(k) is rewritten
as p(k) = eq ( κk ), where q = 1 + γ1 and κ = (q − 1)k0 . This implies that the Barabási-
Albert solution optimizes the Tsallis entropy (Abe & Suzuki, 2004).
One more interesting example is the distribution of scientific articles in journals
(Naranan, 1970). If the journals are divided into groups, each containing the same
7
Figure 1.1: Structure of Random and Scale-Free Networks
number of articles on a given subject, then the number of journals in the succeeding
groups from a geometrical progression.
Tsallis nonextensive formalism had been applied to analyze the various phenomena
which exhibit power-laws, for example stock markets (Queirós et al., 2005), citations
of scientific papers (Tsallis & de Albuquerque, 2000), scale-free network of earth-
quakes (Abe & Suzuki, 2004), models of network packet traffic (Karmeshu & Sharma,
2006) etc. To a great extent, the success of Tsallis proposal is attributed to the ubiquity
of power law distributions in nature.
Until now we have considered information measures in the discrete case, where the
number of configurations is finite. Is it possible to extend the definitions of information
measures to non-discrete cases, or to even more general cases? For example can we
write Shannon entropy in the continuous case, naively, as
Z
S(p) = − p(x) ln p(x) dx
for a probability density p(x)? It turns out that in the above continuous case, entropy
functional poses a formidable problem if one interprets it as an information measure.
Information measures extended to abstract spaces are important not only for math-
ematical reasons, the resultant generality and rigor could also prove important for even-
tual applications. Even in communication problems discrete memoryless sources and
channels are not always adequate models for real-world signal sources or communica-
tion and storage media. Metric spaces of functions, vectors and sequences as well as
random fields naturally arise as models of source and channel outcomes (Cover, Gacs,
& Gray, 1989). The by-products of general rigorous definitions have the potential for
8
proving useful new properties, for providing insight into their behavior and for finding
formulas for computing such measures for specific processes.
Immediately after Shannon published his ideas, the problem of extending the defi-
nitions of information measures to abstract spaces was addressed by well-known math-
ematicians of the time, Kolmogorov (1956, 1957) (for an excellent review on Kol-
mogorov’s contributions to information theory see (Cover et al., 1989)), Dobrushin
(1959), Gelfand (1956, 1959), Kullback (Kullback, 1959), Pinsker (1960a, 1960b),
Yaglom (1956, 1959), Perez (1959), Rényi (1960), Kallianpur (1960), etc.
We now examine why extending the Shannon entropy to the non-discrete case is
a nontrivial problem. Firstly, probability densities mostly carry a physical dimension
(say probability per length) which give the entropy functional the unit of ‘ln cm’, which
seems somewhat odd. Also in contrast to its discrete case counterpart this expression
is not invariant under a reparametrization of the domain, e.g. by a change of unit.
Further, S may now become negative, and is not bounded both from above or below
so that new problems of definition appear cf. (Hardy et al., 1934, pp. 126).
These problems are clarified if one considers how to construct an entropy for a
continuous probability distribution starting from the discrete case. A natural approach
is to consider the limit of the finite discrete entropies corresponding to a sequence of
finite partitions of an interval (on which entropy is defined) whose norms tend to zero.
Unfortunately, this approach does not work, because this limit is infinite for all con-
tinuous probability distributions. Such divergence is also obtained–and explained–if
one adopts the well-known interpretation of the Shannon entropy as the least expected
number of yes/no questions needed to identify the value of x, since in general it takes
an infinite number of such questions to identify a point in the continuum (of course,
this interpretation supposes that the logarithm in entropy functional has base 2).
To overcome the problems posed by the definition of entropy functional in con-
tinuum, the solution suggested was to consider the expression in discrete case (cf.
Gelfand et al., 1956; Kolmogorov, 1957; Kullback, 1959)
n
X p(xk )
S(p|µ) = − p(xk ) ln ,
µ(xk )
k=1
where µ(xk ) are positive weights determined by some ‘background measure’ µ. Note
that the above entropy functional S(p|µ) is the negative of Kullback-Leibler relative-
entropy or KL-entropy when we consider that µ(x k ) are positive and sum to one.
Now, one can show that the present entropy functional, which is defined in terms of
KL-entropy, however has a natural extension to the continuous case (Topsøe, 2001,
9
Theorem 5.2). This is because, if one now partitions the real line in increasingly finer
subsets, the probabilities corresponding to p and the background weights correspond-
ing to µ are both split simultaneously and the logarithm of their ratio will generally
not diverge.
This is how KL-entropy plays an important role in definitions of information mea-
sures extended to continuum. Based on these above ideas one can extend the infor-
mation measures on measure space (X, M, µ); µ is exactly the same as that appeared
in the above definition of the entropy functional S(p|µ) in discrete case. The entropy
functionals in both the discrete and continuous cases can be retrieved by appropri-
ately choosing the reference measure µ. Such a definition of information measures
on measure spaces can be used in ME-prescriptions, which are consistent with the
prescriptions when their discrete counterparts, are used.
One can find the continuum and measure-theoretic aspects of entropy functionals
in the information theory text of Guiaşu (1977). A concise and very good discussion
on ME-prescriptions of continuous entropy functionals can be found in (Uffink, 1995).
One can see from the above discussions that the two generalizations of Shannon en-
tropy, Rényi and Tsallis, originated or developed from different fields. Though Rényi’s
generalization originated in information theory, it has been studied in statistical me-
chanics (e.g., Bashkirov, 2004) and statistics (e.g., Morales et al., 2004). Similarly,
Tsallis generalization was mainly studied in statistical mechanics when it was pro-
posed, but, now, Shannon-Khinchin axioms have been extended to Tsallis entropy (Su-
yari, 2004a) and applied to statistical inference problems (e.g., Tsallis, 1998). This
elicits no surprise because from the above discussion one can see that information
theory is naturally connected to statistical mechanics and statistics.
The study of the mathematical properties and applications of generalized infor-
mation measures and, further, new formulations of the maximum entropy principle
based on these generalized information measures constitute a currently growing field
of research. It is in this line of inquiry that this thesis presents some results pertain-
ing to mathematical properties of generalized information measures and their ME-
prescriptions, including the results related to measure-theoretic formulations of the
same.
Finally, note that Rényi and Tsallis generalizations can be ‘naturally’ applied to
Kullback-Leibler relative entropy to define generalized relative-entropy measures, which
10
are extensively studied in the literature. Indeed, the major results that we present in
this thesis are related to these generalized relative-entropies.
One can view Rényi’s formalism as a tool, which can be used to generalize informa-
tion measures and thereby characterize them using axioms of Kolmogorov-Nagumo
averages (KN-averages). For example, one can apply Rényi’s recipe in the nonexten-
sive case by replacing the linear averaging in Tsallis entropy with KN-averages and
thereby impose the constraint of pseudo-additivity. A natural question arises is what
are the various pseudo-additive information measures that one can prepare with this
recipe? In this thesis we prove that only Tsallis entropy is possible in this case, using
which we characterize Tsallis entropy based on axioms of KN-averages.
Owing to the probabilistic settings for information theory, it is natural that more gen-
eral definitions of information measures can be given on measure spaces. In this thesis
we develop measure-theoretic formulations for generalized information measures and
present some related results.
One can give measure-theoretic definitions for Rényi and Tsallis entropies along
similar lines as Shannon entropy. One can also show that, as is the case with Shannon
entropy, these measure-theoretic definitions are not natural extensions of their discrete
analogs. In this context we present two results: (i) we prove that, as in the case of
classical ‘relative-entropy’, generalized relative-entropies, whether Rényi or Tsallis,
can be extended naturally to the measure-theoretic case, and (ii) we show that, ME-
prescriptions of measure-theoretic Tsallis entropy are consistent with the discrete case.
Another important result that we present in this thesis is the Gelfand-Yaglom-Perez
(GYP) theorem for Rényi relative-entropy, which can be easily extended to Tsallis
relative-entropy. GYP-theorem for Kullback-Leibler relative-entropy is a fundamental
11
theorem which plays an important role in extending discrete case definitions of various
classical information measures to the measure-theoretic case. It also provides a means
to compute relative-entropy and study its behavior.
Recently, power-law distributions have been used in simulated annealing, which claims
to perform better than classical simulated annealing. In this thesis we demonstrate the
use of power-law distributions in evolutionary algorithms (EAs). The proposed algo-
rithm use Tsallis generalized canonical distribution, which is a one-parameter gener-
alization of the Boltzmann distribution, to weigh the configurations in the selection
mechanism. We provide some simulation results in this regard.
12
1.2 Essentials
This section details some heuristic explanations for the logarithmic nature of Hart-
ley and Shannon entropies. We also discuss some notations and why the concept of
“maximum entropy” is important.
The logarithmic nature of Hartley and Shannon information measures, and their ad-
ditivity properties can be explained by heuristic arguments. Here we give one such
explanation (Rényi, 1960).
To characterize an element of a set of size n we need log 2 n units of information,
where a unit is a bit. The important feature of the logarithmic information measure is
its additivity: If a set E is a disjoint union of m n-tuples: E 1 , . . . , Em , then we can
specify an element of this mn-element set E in two steps: first we need ln 2 m bits of
information to describe which of the sets E 1 , . . . , Em , say Ek , contains the element,
and we need log 2 n further bits of information to tell which element of this set E k is
the considered one. The information needed to characterize an element of E is the
‘sum’ of the two partial informations. Indeed, log 2 nm = log 2 n + log2 m.
The next step is due to Shannon (1948). He has pointed out that Hartley’s formula
is valid only if the elements of E are equiprobable; if their probabilities are not equal,
the situation changes and we arrive at the formula (2.15). If all the probabilities are
equal to n1 , Shannon’s formula (2.15) reduces to Hartley’s formula: S(p) = log 2 n.
Shannon’s formula has the following heuristic motivation. Let E be the disjoint
P
union of the sets E1 , . . . , En having N1 , . . . , Nn elements respectively ( nk=1 Nk =
N ). Let us suppose that we are interested only in knowing the subset E k to which a
given element of E belongs. Suppose that the elements of E are equiprobable. The
information characterizing an element of E consists of two parts: the first specifies the
subset Ek containing this particular element and the second locates it within E k . The
amount of the second piece of information is log 2 Nk (by Hartley’s formula), thus it
depends on the index k. To specify an element of E we need log 2 N bits of information
and as we have seen it is composed of the information specifying E k – its amount will
be denoted by Hk – and of the information within Ek . According to the principle
N
of additivity, we have log 2 N = Hk + log2 Nk or Hk = log 2 Nk . It is plausible to
define the information needed to identify the subset E k which the considered element
13
belongs to as the weighted average of the informations H k , where the weights are the
probabilities that the element belongs to the E k ’s. Thus,
n
X Nk
S= Hk ,
N
k=1
from which we obtain the Shannon entropy expression using the above interpretations
N Nk
of Hk = log2 Nk and using the notation pk = N .
Now we note one more important idea behind the Shannon entropy. We frequently
come across Shannon entropy being treated as both a measure of uncertainty and of
information. How is this rendered possible?
If X is the underlying random variable, then S(p) is also written as S(X) though it
does not depend on the actual values of X. With this, one can say that S(X) quantifies
how much information we gain, on an average, when we learn the value of X. An
alternative view is that the entropy of X measures the amount of uncertainty about
X before we learn its value. These two views are complementary; we can either view
entropy as a measure of our uncertainty before we learn the value of X, or as a measure
of how much information we have gained after we learn the value of X.
Following this one can see that Shannon entropy for the most ‘certain distribu-
tion’ (0, . . . , 1, . . . 0) returns the value 0, and for the most ‘uncertain distribution’
( n1 , . . . , n1 ) returns the value ln n. Further one can show the inequality
0 ≤ S(p) ≤ ln n ,
for any probability distribution p. The inequality S(p) ≥ 0 is easy to verify. Let us
prove that for any probability distribution p = (p 1 , . . . , pn ) we have
1 1
S(p) = S(p1 , . . . , pn ) ≤ S ,..., = ln n . (1.1)
n n
Here, we shall see the proof. I One way of showing this property is by using the
Jensen inequality for real-valued continuous functions. Let f (x) be a real-valued con-
tinuous concave function defined on the interval [a, b]. Then for any x 1 , . . . , xn ∈ [a, b]
P
and any set of non-negative real numbers λ 1 , . . . , λn such that nk=1 λk = 1, we have
n n
!
X X
λk f (xk ) ≤ f λk xk . (1.2)
k=1 k=1
14
and hence the result.
Alternatively, one can use Lagrange’s method to maximize entropy subject to the
Pn
normalization condition of probability distribution k=1 pk = 1. In this case the
Lagrangian is
n n
!
X X
L≡− pk ln pk − λ pk − 1 ,
k=1 k=1
−(1 + ln pk ) − λ = 0 , k = 1, . . . n
which gives
1
p1 = p 2 = . . . = p n = . (1.3)
n
which is always negative definite, so that the values from (1.3) determine a maximum
value, which, because of the concavity property, is also the global maximum value.
Hence the result. J
1
pk = , k = 1, . . . n.
n
15
We can restate the principle as, the uniform distribution is the most satisfactory rep-
resentation of our knowledge when we know nothing about the random variate except
that each probability is nonnegative and the sum of the probabilities is unity. This rule,
of course, refers to the meaning of the concept of probability , and is therefore sub-
ject to debate and controversy. We will not discuss this here, one can refer to (Uffink,
1995) for a list of objections to this principle reported in the literature.
Now having the Shannon entropy as a measure of uncertainty (information), can
we generalize the principle of insufficient reason and say that with the available infor-
mation, we can always choose the distribution which maximizes the Shannon entropy?
This is what is known as the Jaynes’ maximum entropy principle which states that of
all the probability distributions that satisfy given constraints, choose the distribution
which maximizes Shannon entropy. That is if our state of knowledge is appropriately
represented by a set of expectation values, then the “best”, least unbiased probability
distribution is the one that (i) reflects just what we know, without “inventing” unavail-
able pieces of knowledge, and, additionally, (ii) maximize ignorance: the truth, all
the truth, nothing but the truth. This is the rationale behind the maximum entropy
principle.
Now we shall examine this principle in detail. Let us assume that some information
about the random variable X is given which can be modeled as a constraint on the set
of all possible probability distributions. It is assumed that this constraint exhaustively
specifies all relevant information about X. The principle of maximum entropy is then
the prescription to choose that probability distribution p for which the Shannon entropy
is maximal under the given constraint.
Here we take simple and often studied type of constraints, i.e. the case where
expectation of X is given. Say we have the constraint
n
X
xk pk = U ,
k=1
ln pk = −λ − βxk
16
The Lagrange parameter λ can be specified by the normalizing constraint. Finally,
maximum entropy distribution can be written as
e−βxk
pk = Pn −βxk
,
k=1 e
The commonly used notation in the thesis is given in the beginning of the chapters.
When we write down the proofs of some results which are not specified in the
Theorem/Lemma environment, we denote the beginning and ending of proofs by I
and J respectively. Otherwise the end of proofs that are part of the above are identified
by . Some additional explanations with in the results are included in the footnotes.
To avoid proliferation of symbols we use the same notation for different concepts
if this does not cause ambiguity; the correspondence should be clear from the con-
text. For example whether it is a maximum entropy distribution or minimum relative-
entropy distribution we use the same symbols for Lagrange multipliers.
17
Roadmap
Apart from this chapter this thesis contains five other chapters. We now briefly outline
a summary of each chapter.
In Chapter 2, we present a brief introduction of generalized information measures
and their properties. We discuss how generalized means play a role in the information
measures and present a result related to generalized means and Tsallis generalization.
In Chapter 3, we discuss various aspects of information measures defined on
measure spaces. We present measure-theoretic definitions for generalized information
measures and present important results.
In Chapter 4, we discuss the geometrical aspects of relative-entropy minimization
and present an important result for Tsallis relative-entropy minimization.
In Chapter 5, we apply power-law distributions to selection mechanism in evolu-
tionary algorithms and test their novelty by simulations.
Finally, in Chapter 6, we summarize the contributions of this thesis, and discuss
possible future directions.
18
2 KN-averages and Entropies:
Rényi’s Recipe
Abstract
This chapter builds the background for this thesis and introduces Rényi and Tsallis
(nonextensive) generalizations of classical information measures. It also presents
a significant result on relation between Kolmogorov-Nagumo averages and nonex-
tensive generalization, which can also be found in (Dukkipati, Murty, & Bhatnagar,
2006b).
19
averages in Tsallis entropy with KN-averages and thereby imposing the constraint of
pseudo-additivity. A natural question that arises is what are the pseudo-additive infor-
mation measures that one can prepare with this recipe? We prove that Tsallis entropy is
the only possible measure in this case, which allows us to characterize Tsallis entropy
using axioms of KN-averages.
As one can see from the above discussion, Hartley information measure (Hartley,
1928) of a single stochastic event plays a fundamental role in the Rényi and Tsallis
generalizations. Generalization of Rényi involves the generalization of linear average
in Shannon entropy, where as, in the case of Tsallis, it is the generalization of the
Hartley function; while Rényi’s is considered to be the additive generalization, Tsal-
lis is non-additive. These generalizations can be extended to Kullback-Leibler (KL)
relative-entropy too; indeed, many results presented in this thesis are related to gener-
alized relative entropies.
First, we discuss the important properties of classical information measures, Shan-
non and KL, in § 2.1. We discuss Rényi’s generalization in § 2.2, where we discuss
the Hartley function and properties of quasilinear means. Nonextensive generalization
of Shannon entropy and relative-entropy is presented in detail in § 2.3. Results on the
uniqueness of Tsallis entropy under Rényi’s recipe and characterization of nonexten-
sive information measures are presented in § 2.4 and § 2.5 respectively.
20
2.1.1 Shannon Entropy
The convention that 0 ln 0 = 0 is followed, which can be justified by the fact that
limx→0 x ln x = 0. This formula was discovered independently by (Wiener, 1948),
hence, it is also known as Shannon-Wiener entropy.
Note that the entropy functional (2.1) is determined completely by the pmf p of r.v
X, and does not depend on the actual values that X takes. Hence, entropy functional
is often denoted as a function of pmf alone as S(p) or S(p 1 , . . . , pn ); we use all these
notations, interchangeably, depending on the context. The logarithmic function in (2.1)
can be taken with respect to an arbitrary base greater than unity. In this thesis, we
always use the base e unless otherwise mentioned.
Shannon entropy of the Bernoulli variate is known as Shannon entropy function
which is defined as follows. Let X be a Bernoulli variate with pmf (p, 1 − p) where
0 < p < 1. Shannon entropy of X or Shannon entropy function is defined as
s(p) attains its maximum value for p = 12 . Later, in this chapter we use this function
to compare Shannon entropy functional with generalized information measures, Rényi
and Tsallis, graphically.
Also, Shannon entropy function is of basic importance as Shannon entropy can be
expressed through it as follows:
p2 p3
S(p1 , . . . , pn ) = (p1 + p2 )s + (p1 + p2 + p3 )s
p1 + p 2 p1 + p 2 + p 3
pn
+ . . . + (p1 + . . . + pn )s
p1 + . . . + p n
n
X
pk
= (p1 + . . . + pk )s . (2.3)
p1 + . . . + p k
k=2
21
S(p) ≥ 0, for any pmf p = (p1 , . . . , pn ) and assumes minimum value, S(p) = 0,
for a degenerate distribution, i.e., p(x 0 ) = 1 for some x0 ∈ X, and p(x) = 0, ∀x ∈ X,
x 6= x0 . If p is not degenerate then S(p) is strictly positive. For any probability
distribution p = (p1 , . . . , pn ) we have
1 1
S(p) = S(p1 , . . . , pn ) ≤ S ,..., = ln n . (2.4)
n n
where X × Y denotes joint r.v of X and Y . When X and Y are not necessarily
independent, then1
i.e., the entropy of the joint experiment is less than or equal to the sum of the uncer-
tainties of the two experiments. This is called the subadditivity property.
Many sets of axioms for Shannon entropy have been proposed. Shannon (1948)
has originally given a characterization theorem of the entropy introduced by him. A
more general and exact one is due to Hinčin (1953), generalized by Faddeev (1986).
The most intuitive and compact axioms are given by Khinchin (1956), which are
known as the Shannon-Khinchin axioms. Faddeev’s axioms can be obtained as a spe-
cial case of Shannon-Khinchin axioms cf. (Guiaşu, 1977, pp. 9, 63).
Here we list the Shannon-Khinchin axioms. Consider the sequence of functions
S(1), S(p1 , p2 ), . . . , S(p1 , . . . pn ), . . ., where, for every n, the function S(p 1 , . . . , pn )
is defined on the set
( n
)
X
P= (p1 , . . . , pn ) | pi ≥ 0, pi = 1 .
i=1
1
This follows from the fact that S(X × Y ) = S(X) + S(Y |X), and conditional entropy S(Y |X) ≤
S(Y ), where
n m
S(Y |X) = − p(xi , yj ) ln p(yj |xi ) .
i=1 j=1
22
Consider the following axioms:
[SK1] continuity: For any n, the function S(p 1 , . . . , pn ) is continuous and symmetric
with respect to all its arguments,
[SK2] expandability: For every n, we have
S(p1 , . . . , pn , 0) = S(p1 , . . . , pn ) ,
where c is any positive constant. Proof of this uniqueness theorem for Shannon entropy
can be found in (Khinchin, 1956) or in (Guiaşu, 1977, Theorem 1.1, pp. 9).
23
where one would assume that whenever r k = 0, the corresponding pk = 0 and 0 ln 00 =
0. Following Rényi (1961), if p and r are pmfs of the same r.v X, the relative-entropy
is sometimes synonymously referred to as the information gain about X achieved if p
can be used instead of r. KL-entropy as a distance measure on the space of all pmfs
of X is not a metric, since it is not symmetric, i.e., I(pkr) 6= I(rkp), and it does not
satisfy the triangle inequality.
KL-entropy is an important concept in information theory, since other information-
theoretic quantities including entropy and mutual information may be formulated as
special cases. For continuous distributions in particular, it overcomes the difficulties
with continuous version of entropy (known as differential entropy); its definition in
nondiscrete cases is a natural extension of the discrete case. These aspects constitute
the major discussion of Chapter 3 of this thesis.
Among the properties of KL-entropy, the property that I(pkr) ≥ 0 and I(pkr) = 0
if and only if p = r is fundamental in the theory of information measures, and is known
as the Gibbs inequality or divergence inequality (Cover & Thomas, 1991, pp. 26). This
property follows from Jensen’s inequality.
I(pkr) is a convex function of both p and r. Further, it is a convex in the pair
(p, r), i.e., if (p1 , r1 ) and (p2 , q2 ) are two pairs of pmfs, then (Cover & Thomas, 1991,
pp. 30)
Similar to Shannon entropy, KL-entropy is additive too in the following sense. Let
X1 , X2 ∈ X and Y1 , Y2 ∈ Y be such that X1 and Y1 are independent, and X2 and Y2
are independent, respectively, then
24
One has to note that the above relation between KL and Shannon entropies differs in
the nondiscrete cases, which we discuss in detail in Chapter 3.
S = K ln W , (2.13)
25
One can give a more general definition of Hartley information measure described
above as follows. Define a function H : {x 1 , . . . , xn } → R of the values taken by r.v
X ∈ X with corresponding p.m.f p = (p1 , . . . pn ) as (Aczél & Daróczy, 1975)
1
H(xk ) = ln , ∀k = 1, . . . n. (2.14)
pk
H is also known as information content or entropy of a single event (Aczél & Daróczy,
1975) and plays an important role in all classical measures of information. It can be
interpreted either as a measure of how unexpected the given event is, or as measure
of the information yielded by the event; and it has been called surprise by Watanabe
(1969), and unexpectedness by Barlow (1990).
Hartley function satisfies: (i) H is nonnegative: H(x k ) ≥ 0 (ii) H is additive:
H(xi , xj ) = H(xi ) + H(xj ), where H(xi , xj ) = ln pi1pj (iii) H is normalized:
1
H(xk ) = 1, whenever pk = e (in the case of logarithm with base 2, the same is
1
satisfied for pk = 2 ). These properties are both necessary and sufficient (Aczél &
Daróczy, 1975, Theorem 0.2.5).
Now, Shannon entropy (2.1) can be written as expectation of Hartley function as
n
X
S(X) = hHi = pk Hk , (2.15)
k=1
26
2.2.2 Kolmogorov-Nagumo Averages or Quasilinear Means
and x0 is greater than some and less than others of the x k unless all xk are zero.
The implication of Theorem 2.1 is that the mean h . i ψ is determined when the
function ψ is given. One may ask whether the converse is true: if hXi ψ1 = hXiψ2 for
all X ∈ X, is ψ1 necessarily the same function as ψ2 ? Before answering this question,
we shall give the following definition.
D EFINITION 2.1 Continuous and strictly monotone functions ψ 1 and ψ2 are said to be KN-equivalent
if hXiψ1 = hXiψ2 for all X ∈ X.
3
Kolmogorov (1930) and Nagumo (1930) first characterized the quasilinear mean for a vector
(x1 , . . . , xn ) as hxiψ = ψ −1 n 1
k=1 n ψ(xk ) where ψ is a continuous and strictly monotone function.
de Finetti (1931) extended their result to the case of simple (finite) probability distributions. The version
of the quasilinear mean representation theorem referred to in § 2.5 is due to Hardy et al. (1934), which
followed closely the approach of de Finetti. Aczél (1948) proved a characterization of the quasilinear
mean using functional equations. Ben-Tal (1977) showed that quasilinear means are ordinary arithmetic
means under suitably defined addition and scalar multiplication operations. Norries (1976) did a survey
of quasilinear means and its more restrictive forms in Statistics, and a more recent survey of general-
ized means can be found in (Ostasiewicz & Ostasiewicz, 2000). Applications of quasilinear means can
be found in economics (e.g., Epstein & Zin, 1989) and decision theory (e.g., Kreps & Porteus, 1978)).
Recently Czachor and Naudts (2002) studied generalized thermostatistics based on quasilinear means.
27
Note that when we compare two means, it is to be understood that the underlying prob-
abilities are same. Now, the following theorem characterizes KN-equivalent functions.
T HEOREM 2.2 In order that two continuous and strictly monotone functions ψ 1 and ψ2 are KN-
equivalent, it is necessary and sufficient that
ψ1 = αψ2 + β , (2.18)
T HEOREM 2.3 Let ψ be a KN-function and c be a real constant then hX + ci ψ = hXiψ + c i.e.,
n
! n
!
X X
ψ −1 pk ψ (xk + c) = ψ −1 pk ψ (xk ) +c
k=1 k=1
Proofs of Theorems 2.1, 2.2 and 2.3 can be found in the book on inequalities
by Hardy et al. (1934).
Rényi (1960) employed these generalized averages in the definition of Shannon
entropy to generalize the same.
In the definition of Shannon entropy (2.15), if the standard mean of Hartley function
H is replaced with the quasilinear mean (2.16), one can obtain a generalized measure
of information of r.v X with respect to a KN-function ψ as
n
X ! Xn
!
1
Sψ (X) = ψ −1 pk ψ ln = ψ −1 pk ψ (Hk ) , (2.19)
pk
k=1 k=1
28
property (postulate)? The answer is that insisting on additivity allows by Theorem 2.3
only for two classes of ψ’s – linear and exponential functions. We formulate these
arguments formally as follows.
If we impose the constraint of additivity on S ψ , i.e., for any X, Y ∈ X
Sψ (X × Y ) = Sψ (X) + Sψ (Y ) , (2.20)
where the KN-function ψ is chosen in (2.19) as ψ(x) = e (1−α)x whose choice is moti-
vated by Theorem 2.3. If we choose ψ as a linear function in quasilinear entropy (2.19),
what we get is Shannon entropy. The right side of (2.22) makes sense 4 as a measure
of information whenever α 6= 1 and α > 0 cf. (Rényi, 1960).
Rényi entropy is a one-parameter generalization of Shannon entropy in the sense
that Sα (p) → S(p) as α → 1. Hence, Rényi entropy is referred to as entropy of order
α, whereas Shannon entropy is referred to as entropy of order 1. The Rényi entropy
can also be seen as an interpolation formula connecting the Shannon (α = 1) and
Hartley (α = 0) entropies.
Among the basic properties of Rényi entropy, S α is positive. This follows from
P
Jensen’s inequality which gives nk=1 pαk ≤ 1 in the case α > 1, and while in the case
P
0 < α < 1 it gives nk=1 pαk ≥ 1; in both cases we have Sα (p) ≥ 0.
Sα is strictly concave with respect to p for 0 < α ≤ 1. For α > 1, Rényi
entropy is neither pure convex nor pure concave. This is a simple consequence of
the fact that both ln x and xα (α < 1) are concave functions, while x α is convex for
α > 1 (see (Ben-Bassat & Raviv, 1978) for proofs and a detailed discussion).
4
For negative α, however, Sα (p) has disadvantageous properties; namely, it will tend to infinity if
any pk tends to 0. This means that it is too sensitive to small probabilities. (This property could also
formulated in the following way: if we add a new event of probability 0 to a probability distribution,
what does not change the probability distribution, Sα (p) becomes infinity.) The case α = 0 must also be
excluded because it yields an expression not depending on the probability distribution p = (p1 , . . . , pn ).
29
A notable property of Sα (p) is that it is a monotonically decreasing function of α
for any pmf p. This can be verified as follows. I We can calculate the derivative of
Sα (p) with respect to α as
n
! n
dSα (p) 1 X pαk 1 X
= Pn α ln pk + ln pαk
dα (1 − α)
k=1 j=1 pj (1 − α)2 k=1
( n
! n
! )
1 X pαk X pαk
= Pn α ln p1−α
k − ln Pn α p1−α
k .
(1 − α)2 k=1 j=1 pj k=1 j=1 pj
(2.23)
pα
1
pα
One should note here that the vector of positive real numbers n
pα
,..., n
n
pα
j=1 j j=1 j
represents a pmf. (Indeed, distributions of this form are known as escort distribu-
tions (Abe, 2003) and plays an important role in ME-prescriptions of Tsallis en-
tropy. We discuss these aspects in Chapter 3.) Denoting the mean of a vector x =
(x1 , . . . , xn ) with respect to this pmf, i.e. escort distribution of p, by hhxii α we can
write (2.23) in an elegant form, which further gives the results as
dSα (p) 1 1−α 1−α
= hhln p iiα − ln hhp iiα ≤ 0 . (2.24)
dα (1 − α)2
Sα (X × Y ) = Sα (X) + Sα (Y ) , (2.26)
30
Similar to the Shannon entropy function (2.2) one can define the entropy function
in the case of Rényi as
1
sα (p) = ln pα + (1 − p)α , p ∈ [0, 1], (2.27)
1−α
which is the Rényi entropy of a Bernoulli random variable.
Figure 2.1 shows the plot of Shannon entropy function (2.2) compared to Rényi
entropy function (2.27) for various values of entropic index α.
0.7
0.6
α=0.8
0.5 α=1.2
s(p) & s ( p)
α=1.5
α
0.4
0.3
0.2
Shannon
0.1 Renyi
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rényi entropy does have a reasonable operational significance even if not one com-
parable with that of Shannon entropy cf. (Csiszár, 1974). As regards the axiomatic ap-
proach, Rényi (1961) did suggest a set of postulates characterizing his entropies but it
P
involved the rather artificial procedure of considering incomplete pdfs ( nk=1 pk ≤ 1 )
as well. This shortcoming has been eliminated by Daróczy (1970). Recently, a slightly
different set of axioms is given by (Jizba & Arimitsu, 2004b).
Despite its formal origin, Rényi entropy proved important in a variety of practical
applications in coding theory (Campbell, 1965; Aczél & Daróczy, 1975; Lavenda,
1998), statistical inference (Arimitsu & Arimitsu, 2000, 2001), quantum mechan-
ics (Maassen & Uffink, 1988), chaotic dynamics systems (Halsey, Jensen, Kadanoff,
Procaccia, & Shraiman, 1986) etc. Rényi entropy is also used in neural networks
(Kamimura, 1998). Thermodynamic properties of systems with multi-fractal struc-
tures have been studied by extending the notion of Gibbs-Shannon entropy into a more
31
general framework - Rényi entropy (Jizba & Arimitsu, 2004a).
Entropy of order 2 i.e., Rényi entropy for α = 2,
n
X
S2 (p) = − ln p2k (2.28)
k=1
for pmfs p and r. Properties of this generalized relative-entropy can be found in (Rényi,
1970, Chapter 9).
We conclude this section with the note that though it is considered that the first
formal generalized measure of information is due to Rényi, the idea of considering
some generalized measure did not start with Rényi. Bhattacharyya (1943, 1946) and
Jeffreys (1948) dealt with the quantity
n
X √
I1/2 (pkr) = −2 pk rk = I1/2 (rkp) (2.30)
k=1
32
expression an information quantity. In this respect, Rényi’s review paper (Rényi, 1965)
is particularly instructive.
Now we discuss the important, non-additive generalization of Shannon entropy.
33
It follows that
1
λ(1 − q) q−1
pk = .
q
which is clearly negative definite for q > 0 (positive definite for q < 0). J One can
recall that Rényi entropy (2.22) is concave only for 0 < α < 1.
Also, one can prove that for two pmfs p and r, and for real number 0 ≤ λ ≤ 1 we
have
xq
which results from Jensen’s inequality and concavity of 1−q .
What separates out Tsallis entropy from Shannon and Rényi entropies is that it is
not additive. The entropy index q in (2.31) characterizes the degree of nonextensivity
reflected in the pseudo-additivity property
1
sq (p) = 1 − xq − (1 − x)q (2.35)
q−1
Figure 2.2 shows the plots of Shannon entropy function (2.2) and Tsallis entropy func-
tion (2.35) for various values of entropic index a.
It is worth mentioning here that the derivation of Tsallis entropy using the Lorentz
addition by Amblard and Vignat (2005) gives insights into the boundedness of Tsallis
entropy. In this thesis we will not go into these details.
The first set of axioms for Tsallis entropy is given by dos Santos (1997), which
were later improved by Abe (2000). The most concise set of axioms are given by Su-
yari (2004a), which are known as Generalized Shannon-Khinchin axioms. A simpli-
34
0.9
0.8
0.7 q=0.8
0.6 q=1.2
0.4
0.3
0.2
Shannon
Tsallis
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
fied proof of this uniqueness theorem for Tsallis entropy is given by (Furuichi, 2005).
In these axioms, Shannon additivity (2.8) is generalized to
n
X
pi1 pimi
Sq (p11 , . . . , pnmn ) = Sq (p1 , . . . , pn ) + pqi Sq ,..., , (2.36)
pi pi
i=1
under the same conditions (2.7); remaining axioms are the same as in Shannon-Khinchin
axioms.
Now we turn our attention to the nonextensive generalization of relative-entropy.
The definition of Kullback-Leibler relative-entropy (2.9) and the nonextensive entropic
functional (2.31) naturally lead to the generalization (Tsallis, 1998)
q−1
pk
n
X −1
rk
Iq (pkr) = pk , (2.37)
q−1
k=1
Iq (pkr) ≥ 0 if q > 0
= 0 if q = 0
≤ 0 if q < 0 . (2.38)
35
For q 6= 0, the equalities hold if and only if p = r. (2.38) can be verified as follows.
1−x1−q
I Consider the function f (x) = 1−q . We have f 00 (x) > 0 for q > 0 and hence it
is convex. By Jensen’s inequality we obtain
1−q
r !1−q
X 1 − pkk
n
1
n
X rk
Iq (pkr) = pk ≥ 1− pk
1−q 1−q pk
k=1 k=1
=0 . (2.39)
For q < 0 we have f 00 (x) < 0 and hence we have the reverse inequality by Jensen’s
inequality for concave functions. J
Further, for q > 0, Iq (pkr) is a convex function of p and r, and for q < 0 it
is concave, which can be proved using Jensen’s inequality cf. (Borland, Plastino, &
Tsallis, 1998).
Tsallis relative-entropy satisfies the pseudo-additivity property of the form (Fu-
ruichi et al., 2004)
The mathematical basis for Tsallis statistics comes from the q-deformed expressions
for the logarithm (q-logarithm) and the exponential function (q-exponential) which
were first defined in (Tsallis, 1994), in the context of nonextensive thermostatistics.
The q-logarithm is defined as
x1−q − 1
lnq x = (x > 0, q ∈ R) , (2.41)
1−q
36
and the q-exponential is defined as
( 1
x
eq = [1 + (1 − q)x] 1−q if 1 + (1 − q)x ≥ 0 (2.42)
0 otherwise.
We have limq→1 lnq x = ln x and limq→1 exq = ex . These two functions are related by
ln x
eq q = x . (2.43)
x ⊕q y = x + y + (1 − q)xy , (2.47)
37
and Tsallis relative-entropy (2.37) as
n
X rk
Iq (pkr) = − pk lnq . (2.50)
pk
k=1
These representations are very important for deriving many results related to nonex-
tensive generalizations as we are going to consider in the later chapters.
H e k ) = lnq 1 ,
e k = H(x k = 1, . . . n . (2.51)
pk
Now, Tsallis entropy (2.31) can be defined as the expectation of the q-Hartley function
e as5
H
D E
e .
Sq (X) = H (2.52)
e k = lnq 1 = φq (Hk ) ,
H
pk
where
e(1−q)x − 1
φq (x) = = lnq (ex ) . (2.53)
1−q
Note that the function φq is KN-equivalent to e(1−q)x (by Theorem 2.2), the KN-
function used in Rényi entropy. Hence Tsallis entropy is related to Rényi entropies
as
38
0.9
Renyi
q<1 Tsallis
0.8
0.7
0.6
0.4
R
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.7
Renyi
Tsallis
q>1
0.6
0.5
Sq (p) & Sq (p)
T
0.4
0.3
R
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
where SqT and SqR denote the Tsallis and Rényi entropies respectively with a real num-
ber q as a parameter. (2.54) implies that Tsallis and Rényi entropies are monotonic
functions of each other and, as a result, both must be maximized by the same probabil-
ity distribution. In this thesis, we consider only ME-prescriptions related to nonexten-
sive entropies. Discussion on ME of Rényi entropy can be found in (Bashkirov, 2004;
Johnson & Vignat, 2005; Costa, Hero, & Vignat, 2002).
Comparisons of Rényi entropy function (2.27) with Tsallis entropy function (2.35)
are shown graphically in Figure 2.3 for two cases of entropic index, corresponding to
0 < q < 1 and q > 1 respectively. Now a natural question that arises is whether
one could generalize Tsallis entropy using Rényi’s recipe, i.e. by replacing the linear
average in (2.52) by KN-averages and imposing the condition of pseudo-additivity. It
is equivalent to determining the KN-function ψ for which the so called q-quasilinear
39
entropy defined as
" n #
D E X
Seψ (X) = H
e = ψ −1 ek
pk ψ H , (2.55)
ψ
k=1
e k = H(x
where H e k ), ∀k = 1, . . . n, satisfies the pseudo-additivity property.
T HEOREM 2.4 Let X, Y ∈ X be two independent random variables. Let ψ be any KN-function.
Then
hX ⊕q Y iψ = hXiψ ⊕q hY iψ (2.56)
Proof Let p and r be the p.m.fs of random variables X, Y ∈ X respectively. The proof of
sufficiency is simple and follows from
n X
X n
hX ⊕q Y iψ = hX ⊕q Y i = pi rj (xi ⊕q yj )
i=1 j=1
Xn X n
= pi rj (xi + yj + (1 − q)xi yj )
i=1 j=1
Xn n
X n
X n
X
= pi xi + rj yj + (1 − q) pi xi rj yj .
i=1 j=1 i=1 j=1
Since (2.57) must hold for arbitrary p.m.fs p, r and for arbitrary numbers x 1 , . . . , xn
and y1 , . . . , yn , one can choose yj = c for all j. Then (2.57) yields
n
! n
!
X X
ψ −1 pk ψ (xi ⊕q c) = ψ −1 pk ψ (xi ) ⊕q c . (2.58)
i=1 i=1
40
That is, ψ should satisfy
and
for any X ∈ X and any constants d, c. From Theorem 2.3, condition (2.61) is satisfied
only when ψ is linear or exponential.
To complete the theorem, we have to show that KN-averages do not satisfy condi-
tion (2.62) when ψ is exponential. For a particular choice of ψ(x) = e (1−α)x , assume
that
where
n
!
1 X
hdXiψ1 = ln pk e(1−α)dxk ,
1−α
k=1
and
n
!
d X
dhXiψ1 = ln pk e(1−α)xk .
1−α
k=1
hXiψ = hXiψ0 ,
41
One can observe that the above proof avoids solving functional equations as in the
case of the proof of Theorem 2.3 (e.g., Aczél & Daróczy, 1975). Instead, it makes
use of Theorem 2.3 itself and other basic properties of KN-averages. The following
corollary is an immediate consequence of Theorem 2.4.
C OROLLARY 2.1 The q-quasilinear entropy Seψ (defined as in (2.55)) with respect to a KN-function
ψ satisfies pseudo-additivity if and only if Seψ is Tsallis entropy.
Proof Let X, Y ∈ X be two independent random variables and let p, r be their corresponding
pmfs. By the pseudo-additivity constraint, ψ should satisfy
Equivalently, we need
Xn X n
ψ −1 pi rj ψ He p ⊕q H
er
i j
i=1 j=1
!
n
X n
X
= ψ −1 ep
pi ψ H ⊕q ψ −1 e jr ,
rj ψ H
i
i=1 j=1
e p and H
where H e r represent the q-Hartley functions corresponding to probability dis-
tributions p and r respectively. That is, ψ should satisfy
e p ⊕q H
hH e r i = hH
e p i ⊕q hH
e ri .
ψ ψ ψ
42
Hartley Information q−Hartley Information
KN−average KN−average
additivity pseudo−additivity
Shannon Entropy ’
Renyi Entropy Tsallis Entropy
Figure 2.4: Rényi’s Recipe for Additive and Pseudo-additive Information Measures
43
T HEOREM 2.5 Let FI be the set of all cumulative distribution functions defined on some interval
I of the real line R. A functional κ : F I → R satisfies the following axioms:
Proof of the above characterization can be found in (Hardy et al., loc. cit.). Mod-
ified axioms for the quasilinear mean can be found in (Chew, 1983; Fishburn, 1986;
Ostasiewicz & Ostasiewicz, 2000). Using this characterization of the quasilinear mean,
Rényi gave the following characterization for additive information measures.
The proof of above theorem is straight forward by using Theorem (2.3); for details
see (Rényi, 1960).
Now we give the following characterization theorem for nonextensive entropies.
44
T HEOREM 2.7 Let X ∈ X be a random variable. An information measure defined as a (general-
ized) mean κ of q-Hartley function of X is Tsallis entropy if and only if
The above theorem is a direct consequence of Theorems 2.4 and 2.5. This charac-
terization of Tsallis entropy only replaces the additivity constraint in the characteriza-
tion of Shannon entropy given by Rényi (1960) with pseudo-additivity, which further
does not make use of the postulate κ(X) + κ(−X) = 0. (This postulate is needed
to distinguish Shannon entropy from Rényi entropy). This is possible because Tsallis
entropy is unique by means of KN-averages and under pseudo-additivity.
From the relation between Rényi and Tsallis information measures (2.54), pos-
sibly, generalized averages play a role – though not very well understood till now –
in describing the operational significance of Tsallis entropy. Here, one should men-
tion the work of Czachor and Naudts (2002), who studied the KN-average based ME-
prescriptions of generalized information measures (constraints with respect to which
one would maximize entropy are defined in terms of quasilinear means). In this regard,
results presented in this chapter have mathematical significance in the sense that they
further the relation between nonextensive entropic measures and generalized averages.
45
3 Measures and Entropies:
Gelfand-Yaglom-Perez Theorem
Abstract
R
The measure-theoretic KL-entropy defined as X ln dP dR dP , where P and R are
probability measures on a measurable space (X, M), plays a basic role in the defi-
nitions of classical information measures. A fundamental theorem in this respect is
the Gelfand-Yaglom-Perez Theorem (Pinsker, 1960b, Theorem 2.4.2) which equips
measure-theoretic KL-entropy with a fundamental definition and can be stated as,
Z X m
dP P (Ek )
ln dP = sup P (Ek ) ln ,
X dR R(Ek )
k=1
where supremum is taken over all the measurable partitions {Ek }m k=1 . In this chap-
ter, we state and prove the GYP-theorem for Rényi relative-entropy of order greater
than one. Consequently, the result can be easily extended to Tsallis relative-entropy.
Prior to this, we develop measure-theoretic definitions of generalized information
measures and discuss the maximum entropy prescriptions. Some of the results pre-
sented in this chapter can also be found in (Dukkipati, Bhatnagar, & Murty, 2006b,
2006a).
Shannon’s measure of information was developed essentially for the case when the
random variable takes a finite number of values. However in the literature, one often
encounters an extension of Shannon entropy in the discrete case (2.1) to the case of a
one-dimensional random variable with density function p in the form (e.g., Shannon
& Weaver, 1949; Ash, 1965)
Z +∞
S(p) = − p(x) ln p(x) dx .
−∞
46
Inspite of these short comings, one can still use the continuous entropy functional
in conjunction with the principle of maximum entropy where one wants to find a proba-
bility density function that has greater uncertainty than any other distribution satisfying
a set of given constraints. Thus, one is interested in the use of continuous measure as
a measure of relative and not absolute uncertainty. This is where one can relate maxi-
mization of Shannon entropy to the minimization of Kullback-Leibler relative-entropy
cf. (Kapur & Kesavan, 1997, pp. 55). On the other hand, it is well known that the
continuous version of KL-entropy defined for two probability density functions p and
r,
Z +∞
p(x)
I(pkr) = p(x) ln dx ,
−∞ r(x)
One can see from the above definition that the concept of “the entropy of a pdf” is a
misnomer as there is always another measure µ in the background. In the discrete case
considered by Shannon, µ is the cardinality measure 1 (Shannon & Weaver, 1949, pp.
19); in the continuous case considered by both Shannon and Wiener, µ is the Lebesgue
measure cf. (Shannon & Weaver, 1949, pp. 54) and (Wiener, 1948, pp. 61, 62). All
entropies are defined with respect to some measure µ, as Shannon and Wiener both
emphasized in (Shannon & Weaver, 1949, pp.57, 58) and (Wiener, 1948, pp.61, 62)
respectively.
This case was studied independently by Kallianpur (1960) and Pinsker (1960b),
and perhaps others were guided by the earlier work of Kullback and Leibler (1951),
where one would define entropy in terms of Kullback-Leibler relative-entropy. In
this respect, the Gelfand-Yaglom-Perez theorem (GYP-theorem) (Gelfand & Yaglom,
1959; Perez, 1959; Dobrushin, 1959) plays an important role as it equips measure-
theoretic KL-entropy with a fundamental definition. The main contribution of this
chapter is to prove GYP-theorem for Rényi relative-entropy of order α > 1, which can
be extended to Tsallis relative-entropy.
1
Counting or cardinality measure µ on a measurable space (X, ), where X is a finite set and
= 2X , is defined as µ(E) = #E, ∀E ∈ .
47
Before proving GYP-theorem for Rényi relative-entropy, we study the measure-
theoretic definitions of generalized information measures in detail, and discuss the
corresponding ME-prescriptions. We show that as in the case of relative-entropy,
the measure-theoretic definitions of generalized relative-entropies, Rényi and Tsallis,
are natural extensions of their respective discrete definition. We also show that ME-
prescriptions of measure-theoretic Tsallis entropy are consistent with that of discrete
case, which is true for measure-theoretic Shannon-entropy.
We review the measure-theoretic formalisms for classical information measures in
§ 3.1 and extend these definitions to generalized information measures in § 3.2. In
§ 3.3 we present the ME-prescription for Shannon entropy followed by prescriptions
for Tsallis entropy in § 3.4. We revisit measure-theoretic definitions of generalized
entropy functionals in § 3.5 and present some results. Finally, Gelfand-Yaglom-Perez
theorem in the general case is presented in § 3.6.
48
satisfies
Z b
p(x) ≥ 0, ∀x ∈ [a, b] and p(x) dx = 1 .
a
In trying to define entropy in the continuous case, the expression of Shannon entropy
in the discrete case (2.1) was automatically extended to continuous case by replacing
the sum in the discrete case with the corresponding integral. We obtain, in this way,
Boltzmann’s H-function (also known as differential entropy in information theory),
Z b
S(p) = − p(x) ln p(x) dx . (3.1)
a
1
p(x) = , x ∈ [a, b] .
b−a
S(p) = ln(b − a) .
On the other hand, let us consider a finite partition of the interval [a, b] which is com-
posed of n equal subintervals, and let us attach to this partition the finite discrete
uniform probability distribution whose corresponding entropy will be, of course,
Sn (p) = ln n .
Obviously, if n tends to infinity, the discrete entropy S n (p) will tend to infinity too,
and not to ln(b − a); therefore S(p) is not the limit of S n (p), when n tends to infinity.
J Further, one can observe that ln(b − a) is negative when b − a < 1.
Thus, strictly speaking, continuous entropy (3.1) cannot represent a measure of
uncertainty since uncertainty should in general be positive. We are able to prove the
“nice” properties only for the discrete entropy, therefore, it qualifies as a “good” mea-
sure of information (or uncertainty) supplied by a random experiment 2 . We cannot
2
One importent property that Shannon entropy exhibits in the continuous case is the entropy power
inequality, which can be stated as follows. Let X and Y are continuous independent random variables
with entropies S(X) and S(Y ) then we have e2S(X+Y ) ≥ e2S(X) + e2S(Y ) with equality if and only if
X and Y are Gaussian variables or one of them is determenistic. The entropy power inequality is derived
by Shannon (1948). Only few and partial versions of it have been proved in the discrete case.
49
extend the so called nice properties to the “continuous entropy” because it is not the
limit of a suitably defined sequence of discrete entropies.
Also, in physical applications, the coordinate x in (3.1) represents an abscissa, a
distance from a fixed reference point. This distance x has the dimensions of length.
Since the density function p(x) specifies the probabilities of an event of type [c, d) ⊂
Rd
[a, b] as c p(x) dx and probabilities are dimensionless, one has to assign the dimen-
sions (length)−1 to p(x). Now for 0 ≤ z < 1, one has the series expansion
1 1
− ln(1 − z) = z + z 2 + z 3 + . . . . (3.2)
2 3
It is thus necessary that the argument of the logarithmic function in (3.1) be dimen-
sionless. Hence the formula (3.1) is then seen to be dimensionally incorrect, since the
argument of the logarithm on its right hand side has the dimensions of a probability
density (Smith, 2001). Although, Shannon (1948) used the formula (3.1), he did note
its lack of invariance with respect to changes in the coordinate system.
In the context of maximum entropy principle, Jaynes (1968) addressed this prob-
lem and suggested the formula,
Z b
0 p(x)
S (p) = − p(x) ln dx , (3.3)
a m(x)
in the place of (3.1), where m(x) is a prior function. Note that when m(x) is also a
probability density function, (3.3) is nothing but the relative-entropy. However, if we
choose m(x) = c, a constant (e.g., Zellner & Highfield, 1988), we get
S 0 (p) = S(p) + ln c ,
where S(p) refers to the continuous entropy (3.1). Thus, maximization of S 0 (p) is
equivalent to maximization of S(p). Further discussion on estimation of probability
density functions by maximum entropy method can be found in (Lazo & Rathie, 1978;
Zellner & Highfield, 1988; Ryu, 1993).
Prior to that, Kullback and Leibler (1951) too suggested that in the measure-
theoretic definition of entropy, instead of examining the entropy corresponding only
to the given measure, we have to compare the entropy inside a whole class of mea-
sures.
Let (X, M, µ) be a measure space, where µ need not be a probability measure unless
otherwise specified. Symbols P , R will denote probability measures on measurable
50
space (X, M) and p, r denote M-measurable functions on X. An M-measurable func-
R
tion p : X → R+ is said to be a probability density function (pdf) if X p(x) dµ(x) =
R
1 or X p dµ = 1 (henceforth, the argument x will be omitted in the integrals if this
does not cause ambiguity).
In this general setting, Shannon entropy S(p) of pdf p is defined as follows (Athreya,
1994).
D EFINITION 3.1 Let (X, M, µ) be a measure space and the M-measurable function p : X → R + be
a pdf. Then, Shannon entropy of p is defined as
Z
S(p) = − p ln p dµ , (3.4)
X
Entropy functional S(p) defined in (3.4) can be referred to as entropy of the prob-
ability measure P that is induced by p, that is defined according to
Z
P (E) = p(x) dµ(x) , ∀E ∈ M . (3.5)
E
This reference is consistent3 because the probability measure P can be identified a.e
by the pdf p.
Further, the definition of the probability measure P in (3.5), allows us to write
entropy functional (3.4) as,
Z
dP dP
S(p) = − ln dµ , (3.6)
X dµ dµ
51
D EFINITION 3.2 Let (X, M) be a measurable space. Let P and R be two probability measures on
(X, M). Kullback-Leibler relative-entropy KL-entropy of P relative to R is defined
as
Z
dP
ln dP if P R ,
X dR
I(P kR) = (3.7)
+∞ otherwise.
The divergence inequality I(P kR) ≥ 0 and I(P kR) = 0 if and only if P = R
can be shown in this case too. KL-entropy (3.7) also can be written as
Z
dP dP
I(P kR) = ln dR . (3.8)
X dR dR
Let the σ-finite measure µ on (X, M) be such that P R µ. Since µ is
σ-finite, from Radon-Nikodym theorem, there exist non-negative M-measurable func-
tions p : X → R+ and r : X → R+ unique µ-a.e, such that
Z
P (E) = p dµ , ∀E ∈ M , (3.9a)
E
and
Z
R(E) = r dµ , ∀E ∈ M . (3.9b)
E
The pdfs p and r in (3.9a) and (3.9b) (they are indeed pdfs) are Radon-Nikodym deriva-
dP
tives of probability measures P and R with respect to µ, respectively, i.e., p = dµ and
dR
r= dµ . Now one can define relative-entropy of pdf p w.r.t r as follows 5 .
As we have mentioned earlier, KL-entropy (3.10) exists if the two densities are
absolutely continuous with respect to one another. On the real line, the same definition
can be written with respect to the Lebesgue measure
Z
p(x)
I(pkr) = p(x) ln dx ,
r(x)
5
This follows from the chain rule for Radon-Nikodym derivative:
−1
dP a.e dP dR
= .
dR dµ dµ
52
which exists if the densities p(x) and r(x) share the same support. Here, in the sequel
we use the convention
a
ln 0 = −∞, ln = +∞ forany a ∈ R, 0.(±∞) = 0. (3.11)
0
Now, we turn to the definition of entropy functional on a measure space. Entropy
functional in (3.6) is defined for a probability measure that is induced by a pdf. By
the Radon-Nikodym theorem, one can define Shannon entropy for any arbitrary µ-
continuous probability measure as follows.
D EFINITION 3.4 Let (X, M, µ) be a σ-finite measure space. Entropy of any µ-continuous probability
measure P (P µ) is defined as
Z
dP
S(P ) = − ln dP . (3.12)
X dµ
Note that when µ is not a probability measure, the divergence inequality I(P kµ) ≥ 0
need not be satisfied.
A note on the σ-finiteness of measure µ in the definition of entropy functional. In
the definition of entropy functional we assumed that µ is a σ-finite measure. This con-
dition was used by Ochs (1976), Csiszár (1969) and Rosenblatt-Roth (1964) to tailor
the measure-theoretic definitions. For all practical purposes and for most applications,
this assumption is satisfied (see (Ochs, 1976) for a discussion on the physical inter-
pretation of measurable space (X, M) with σ-finite measure µ for an entropy measure
of the form (3.12), and of the relaxation of the σ-finiteness condition). The more uni-
versal definitions of entropy functionals, by relaxing the σ-finiteness condition, are
studied by Masani (1992a, 1992b).
53
3.1.3 Interpretation of Discrete and Continuous Entropies in terms of KL-entropy
and
n
X
P : Pk = P ({xk }) ≥ 0, k = 1, . . . , n, Pk = 1 . (3.14b)
k=1
S(P ) = Sn (P ) − ln n , (3.15)
and
Z Z b
P : P (x) ≥ 0, x ∈ [a, b], 3 P (E) = P (x) dx, ∀E ∈ M, P (x) dx = 1 .
E a
(3.16b)
54
Note that the abuse of notation in the above specification of probability measures µ
and P , where we have used the same symbols for both measures and pdfs, is in order
to have the notation consistent with the discrete case analysis given above. The proba-
bility measure P is absolutely continuous with respect to the probability measure µ, if
µ(x) = 0 on a set of a positive Lebesgue measure implies that P (x) = 0 on the same
set. The Radon-Nikodym derivative of the probability measure P with respect to the
probability measure µ will be
dP P (x)
(x) = .
dµ µ(x)
We emphasize here that this relation can only be understood with the above (abuse
of) notation explained. Then the measure-theoretic entropy S(P ) in this case can be
written as
Z b
P (x)
S(P ) = − P (x) ln dx .
a µ(x)
1
If we take referential probability measure µ as a uniform distribution, i.e. µ(x) = b−a ,
x ∈ [a, b], then we obtain
where S[a,b] (P ) denotes the Shannon entropy (3.1) of pdf P (x) and S(P ) denotes the
measure-theoretic entropy (3.12) reduced to the continuous case, with the probability
measures µ and P specified as in (3.16a) and (3.16b) respectively.
Hence, one can conclude that measure theoretic entropy S(P ) defined for a proba-
bility measure P on the measure space (X, M, µ), is equal to both Shannon entropy in
the discrete and continuous case up to an additive constant, when the reference mea-
sure µ is chosen as a uniform probability distribution. On the other hand, one can see
that measure-theoretic KL-entropy, in the discrete and continuous cases corresponds
to its discrete and continuous definitions.
Further, from (3.13) and (3.15), we can write Shannon entropy in terms of Kullback-
Leibler relative-entropy as
Thus, Shannon entropy appears as being (up to an additive constant) the variation
of information when we pass from the initial uniform probability distribution to new
P
probability distribution given by P k ≥ 0, nk=1 Pk = 1, as any such probability
distribution is obviously absolutely continuous with respect to the uniform discrete
55
probability distribution. Similarly, from (3.13) and (3.17) the relation between Shan-
non entropy and relative-entropy in continuous case can be obtained, and we can write
Boltzmann H-function in terms of relative-entropy as
56
we give definitions of information measures for pdfs, we also use the corresponding
definitions of probability measures as well, wherever convenient or required – with
R
the understanding that P (E) = E p dµ, and the converse holding as a result of the
dP
Radon-Nikodym theorem, with p = dµ . In both the cases we have P µ.
With these notations we move on to the measure-theoretic definitions of general-
ized information measures. First we consider the Rényi generalizations. The measure-
theoretic definition of Rényi entropy is as follows.
The same can also be defined for any µ-continuous probability measure P as
Z α−1
1 dP
Sα (P ) = ln dP . (3.21)
1−α X dµ
D EFINITION 3.6 Let p, r : X → R+ be two pdfs on a measure space (X, M, µ). The Rényi relative-
entropy of p relative to r is defined as
Z
1 p(x)α
Iα (pkr) = ln α−1
dµ(x) , (3.22)
α−1 X r(x)
Sα (P ) = Iα (P kµ) . (3.24)
57
D EFINITION 3.7 Tsallis entropy of a pdf p on (X, M, µ) is defined as
Z R
1 1 − X p(x)q dµ(x)
Sq (p) = p(x) lnq dµ(x) = , (3.25)
X p(x) q−1
The q-logarithm lnq is defined as in (2.41). The same can be defined for µ-
continuous probability measure P , and can be written as
Z −1
dP
Sq (P ) = lnq dP . (3.26)
X dµ
D EFINITION 3.8 Let (X, M, µ) be a measure space. Let p, r : X → R + be two probability density
functions. The Tsallis relative-entropy of p relative to r is defined as
Z R p(x)q
r(x) X r(x)q−1 dµ(x) − 1
Iq (pkr) = − p(x) lnq dµ(x) = (3.27)
X p(x) q−1
Sq (P ) = Iq (P kµ) . (3.29)
58
R
with the normalizing constraint X dP = 1. (From now on we assume that any set of
constraints on probability distributions implicitly includes this constraint, which will
therefore not be mentioned in the sequel.)
To maximize the entropy (3.4) with respect to the constraints (3.30), the solution
is calculated via the Lagrangian:
Z Z
dP
L(x, λ, β) = − ln (x) dP (x) − λ dP (x) − 1
X dµ X
M
X Z
− βm um (x) dP (x) − hum i , (3.31)
m=1 X
59
3.4 ME-prescription for Tsallis Entropy
As we mentioned earlier, the great success of Tsallis entropy is attributed to the power-
law distributions that result from the ME-prescriptions of Tsallis entropy. But there are
subtleties involved in the choice of constraints one would choose for ME prescriptions
of these entropy functionals. The issue of what kind of constraints one should use in
the ME-prescriptions is still a part of the major discussion in the nonextensive formal-
ism (Ferri et al., 2005; Abe & Bagci, 2005; Wada & Scarfone, 2005).
In the nonextensive formalism, maximum entropy distributions are derived with
respect to the constraints that are different from (3.30), and are inadequate for han-
dling the serious mathematical difficulties that result for instance, those of unwanted
divergences etc. cf. (Tsallis, 1988). To handle these difficulties constraints of the form
Z
um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M (3.38)
X
are proposed by Curado and Tsallis (1991). The averages of the form hu m iq are re-
ferred to as q-expectations.
To calculate the maximum Tsallis entropy distribution with respect to the constraints
(3.38), the Lagrangian can be written as
Z Z
1
L(x, λ, β) = lnq dP (x) − λ dP (x) − 1
X p(x) X
XM Z
− βm p(x)q−1 um (x) dP (x) − hum iq . (3.39)
m=1 X
XM
1
lnq −λ− βm um (x)p(x)q−1 = 0 . (3.40)
p(x) m=1
60
R
The denominator in (3.41) can be calculated using the normalizing constraint X dP =
1. Finally, Tsallis maximum entropy distribution can be written as
h PM i 1
1−q
1 − (1 − q) m=1 βm um (x)
p(x) = , (3.42)
Zq
Z " M
X
# 1−q
1
Tsallis maximum entropy distribution (3.42) can be expressed in terms of the q-expectation
function (2.42) as
M
− m=1 βm um (x)
eq
p(x) = . (3.44)
Zq
Note that in order to guarantee that pdf p in (3.42) is non-negative real for any
x ∈ X, it is necessary to supplement it with an appropriate prescription for treating
h P i
negative values of the quantity 1 − (1 − q) M m=1 βm um (x) . That is, we need a
prescription for the value of p(x) when
" M
#
X
1 − (1 − q) βm um (x) < 0 . (3.45)
m=1
The simplest possible prescription, and the one usually adopted, is to set p(x) = 0
whenever the inequality (3.45) holds (Tsallis, 1988; Curado & Tsallis, 1991). This
rule is known as the Tsallis cut-off condition. Simple extensions of Tsallis cut-off
conditions are proposed in (Teweldeberhan et al., 2005) by defining an alternate q-
exponential function. In this thesis, we consider only the usual Tsallis cut-off con-
dition mentioned above. Note that by expressing Tsallis maximum entropy distribu-
tion (3.42) in terms of the q-exponential function, as in (3.44), we have assumed Tsallis
cut-off condition implicitly. In summary, when we refer to Tsallis maximum entropy
distribution we mean the following
M
− m=1 βm um (x) h PM i
eq
if 1 − (1 − q) β u
m=1 m m (x) >0
p(x) = Zq (3.46)
0 otherwise.
61
The corresponding thermodynamic equations are as follows (Curado & Tsallis, 1991):
∂
lnq Zq = −hum iq , m = 1, . . . M, (3.48)
∂βm
∂Sq
= βm , m = 1, . . . M. (3.49)
∂hum iq
Constraints of the form (3.38) had been used for some time in the nonextensive ME-
prescriptions, but because of problems in justifying it on physical grounds (for example
q-expectation of a constant need not be a constant and hence they are not expecta-
tions in the true sense) the constraints of the following form were proposed in (Tsallis,
Mendes, & Plastino, 1998)
R
um (x)p(x)q dµ(x)
XR
q
= hhum iiq , m = 1, . . . , M . (3.50)
X p(x) dµ(x)
Here hhum iiq can be considered as the expectation of u m with respect to the modified
probability measure P(q) (it is indeed a probability measure) defined as
Z −1 Z
q
P(q) (E) = p(x) dµ(x) p(x)q dµ(x) , ∀E ∈ M . (3.51)
X E
The modified probability measure P(q) is known as the escort probability measure (Tsal-
lis et al., 1998).
Now, the variational principle for Tsallis entropy maximization with respect to
constraints (3.50) can be written as
Z Z
1
L(x, λ, β) = lnq dP (x) − λ dP (x) − 1
X p(x) X
M
X Z
(q) q−1
− βm p(x) um (x) − hhum iiq dP (x) , (3.52)
m=1 X
62
(q)
where the parameters βm can be defined in terms of true Lagrange parameters β m as
(q) βm
βm =Z , m = 1, . . . , M. (3.53)
p(x)q dµ(x)
X
where
P
Z M
β m u m (x) − hhu m ii
m=1 q
Zq = expq − R
q
dµ(x) . (3.56)
X X p(x) dµ(x)
Sq = lnq Zq , (3.57)
∂Sq
= βm , m = 1, . . . M , (3.59)
∂hhum iiq
where
M
X
lnq Zq = lnq Zq − βm hhum iiq . (3.60)
m=1
63
limiting process cf. (Topsøe, 2001, Theorem 5.2). In this section, we show that this
fact is true for generalized relative-entropies too. Rényi relative-entropy on continuous
valued space R and its equivalence with the discrete case is studied by Rényi (1960),
Jizba and Arimitsu (2004b). Here, we present the result in the measure-theoretic case
and conclude that measure-theoretic definitions of both Tsallis and Rényi relative-
entropies are equivalent to their respective entities.
We also present a result pertaining to ME of measure-theoretic Tsallis entropy. We
prove that ME of Tsallis entropy in the measure-theoretic case is consistent with the
discrete case.
Here we show that generalized relative-entropies in the discrete case can be naturally
extended to measure-theoretic case, in the sense that measure-theoretic definitions can
be defined as limits of sequences of finite discrete entropies of pmfs which approximate
the pdfs involved. We refer to any such sequence of pmfs as “approximating sequence
of pmfs of a pdf”. To formalize these aspects we need the following lemma.
L EMMA 3.1 Let p be a pdf defined on measure space (X, M, µ). Then there exists a sequence
of simple functions {fn } (approximating sequence of simple functions of p) such
that limn→∞ fn = p and each fn can be written as
Z
1
fn (x) = p dµ , ∀x ∈ En,k , k = 1, . . . m(n) , (3.61)
µ(En,k ) En,k
m(n)
where {En,k }k=1 , is the measurable partition of X corresponding to f n (the nota-
tion m(n) indicates that m varies with n). Further each f n satisfies
Z
fn dµ = 1 . (3.62)
X
(3.63)
64
Each fn is indeed a simple function and can be written as
n2 n −1 Z ! Z
X 1 1
fn = p dµ χEn,k + p dµ χFn , (3.64)
µ(En,k ) En,k µ(Fn ) Fn
k=0
k k+1
where En,k = p−1 2n , 2n , k = 0, . . . , n2n −1 and Fn = p−1 ([n, ∞)). Also, for
any measurable set E ∈ M, χE : X → {0, 1} denotes its indicator or characteristic
function. Note that {En,0 , . . . , En,n2n −1 , Fn } is indeed a measurable partition of X,
R R
for any n. Since E p dµ < ∞ for any E ∈ M, we have En,k p dµ = 0 whenever
R
µ(En,k ) = 0, for k = 0, . . . n2n − 1. Similarly Fn p dµ = 0 whenever µ(Fn ) = 0.
Now we show that limn→∞ fn = p, point-wise.
Since p is a pdf, we have p(x) < ∞. Then ∃ n ∈ Z + 3 p(x) ≤ n. Also ∃ k ∈ Z+ ,
k k+1 k k+1
0 ≤ k ≤ n2n − 1 3 2n ≤ p(x) < 2n and 2n ≤ fn (x) < 2n . This implies
1
0 ≤ |p − fn | < 2n as required.
(Note that this lemmma holds true even if p is not a pdf. This follows from, if
p(x) = ∞, for some x ∈ X, then x ∈ Fn for all n, and therefore fn (x) ≥ n for all n;
hence limn→∞ fn (x) = ∞ = p(x).)
Finally, we have
Z n(m) " Z # Z
X 1 1
fn dµ = p dµ µ(En,k ) + p dµ µ(Fn )
X µ(En,k ) En,k µ(Fn ) Fn
k=1
n(m) Z Z
X
= p dµ + p dµ
k=1 En,k Fn
Z
= p dµ = 1 .
X
65
for any n. Note that in (3.65) the function f n χEn,k is a constant function by the con-
struction (Lemma 3.1) of fn . We have
m(n) m(n) Z Z
X X
p̃n,k = p dµ = p dµ = 1 , (3.66)
k=1 k=1 En,k X
and hence p̃n is indeed a pmf. We call {p̃n } as the approximating sequence of pmfs of
pdf p.
Now we present our main theorem, where we assume that p and r are bounded. The
assumption of boundedness of p and r simplifies the proof. However, the result can be
extended to an unbounded case. (See (Rényi, 1959) analysis of Shannon entropy and
relative-entropy on R in the unbounded case.)
T HEOREM 3.1 Let p and r be pdfs, which are bounded and defined on a measure space (X, M, µ).
Let p̃n and r̃n be approximating sequences of pmfs of p and r respectively. Let
Iα denote the Rényi relative-entropy as in (3.22) and I q denote the Tsallis relative-
entropy as in (3.27). Then
and
respectively.
Proof It is enough to prove the result for either Tsallis or Rényi since each one of them is a
monotone and continuous functions of the other. Hence we write down the proof for
the case of Rényi and we use the entropic index α in the proof.
Corresponding to pdf p, let {fn } be the approximating sequence of simple func-
tions such that limn→∞ fn = p as in Lemma 3.1. Let {gn } be the approximating se-
quence of simple functions for r such that lim n→∞ gn = r. Corresponding to simple
functions fn and gn there exists a common measurable partition 6 {En,1 , . . . En,m(n) }
such that fn and gn can be written as
m(n)
X
fn (x) = (an,k )χEn,k (x) , an,k ∈ R+ , ∀k = 1, . . . m(n) , (3.69a)
k=1
6
Let ϕ and φ be two simple functions defined on (X, ). Let {E1 , . . . En } and {F1 , . . . , Fm } be the
measurable partitions corresponding to ϕ and φ respectively. Then the collection defined as {Ei ∩ Fj |i =
1, . . . n, j = 1, . . . m} is a common measurable partition for ϕ and φ.
66
m(n)
X
gn (x) = (bn,k )χEn,k (x) , bn,k ∈ R+ , ∀k = 1, . . . m(n) , (3.69b)
k=1
m(n) α
1 X an,k
Sα (p̃n kr̃n ) = ln µ(En,k ) . (3.71)
α−1 bα−1
k=1 n,k
since we have7
Z m(n) α
X an,k
fn (x)α
α−1 dµ(x) = α−1 µ(En,k ) . (3.73)
X gn (x) k=1
b n,k
m(n)
(gn )α−1 (x) = bα−1
n,k χEn,k (x) .
k=1
Further,
m(n)
fnα aα
n,k
(x) = χEn,k (x) .
gnα−1 k=1
bα−1
n,k
67
In this case, the Lebesgue dominated convergence theorem (Rudin, 1964, pp.26,
Theorem 1.34) gives that,
Z Z
fnα pα
lim α−1 dµ = α−1
dµ , (3.75)
n→∞ X gn X r
68
3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy
By (2.46) we get
n
X
Sq (P ) = Pkq [lnq µk − lnq Pk ] . (3.84)
k=1
69
which can be rearranged as
PM
m=1 β m u m (x) − hhu m ii q
(Zq )1−q p(x) = 1 − (1 − q) R q
p(x)q .
X p(x) dµ(x)
By integrating both sides in the above equation, and by using (3.50) we get (3.86). J
Now, (3.86) can be written in its discrete form as
n
X Pkq 1−q
= (Zq ) . (3.87)
µq−1
k=1 k
which is a constant.
Hence, by (3.85) and (3.88), one can conclude that with respect to a particular in-
stance of ME, measure-theoretic Tsallis entropy S q (P ) defined for a probability mea-
sure P on the measure space (X, M, µ) is equal to discrete Tsallis entropy up to an
additive constant, when the reference measure µ is chosen as a uniform probability dis-
tribution. There by, one can further conclude that with respect to a particular instance
of ME, measure-theoretic Tsallis entropy is consistent with its discrete definition.
The same result can be shown in the case of q-expectation values too.
70
(X, M) be a measurable space and Π denote the set of all measurable partitions of X.
We denote a measurable partition π ∈ Π as π = {E k }m m
k=1 , i.e, ∪k=1 Ek = X and
Ei ∩ Ej = ∅, i 6= j, i, j = 1, . . . m. We denote the set of all simple functions on
(X, M) by L+ +
0 , and the set of all nonnegative M-measurable functions by L . The set
of all µ-integrable functions, where µ is a measure defined on (X, M), is denoted by
L1 (µ). Rényi relative-entropy Iα (P kR) refers to (3.23), which can be written as
Z
1
Iα (P kR) = ln ϕα dR , (3.89)
α−1 X
dP
where ϕ ∈ L1 (R) is defined as ϕ = dR .
L EMMA 3.2 Let P and R be probability measures on the measurable space (X, M) such that
dP
P R. Let ϕ = dR . Then for any E ∈ M and α > 1 we have
Z
P (E)α
≤ ϕα dR . (3.92)
R(E)α−1 E
R
Proof Since P (E) = E ϕ dR, ∀E ∈ M, by Hölder’s inequality we have
Z Z 1 Z 1− 1
α α
α
ϕ dR ≤ ϕ dR dR .
E E E
That is
Z
1
α α(1− α )
P (E) ≤ R(E) ϕα dR ,
E
71
We now present our main result in a special case as follows.
L EMMA 3.3 Let P and R be two probability measures such that P R. Let ϕ = dP
dR ∈ L+
0.
Then for any 0 < α < ∞, we have
m
X P (Ek )α
1
Iα (P kR) = ln , (3.93)
α−1 R(Ek )α−1
k=1
where {Ek }m
k=1 ∈ Π is the measurable partition corresponding to ϕ.
Pm
Proof The simple function ϕ ∈ L+
0 can be written as ϕ(x) = k=1 ak χEk (x), ∀x ∈ X,
R
where ak ∈ R, k = 1, . . . m. Now we have P (Ek ) = Ek ϕ dR = ak R(Ek ), and
hence
P (Ek )
ak = , ∀k = 1, . . . m. (3.94)
R(Ek )
Pm
We also have ϕα (x) = α
k=1 ak χEk (x), ∀x ∈ X and hence
Z m
X
α
ϕ dR = aαk R(Ek ) . (3.95)
X k=1
T HEOREM 3.2 Let (X, M) be a measurable space and Π denote the set of all measurable partitions
of X. Let P and R be two probability measures. Then for any α > 1, we have
m
1 X P (Ek )α
sup ln if P R ,
α−1
{Ek }m ∈Π α − 1 R(E k )
Iα (P kR) = k=1 k=1 (3.96)
+∞ otherwise.
Proof If P is not absolutely continuous with respect R, there exists E ∈ M such that P (E) >
0 and R(E) = 0. Since {E, X − E} ∈ Π, Iα (P kR) = +∞.
Now, we assume that P R. It is clear that it is enough to prove that
Z Xm
P (Ek )α
ϕα dR = sup α−1 , (3.97)
X {Ek }m
k=1 ∈Π k=1
R(E k )
72
dP
where ϕ = dR . From Lemma 3.2, for any measurable partition {E k }m
k=1 ∈ Π, we
have
Xm Xm Z Z
P (Ek )α α
α−1 ≤ ϕ dR = ϕα dR ,
k=1
R(Ek ) k=1 Ek X
and hence
Xm Z
P (Ek )α
sup α−1 ≤ ϕα dR . (3.98)
{Ek }m
k=1 ∈Π k=1
R(E k ) X
Now we shall obtain the reverse inequality to prove (3.97) . Thus, we now show
m
X Z
P (Ek )α
sup α−1 ≥ ϕα dR . (3.99)
{Ek }m
k=1 ∈Π k=1
R(Ek ) X
0 ≤ ϕ 1 ≤ ϕ2 ≤ . . . ≤ ϕ (3.100)
Now, ϕn ∈ L+ α α α α α
0 , ϕn ≤ ϕn+1 ≤ ϕ , 1 ≤ n < ∞ and lim n→∞ ϕn = ϕ for any α > 0.
Hence from Lebesgue monotone convergence theorem we have
Z Z
α
lim ϕn dR = ϕα dR . (3.103)
n→∞ X X
73
Now for any φ ∈ L+ α
0 such that 0 ≤ φ ≤ ϕ we have
Z Z
φ dR ≤ ϕα dR
X X
and hence
Z Z
sup α
φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L+
0 ≤ ϕα dR . (3.106)
X
R
Now we show the reverse inequality of (3.106). If X ϕα dR < +∞, from (3.105)
given any > 0 one can find 0 ≤ n0 < ∞ such that
Z Z
ϕα dR < φn0 dR +
X X
and hence
Z Z
α α +
ϕ dR < sup φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 + . (3.107)
X X
and
Z
α
sup φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L+
0 >N . (3.110)
X
Since (3.109) and (3.110) are true for any N > 0, we have
Z Z
α α +
ϕ dR = sup φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 = +∞ (3.111)
X X
R
and hence (3.108) is verified in the case of X ϕα dR = +∞. Now (3.106) and (3.108)
verify the claim that (3.103) implies (3.104). Finally (3.104) together with Lemma 3.3
proves (3.97) and hence the theorem.
Now from the fact that Rényi and Tsallis relative-entropies ((3.23) and (3.28) re-
spectively) are monotone and continuous functions of each other, the GYP-theorem
presented in the case of Rényi is valid for the Tsallis case too, whenever q > 1.
However, the GYP-theorem is yet to be stated for the case when entropic index
0 < α < 1 ( 0 < q < 1 in the case of Tsallis). Work on this problem is ongoing.
74
4 Geometry and Entropies:
Pythagoras’ Theorem
Abstract
Kullback-Leibler relative-entropy, in cases involving distributions resulting from
relative-entropy minimization, has a celebrated property reminiscent of squared Eu-
clidean distance: it satisfies an analogue of the Pythagoras’ theorem. And hence,
this property is referred to as Pythagoras’ theorem of relative-entropy minimization
or triangle equality and plays a fundamental role in geometrical approaches of statis-
tical estimation theory like information geometry. We state and prove the equivalent
of Pythagoras’ theorem in the generalized nonextensive formalism as the main result
of this chapter. Before presenting this result we study the Tsallis relative-entropy
minimization and present some differences with the classical case. This work can
also be found in (Dukkipati et al., 2005b; Dukkipati, Murty, & Bhatnagar, 2006a).
75
inference about an unknown probability distribution when there exists a prior estimate
of the distribution and new information in the form of constraints on expected val-
ues (Shore, 1981b). Formally, one can state this principle as: given a prior distribution
r, of all the probability distributions that satisfy the given moment constraints, one
should choose the posterior p with the least relative-entropy. The prior distribution
r can be a reference distribution (uniform, Gaussian, Lorentzian or Boltzmann etc.)
or a prior estimate of p. The principle of Jaynes maximum entropy is a special case
of minimization of relative-entropy under appropriate conditions (Shore & Johnson,
1980).
Many properties of relative-entropy minimization just reflect well-known prop-
erties of relative-entropy but there are surprising differences as well. For example,
relative-entropy does not generally satisfy a triangle relation involving three arbitrary
probability distributions. But in certain important cases involving distributions that
result from relative-entropy minimization, relative-entropy results in a theorem com-
parable to the Pythagoras’ theorem cf. (Csiszár, 1975) and ( Čencov, 1982, § 11). In
this geometrical interpretation, relative-entropy plays the role of squared distance and
minimization of relative-entropy appears as the analogue of projection on a sub-space
in a Euclidean geometry. This property is also known as triangle equality (Shore,
1981b).
The main aim of this chapter is to study the possible generalization of Pythagoras’
theorem to the nonextensive case. Before we take up this problem, we present the
properties of Tsallis relative-entropy minimization and present some differences with
the classical case. In the representation of such a minimum entropy distribution, we
highlight the use of the q-product (q-deformed version of multiplication), an operator
that has been introduced recently to derive the mathematical structure behind the Tsal-
lis statistics. Especially, q-product representation of Tsallis minimum relative-entropy
distribution will be useful for the derivation of the equivalent of triangle equality for
Tsallis relative-entropy.
Before we conclude this introduction on geometrical ideas of relative-entropy min-
imization, we make a note on the other geometric approaches that will not be consid-
ered in this thesis. One approach is that of Rao (1945), where one looks at the set of
probability distributions on a sample space as a differential manifold and introduce a
Riemannian geometry on this manifold. This approach is pioneered by Čencov (1982)
and Amari (1985) who have shown the existence of a particular Riemannian geometry
which is useful in understanding some questions of statistical inference. This Rieman-
nian geometry turns out to have some interesting connections with information theory
76
and as shown by Campbell (1985), with the minimum relative-entropy. In this ap-
proach too, the above mentioned Pythagoras’ Theorem plays an important role (Amari
& Nagaoka, 2000, pp.72).
The other idea involves the use of Hausdorff dimension (Billingsley, 1960, 1965)
to understand why minimizing relative-entropy should provide useful results. This
approach was begun by Eggleston (1952) for a special case of maximum entropy and
was developed by Campbell (1992). For an excellent review on various geometrical
aspects associated with minimum relative-entropy one can refer to (Campbell, 2003).
The structure of the chapter is organized as follows. We present the necessary
background in § 4.1, where we discuss properties of relative-entropy minimization in
the classical case. In § 4.2, we present the ME prescriptions of Tsallis relative-entropy
and discuss its differences with the classical case. Finally, the derivation of Pythagoras’
theorem in the nonextensive case is presented in § 4.3.
Regarding the notation, we use the same notation as in Chapter 3, and we write all
our mathematical formulations on the measure space (X, M, µ). All the assumptions
we made in Chapter 3 (see § 3.2) are valid here too. Also, though results presented in
this chapter do not involve major measure theoretic concepts, we write all the integrals
with respect to the measure µ, as a convention; these integrals can be replaced by
summations in the discrete case or Lebesgue integrals in the continuous case.
77
for relative-entropy minimization below, we use the definition of relative-entropy (3.7)
for probability measures P and R, corresponding to pdfs p and r respectively (refer to
Definitions 3.2 and 3.3). This correspondence between probability measures P and R
with pdfs p and r, respectively, will not be described again in the sequel.
To minimize the relative-entropy (4.2) with respect to the constraints (4.1), the La-
grangian turns out to be
Z Z
dP
L(x, λ, β) = ln (x) dP (x) + λ dP (x) − 1
X dR X
M
X Z
+ βm um (x) dP (x) − hum i , (4.3)
m=1 X
XM
dP
ln (x) + λ + βm um (x) = 0 ,
dR m=1
dP
Finally, from (4.4) the posterior distribution p(x) = dµ given by Kullback’s minimum
dR
entropy principle can be written in terms of the prior r(x) = dµ as
M
r(x)e− m=1 βm um (x)
p(x) = , (4.5)
Zb
where
Z M
Zb = r(x)e− m=1 βm um (x)
dµ(x) (4.6)
X
78
Properties of relative-entropy minimization have been studied extensively and pre-
sented by Shore (1981b). Here we briefly mention a few.
The principle of maximum entropy is equivalent to relative-entropy minimization
in the special case of discrete spaces and uniform priors, in the sense that, when the
prior is a uniform distribution with finite support W (over E ⊂ X), the minimum
entropy distribution turns out to be
M
e− m=1 βm um (x)
p(x) = Z , (4.7)
M
e− m=1 βm um (x)
dµ(x)
E
which is in fact, a maximum entropy distribution (3.33) of Shannon entropy with re-
spect to the constraints (4.1).
The important relations to relative-entropy minimization are as follows. Minimum
relative-entropy, I, can be calculated as
M
X
I = − ln Zb − βm hum i , (4.8)
m=1
T HEOREM 4.1 Let r be the prior, p be the probability distribution that minimizes the relative-
entropy subject to a set of constraints
Z
um (x)p(x) dµ(x) = hum i , m = 1, . . . , M , (4.11)
X
79
Proof We have
Z
l(x)
I(lkr) = l(x) ln dµ(x)
X r(x)
Z Z
l(x) p(x)
= l(x) ln dµ(x) + l(x) ln dµ(x)
X p(x) X r(x)
Z
p(x)
= I(lkp) + l(x) ln dµ(x) (4.13)
X r(x)
From the minimum entropy distribution (4.5) we have
X M
p(x) b .
ln =− βm um (x) − ln Z (4.14)
r(x)
m=1
XM
= I(lkp) − βm hum i − ln Zb (By hypothesis)
m=1
= I(lkp) + I(pkr) . (By (4.8))
since I(lkp) ≥ 0 for every pair of pdfs, with equality if and only if l = p. A pictorial
depiction of the triangle equality (4.12) is shown in Figure 4.1.
r p
80
For a study of relative-entropy minimization without the use of Lagrange multiplier
technique and corresponding geometrical aspects, one can refer to (Csiszár, 1975).
Triangle equality of relative-entropy minimization not only plays a fundamental
role in geometrical approaches of statistical estimation theory ( Čencov, 1982) and
information geometry (Amari, 1985, 2001) but is also important for applications in
which relative-entropy minimization is used for purposes of pattern classification and
cluster analysis (Shore & Gray, 1982).
81
where λ and βm , m = 1, . . . M are Lagrange multipliers. Now set
dL
=0 . (4.19)
dP
XM
r(x)
lnq − λ − p(x)q−1 βm um (x) = 0 ,
p(x) m=1
x1−q −1
which can be rearranged by using the definition of q-logarithm ln q x = 1−q as
h PM i 1
r(x)1−q − (1 − q)
1−q
m=1 βm um (x)
p(x) = 1 .
(λ(1 − q) + 1) 1−q
R
Specifying the Lagrange parameter λ via the normalization X p(x) dµ(x) = 1, one
can write Tsallis minimum relative-entropy distribution as (Borland et al., 1998)
" M
# 1−q
1
X
r(x)1−q − (1 − q) βm um (x)
m=1
p(x) = , (4.20)
cq
Z
Z " M
X
# 1−q
1
cq =
Z r(x)1−q − (1 − q) βm um (x) dµ(x) . (4.21)
X m=1
Note that the generalized relative-entropy distribution (4.20) is not of the form of its
classical counterpart (4.5) even if we replace the exponential with the q-exponential.
But one can express (4.20) in a form similar to the classical case by invoking q-
deformed binary operation called q-product.
In the framework of q-deformed functions and operators discussed in Chapter 2, a
new multiplication, called q-product defined as
1
x1−q + y 1−q − 1 1−q if x, y > 0,
x ⊗q y ≡ x 1−q + y 1−q − 1 > 0 (4.22)
0 otherwise.
82
is first introduced in (Nivanen et al., 2003) and explicitly defined in (Borges, 2004) for
satisfying the following equations:
The q-product recovers the usual product in the limit q → 1 i.e., lim q→1 (x ⊗q y) =
xy. The fundamental properties of the q-product ⊗ q are almost the same as the usual
product, and the distributive law does not hold in general, i.e.,
a(x ⊗q y) 6= ax ⊗q y (a, x, y ∈ R) .
Further properties of the q-product can be found in (Nivanen et al., 2003; Borges,
2004).
One can check the mathematical validity of the q-product by recalling the expres-
sion of the exponential function ex
x n
ex = lim 1+ . (4.25)
n→∞ n
Replacing the power on the right side of (4.25) by n times the q-product ⊗ q :
n
x ⊗q = x ⊗ q . . . ⊗ q x , (4.26)
| {z }
n times
where
Z M
cq = − βm um (x)
Z r(x) ⊗q eq m=1
dµ(x). (4.29)
X
83
Later in this chapter we see that this representation is useful in establishing properties
of Tsallis relative-entropy minimization and corresponding thermodynamic equations.
It is important to note that the distribution in (4.20) could be a (local/global) min-
imum only if q > 0 and the Tsallis cut-off condition (3.46) specified by Tsallis maxi-
mum entropy distribution is extended to the relative-entropy case i.e., p(x) = 0 when-
h P i
ever r(x)1−q − (1 − q) M β
m=1 m m u (x) < 0. The latter condition is also required
for the q-product representation of the generalized minimum entropy distribution.
4.2.3 Properties
1 X
− (1 − q) βm um (x)
W 1−q m=1
p(x) = ,
Z " M
X
# 1−q
1
1
− (1 − q) βm um (x) dµ(x)
E W 1−q
m=1
or
M
−W 1−q m=1 βm um (x)
eq
p(x) = Z M
. (4.31)
−W 1−q βm um (x)
eq m=1
dµ(x)
E
By comparing (4.30) or (4.31) with Tsallis maximum entropy distribution (3.44), one
can conclude (formally one can verify this by the thermodynamic equations of Tsal-
lis entropy (3.37)) that minimizing relative-entropy is not equivalent 1 to maximizing
1
For fixed q-expected values hum iq , the two distributions, (4.31) and (3.44) are equal, but the values
of corresponding Lagrange multipliers are different when q 6= 1 (while in the classical case they remain
same). Further, (4.31) offers the relation between the Lagrange parameters in these two cases. Let
(S)
βm , m = 1, . . . M be the Lagrange parameters corresponding to the generalized maximum entropy
(I)
distribution while βm , m = 1, . . . M correspond to generalized minimum entropy distribution with
(S) (I)
uniform prior. Then, we have the relation βm = W 1−q βm , m = 1, . . . M .
84
entropy when the prior is a uniform distribution. The key observation here is that W
appears in (4.31) unlike in (3.44).
In this case, one can calculate minimum relative-entropy I q as
M
X
cq −
Iq = − lnq Z βm hum iq . (4.32)
m=1
By (4.17) and expanding lnq p(x) one can write Iq in its final form as in (4.32). J
It is easy to verify the following thermodynamic equations for the minimum Tsallis
relative-entropy:
∂ cq = −hum i , m = 1, . . . M,
lnq Z q (4.34)
∂βm
∂Iq
= −βm , m = 1, . . . M, (4.35)
∂hum iq
85
4.2.4 The Case of Normalized q-Expectations
(q)
where the parameters βm can be defined in terms of the true Lagrange parameters
βm as
(q) βm
βm =Z , m = 1, . . . , M. (4.38)
p(x)q dµ(x)
X
where
PM 1
Z βm um (x) − hhum iiq
1−q
c= m=1
Z q r(x)1−q − (1 − q) R q
dµ(x) .
X X p(x) dµ(x)
Now, the minimum entropy distribution (4.39) can be expressed using the q-product
(4.22) as
P
M
1 m=1 β m u m (x) − hhu m ii q
p(x) = r(x) ⊗q expq R q
. (4.40)
c
Z X p(x) dµ(x)
q
c ,
Iq = − lnq Z (4.41)
q
86
∂Iq
= −βm , m = 1, . . . M, (4.43)
∂hhum iiq
where
M
lnq Z c − X β hhu ii .
cq = lnq Z (4.44)
q m m q
m=1
Significance of the triangle equality is evident in the following scenario. Let r be the
prior estimate of the unknown probability distribution l, about which, the information
in the form of constraints
Z
um (x)l(x) dµ(x) = hum i , m = 1, . . . M (4.45)
X
from which one can draw the following conclusions. By (4.15), the minimum relative-
entropy posterior estimate of l is not only logically consistent, but also closer to l, in the
87
relative-entropy sense, that is the prior r. Moreover, the difference I(lkr) − I(lkp) is
exactly the relative-entropy I(pkr) between the posterior and the prior. Hence, I(pkr)
can be interpreted as the amount of information provided by the constraints that is not
inherent in r.
Additional justification to use minimum relative-entropy estimate of p with respect
to the constraints (4.46) is provided by the following expected value matching prop-
erty (Shore, 1981b). To explain this concept we restate our above estimation problem
as follows.
For fixed functions um , m = 1, . . . M , let the actual unknown distribution l satisfy
Z
um (x)l(x) dµ(x) = hwm i , m = 1, . . . M, (4.48)
X
The proof is as follows (Shore, 1981b). I Proceeding as in the proof of Theorem 4.1,
we have
M
X Z
I(lkp) = I(lkr) + βm l(x)um (x) dµ(x) + ln Zb
m=1 X
XM
= I(lkr) + βm hwm i + ln Zb (By (4.48)) (4.50)
m=1
Since the variation of I(lkp) with respect to hu m i results in the variation of I(lkp)
with respect to βm for any m = 1, . . . , M , to find the minimum of I(lkp) one can
solve
∂
Iq (lkp) = 0 , m = 1, . . . M ,
∂βm
which gives the solution as in (4.49). J
This property of expectation matching states that, for a distribution p of the form
(4.5), I(lkp) is the smallest when the expected values of p match those of l. In partic-
ular, p is not only the distribution that minimizes I(pkr) but also minimizes I(lkp).
88
We now restate the Theorem 4.1 which summarizes the above discussion.
T HEOREM 4.2 Let r be the prior distribution, and p be the probability distribution that minimizes
the relative-entropy subject to a set of constraints
Z
um (x)p(x) dµ(x) = hum i , m = 1, . . . , M. (4.51)
X
Then
By the above interpretation of triangle equality and analogy with the compara-
ble situation in Euclidean geometry, it is natural to call p, as defined by (4.5) as the
projection of r on the plane described by (4.52). Csiszár (1975) has introduced a gen-
eralization of this notion to define the projection of r on any convex set E of probability
distributions. If p ∈ E satisfies the equation
From the above discussion, it is clear that to derive the triangle equality of Tsallis
relative-entropy minimization, one should first deduce the equivalent of expectation
matching property in the nonextensive case.
We state below and prove the Pythagoras theorem in nonextensive framework
(Dukkipati, Murty, & Bhatnagar, 2006a).
89
T HEOREM 4.3 Let r be the prior distribution, and p be the probability distribution that minimizes
the Tsallis relative-entropy subject to a set of constraints
Z
um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M. (4.56)
X
Then
hwm iq
hum iq = , m = 1, . . . M. (4.58)
1 − (1 − q)Iq (lkp)
Proof First we deduce the equivalent of expectation matching property in the nonextensive
case. That is, we would like to find the values of hu m iq for which Iq (lkp) is minimum.
We write the following useful relations before we proceed to the derivation.
We can write the generalized minimum entropy distribution (4.28) as
M M
ln r(x)
eq q ⊗q eq − m=1 βm um (x)
eq − m=1 βm um (x)+lnq r(x)
p(x) = = , (4.60)
cq
Z cq
Z
ln x
by using the relations eq q = x and exq ⊗q eyq = ex+y
q . Further by using
90
one can verify that
Z
Iq (pkr) = − p(x)q lnq r(x) dµ(x) − Sq (p) . (4.63)
X
1
R
and by the expression of Tsallis entropy S q (l) = q−1 1− X l(x)q dµ(x) , we have
M
X
Iq (lkp) = Iq (lkr) + cq − (1 − q) lnq Z
βm hwm iq + lnq Z cq Iq (lkp) . (4.67)
m=1
∂
Iq (lkp) = 0 . (4.68)
∂βm
By using thermodynamic equation (4.34), solution of (4.68) provides us with the
expectation matching property in the nonextensive case as
hwm iq
hum iq = , m = 1, . . . M . (4.69)
1 − (1 − q)Iq (lkp)
91
In the limit q → 1 the above equation gives hu m i1 = hwm i1 which is the expectation
matching property in the classical case.
Now, to derive the triangle equality for Tsallis relative-entropy minimization, we
substitute the expression for hwm iq , which is given by (4.69), in (4.67). And after
some algebra one can arrive at (4.59).
Note that the limit q → 1 in (4.59) gives the triangle equality in the classical
case (4.54). The two important cases which arise out of (4.59) are,
In the case of normalized q-expectation too, the Tsallis relative-entropy satisfies nonex-
tensive triangle equality with modified conditions from the case of q-expectation val-
ues.
T HEOREM 4.4 Let r be the prior distribution, and p be the probability distribution that minimizes
the Tsallis relative-entropy subject to the set of constraints
R
XRum (x)p(x)q dµ(x)
q
= hhum iiq , m = 1, . . . , M. (4.72)
X p(x) dµ(x)
Then we have
provided
92
Proof From Tsallis minimum entropy distribution p in the case of normalized q-expected
values (4.40), we have
c + (1 − q) ln p(x) ln Z
lnq r(x) − lnq p(x) = lnq Z c
q q q q
PM
m=1 βm um (x) − hhum iiq
+ R . (4.76)
q
X p(x) dµ(x)
and
Z
l(x)q dµ(x) = (1 − q)Sq (l) + 1 ,
X
c − (1 − q) ln Z
Iq (lkp) = Iq (lkr) + lnq Z c
q q q Iq (lkp)
R M
q X
R X l(x) dµ(x)
+ q
β m hhw m ii q − hhu m ii q . (4.80)
X p(x) dµ(x) m=1
Finally using (4.41) and (4.75) we have the nonextensive triangle equality (4.74).
93
Note that in this case the minimum of I q (lkp) is not guaranteed. Also the condition
(4.75) for nonextensive triangle equality here is the same as the expectation value
matching property in the classical case.
Finally, nonextensive Pythagoras’ theorem is yet another remarkable and consis-
tent generalization shown by Tsallis formalism.
94
5 Power-laws and Entropies:
Generalization of Boltzmann Selection
Abstract
The great success of Tsallis formalism is due to the resulting power-law distribu-
tions from ME-prescriptions of its entropy functional. In this chapter we provide
experimental demonstration of use of the power-law distributions in evolutionary
algorithms by generalizing Boltzmann selection to the Tsallis case. The proposed al-
gorithm uses Tsallis canonical distribution to weigh the configurations for ’selection’
instead of Gibbs-Boltzmann distribution. This work is motivated by the recently
proposed generalized simulated annealing algorithm based on Tsallis statistics. The
results in this chapter can also be found in (Dukkipati, Murty, & Bhatnagar, 2005a).
The central step of an enormous variety of problems (in Physics, Chemistry, Statistics,
Engineering, Economics) is the minimization of an appropriate energy or cost func-
tion. (For example, energy function in the traveling salesman problem is the length
of the path.) If the cost function is convex, any gradient descent method easily solves
the problem. But if the cost function is nonconvex the solution requires more sophis-
ticated methods, since a gradient decent procedure could easily trap the system in a
local minimum. Consequently, various algorithmic strategies have been developed
along the years for making this important problem increasingly tractable. Among
the various methods developed to solve hard optimization problems, the most popular
ones are simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) and evolutionary
algorithms (Bounds, 1987).
Evolutionary computation comprises of techniques for obtaining near-optimal so-
lutions of hard optimization problems in physics (e.g., Sutton, Hunter, & Jan, 1994)
and engineering (Holland, 1975). These methods are based largely on ideas from bio-
logical evolution and are similar to simulated annealing, except that, instead of explor-
ing the search space with a single point at each instant, these deal with a population – a
multi-subset of search space – in order to avoid getting trapped in local optima during
the process of optimization. Though evolutionary algorithms are not analyzed tradi-
tionally in the Monte Carlo framework, few researchers (e.g., Cercueil & Francois,
2001; Cerf, 1996a, 1996b) analyzed these algorithms in this framework.
95
A typical evolutionary algorithm is a two step process: selection and variation.
Selection comprises replicating an individual in the population based on probabilities
(selection probabilities) that are assigned to individuals in the population on the basis
of a “fitness” measure defined by the objective function. A stochastic perturbation of
individuals while replicating is called variation.
Selection is a central concept in evolutionary algorithms. There are several selec-
tion mechanisms in evolutionary algorithms, among which Boltzmann selection has an
important place because of the deep connection between the behavior of complex sys-
tems in thermal equilibrium at finite temperature and multivariate optimization (Nulton
& Salamon, 1988). In these systems, each configuration is weighted by its Gibbs-
Boltzmann probability factor e−E/T , where E is the energy of the configuration and
T is the temperature. Finding the low-temperature state of a system when the en-
ergy can be computed amounts to solving an optimization problem. This connection
has been used to devise the simulated annealing algorithm (Kirkpatrick et al., 1983).
Similarly for evolutionary algorithms, in the selection process where one would select
“better” configurations, one can use the same technique to weigh the individuals i.e.,
using Gibbs-Boltzmann factor. This is called Boltzmann selection, which is nothing
but defining selection probabilities in the form of Boltzmann canonical distribution.
Classical simulated annealing, as proposed by Kirkpatrick et al. (1983), extended
the well-known procedure of Metropolis et al. (1953) for equilibrium Gibbs-Boltzmann
statistics: a new configuration is accepted with the probability
p = min 1, e−β∆E , (5.1)
1
where β = T is the inverse temperature parameter and ∆E is the change in the energy.
The annealing consists in decreasing the temperature gradually. Geman and Geman
(1984) showed that if the temperature decreases as the inverse logarithm of time, the
system will end in a global minimum.
On the other hand, in the generalized simulated annealing procedure proposed
by Tsallis and Stariolo (1996) the acceptance probability is generalized to
n 1
o
p = min 1, [1 − (1 − q)β∆E] 1−q , (5.2)
1
for some q. The term [1 − (1 − q)β∆E] 1−q is due to Tsallis distribution in Tsallis
statistics (see § 3.4) and q → 1 in (5.2) retrieves the acceptance probability in the
classical case. This method is shown to be faster than both classical simulated an-
nealing and the fast simulated annealing methods (Stariolo & Tsallis, 1995; Tsallis,
96
1988). This algorithm has been used successfully in many applications (Yu & Mo,
2003; Moret et al., 1998; Penna, 1995; Andricioaei & Straub, 1996, 1997).
The above described use of power-law distributions in simulated annealing is the
motivation for us to incorporate Tsallis canonical probability distribution for selection
in evolutionary algorithms and test their novelty.
Before we present the proposed algorithm and simulation results, we also present
an information theoretic justification of Boltzmann distribution in selection mecha-
nism (Dukkipati et al., 2005a). In fact, in evolutionary algorithms Boltzmann selec-
tion is viewed just as an exponential scaling for proportionate selection (de la Maza &
Tidor, 1993) (where selection probabilities of configurations are inversely proportional
to their energies (Holland, 1975)). We show that by using Boltzmann distribution in the
selection mechanism one would implicitly satisfy Kullback minimum relative-entropy
principle.
that the size of population at any time is finite and need not be a constant.
In the first step, initial population P 0 is chosen with random configurations. At
each time step t, the population undergoes the following procedure.
selection variation
Pt −→ Pt0 −→ Pt+1 .
97
start
Initialize Population
Evaluate "Fitness"
Apply Selection
no
Stop Criterion
yes
end
P Nt
and {pt (ωk )}N t
k=1 satisfies the condition: k=1 pt (ωk ) = 1.
The general structure of evolutionary algorithms is shown in Figure 5.1; for further
details refer to (Fogel, 1994; Back, Hammel, & Schwefel, 1997).
According to Boltzmann selection, selection probabilities are defined as
e−βEk
pt (ωk ) = PNt , (5.3)
−βEj
j=1 e
98
et al., 2004), which characterizes Boltzmann selection from first principles. Given a
population Pt = {ωk }N
k=1 , the simplest probability distribution on Ω which represents
t
Pt is
νt (ω)
ξt (ω) = , ∀ω ∈ Ω , (5.4)
Nt
p(ω) = (5.6)
0 otherwise.
One can estimate the frequencies of configurations after the selection ν t+1 as
e−βE(ω)
νt+1 (ω) = νt (ω) P N
−βE(ω) t+1
, (5.7)
ω∈Pt νt (ω)e
where Nt+1 is the population size after the selection. Further, the probability distribu-
tion which represents the population P t+1 can be estimated as
ξt (ω)e−βE(ω)
ξt+1 (ω) = P −βE(ω)
. (5.8)
ω∈Pt ξt (ω)e
99
One can observe that (5.8) resembles the minimum relative-entropy distribution
that we derived in § 4.1.1 (see 4.5). This motivates one to investigate the possible
connection of Boltzmann selection with the Kullback’s relative-entropy principle.
Given the distribution ξt , which represents the population P t , we would like to
estimate the distribution ξt+1 that represents the population Pt+1 . In this context one
can view ξt as a prior estimate of ξt+1 . The available constraints for ξt+1 are
X
ξt+1 (ω) = 1 , (5.9a)
w∈Ω
X
ξt+1 (ω)E(ω) = hEit+1 , (5.9b)
w∈Ω
where hEit+1 is the expected value of the function E with respect to ξ t+1 . At this
stage let us assume that hEi t+1 is a given quantity; this will be explained later.
In this set up, Kullback minimum relative-entropy principle gives the estimate for
ξt+1 . That is, one should choose ξt+1 in such a way that it minimizes the relative-
entropy
X ξt+1 (ω)
I(ξt+1 kξt ) = ξt+1 (ω) ln (5.10)
ξt (ω)
ω∈Ω
with respect to the constraints (5.9a) and (5.9b). The corresponding Lagrangian can
be written as
!
X
L ≡ −I(ξt+1 kξt )−(λ − 1) ξt+1 (ω) − 1
ω∈Ω
!
X
−β E(ω)ξt+1 (ω) − hEit+1 ,
ω∈Ω
∂L
= 0 =⇒ ξt+1 (ω) = eln ξt (ω)−λ−βE(ω) .
∂ξt+1 (ω)
By (5.9a) we get
which is the selection equation (5.8) that we have derived from the Boltzmann selec-
tion mechanism. The Lagrange multiplier β is the inverse temperature parameter in
Boltzmann selection.
100
The above justification is incomplete without explaining the relevance of the con-
straint (5.9b) in this context. Note that the inverse temperature parameter β in (5.11)
is determined using constraint (5.9b). Thus we have
P −βE(ω)
ω∈Ω E(ω)ξt (ω)e
P −βE(ω)
= hEit+1 . (5.12)
ω∈Ω ξt (ω)e
where β0 is any constant and α > 1. The novelty of this annealing schedule has been
demonstrated using simulations in (Dukkipati et al., 2004). Similar to the practice
in generalized simulated annealing (Andricioaei & Straub, 1997), in our algorithm, q
tends towards 1 as temperature decreases during annealing.
The generalized evolutionary algorithm based on Tsallis statistics is given in Fig-
ure 5.2.
101
Algorithm 1 Generalized Evolutionary algorithm
P (0) ← Initialize with configurations from search space randomly
Initialize β and q
for t = 1 to T do
for all ω ∈ P (t) do
(Selection)
Calculate 1
[1 − (1 − q)βE(ω)] 1−q
p(ω) =
Zq
Copy ω into P 0 (t) with probability p(ω) with replacement
end for
for all ω ∈ P 0 (t) do
(Variation)
Perform variation with specific probability
end for
Update β according to annealing schedule
Update q according to its schedule
P (t + 1) ← P 0 (t)
end for
• Ackley’s function:
q P P
E1 (~x) = −20 exp −0.2 l i=1 xi −exp 1l li=1 cos(2πxi ) +20+e ,
1 l 2
• Rastrigin’s function:
P
E2 (~x) = lA + li=1 x2i − A cos(2πxi ),
where A = 10 ; −5.12 ≤ xi ≤ 5.12,
102
• Griewangk’s function:
P Q
xi 2
E3 (~x) = li=1 4000 − li=1 cos √
xi
i
+ 1,
where −600 ≤ xi ≤ 600.
Parameters for the algorithms were set to compare performance of these algorithms
in identical conditions. Each xi is encoded with 5 bits and l = 15 i.e., search space is
of size 275 . Population size is n = 350. For all the experiments, probability of uniform
crossover is 0.8 and probability of mutation is below 0.1. We limited each algorithm
to 100 iterations and have given plots for the behavior of the process when averaged
over 20 runs.
As we mentioned earlier, for Boltzmann selection we have used the Cauchy an-
nealing schedule (see (5.14)), in which we set β 0 = 200 and α = 1.01. For Tsallis
selection too, we have used the same annealing schedule as Boltzmann selection with
identical parameters. In our preliminary simulations, q was kept constant and tested
with various values. Then we adopted a strategy from generalized simulated annealing
where one would choose an initial value of q 0 and decrease linearly to the value 1. This
schedule of q gave better performance than keeping it constant. We reported results
with various values of q0 .
20
19
18
bestfitness
17
16
15
q0 = 3
q0 = 2
q0 = 1.5
q0 = 1.01
14
0 10 20 30 40 50 60 70 80 90 100
generations
Figure 5.3: Performance of evolutionary algorithm with Tsallis selection for various
values of q0 for the test function Ackley
From various simulations, we observed that when the problem size is small (for
example smaller values of l) all the selection mechanisms perform equally well. Boltz-
mann selection is effective when we increase the problem size. For Tsallis selection,
we performed simulations with various values of q 0 . Figure 5.3 shows the performance
for Ackley function for q0 = 3, 2, 1.5 and 1.01, respectively, from which one can see
103
20
proportionate
Boltzmann
Tsallis
19
18
best_fitness
17
16
15
14
0 10 20 30 40 50 60 70 80 90 100
generations
180
proportionate
Boltzmann
Tsallis
170
160
150
best_fitness
140
130
120
110
100
0 10 20 30 40 50 60 70 80 90 100
generations
200
proportionate
Boltzmann
Tsallis
180
160
best_fitness
140
120
100
80
60
0 10 20 30 40 50 60 70 80 90 100
generations
104
that the choice of q0 is very important for the evolutionary algorithm with Tsallis se-
lection which varies with the problem at hand.
Figures 5.4, 5.5 and 5.6 show the comparisons of evolutionary algorithms based
on Tsallis selection, Boltzmann selection and proportionate selection, respectively, for
different functions. We have reported only the best behavior for various values of q 0 .
From these simulation results, we conclude that the evolutionary algorithm based on
Tsallis canonical distribution with appropriate value of q 0 outperforms those based on
Boltzmann and proportionate selection respectively.
105
6 Conclusions
Abstract
In this concluding chapter we summarize the results of the Dissertation, with an
emphasis on novelties, and new problems suggested by this research.
Information theory based on Shannon entropy functional found applications that cut
across a myriad of fields, because of its established mathematical significance i.e., its
beautiful mathematical properties. Shannon (1956) too emphasized that “the hard core
of information theory is, essentially, a branch of mathematics” and “a thorough under-
standing of the mathematical foundation . . . is surely a prerequisite to other applica-
tions.” Given that “the hard core of information theory is a branch of mathematics,”
one could expect formal generalizations of information measures taking place, just as
would be the case for any other mathematical concept.
At the outset of this Dissertation we noted from (Rényi, 1960; Csiszár, 1974) that
generalization of information measures should be indicated by their operational sig-
nificance (pragmatic approach) and by a set of natural postulates characterizing them
(axiomatic approach). In the literature ranging from mathematics to physics, infor-
mation theory to machine learning one can find various operational and axiomatic
justifications of the generalized information measures. In this thesis, we investigated
some properties of generalized information measures and their maximum and mini-
mum entropy prescriptions pertaining to their mathematical significance.
106
tions and characterizations of information measure in terms of axioms of quasilinear
means. In Chapter 2, we studied this technique for nonextensive entropy and showed
that Tsallis entropy is unique under Rényi’s recipe. Assuming that any putative candi-
date for an entropy should be a mean (Rényi, 1961), and in light of attempts to study
ME-prescriptions of information measures, where constraints are specified using KN-
averages (e.g., Czachor & Naudts, 2002), the results presented in this thesis further the
relation between entropy functionals and generalized means.
Measure-theoretic formulations
GYP-theorem
107
The detailed study of Tsallis relative-entropy minimization in the case of nor-
malized q-expected values and the computation of corresponding minimum relative-
entropy distribution (where one has to address the self-referential nature of the prob-
abilities) based on Tsallis et al. (1998), Martı́nez et al. (2000) formalisms for Tsallis
entropy maximization is currently under investigation. Considering the various fields
to which Tsallis generalized statistics has been applied, studies of applications of Tsal-
lis relative minimization of various inference problems are of particular relevance.
108
learning, Θ is usually a space of statistical models, {p(x; θ) : θ ∈ Θ} in the generative
case or {p(y|x; θ) : θ ∈ Θ} in the discriminative case. Learning algorithms select a
model θ ∈ Θ based on the training example {x k }nk=1 ⊂ X or {(xk , yk )}nk=1 ⊂ X × Y
depending on whether the generative case or the discriminative case are considered.
Applying differential geometry, a mathematical theory of geometries, in smooth,
locally Euclidean spaces to space of probability distributions and so to statistical mod-
els is a fundamental technique in information geometry. Information does however
play two roles in it: Kullback-Leibler relative entropy features as a measure of diver-
gence, and Fisher information takes the role of curvature.
ME-principle is involved in information geometry due to the following reasons.
One is Pythagoras’ theorem of relative-entropy minimization. And the other is due to
the work of Amari (2001). Amari showed that ME distributions are exactly the ones
with minimal interaction between their variables — these are close to independence.
This result plays an important role in geometric approaches to machine learning.
Now, equipped with the nonextensive Pythagoras’ theorem in the generalized case
of Tsallis, it is interesting to know the resultant geometry when we use generalized
information measures and role of entropic index in the geometry.
Another open problem in generalized information measures is the kind of con-
straints one should use for the ME-prescriptions. At present ME-prescriptions for
Tsallis come in three flavors. These three flavors correspond to the kind of constraints
one would use to derive the canonical distribution. The first is conventional expectation
(Tsallis (1988)), second is q-expectation values (Curado-Tsallis (1991)), and the third
is normalized q-expectation values (Tsallis-Mendes-Plastino (1998)). The problem of
which constraints to use remains an open problem that has so far been addressed only
in the context of thermodynamics.
Boghosian (1996) suggested that the entropy functional and the constraints one
would use should be considered as axioms. By this he suggested that their validity
is to be decided solely by the conclusions to which they lead and ultimately by com-
parison with experiment. A practical study of it in the problems related to estimating
probability distributions by using ME of Tsallis entropy might throw some light.
Moving on to another problem, we have noted that Tsallis entropy can be written
as a Kolmogorov-Nagumo function of Rényi entropy. We have also seen that the same
function is KN-equivalent to the function which is used in the generalized averaging of
Hartley information to derive Rényi entropy. This suggests the possibility that gener-
alized averages play a role in describing the operational significance of Tsallis entropy,
109
an explanation for which still eludes us.
Finally, though Rényi information measure offers very natural – and perhaps con-
ceptually the cleanest – setting for generalization of entropy, and while generalization
of Tsallis entropy too can be put in some what formal setting with q-generalizations of
functions – we still are not in the know about the complete relevance, in the sense of
operational, axiomatic, mathematical, of entropic indexes α in Rényi and q in Tsallis.
This is easily the most challenging problem before us.
110
Bibliography
Aarts, E., & Korst, J. (1989). Simulated Annealing and Boltzmann Machines–A
Stochastic Approach to Combinatorial Optimization and Neural Computing.
Wiley, New York.
Abe, S., & Suzuki, N. (2004). Scale-free network of earthquakes. Europhysics Letters,
65(4), 581–586.
Abe, S. (2000). Axioms and uniqueness theorem for Tsallis entropy. Physics Letters
A, 271, 74–79.
Aczél, J. (1948). On mean values. Bull. Amer. Math. Soc., 54, 392–400.
Aczél, J., & Daróczy, Z. (1975). On Measures of Information and Their Characteri-
zation. Academic Press, New York.
Agmon, N., Alhassid, Y., & Levine, R. D. (1979). An algorithm for finding the distri-
bution of maximal entropy. Journal of Computational Physics, 30, 250–258.
Amari, S., & Nagaoka, H. (2000). Methods of Information Geometry, Vol. 191 of
Translations of Mathematical Monographs. Oxford University Press, Oxford.
Andricioaei, I., & Straub, J. E. (1996). Generalized simulated annealing algorithms us-
ing Tsallis statistics: Application to conformational optimization of a tetrapep-
tide. Physical Review E, 53(4), 3055–3058.
111
Andricioaei, I., & Straub, J. E. (1997). On Monte Carlo and molecular dynamics meth-
ods inspired by Tsallis statistics: Methodology, optimization, and application to
atomic clusters. J. Chem. Phys., 107(21), 9117–9124.
Arimitsu, T., & Arimitsu, N. (2000). Tsallis statistics and fully developed turbulence.
J. Phys. A: Math. Gen., 33(27), L235.
Athreya, K. B. (1994). Entropy maximization. IMA preprint series 1231, Institute for
Mathematics and its Applications, University of Minnesota, Minneapolis.
Back, T., Hammel, U., & Schwefel, H.-P. (1997). Evolutionary computation: Com-
ments on the history and current state. IEEE Transactions on Evolutionary Com-
putation, 1(1), 3–17.
Bashkirov, A. G. (2004). Maximum Rényi entropy principle for systems with power-
law hamiltonians. Physical Review Letters, 93, 130601.
Ben-Bassat, M., & Raviv, J. (1978). Rényi’s entropy and the probability of error. IEEE
Transactions on Information Theory, IT-24(3), 324–331.
112
Bhattacharyya, A. (1946). On some analogues of the amount of information and their
use in statistical estimation. Sankhya, 8, 1–14.
Billingsley, P. (1965). Ergodic Theory and Information. John Wiley & Songs, Toronto.
Borland, L., Plastino, A. R., & Tsallis, C. (1998). Information gain within nonexten-
sive thermostatistics. Journal of Mathematical Physics, 39(12), 6490–6501.
Bounds, D. G. (1987). New optimization methods from physics and biology. Nature,
329, 215.
Campbell, L. L. (1985). The relation between information theory and the differential
geometry approach to statistics. Information Sciences, 35(3), 195–210.
Caticha, A., & Preuss, R. (2004). Maximum entropy and Bayesian data analysis:
Entropic prior distributions. Physical Review E, 70, 046127.
Cercueil, A., & Francois, O. (2001). Monte Carlo simulation and population-based op-
timization. In Proceedings of the 2001 Congress on Evolutionary Computation
(CEC2001), pp. 191–198. IEEE Press.
113
Cerf, R. (1996a). The dynamics of mutation-selection algorithms with large population
sizes. Ann. Inst. H. Poincaré, 32, 455–508.
Costa, J. A., Hero, A. O., & Vignat, C. (2002). A characterization of the multivariate
distributions maximizing Rényi entropy. In Proceedings of IEEE International
Symposium on Information Theory(ISIT), pp. 263–263. IEEE Press.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley, New
York.
Cover, T. M., Gacs, P., & Gray, R. M. (1989). Kolmogorov’s contributions to infor-
mation theory and algorithmic complexity. The Annals of Probability, 17(3),
840–865.
114
Davis, H. (1941). The Theory of Econometrics. Principia Press, Bloomington, IN.
de Finetti, B. (1931). Sul concetto di media. Giornale di Istituto Italiano dei Attuarii,
2, 369–396.
de la Maza, M., & Tidor, B. (1993). An analysis of selection procedures with particular
attention paid to proportional and Boltzmann selection. In Forrest, S. (Ed.),
Proceedings of the Fifth International Conference on Genetic Algorithms, pp.
124–131 San Mateo, CA. Morgan Kaufmann Publishers.
Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006b). On measure theoretic defini-
tions of generalized information measures and maximum entropy prescriptions.
arXiv:cs.IT/0601080. (Submitted to Physica A).
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2004). Cauchy annealing sched-
ule: An annealing schedule for Boltzmann selection scheme in evolutionary
algorithms. In Proceedings of the IEEE Congress on Evolutionary Computa-
tion(CEC), Vol. 1, pp. 55–62. IEEE Press.
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005a). Information theoretic justi-
fication of Boltzmann selection and its generalization to Tsallis case. In Pro-
ceedings of the IEEE Congress on Evolutionary Computation(CEC), Vol. 2, pp.
1667–1674. IEEE Press.
115
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006a). Nonextensive triangle equality
and other properties of Tsallis relative-entropy minimization. Physica A, 361,
124–138.
Ebanks, B., Sahoo, P., & Sander, W. (1998). Characterizations of Information Mea-
sures. World Scientific, Singapore.
Epstein, L. G., & Zin, S. E. (1989). Substitution, risk aversion and the temporal behav-
ior of consumption and asset returns: A theoretical framework. Econometrica,
57, 937–970.
Ferri, G. L., Martı́nez, S., & Plastino, A. (2005). The role of constraints in Tsallis’
nonextensive treatment revisited. Physica A, 347, 205–220.
Forte, B., & Ng, C. T. (1973). On a characterization of the entropies of type β. Utilitas
Math., 4, 193–205.
Furuichi, S. (2005). On uniqueness theorem for Tsallis entropy and Tsallis relative
entropy. IEEE Transactions on Information Theory, 51(10), 3638–3645.
Furuichi, S., Yanagi, K., & Kuriyama, K. (2004). Fundamental properties of Tsallis
relative entropy. Journal of Mathematical Physics, 45, 4868–4877.
116
Gelfand, I. M., Kolmogorov, N. A., & Yaglom, A. M. (1956). On the general definition
of the amount of information. Dokl. Akad. Nauk USSR, 111(4), 745–748. (In
Russian).
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6(6),
721–741.
Gokcay, E., & Principe, J. C. (2002). Information theoretic clustering. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 24(2), 158–171.
Good, I. J. (1963). Maximum entropy for hypothesis formulation, especially for mul-
tidimensional contingency tables. Ann. Math. Statist., 34, 911–934.
Grendár jr, M., & Grendár, M. (2001). Maximum entropy: Clearing up mysteries.
Entropy, 3(2), 58–63.
Halsey, T. C., Jensen, M. H., Kadanoff, L. P., Procaccia, I., & Shraiman, B. I. (1986).
Fractal measures and their singularities: The characterization of strange sets.
Physical Review A, 33, 1141–1151.
117
Hobson, A. (1969). A new theorem of information theory. J. Stat. Phys., 1, 383–391.
Hobson, A. (1971). Concepts in Statistical Mechanics. Gordon and Breach, New York.
Ireland, C., & Kullback, S. (1968). Contingency tables with given marginals.
Biometrika, 55, 179–188.
Jaynes, E. T. (1957b). Information theory and statistical mechanics ii. Physical Review,
108(4), 171–190.
Jizba, P., & Arimitsu, T. (2004a). Observability of Rényi’s entropy. Physical Review
E, 69, 026128.
Jizba, P., & Arimitsu, T. (2004b). The world according to Rényi: thermodynamics of
fractal systems. Annals of Physics, 312, 17–59.
Johnson, O., & Vignat, C. (2005). Some results concerning maximum Rényi entropy
distributions. math.PR/0507400.
Johnson, R., & Shore, J. (1983). Comments on and correction to ’axiomatic deriva-
tion of the principle of maximum entropy and the principle of minimum cross-
entropy’ (jan 80 26-37) (corresp.). IEEE Transactions on Information Theory,
29(6), 942–943.
118
Kapur, J. N. (1994). Measures of Information and their Applications. Wiley, New
York.
Kapur, J. N., & Kesavan, H. K. (1997). Entropy Optimization Principles with Appli-
cations. Academic Press.
Karmeshu, & Sharma, S. (2006). Queue lengh distribution of network packet traffic:
Tsallis entropy maximization with fractional moments. IEEE Communications
Letters, 10(1), 34–36.
Kreps, D. M., & Porteus, E. L. (1978). Temporal resolution of uncertainty and dynamic
choice theory. Econometrica, 46, 185–200.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Ann. Math.
Stat., 22, 79–86.
Lavenda, B. H. (1998). The analogy between coding theory and multifractals. Journal
of Physics A: Math. Gen., 31, 5651–5660.
119
Mahnig, T., & Mühlenbein, H. (2001). A new adaptive Boltzmann selection schedule
sds. In Proceedings of the Congress on Evolutionary Computation (CEC’2001),
pp. 183–190. IEEE Press.
Martı́nez, S., Nicolás, F., Pennini, F., & Plastino, A. (2000). Tsallis’ entropy maxi-
mization procedure revisited. Physica A, 286, 489–502.
Mead, L. R., & Papanicolaou, N. (1984). Maximum entropy in the problem of mo-
ments. Journal of Mathematical Physics, 25(8), 2404–2417.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equa-
tion of state calculation by fast computing machines. Journal of Chemical
Physics, 21, 1087–1092.
Morales, D., Pardo, L., Pardo, M. C., & Vajda, I. (2004). Rényi statistics for testing
composite hypotheses in general exponential models. Journal of Theoretical
and Applied Statistics, 38(2), 133–147.
Moret, M. A., Pascutti, P. G., Bisch, P. M., & Mundim, K. C. (1998). Stochastic molec-
ular optimization using generalized simulated annealing. J. Comp. Chemistry,
19, 647.
Mühlenbein, H., & Schlierkamp-Voosen, D. (1993). Predictive models for the breeder
genetic algorithm. Evolutionary Computation, 1(1), 25–49.
Nagumo, M. (1930). Über eine klasse von mittelwerte. Japanese Journal of Mathe-
matics, 7, 71–79.
Nivanen, L., Méhauté, A. L., & Wang, Q. A. (2003). Generalized algebra within a
nonextensive statistics. Rep. Math. Phys., 52, 437–434.
120
Norries, N. (1976). General means and statistical theory. The American Statistician,
30, 1–12.
Ormoneit, O., & White, H. (1999). An efficient algorithm to compute maximum en-
tropy densities. Econometric Reviews, 18(2), 127–140.
Ostasiewicz, S., & Ostasiewicz, W. (2000). Means and their applications. Annals of
Operations Research, 97, 337–355.
Prügel-Bennett, A., & Shapiro, J. (1994). Analysis of genetic algorithms using statis-
tical mechanics. Physical Review Letters, 72(9), 1305–1309.
121
Rényi, A. (1959). On the dimension and entropy of probability distributions. Acta
Math. Acad. Sci. Hung., 10, 193–215. (reprinted in (Turán, 1976), pp. 320-342).
Rényi, A. (1960). Some fundamental questions of information theory. MTA III. Oszt.
Közl., 10, 251–282. (reprinted in (Turán, 1976), pp. 526-552).
Rényi, A. (1965). On the foundations of information theory. Rev. Inst. Internat. Stat.,
33, 1–14. (reprinted in (Turán, 1976), pp. 304-317).
Shore, J. E., & Johnson, R. W. (1980). Axiomatic derivation of the principle of max-
imum entropy and the principle of minimum cross-entropy. IEEE Transactions
122
on Information Theory, IT-26(1), 26–37. (See (Johnson & Shore, 1983) for
comments and corrections.).
Shore, J. E., & Gray, R. M. (1982). Minimum cross-entropy pattern classification and
cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 4(1), 11–18.
Sutton, P., Hunter, D. L., & Jan, N. (1994). The ground state energy of the ±j spin
glass from the genetic algorithm. Journal de Physique I France, 4, 1281–1285.
Suyari, H., & Tsukada, M. (2005). Law of error in Tsallis statistics. IEEE Transactions
on Information Theory, 51(2), 753–757.
Teweldeberhan, A. M., Plastino, A. R., & Miller, H. G. (2005). On the cut-off prescrip-
tions associated with power-law generalized thermostatistics. Physics Letters A,
343, 71–78.
Tikochinsky, Y., Tishby, N. Z., & Levine, R. D. (1984). Consistent inference of prob-
abilities for reproducible experiments. Physical Review Letters, 52, 1357–1360.
123
Topsøe, F. (2001). Basic concepts, identities and inequalities - the toolkit of informa-
tion theory. Entropy, 3, 162–190.
Tsallis, C. (1994). What are the numbers that experiments provide?. Quimica Nova,
17, 468.
Tsallis, C., & de Albuquerque, M. P. (2000). Are citations of scientific papers a case
of nonextensivity?. Eur. Phys. J. B, 13, 777–780.
Tsallis, C., Levy, S. V. F., Souza, A. M. C., & Maynard, R. (1995). Statistical-
mechanical foundation of the ubiquity of lévy distributions in nature. Physical
Review Letters, 75, 3589–3593.
Tsallis, C., Mendes, R. S., & Plastino, A. R. (1998). The role of constraints within
generalized nonextensive statistics. Physica A, 261, 534–554.
Turán, P. (Ed.). (1976). Selected Papers of Alfréd Rényi. Akademia Kiado, Budapest.
Uffink, J. (1996). The constraint rule of the maximum entropy principle. Studies in
History and Philosophy of Modern Physics, 27, 47–79.
Vignat, C., Hero, A. O., & Costa, J. A. (2004). About closedness by convolution of
the Tsallis maximizers. Physica A, 340, 147–152.
Wada, T., & Scarfone, A. M. (2005). Connections between Tsallis’ formalism em-
ploying the standard linear average energy and ones employing the normalized
q-average enery. Physics Letters A, 335, 351–362.
124
Watanabe, S. (1969). Knowing and Guessing. Wiley.
Wehrl, A. (1991). The many facets of entropy. Reports on Mathematical Physics, 30,
119–129.
Yu, Z. X., & Mo, D. (2003). Generalized simulated annealing algorithm applied in the
ellipsometric inversion problem. Thin Solid Films, 425, 108.
125