Probabilistic Latent Factor Induction and Statistical Factor Analysis

Probabilistic Latent Factor Induction and
Statistical Factor Analysis
A Comparison of Methods
Stefan Conrady, stefan.conrady@conradyscience.com
Dr. Lionel Jouffe, jouffe@bayesia.com
April 7, 2011
Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Probabilistic Factor Induction and Statistical Factor Analysis
Table of Contents
Introduction
About the Authors iv
Stefan Conrady iv
Lionel Jouffe iv
Key Concepts from Information Theory 1

Entropy 1
Chain Rule Theorem 2
Conditional Entropy 2
Mutual Information 3
Relative Entropy (Kullback-Leibler Divergence) 3
Example 1 3
Example 2 4
Comparison of Methods
Approach 5
Notation 5
Key Terminology 5
Data Set 6
Probabilistic Latent Factor Induction with BayesiaLab 7

Data Import 7
Variable Clustering 16
Latent Factor Induction 21
Statistical Factor Analysis 30
Factor Analysis with STATISTICA 32
Conclusion 39
References 40
Contact Information 41
Conrady Applied Science, LLC 41
Bayesia SAS 41
Copyright 41
www.conradyscience.com | www.bayesia.com ii
Introduction
Bayesian networks have been gaining prominence among scientists over the recent decade and the new insights gener-
ated by this powerful research approach can now be found in studies that circulate well beyond the academic communi-
ties. As a result, many practitioners and managerial decision-makers see more and more references to Bayesian networks
in all kinds of scientific and business research, ranging from biostatistics to marketing analytics.
It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In
the field of market research, for instance, long-established methods, such as factor analysis remain in daily use today.
Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight
similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and
thus help researchers to correctly compare and interpret the respective results. More specifically, we want to establish
the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on
Bayesian networks, which we refer to as Probabilistic Latent Factor Induction.
Factor Analysis is a statistical method used to describe variability among observed variables in terms of a potentially
lower number of unobserved variables called factors. It is possible, for example, that variations in three or four ob-
served variables mainly reflect the variations in a single unobserved variable, or in a reduced number of unobserved
variables. The observed variables can be seen as manifestations of abstract underlying (and unobserved) dimensions or
(latent) factors.
Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product man-
agement, operations research, and other applied sciences that deal with a large number of variables in their data.
Probabilistic Latent Factor Induction is a workflow within the BayesiaLab software package, which has the same objec-
tive as the traditional factor analysis, i.e. variable reduction, but works entirely with the framework of Bayesian net-
works and is based on principles derived from information theory.
It is important to point out that this comparison is not meant to favor one approach over the other (and to declare a
winner and loser), although it is clearly in the authors’ interest to promote Bayesian networks in general and BayesiaLab
in particular. Rather, this paper should serve as reference for research practitioners and those who use research results
in their decision-making processes, so they can correctly interpret insights generated with either approach.
www.conradyscience.com | www.bayesia.com iii

About the Authors
Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting
firm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied
Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America.
Stefan Conrady studied Electrical Engineering and has extensive management experience in the fields of product plan-
ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to es-
tablishing his own firm, he was heading the Analytics & Forecasting group at Nissan North America.
Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science
and has been working in the field of Artificial Intelligence since the early 1990s. He and his team have been developing
BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and
knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as
in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-
lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007.
www.conradyscience.com | www.bayesia.com iv
Information Theory Background
Key Concepts from Information Theory

Before we proceed to the direct comparison of methods, it is important to establish several key concepts relating to the
knowledge representation in Bayesian networks.
Entropy
The concept of entropy provides the underpinning for all structural learning and analysis algorithms in BayesiaLab.
Entropy measures the uncertainty inherent in the distribution of a random variable.
The entropy H(X) of a random variable X is defined as:
H (X) = − ∑ p(x)log 2 p(x) ,

x∈X
where x stands for the states, which variable X can take. Note that the log is to the base of 2 and the value of entropy is
expressed in bits (0/1).
An example can perhaps illustrate this: If variable X represents the outcome of a coin toss, X can have one of two
states, Heads and Tails, i.e. the set of potential outcomes is X={Heads, Tails}. Given the coin toss is fair, the probability
of Head and Tails will be 0.5, i.e. p(Heads)=0.5 and p(Tails)=0.5.
We can now compute the entropy H(Xfair), based on these values:
H (X fair ) = − p(Heads)log 2 p(Heads) − p(Tails)log 2 p(Tails)

= −0.5 log 2 0.5 − 0.5 log 2 0.5 = 0.5 + 0.5 = 1 bit
This means our uncertainty prior to a fair coin toss is equivalent to an entropy value of 1 bit, which is the maximum
entropy due to the uniform distribution of the variable with two states.
If we had a biased coin instead with p(Heads)=0.7 and p(Tails)=0.3, it is intuitive to think that the uncertainty would be
lower as one state of the coin toss will be more probable and, indeed, computing the entropy H(Xbiased) yields a lower
value.
H (Xbiased ) = −0.7 log 2 0.7 − 0.3log 2 0.3 = 0.881
To complete this idea, we can also plot H(X) as a function of the bias, p(Heads)=1-p(Tails), with p(Heads)∈{0,..,1}, i.e.
ranging from impossible, p(Heads)=0, to certain, p(Heads)=1.
www.conradyscience.com | www.bayesia.com 1
H�X�
1.0
0.8
0.6
0.4
0.2
p�Heads�
0.2 0.4 0.6 0.8 1.0
Clearly, anything other than a perfectly fair coin reduces the entropy and thus our uncertainty regarding the outcome of
the coin toss.
Chain Rule Theorem

The chain rule for joint entropy states that the total uncertainty about the value of X and Y is equal to the uncertainty
about X plus the (average) uncertainty about Y once you know X.
H (X,Y ) = H (X) + H (Y∣X)
The proof of this theorem follows:
H (X,Y ) = − ∑ ∑ p(x, y)log 2 p(x, y)

y∈Y x∈X
= − ∑ ∑ p(x, y)log 2 p(y∣x)p(x)

y∈Y x∈X
= − ∑ ∑ p(x, y)log 2 p(y∣x) − ∑ ∑ p(x, y)log 2 p(x)

y∈Y x∈X y∈Y x∈X
= − ∑ ∑ p(x, y)log 2 p(y∣x) − ∑ p(x)log 2 p(x)

y∈Y x∈X x∈X
= H (Y∣X) + H (X)
Conditional Entropy
Perhaps the single most important concept for computations in BayesiaLab is conditional entropy. Conditional entropy
refers to the entropy of a random variable when we have information on another variable.
The conditional entropy H(Y|X), is defined as
H (Y∣X) = ∑ p(x)H (Y∣X = x)

x∈X
= − ∑ p(x)∑ p(y∣x)log 2 p(y∣x)

x∈X y∈Y
= − ∑ ∑ p(x, y)log 2 p(y∣x)

x∈X y∈Y
The conditional entropy of Y conditional on X refers to the expected entropy of Y conditional on the value of X.
Mutual Information
The mutual information I(X,Y) measures how much (on average) the observation of random variable Y tells us about
the uncertainty of X, i.e. by how much the entropy of X is reduced if we we have information on Y.
I(X,Y ) = H (X) − H (X∣Y ) = H (Y ) − H (Y∣X)
Note that the mutual information is a symmetric metric, which reflects the uncertainty reduction of X by knowing Y as
well as of Y by knowing X.
Relative Entropy (Kullback-Leibler Divergence)

A closely related concept is the relative entropy, also referred to as the Kullback-Leibler Divergence (DKL) or sometimes
cross entropy. The Kullback-Leibler Divergence is a measure of the difference between two probability distributions p
and q.
For probability distributions p and q of a discrete random variable X, their K–L divergence is defined to be
p(x)
DKL = ( p(X) || q(X)) = ∑ p(x)log 2
x∈X q(x)
In words, it is the expected value of the logarithmic difference between the joint probability distributions p(X) and q(X).
In contrast to the mutual information, the relative entropy is non-symmetric.
Example 1
We once again use tossing coins as an example. By default, we would expect that any given coin is fair and assume a
model q(Heads)=q(Tails)=0.5. As it turns out, in repeated coin tosses, we observe that a probability of p(Heads)=0.75
and of p(Tails)=0.25. We can now use the Kullback-Leibler Divergence to establish the “distance” or “distortion” be-
tween the originally assumed distribution q(x) and the observed distribution of p(x).
p(x)
DKL = ( p(X) || q(X)) = ∑ p(x)log 2
x∈X q(x)
p(Heads) p(Tails) 0.75 0.25
= p(Heads)log 2 + p(Tails)log 2 = 0.75 log 2 + 0.25 log 2
q(Heads) q(Tails) 0.5 0.5
= 0.188722 bits
Example 2
For another illustration we use an example from the field of meteorology. More specifically, we look at the rainfall in
two cities in state of Victoria, Australia. We used daily rainfall data measured at Geelong Airport and at Melbourne
Tullamarine Airport, which are approximately 80 kilometers apart, over the entire calendar year of 2010. Given the
proximity of those locations, one would generally expect similar weather. Perhaps the Geelong weather isn’t reported in
the Melbourne newspapers and so a traveler wants to use the Melbourne weather as a proxy. However, the actual
weather station observations tell us that there is rain in Melbourne on 40.3% of the days, whereas Geelong sees rainfall
on 47.4% of the days in the year.
We can now compute the Kullback-Leibler Divergence for these two distributions, and pGeelong(x) stands for Geelong
and pMelbourne(x) for the Melbourne rain probability distributions.
( )
pGeelong (x)
DKL = pGeelong (X) || pMelbourne (X) = ∑ pGeelong (x)log 2
x∈X pMelbourne (x)
pGeloong (x = No Rain) pGeloong (x = Rain)
= pGeelong (x = No Rain)log 2 + pGeelong (x = Rain)log 2
pMelbourne (x = No Rain) pMelbourne (x = Rain)
0.526 0.474
= 0.526 log 2 + 0.474 log 2 = 0.0148958 bits
0.597 0.403
( )
DKL = pMelbourne (X) || pGeelong (X) = ∑ pMelbourne (x)log 2
x∈X
pMelbourne (x)
pGeelong (x)
pMelbourne (x = Rain) p (x = No Rain)
= pMelbourne (x = Rain)log 2 + pMelbourne (x = No Rain)log 2 Melbourne
pGeelong (x = Rain) pGeelong (x = No Rain)
0.403 0.597
= 0.403log 2 + 0.597 log 2 = 0.0147077 bits
0.474 0.526
BayesiaLab’s primary metric, the Arc Force, is directly proportional to the relative entropy and describes the strength of
the directional link between two variables. More specifically, it describes the difference between the joint probability
distributions with and without the particular arc.
Probabilistic Latent Factor Induction vs. Statistical Factor Analysis
Comparison of Methods
Approach
We believe that we can best facilitate a comparison of the statistical factor analysis and latent factor induction by work-
ing through an example. We draw upon the familiar dataset from the previously presented case study from the perfume
industry, hereafter referred to as the “Perfume Study.” 1
We begin our tutorial with the Data Import process for BayesiaLab, although it is not meant to be at the core of the
comparison. It is important though to spell out the data pre-processing steps in BayesiaLab, as they highlight some of
the fundamental differences between probabilistic and statistical approaches.
Once the data preparation is complete, we first present the probabilistic latent factor induction workflow with
BayesiaLab and then provide an example of a statistical factor analysis. For the statistical factor analysis, we will use
STATISTICA 10 as the software platform, although most steps are fairly generic and could be reproduced with a num-
ber of other statistical software packages as well.
Notation
To clearly distinguish between natural language, software-specific functions and study-specific variable names, the fol-
lowing notation is used:
• BayesiaLab-specific functions, keywords, commands, etc., are capitalized and shown in bold type.
• Names of attributes, variable, node and factors are italicized.
• At appropriate points in the text, grey boxes highlight parallels between the two presented methods:
Probabilistic Latent Factor Induction Statistical Factor Analysis
Key Terminology
• “Observed” and “manifest” are used interchangeably and describe those random variables, which have been meas-
ured by the researcher. Each variable measure
• The terms “latent” or “unobserved” are used interchangeably in the context of hidden concepts or factors, which
cannot be measured, but can potentially be extracted or induced. In our context, the term factor stands exclusively for
latent variables. Consequently, the terms “factor”, “factor variable”, “latent variable” and “unobserved variable” are
equivalent.
1 Conrady and Jouffe (2010)
Probabilistic Latent Factor Induction vs. Statistical Factor Analysis
Data Set
The Perfume Study is based on a monadic consumer survey about a range of fragrances, which was conducted in
France. In this example we use survey responses from 1,321 women, who have evaluated a total of 11 fragrances on a
wide range of attributes:
• 27 ratings on fragrance-related attributes, such as, “sweet”, “flowery”, “feminine”, etc., measured on a 1-to-10 scale.
• 12 ratings on projected imagery related to someone, who would be wearing the respective fragrance, e.g. “is sexy”,
“is modern”, measured on a 1-to-10 scale.
• 1 variable for Intensity, a measure reflecting the level of intensity, measured on a 1-to-5 scale.
• 1 variable for Purchase Intent, measured on a 1-to-6 scale.
• 1 nominal variable, Product, for product identification purposes.
Probabilistic Latent Factor Induction with BayesiaLab
Data Import
To start the process with BayesiaLab, we first import the data set, which is formatted as a CSV file.2 With Data>Open
Data Source>Text File, we start the Data Import wizard, which immediately provides a preview of the data file.
The table displayed in the Data Import wizard shows the individual variables as columns and the survey responses as
rows. There are a number of options available, e.g. for sampling. However, this is not necessary in our example given
the relatively small size of the database.
Clicking the Next button, prompts a data type analysis, which provides BayesiaLab’s best guess regarding the data type
of each variable.
Furthermore, the Information box provides a brief summary regarding the number of records, the number of missing
values, filtered states, etc.3
2 CSV stands for “comma-separated values”, a common format for text-based data files.
3 There are no missing values in our database and filtered states are not applicable in this survey.
For this example, we will need to override the default data type for the Product variable, as each value is a nominal
product identifier rather than a numerical scale value. We can change the data type by highlighting the Product variable
and clicking the Discrete check box, which changes the color of the Product column to red.
We will also define Purchase Intent and Intensity as discrete variables, as the default number of states of these variables
is already adequate for our purposes.4
The next screen provides options as to how to treat any missing values. In our case, there are no missing values so the
corresponding panel is grayed-out.
Clicking the small upside-down triangle next to the variable names brings up a window with key statistics of the
selected variable, in this case Fresh.
4 The desired number of variable states is largely a function of the analyst’s judgment.
The next step is the Discretization and Aggregation dialogue, which allows the analyst to determine the type of
discretization that must be performed on all continuous variables.5 For this survey, and given the number of
observations, it is appropriate to reduce the number of states from the original 10 states (1 through 10) to smaller
number. One could, for instance, bin the 1-10 rating into low, mid and high, or apply any other arbitrary method
deemed appropriate by the analyst.
The screenshot shows the dialogue for the Manual selection of discretization steps, which permits to select binning
thresholds by point-and-click.
5 BayesiaLab requires discrete distributions for all variables.
Note
For choosing discretization algorithms beyond this example, the following rule of thumb may be helpful:
• For supervised learning, choose Decision Tree.
• For unsupervised learning, choose, in the order of priority, K-Means, Equal Distances or Equal Frequencies.
For this particular example, we select Equal Distances with 5 intervals for all continuous variables. This was the
analyst’s choice in order to be consistent with prior research.
Clicking Select All Continuous followed by Finish completes the import process and the 49 variables (columns) from
our database are now shown as blue nodes in the Graph Panel, which is the main window for network editing. By
default, all variables are represented as nodes. This initial view represents a fully unconnected Bayesian network.
In the above graph, two variables play a fundamentally different role. The values of Product represent categories and
Purchase Intent is the overall target variable, i.e. the dependent variable of the Perfume Study. Thus both will be ex-
cluded from the factor generation process.
While correlation and covariance the central measures for statistical factor analysis, learning Bayesian networks with
BayesiaLab (and thus probabilistic factor induction) is based on measures from information theory, such as the
Kullback-Leibler Divergence, which was introduced in the first chapter.
The Kullback-Leibler Divergence can be obtained after learning an initial Bayesian network with one of BayesiaLab’s
unsupervised learning algorithms. “Unsupervised” implies that the learning algorithm searches for an overall representa-
tion of the joint distribution of the underlying data rather than the characterization of an individual target variable.
In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian network.
As this view of the network is not easily readable, BayesiaLab has numerous built-in layout algorithms, of which the
Force Directed Layout is perhaps the most commonly used. It can be invoked by View>Automatic Layout>Force
Directed Layout or alternatively through the keyboard shortcut “p”.
The resulting network will look similar to the following screenshot.
Completed Bayesian Network upon EQ Learning
With the network established, we can now further examine the probabilistic relationships between the nodes, which are
represented as arcs.6 By selecting, Analysis>Graphic>Arc Force, we can show the probabilistic strength of the arcs,
which is visualized by the thickness of the arcs.
6 “Arcs” are directed links or edges between nodes, which appear as arrows in the graph.
Network with Arc Force
The numeric values of the Arc Force can be shown by selecting View>Display Arc Comments. In the network shown
below, the Arc Force values are presented in yellow boxes attached to each arc.
Network with Arc Force
Arc Force Covariance

In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure
for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance
matrix play the equivalent role.
Variable Clustering
With Arc Force established as a the key measure across the entire network, BayesiaLab can determine clusters of vari-
ables, which are “close” in a probabilistic sense. This can be initiated from the menu via Analysis>Graphic>Variable
Clustering.
The clustering algorithm is iterative and starts with those two variables, whose connecting arc has the strongest Arc
Force. The following sequence of screenshots illustrates this algorithm conceptually in “slow motion,” as the analyst
would not see these individual steps in the actual workflow.
As a starting point, every manifest variable is treated as a distinct cluster and so we have 47 clusters. Using the
Kullback-Leibler Divergence as a measure, the “closest” variables are then merged into one concept. As a result, we first
obtain 46 clusters, then 45, etc., as shown in the array of dendrograms below. BayesiaLab proposes to conclude this
algorithm upon finding 15 clusters. However, the analyst has the ability to override this automatic selection. As the
choice of clusters appears to be generally compatible with our interpretation of the variable names, we accept this rec-
ommendation.
Sequence of Dendrograms
47 46 45 44 ... 16 15
Because of the importance of this process, we will also show it from another angle, i.e. by looking at sequential views of
the graph.
Step 0 - 47 Clusters
Step 1 - 46 Clusters: Pleasure merged with Corresponds
The strongest Arc Force exists between Pleasure and Corresponds and BayesiaLab will form an interim concept from
them. The next-highest Arc Force then determines whether another variable is merged with the first concept or whether
a new concept is created. In our case, Radiant and In Love are combined as a new concept.
Step 2 - 45 Clusters: Radiant merged with In Love
In the third step, we see Sensual and Romantic merged into a new latent concept, and so on.
Step 3 - 44 Clusters: Sensual merged with Romantic
Upon completion of this process, BayesiaLab forms variable/node clusters from these common concepts and color-codes
them accordingly.
Network with Color-Coded Variable Clusters
By clicking the Validate Clustering button , we can now formally fixate the new latent factor variables. The
new latent factors are shown in the following table with their associated observed variables. By default, they are given
the name “Factor” plus a numeric suffix
Latent Factor Induction

Upon definition of the new latent factor variables, we now want to make them
available for modeling purposes. Although these latent factors exist as new concepts
and are conceptually linked to the manifest variables, the factors do not yet have
any values or states.
This will now happen in the Multiple Clustering process, which creates discrete
states for each latent factor variable by performing data clustering over the linked
manifest variables.
More specifically, the states of each latent factor will be created in such a way that
they best summarize the joint probability distribution defined by the manifest vari-
ables. Factor 0 and its linked manifest variables are shown below.
Subnetwork for Factor 0
The following Monitors display the marginal probability distributions of the variables associated with Factor 1, plus,
highlighted in red, Factor 1 itself and its states are shown. We can see that 5 states were created for Factor 1, labelled
C1 through C5, and they each have an expected value, which is shown in parentheses. For instance, state C2 has an
expected value of 9.21. That means, given that C2 is observed, the mean value of the manifest variables, weighted by
their relation with C2, is equal to 9.21. In other words, C2 corresponds to high ratings with regard to those 5 dimen-
sions.
By selecting specific states of Factor 0 in the Monitor Panel, we can see the conditional distributions of the manifest
variables. The states C2 and C3 are displayed for reference below. They can be easily interpreted by looking at the asso-
ciated values, e.g. state C2 appears to reflect high ratings of the manifest variables, whereas state C3 captures very low
ratings.
A more general analysis of the relationships between manifest variables and latent factors can be obtained through
Analysis>Reports>Relationship Analysis:
This chart summarizes the values of key clustering measures, such as the Kullback-Leibler Divergence, for every mani-
fest variable associated with Factor 0. For reference only, it also includes Pearson’s Correlation Coefficient R.
Relationship Analysis Factor Loadings

This summary of clustering measures in the Relationship Analysis allows an interpretation, which is very similar to
what is provided with factor loadings.
It is also possible to visualize the mean values of the manifest variables (x-axis) along with the Mutual Information (y-
axis, left panel) and the Standardized Total Effect (y-axis, right panel) for the latent factor variable.
Although we have now defined new factor variables, we have not yet seen the original matrix survey responses in terms
of the new factor variables. For instance, every respondent record has a value for Active, Fulfilled, Trust, etc., as these
variables were observed and recorded in the survey, but how do we find the values (or states) of the new latent factors
for each respondent record?
Actually, at the conclusion of the Multiple Clustering process, BayesiaLab has introduced the new factors into the origi-
nal network. By using BayesiaLab’s imputation process, which is based on maximum likelihood, they were added as
new nodes to the graph and also saved as new columns (or fields) to the database,
Latent Factors Introduced into Network
Factor Induction Saving Factor Scores

Introducing the new latent factors into the network is equivalent to adding the factor scores to the original observa-
tion matrix.
We can easily verify that each new factor has a value for each respondent record. We start Inference>Interactive Infer-
ence, which allows to scroll through the survey records and view the values of any variable, including the values of the
new latent factors.
For instance, survey record #0 is expressed as state C4 in terms of Factor 0. The states of the manifest variables are
shown for reference.
Record #8, for example is assigned to state C3:
Now we have the entire set of respondent records re-expressed in terms of 15 latent factors, which allows us to use
them for all kinds of modeling purposes.
Given the importance of latent factors for interpretation, we will assign descriptive labels to each of them. BayesiaLab
can visually aid in this process by showing the latent factors and their relationships to the original manifest variables.
This means, we will simply learn a new network, which includes both factor variables and manifest variables.
Network including Latent Factors and Manifest Variables
The emerging network structure clearly lends itself to defining descriptive labels, which are applied to the factors in the
following graph.7
7 See Conrady and Jouffe (2010) for a more detailed explanation of the interpretation process.
Network including Latent Factors and Manifest Variables plus Factor Labels
It is important to reiterate that the latent factors generated here are not orthogonal, which means that probabilistic rela-
tionships exist between the factors. For illustration purposes, we can highlight the latent factors and exclude the mani-
fest variables from being displayed. In addition, the following graph also displays the Arc Force between each latent
factor providing further confirmation that the latent factors are not independent.
Network with Latent Factors and Arc Forces

Perhaps the most common approach for extracting factors from a set of observed variables is Principal Components
Analysis (PCA) and it is frequently considered a synonym for factor analysis.8 For our purpose, we look at PCA as a
prototypical tool for factor extraction, which lends itself to be compared to the latent factor induction with BayesiaLab
presented earlier.
Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a
set of observations, represented by matrix X, of possibly correlated variables into a set of values of uncorrelated vari-
ables called principal components, to be represented by a new matrix Y. The goal of this transformation is to minimize
redundancy (measured by covariance) and to maximize the signal (measured by variance).
This transformation is defined in such a way that the first principal component has the highest possible variance, i.e.
accounting for as much of the variability in the data as possible. In turn, each succeeding component has the next-
highest variance while being orthogonal to (uncorrelated with) the preceding components.
Conceptual Illustration of Principal Component Vectors
More formally, PCA creates a re-expression of the original data set on the basis of a new set of orthonormal vectors,
replacing the original set of “naive” basis vectors, which resulted from the choice of measurements.9
In matrix notation, this can be expressed as follows:
PX = Y
8 There are differences between PCA and the more general concept of factor analysis, but explaining those goes beyond
the scope of this paper.
9 Any observed variable automatically establishes a basis vector. Measuring 47 variables would thus result in a 47-
dimensional coordinate system.
with X being the matrix of original observations and P being a yet-to-be-determined orthonormal matrix that trans-
forms X into Y. Interpreting this geometrically, P is a rotation and stretch to generate Y. The rows of P, {p1,…,pm}, are
the new set of basis vectors for expressing the columns of X. Writing out the explicit dot products may better illustrate
this.
⎛ p1 ⎞
⎜
PX = ⎜ 
⎜⎝ p m
⎟
⎟
⎟⎠
(x 1  xn )
⎛ p 1 ⋅ x1 … p 1 ⋅ x n ⎞
⎜ ⎟
Y=⎜    ⎟
⎜⎝ p m ⋅ x1  p m ⋅ x n ⎟⎠
This provides us with the general framework, but we have yet to determine what matrix P should be.
This is the point where we need to introduce the concept of the covariance matrix (Cx). It is defined as
1
CX = XX T
n −1
• CX is a square and symmetric m × m matrix.
• The elements on the diagonal of CX represent the variance of the observed variables.
• The off-diagonal elements of CX represent the covariance between observed variables.
As a result CX captures the correlations between all possible pairs of observed variables.
This obviously relates to our objective of minimizing redundancy (measured by covariance) and maximizing the signal
(measured by variance) of the target matrix Y. The optimum achievement of these goals would imply a diagonal covari-
ance matrix of Y, i.e. with all off-diagonal elements being zero, and our objective thus translates into stipulating that CY
must be diagonal. Fortunately, linear algebra provides several tools for diagonalizing a matrix.
More formally, the objective becomes finding some orthonormal matrix P where Y=PX such that CY is diagonalized.
The rows of P are then the principal components.
Without providing further detail, the solution is:
• The principal components of X are the eigenvectors of XXT or the rows of P.
• The ith diagonal value of CY is the variance of X along pi.
Factor Analysis with STATISTICA

Upon loading the survey data into STATISTICA, the respondent records will be presented as a data table, with the vari-
able names shown as column headers and case numbers shown as row headers.10 This represents our observation matrix
X.
Observation Matrix X
As a starting point of the PCA process, we can display CX, the covariance matrix of X:
10 We will skip a detailed description of the data import steps, as they are fairly generic and we assume that readers
would use a wide array of statistical programs.
Covariance Matrix
Arc Force Covariance

In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure
for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance
matrix play the equivalent role.
As expected, there is a high amount of covariance, i.e. redundancy, between many of the observed variables. To get a
better sense of the magnitude of these pairwise relationships, it helps to display the correlation matrix for reference:
Correlation Matrix
STATISTICA, like many other statistical software packages, has built-in routines, which can perform the computation
of the matrix P of principal components automatically. There are several methods available for solving the PCA, includ-
ing the approach using the eigenvectors of the covariance matrix, which was shown earlier.
Regardless of the computational method used, the solution of the PCA provides as many eigenvalues as there are ob-
served variables. The sum of all eigenvalues equals the number of observed variables, in our case 47. This allows to de-
termine the share of variance attributable to each factor. For instance, the first factor has an eigenvalue of 29.6, which
means that it accounts for 29.6/47=62.98% of the variance. Proceeding down the list, the eigenvalues decline in value and
correspondingly their contribution to the total variance.
List of Eigenvalues
Now that we have a measure of how much variance each successive factor extracts, we can return to the question of
how many factors to retain, as the overall objective of this exercise is variable reduction. The precise number of factors
to be retained is ultimately an arbitrary decision of the analyst, but factors with eigenvalues greater than 1 are typically
considered candidates. A scree plot11 is typically used to illustrate the eigenvalues of the extracted factors. Sometimes
this provides a visual indication of a natural cutoff point between higher and lower eigenvalues. Here such a distinction
cannot be made easily, so we defer to the rule-of-thumb and retain eigenvalues greater than 1.
11 The name “scree plot” is a metaphorical expression, as “scree” is the term for the accumulation of broken rock at the
base of mountain cliffs. In the scree plot we want to distinguish the substantial eigenvalues from the “rubble” at the
bottom.
Scree Plot
In the next step we turn to the interpretation of the extracted factors. The table below shows the factor loadings, which
are the correlations of each observed variable with the extracted factors.
Factor Loadings
Given the high eigenvalue of factor 1, it is not surprising that many variables are highly correlated with it. In our par-
ticular case, however, this correlation is mostly negative, which may be counterintuitive for interpretation purposes.
It is common practice to rotate factors in order to aid in the interpretation process. Intuitively speaking, the rotation in
typically chosen in such a way that the principal factor, i.e. factor 1, aligns with what is commonly understood as the
“positive x-axis.”
Such a factor rotation, for which several methods exist, was also performed with STATISTICA and the results appear in
the table below. In addition, factor loadings higher than 0.7 are highlighted.
Loadings on Rotated Factors
Relationship Analysis Factor Loadings

The summary of clustering measures in BayesiaLab’s Relationship Analysis allows an interpretation, which is very simi-
lar to what is provided with factor loadings.
The analyst can now use these factor loadings to assign meaningful names to each factor. Some are quite obvious in
their characterization, such as factor 3, which could be called “pleasant” or factor 4, which is quite obviously “classi-
cal.” It is also interesting to see that only one variable, i.e. Intensity, has a high loading on factor 2. This implies that
perhaps Intensity is a standalone concept, which has little redundancy. On the other extreme, many variables have high
loadings on factor 1, which makes identifying a distinct concept more elusive.
Without completing this interpretation process, we turn to the “reduction” part by introducing the extracted factors as
variables into the original data set, i.e. replacing 47 variables with 6 variables. This is often referred to as “saving factor
scores,” with the factor scores being the values related to the original observations in this new coordinate system created
by the extracted factors. Our observations now have new coordinates in a 6-dimensional coordinate system rather than
in one with 47 dimensions.
Factor Scores
Latent Factor Induction Saving Factor Scores

Introducing the latent factors into the network is equivalent to adding the factor scores to the original observation
matrix.
We now have the ability to create a wide range of models, for instance, modeling Purchase Intent as a function of the 6
new factors. This will undoubtedly be easier to interpret than a model, which includes all of the 47 original observed
variables.
Conclusion
Although fundamentally different in their framework, statistical factor analysis and probabilistic latent factor induction
have many parallels, which lend themselves to direct comparative interpretation. Given these parallels, analysts familiar
with either domain should find it easy to translate their research workflow from one framework into the other. Equally,
end users of research results, who may be less familiar with the underlying computations, should be in a position to
interpret the findings from both methods in a very similar manner.
References
Conrady, Stefan, and Lionel Jouffe. “Driver Analysis & Product Optimization, A Case Study from the Perfume Indus-
try”, December 1, 2010. http://www.conradyscience.com/index.php/driver-analysis.
Cover, T. M, and J. A Thomas. “Entropy, relative entropy and mutual information.” Elements of Information Theory
(1991): 12–49.
Kachigan, Sam Kash. Multivariate Statistical Analysis: A Conceptual Introduction. 2nd ed. Radius Press, 1991.
MacKay, David J. C. Information Theory, Inference and Learning Algorithms. 1st ed. Cambridge University Press,
2003.
Shlens, J. “A tutorial on principal component analysis.” Systems Neurobiology Laboratory, University of California at
San Diego (2005).
StatSoft, Inc. “Electronic Statistics Textbook.” Electronic Statistics Textbook, 2011. http://www.statsoft.com/textbook/.
Contact Information
Conrady Applied Science, LLC

312 Hamlet’s End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com
Bayesia SAS
6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com
Copyright
© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.
Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:
• You may print or download this document for your personal and noncommercial use only.
• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady
Applied Science, LLC and Bayesia SAS as the source of the material.
• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may
you transmit it or store it in any other website or other form of electronic retrieval system.

Probabilistic Latent Factor Induction and Statistical Factor Analysis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Probabilistic Latent Factor Induction and Statistical Factor Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Probabilistic Latent Factor Induction and

Statistical Factor Analysis

Stefan Conrady, stefan.conrady@conradyscience.com

Dr. Lionel Jouffe, jouffe@bayesia.com

Key Concepts from Information Theory 1

Probabilistic Latent Factor Induction with BayesiaLab 7

Statistical Factor Analysis 30

Factor Analysis with STATISTICA 32

www.conradyscience.com | www.bayesia.com iii

About the Authors

Key Concepts from Information Theory

The entropy H(X) of a random variable X is defined as:

H (X) = − ∑ p(x)log 2 p(x) ,

We can now compute the entropy H(Xfair), based on these values:

H (X fair ) = − p(Heads)log 2 p(Heads) − p(Tails)log 2 p(Tails)

H (Xbiased ) = −0.7 log 2 0.7 − 0.3log 2 0.3 = 0.881

Chain Rule Theorem

H (X,Y ) = H (X) + H (Y∣X)

The proof of this theorem follows:

H (X,Y ) = − ∑ ∑ p(x, y)log 2 p(x, y)

= − ∑ ∑ p(x, y)log 2 p(y∣x)p(x)

= − ∑ ∑ p(x, y)log 2 p(y∣x) − ∑ ∑ p(x, y)log 2 p(x)

= − ∑ ∑ p(x, y)log 2 p(y∣x) − ∑ p(x)log 2 p(x)

The conditional entropy H(Y|X), is defined as

H (Y∣X) = ∑ p(x)H (Y∣X = x)

= − ∑ p(x)∑ p(y∣x)log 2 p(y∣x)

= − ∑ ∑ p(x, y)log 2 p(y∣x)

I(X,Y ) = H (X) − H (X∣Y ) = H (Y ) − H (Y∣X)

Relative Entropy (Kullback-Leibler Divergence)

• Names of attributes, variable, node and factors are italicized.

Probabilistic Latent Factor Induction Statistical Factor Analysis

1 Conrady and Jouffe (2010)

• 1 variable for Purchase Intent, measured on a 1-to-6 scale.

• 1 nominal variable, Product, for product identification purposes.

Probabilistic Latent Factor Induction with BayesiaLab

5 BayesiaLab requires discrete distributions for all variables.

• For supervised learning, choose Decision Tree.

In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian network.

The resulting network will look similar to the following screenshot.

Completed Bayesian Network upon EQ Learning

Network with Arc Force

Network with Arc Force

Arc Force Covariance

Step 1 - 46 Clusters: Pleasure merged with Corresponds

Step 2 - 45 Clusters: Radiant merged with In Love

Step 3 - 44 Clusters: Sensual merged with Romantic

Network with Color-Coded Variable Clusters

Latent Factor Induction

Subnetwork for Factor 0

Relationship Analysis Factor Loadings

Latent Factors Introduced into Network

Factor Induction Saving Factor Scores

Record #8, for example is assigned to state C3:

Network including Latent Factors and Manifest Variables

Network with Latent Factors and Arc Forces

Statistical Factor Analysis

Conceptual Illustration of Principal Component Vectors

In matrix notation, this can be expressed as follows:

• CX is a square and symmetric m × m matrix.

• The off-diagonal elements of CX represent the covariance between observed variables.

Without providing further detail, the solution is:

• The principal components of X are the eigenvectors of XXT or the rows of P.

• The ith diagonal value of CY is the variance of X along pi.