Beruflich Dokumente
Kultur Dokumente
Abstract Introduction
Printer identification based on a printed document has In our previous work, we have described the intrinsic and
many desirable forensic applications. In the electropho- extrinsic features that can be used for printer identification1 .
tographic process (EP) quasiperiodic banding artifacts can Our intrinsic feature extraction method is based on fre-
be used as an effective intrinsic signature. However, in text quency domain analysis of the one dimensional projected
only document analysis, the absence of large midtone ar- signal. If there are a sufficient number of samples in the
eas makes it difficult to capture suitable signals for banding projected signal, the Fourier transform gives us the correct
detection. Frequency domain analysis based on the pro- banding frequency. When we work with a text-only doc-
jection signals of individual characters does not provide ument, our objective is to get the banding frequency from
enough resolution for proper printer identification. Ad- the projected signals of individual letters. In this situation,
vanced pattern recognition techniques and knowledge about there are not enough samples per projection to give high
the print mechanism can help us to device an appropriate frequency domain resolution. Significant overlap between
method to detect these signatures. We can get reliable in- spectra from different printers makes it difficult to use it as
trinsic signatures from multiple projections to build a clas- an effective classification method. The printer identifica-
sifier to identify the printer. Projections from individual tion or classification task is closely related to various pat-
characters can be viewed as a high dimensional data set. tern identification and pattern recognition techniques. The
In order to create a highly effective pattern recognition intrinsic features are the patterns that are used to recognize
tool, this high dimensional projection data has to be repre- an unknown printer. The basic idea is to create a classi-
sented in a low dimensional space. The dimension reduc- fier that can utilize the intrinsic signatures from a docu-
tion can be performed by some well known pattern recog- ment to make proper identification. A Gaussian mixture
nition techniques. Then a classifier can be built based on model(GMM) or the tree based classifier is suitable for the
the reduced dimension data set. A popular choice is the classification part; but the initial dimension reduction is
Gaussian Mixture Model where each printer can be rep- performed by principal component analysis.
resented by a Gaussian distribution. The distributions of Principal component analysis (PCA) is often used as
all the printers help us to determine the mixing coefficient a dimension-reducing technique within some other type of
for the projection from an unknown printer. Finally, the analysis2 . Classical PCA is a linear transform that maps the
decision making algorithm can vote for the correct printer. data into a lower dimensional space by preserving as much
In this paper we will describe different classification algo- data variance as possible. In the case of intrinsic feature
rithms to identify an unknown printer. We will present the extraction, PCA can be used to reduce the dimension of
experiments based on several different EP printers in our the projected signal. The proper number of components
printer bank. The classification results based on different can be chosen to discriminate between different printers.
classifiers will be compared∗ . These components are the features that can be used by the
classifier (GMM or tree classifier).
The Gaussian mixture model defines the overall data
∗ This research is supported by a grant from National Science Founda- set as a combination of several different Gaussian distribu-
tion, under award number 0219893. tions. The parameters of the model are determined by the
Printed An n dimensional vector X can be represented by the
document summation of n linearly independent vectors.
from Printer X
n
Scanning
X= yi φi = ΦY, (1)
i=1
Digitized
version of the where yi is the i-th principal component and the φi ’s are
document
the basis vectors obtained from the eigenvectors of the co-
Deskewing, variance matrix of X. Using only m < n of φi ’s the vector
segmentation, X can be approximated as
recognition
Segmented m n
characters
X(m) = yi φi + bi φi , (2)
PCA i=1 i=m+1
Reduced dimension
where bi ’s are constants. The coefficients bi and the vec-
feature space tors φi are to be determined so that X can be best approxi-
mated. If the first m yi ’s are calculated, the resulting error
Tree
growing & Modeling is
pruning
n
Tree GMM
∆X(m) = X − X(m) = (yi − bi )φi . (3)
classifier
i=m+1
Tree Majority
structure Vote The set of m eigenvectors of the covariance matrix of
X, which corresponds to the m largest eigenvalues, min-
Printer X identified imizes the error over all choices of m orthonormal basis
vectors. The expansion of a random vector in the eigenvec-
Figure 1: Steps followed in printer characterization. tors of covariance matrix is also called the discrete version
of the Karhunen-Loéve expansion3 .
training data. Once the parameters are selected, the model Method of Canonical Variates
is used to predict the printer, based on projections from the
In the printer identification problem, we have additional
unknown printer.
information about the class label from the training data.
Binary tree structured classifiers are formed by repeated
We can get training samples from different known print-
splits of the original data set. Tree classifiers are also suit-
ers. The class label will represent different printer mod-
able for the printer identification problem. Properly grown
els. This additional class label information provides the
and pruned tree leaves should represent different printers.
optimal linear discrimination between classes with a lin-
In Fig. 1, the printer identification process is described
ear projection of the data. This is known as the method
briefly with the help of a flowchart. The printed document
of canonical variates5 . Here the basis vectors are obtained
is scanned and the scanned image is used to get the one
from the between class and within class covariance ma-
dimensional projected signal from individual characters1 .
trices. Due to singularity in the covariance matrix, this
PCA provides the lower dimensional feature space. GMM
method has to be implemented with the help of simultane-
or tree classifier works on the feature space to correctly
ous digonalization6 . This version of PCA is described in
identify the unknown printer.
detail by Webb4 . We have to diagonalize two symmetric
In the following sections of the paper, dimension re-
matrices SW and SB simultaneously:
duction technique by PCA, classification by GMM, and
tree growing and pruning algorithm by binary tree classi- 1. The within class covariance matrix SW is whitened.
fiers are explained. Experimental results related to these The same transformation is applied to the between
algorithms are also provided. class covariance matrix SB . Let us denote by SB
the transformed SB .
Principal Component Analysis (PCA) 2. The orthonormal transformation is applied to diago-
nalize SB .
The theory behind principal component analysis is described
in detail by Jolliffe2 , Fukunaga3 , and Webb4 . In this section The complete mathematical formulation can be found in
the fundamentals of the PCA are described. Webb4 and Fukunaga6 .
8
LJ4050
LJ1000
LJ1200
Samsung1450
6 Okipage
Test(LJ4050)
Printer 1
1-D
Projections PCA Printer 2 2
0
Test
Page
−2
−4
Figure 2: Principal component analysis using 1D projected sig-
nal.
−6
−8 −6 −4 −2 0 2 4 6 8
N
P k (j|xi )xi different printer model. Then the sixth printer is added
i=1 as an unknown printer to check the performance of the
µk+1
j = . (8)
N classifier. Forty projections from each printer are used
P k (j|xi ) for the experiment during training and testing. For ex-
i=1
ample, when Laserjet 4050 is tested, the classifier pre-
dicts that all the 40 projections are from the Laserjet 4050
N class, i.e. the classification is 100% correct. The Sam-
P k (j|xi )xi − µk+1
j 2 sung ML-1450 and Okipage 14e are also identified cor-
1
(σjk+1 )2 = i=1 . (9) rectly. Due to the close banding characteristics between
d
N
the Laserjet 1200 and Laserjet 1000, the correct classifi-
P k (j|xi )
i=1 cation rate is decreased. This result confirms the outcome
from the banding analysis1 and PCA.
The initialization of the model is performed by the k-
means algorithm. First, a rough clustering is done; and the
number of clusters is determined by the number of printer Classification and Regression Tree (CART)
models in the training set. Then each data point is assumed
to belong to the closest cluster center. Initial prior proba- A classification tree is a multistage decision process that
bility, mean, and variances are calculated from these clus- uses subsets of features at different levels of the tree. The
ters. After that, the iterative part of the algorithm starts by construction of the tree involves three steps:
using Eqs.(7)-(9).
1. Splitting rule selection for each internal node.
Experimental Results 2. Terminal node determination.
The feature space is obtained by PCA as described in the 3. Class label assignment to terminal nodes.
previous section. The model is based on only the first two
principal components (d=2). Model initialization is per- The approach we have used in our experiments is based
formed by seven iterations of the k-means algorithm. The on CART11 . The Gini criterion is used as the splitting
initial parameters are used by the iterative EM algorithm to rule. The terminal nodes are determined when there is
get the means and variances of the different classes. The pure class membership. In the printer identification prob-
EM algorithm is terminated by either the number of itera- lem, the class labels of the training printers are known; and
tions or by a threshold depending on the amount of change the unknown printer can be given a temporary label. After
in the parameters between successive iterations. The clas- the classification process, the class label of the unknown
sification is done by majority vote. For each projection printer can be determined.
from the unknown printer, the posterior probabilities of all
the classes are determined. The unknown projection be- Experimental Results
longs to the class with highest probability. This operation
is performed with all the projections from the unknown The CART algorithm also works in the reduced dimension
printer. The class with highest number of votes represents feature space generated by PCA. At each node the impu-
the model of the unknown printer. rity function is defined by the Gini criterion. The algorithm
In Table 1, the classification result for five different tries to minimize the node impurity by selecting the opti-
printer models is presented. In the printer bank, we have mum splitting. When the node impurity is zero; a terminal
two printers for each model1 . One printer is used for train- node is reached. When the complete tree is grown in this
ing; and the other printer is used for testing. The initial manner, it is overfitted. So the tree is pruned by the cross-
training data set is created by forty projections from each validation method.
6
Second Split PC1= 1st Principal Component
PC2<-0.85
PC2= 2nd Principal Component
4
Y
N
2nd Principal Component
PC1<-1.16
PC2<-3.5
0
Y N
Y N
−2 LJ1000 LJ1200
First Split
PC2<1.89
−4 PC1<-3.79
LJ4050
LJ1000
LJ1200 Y N Y N
Oki
−6 SS1450
TEST−LJ4050 LJ4050 Test LJ4050 Okipage SS 1450
−8
−8 −6 −4 −2 0 2 4 6 Figure 5: Binary tree structure for classifying an unknown
1st Principal Component printer. Five printer models are used in training process.