Beruflich Dokumente
Kultur Dokumente
COMPARISON BETWEEN Cross likelihood ratio based speaker clustering using eigenvoice models AND Partially Supervised Speaker Clustering FOR SPEAKER SEGMENTATION AND CLUSTERING
speaker clustering system which outperforms traditional approaches using Gaussian Mixture Model (GMM) based modeling.
Clustering and segmentation are used principally in television or radio studios in order to storage all the shows radio or the audio of the TV shows. Some studios have too much hours of registration to do it manually and for this reason these algorithms are used. In this document there is a comparison between two papers which talk about speaker segmentation and clustering. The first one is Cross likelihood ratio based speaker clustering using eigen voice models and the other one is Partially Supervised Speaker Clustering.
3. SECOND DOCUMENT
The second document is Partially Supervised Speaker Clustering and describe another way to do the clustering and segmentation of the voice. The figure below explain the algorithm used:
2. FIRST DOCUMENT
In the first document Cross likelihood ratio based speaker clustering using eigen voice models describes the Cross Likelihood ratio (CLR) criterion using eigenvoice modeling: In the baseline system, the audio is first passed through a speech activity detection stage which separates the audio into speech and non-speech regions. Speaker segmentation is then performed to partition the speech regions into homogeneous speaker segments, this is followed by a Viterbi resegmentation stage which aims to refine the segment boundary locations. The set of speaker segments are then passed to the speaker clustering stages of the system, which aim to merge the segments containing the utterances produced by the same speaker. Two clustering stages: The initial clustering stage merges only the closest speaker segments and is terminated early, resulting in a set of underclustered nodes, which is passed into the second clustering stage that performs further clustering using more complex models. At the end of the initial clustering stage, the segment boundaries are refined once more via Viterbi resegmentation. The refined segments are then passed into the second clustering stage, which completes the clustering process using the CLR as the decision criterion. In this clustering stage, the initial clusters have considerably more data than the individual speaker segments passed into the first clustering stage. The CLR between each pair of clusters is then calculated and agglomerative clustering is performed, iteratively merging the closest air of clusters until no more suitable merge candidates can be found. The second clustering stage produces the final diarization output, consisting of a relative, show-internal set of speaker labels and their corresponding start and end times. This document demonstrates that by incorporating eigenvoice modeling into the CLR framework, it is possible to capitalize on the advantages of each technique to produce a robust The first step is the Feature Extraction: it is the process of identifying the most important cues from the measured data while removing unimportant ones for a specific task or purpose based on domain knowledge. In order to account for the temporal dynamics of the spectrum, the basic melfrequency cepstral coefficients (MFCC) or perceptual linear predictive (PLP) features are usually augmented by their firstorder derivatives (i.e. the delta coefficients) and secondorder derivatives (i.e. the acceleration coefficients). Higherorder derivatives may be used too, although that is rarely seen. Using an independent training data set, it is possible to simultaneously characterize the temporal dynamics of the spectrum and maximize the discriminative power of the augmented acoustic features based on a discriminative learning framework. Then there is the Utterance Representation: it is, as its name suggests, is the task of compactly encoding the acoustic features of an utterance. The rationale behind this representation is that in speaker clustering the linguistic content of the speech signal is considered to be irrelevant and normally disregarded. The acoustic features of an utterance are modeled by an mcomponent GMM, defined as a weighted sum of m component Gaussian densities The number of free parameters of a GMM depends on the feature dimension and the number of Gaussian components. A relatively new utterance representation that has emerged in the speaker recognition area is the GMM mean supervector representation, which is obtained by concatenating the mean vectors of the Gaussian components of a GMM trained on the acoustic features of a particular utterance.
In each plot, the different speakers are shown in different colors. For each speaker, there are about 150 utterances, denoted by small dots. As one can see, the data points belonging to the same speaker tend to cluster together. The utterances from the same speaker tend to yield a cluster of GMM mean supervectors that scatter in a particular direction in the GMM mean supervector space. The third stage of the speaker clustering pipeline is the Distance Metric: the distance metrics that can be used for speaker clustering are closely related to the particular choice of utterance representations. Two popular categories of distance metrics, namely likelihood-based distance metrics and vector-based distance metrics, are widely used for the two corresponding utterance representations, respectively. The distance metric should represent some measure of the distance between two statistical models. A famous likelihood-based distance metric extensively used for speaker clustering is the Bayesian information criterion (BIC) which indicates how well a model fits the utterance. For the GMM mean supervector utterance representation, since an utterance can be represented as a single data point in a high-dimensional vector space, the most often used distance metric is the Euclidean distance metric. In the GMM mean supervector space, the data points belonging to the same speaker tend to cluster together. Thus, the Euclidean distance metric is a reasonable choice for speaker clustering in the GMM mean supervector space. However, we observe that the data points show very strong directional scattering patterns. The directions of the data points seem to be more informative and indicative than their magnitudes. This observation motivated us to advocate the use of the cosine distance metric instead of the Euclidean distance metric for speaker clustering in the GMM mean supervector space. The cosine distance metric is a measure of the angle between two vectors in the space and is irrelevant to the norms of the vectors.