Sie sind auf Seite 1von 13

An Introduction to MOS

Scoring and Quality


Measurements
White Paper
December 2008

SwissQual® AG
Allmendweg 8
CH-4528 Zuchwil
Switzerland

Internet: http://www.swissqual.com
Office: +41 32 686 65 65
Fax: +41 32 686 65 66

Part Number: 12-070-200709-3


Copyright © 2000 - 2008 SwissQual AG. All rights reserved.

No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system,
or translated into any human or computer language without the prior written permission of SwissQual AG.

SwissQual has made every effort to ensure that eventual instructions contained in the document are
adequate and free of errors and omissions. SwissQual will, if necessary, explain issues which may not be
covered by the documents. SwissQual’s liability for any errors in the documents is limited to the correction of
errors and the aforementioned advisory services.

When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo
trademark somewhere in your text.

SwissQual®, Seven.Five®, SQuad®, QualiPoc®, NetQual®, VQuad as well as the following logos are
registered trademarks of SwissQual AG.

Diversity™, NQDI™, VMon™, NiNA™, NiNA+™, NQView™, NQComm™, NQTM™, QualiWatch-M™,


QualiWatch-S™, NQAgent™, NQWeb™, QPControl™, SystemInspector™ are trademarks of SwissQual
AG.

SwissQual acknowledges the following trademarks for company names and products:

Adobe®, Adobe Acrobat®, and Adobe Postscript® are trademarks of Adobe Systems Incorporated.

Apple is a trademark of Apple Computer, Inc.

DIMENSION®, LATITUDE®, and OPTIPLEX® are registered trademarks of Dell Inc.

ELEKTROBIT® is a registered trademark of Elektrobit Group Plc.

Google® is a registered trademark of Google Inc.

Intel®, Intel Itanium®, Intel Pentium®, and Intel Xeon™ are trademarks or registered trademarks of Intel
Corporation.

INTERNET EXPLORER®, SMARTPHONE®, TABLET® are registered trademarks of Microsoft Corporation.

Java™ is a U.S. trademark of Sun Microsystems, Inc.

Linux® is a registered trademark of Linus Torvalds.

Microsoft®, Microsoft Windows®, Microsoft Windows NT®, and Windows Vista® are either registered
trademarks or trademarks of Microsoft Corporation in the United States and/or other countries U.S.

NOKIA® is a registered trademark of Nokia Corporation.

Oracle® is a registered US trademark of Oracle Corporation, Redwood City, California.

SAMSUNG® is a registered trademark of Samsung Corporation.

SIERRA WIRELESS® is a registered trademark of Sierra Wireless, Inc.

TEMS® is a registered trademark of TELEFONAKTIEBOLAGET LM ERICSSON.

TRIMBLE® is a registered trademark of Trimble Navigation Limited.

U-BLOX® is a registered trademark of u-blox Holding AG.

UNIX® is a registered trademark of The Open Group.


An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

Contents
1 How do you measure “quality”? ...................................................................................... 4
What is the mean opinion score? ........................................................................................ 5
2 How to get objective measures for quality? ................................................................... 6
How do you interpret a scatter-plot?.................................................................................... 6
Application of objective quality measures ........................................................................... 7
Full-reference vs. no-reference assessment ....................................................................... 8
3 Sample and content dependencies ................................................................................. 9
Requirements on content .................................................................................................... 9
Content dependency and cultural behavior ....................................................................... 10
4 Long term measurement campaigns............................................................................. 10
Collection of Results .......................................................................................................... 10
Interpretation of Statistical Analysis................................................................................... 11
5 Conclusion ....................................................................................................................... 13

| 3
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

1 How do you measure


“quality”?
"Beauty in things exists merely in the mind which contemplates them."

David Hume's Essays, Moral and Political, 1742.

It is difficult to have an absolute measure for “quality”. It is not the same as trying to
measure a physical quantity such as weight, length or brightness.

To produce a valid measure of “quality” you need to take into account the perception,
cultural attitudes, preferences and expectations of human beings. And, in any given group
of people, these vary. Quality perception is driven by experience and expectation;
basically, it is the gap between what you expect and what you get.

People can also become habituated to certain levels of “quality”. For example, when
mobile phone technology was first deployed, people found the audio characteristics of the
codec difficult to listen to. Over time they “tuned in” to the characteristics of the codec
and began to find the quality more acceptable.

So how do you measure quality?

Because of the wide variation in human beings and their perception of “quality”, you
need to use a statistically significant number of people and ask them their opinions. To
get valid and meaningful results, human subjects are asked to judge, and give a score to,
samples of speech, music, video, game playing or whatever you are trying to measure.
These samples are experienced by the human subjects under identical and tightly
controlled conditions.

The scores derived in this way are truly valid only for the specific conditions in which the
tests were conducted and for the specific questions that the human subjects were asked.
To produce a truly useful measure of “quality”, the conditions should reflect real life as
closely as possible

Note: In technical applications the quality measurement is performed to characterize a


transmission or processing system. Simplified, the quality of a certain speech or video
sample is interpreted as the quality of the system under test. This assumption is valid
under given circumstances as explained in section 3 ‘Sample and content
dependencies’.

4 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

What is the mean opinion score?


Traditionally, so-called “Mean Opinion Scores” (MOS) are derived by carrying out
listening or visual tests with groups of human subjects which are large enough to
constitute a statistically significant sample. They are individually asked for their “quality
score” for a pre-recorded speech or video sample of a few seconds length. The average of
all individual scores (mean opinion score) is then the value accepted as indicating the
quality of that short sample. Usually, the score is given on a 1 to 5 scale assigned with
verbal categories in the native language of the subjects. Very often so-called “Absolute
Category Ratings” are used.

Score English German French Spanish


5 excellent ausgezeichnet excellente excelente
4 good gut bonne buena
3 fair ordentlich assez bonne regular
2 poor dürftig médiocre mediocre
1 bad schlecht mauvaise mala

Each subject’s score is, as previously discussed, driven by their individual experience,
expectation and preferences. In practice there is always a variation in the scores awarded
by different individuals. The score is also subject to the short-term nature of the test and
– of course – accidently wrong scoring. Consequently, the MOS is the average of a
distribution of individual scores.
One consequence of this is that, in practice, a statistically significant group of people will
never, as individuals, all award an ‘excellent’ score, whatever the quality of the sample
under test. Some subjects will lack confidence in their own perception, some will be
hyper-critical critical and still more will award a less then perfect score purely through
accident or because of mental distraction during the test. The highest MOS value reached
in subjective tests, therefore, is around 4.5.
At the other end of the scale we have a corresponding but slightly different effect. This
difference is caused by the fact that the lower end of “quality” is much broader (it can be
even ‘worse than bad’), whereas the upper end cannot be so easily extended (the quality
cannot be “better” than speech or video which has not been distorted in any way).
To constitute statistical significance a group of subjects should consist of at least 24
persons. In scientific papers MOS scores are accompanied by their standard deviation to
give some basic information about the width of distribution of the individual scores. An
additional value that is often given along with the MOS is the 95% confidence interval. It
shows a range around the MOS – by statistical means – where the true MOS of the whole
population worldwide will fall in with 95% probability. It gives an impression of how
close the MOS is to the ‘true quality’. Logically, this confidence interval becomes
smaller if the group of subjects in the test increases. In well designed traditional tests this
range of uncertainty is less than 0.2 (MOS).
The term MOS is by definition only a generic definition. It is meaningless without a
further specification of the kind of quality perception it describes. For example, a MOS
can be obtained for Listening Quality as well as for Visual Quality.

| 5
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

2 How to get objective


measures for quality?
Objective measures for quality are different from the traditional empirical approach to
measuring physical phenomena. The simply estimate or predict quality as it would be
perceived by a large group of human observers. Therefore, for the development and
calibration of objective measures like VQuad, VMon or SQuad a huge amount (more than
5000) of subjectively scored samples are needed.
The objective measure is based on sophisticated psycho-acoustic, psycho-visual, and
perceptive models carrying out signal processing in a way that mimics the human
auditory and visual system. All the signal analysis and the comparison to the undistorted
original signal – if available – lead to a quality value that is finally mapped to the
common 5 to 1 scale again.
Thus, objective approaches do not measure quality in the traditional sense. They try to
predict a score that might be obtained from scoring by a sufficient amount of individuals.
The performance, or quality, of those objective measures is usually described by means of
correlation coefficients and residual prediction errors. The correlation of the subjective
and objective scores should usually be well above 90% for the data used.

How do you interpret a scatter-plot?


A common way of presenting the accuracy of objective quality predictions is a so-called
scatter-plot, where subjective and objective data are plotted on the x- and y-axis of a
diagram. A good objective measure can be indicated by very narrow distribution along
the 45° line in this diagram (like a ‘pearl-chain’).
The following example plot is used for explanations:
5.0 NiNA+ MOS scores vs. Auditiry Test re sults
ITU-T Suppl. 23 Exp. 1, American English
4.5 r = 0.905

4.0

3.5
NiNA+

3.0

2.5

2.0

1.5

1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Auditory Test (MOS)

Figure 2-1: Figure 1: NiNA+ Listening Quality values for noise-free speech transmissions

6 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

Obviously, the objective method (NiNA+ in this example) predicts scores pretty close and
nearly symmetrically grouped along the 45° line. However, a few points stand out as
lying well away from the target.
Since the objective scores are drawn on the y-axis, all points above the line indicate that
the objective predictor (here: NiNA+) delivers a more optimistic score by comparison
with the observers in the subjective test. Vice versa, points below the line show that the
measure is more pessimistic.
For a more general view usually the correlation coefficient and/or the root mean square
error (RMSE, based on the differences between the subjectively derived MOS and its
objective prediction) are given as well.
In voice quality tests, a so-called ‘per-condition’ analysis is also often performed. Here all
scores for a particular condition (e.g. a given codec setting) will be averaged across the
individually used speakers and sentences in the experiment. This averaging minimises
dependencies on individual characteristics and focuses more on the system under test.
Usually, the accuracy values of the objective measure increase by condition averaging.
This is due to the ‘averaging out’ of smaller prediction errors and a more confident
subjective MOS through the consideration of many more individual scores.

Application of objective quality measures


If you want to manage the delivered quality of your mobile network, you need to be able
to accurately assess that quality. One method of assessing the service quality of a
telecommunications network is by assessing the quality of a signal transmitted through
the network.
In the case of objective quality evaluation, several approaches are available for assessing
the quality. The primary distinction is between a ‘no reference’ method (often called
‘non intrusive’ or ‘single-ended’) and a ‘full reference’ method (often called ‘intrusive’ or
‘double-ended’) approaches.
 ‘No reference’: evaluation and rating is conducted and based on the
“received” signal alone (single-ended method, might be a test call to an
answering machine or even live monitoring).
 ‘Full reference’: a reference signal is transmitted. The received signal is
evaluated and rated based on the “known reference” (double-ended method
and intrusive, requires a ‘test call’).
Both methods predict the mean opinion score (MOS), the score that would be obtained by
performing a subjective test. The basic relationship between subjective /objective
assessments and full- / no-reference models is shown in Figure 2-2.
Usually, the accuracy of a no-reference approach is lower than that of a full-reference
approach due to the missed reference signal for detailed comparison. However, the
accuracy may be sufficient for a basic classification of the quality delivered and for the
detection of consistently poor quality links. No-reference models have a wider
application range because there is no need for special facilities at the far-end of the
communications link under test.

| 7
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

send signal receive


‘reference’ ‘processed’
signal signal
System under test

Internal reference
‚expectation’

Human viewer

Methods requiring
a reference signal

(double-ended)

Methods requiring
NO reference
signal
(single-ended)

Figure 2-2 Subjective versus objective quality assessment

Full-reference vs. no-reference assessment


For objective quality testing, the methods can be used in several application scenarios:
 No-Reference – Single ended measurement: A test connection is
established to any answering station, which plays back a voice/video
signal that is unknown at the receiving side (e.g. from a streaming server
or a live TV application). However, a test connection has to be established
and causes additional load on the network under test and its resources.
 No-Reference – Non-intrusive In-service Monitoring: Assessment of video
signals in real applications such as IPTV or telephony by parallel
monitoring of active connections in the core network. Here a no-reference
approach is also used. No dedicated test connection is required, often the
test being done on the equipment of “friendly users” as they go about their
everyday business. However, this means that the measurement point can
be anywhere in the network. Any degradation beyond this measurement
point remains unconsidered. In addition, the characteristics of the specific
device being used may affect the measurement.
 Full-Reference – Double Ended measurement: Both ends of the connection
are under control and a defined voice, data or video sequence will be
transmitted over the test connection. These applications require a
controlled answering station or server with known pre-stored sequences at
the far-end side. Logically, a test connection has to be established and
resources are taken from the network. Although there is the disadvantage
of having to intervene at the source of the signal and the network to be
tested, the advantages may compensate for that. The advantage of the

8 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

’double ended’ method is that the input signal or “reference” signal is


known, allowing for very accurate and detailed analysis of voice or video
quality impairments. Each change in the signal during its transmission can
be detected and be evaluated for its impact on perceived quality by
applying models of human perception. The full reference methods are also
applicable for optimisation processes in laboratories as well as in real
networks. They are capable of measuring even minimal degradations of
the signal and can be applied to compare various transmission scenarios.

3 Sample and content


dependencies
Requirements on content
A participant in a quality test has no knowledge about the transmission or processing
systems under test. He or she is simply asked to assess the quality of the presented
sample. Consequently, the achieved quality does not only depend on the degradations
inserted by the system under test (e.g. a transmission channel), but also on the source
sample itself, as well as any interactions between that specific sample and the
transmission system.
Obviously, a bias of the quality values due to the content used should be minimised and
the achieved quality values should accurately characterise the system under test.
For that reason the content that is used as a sample should meet a number of criteria:
• The content should represent the service to be tested (i.e. human voice for
telephony, head and shoulder video for video telephony).
• The source material should be free of any distortions or characteristics which can
be interpreted as distortions (e.g. clips with ‘special effects’ like blinking objects,
stretched faces etc.)
• Usually, a natural source (e.g. a voice recording of a native speaker in a quiet
studio, ‘normal’ video content such as people, landscapes …) is a good choice.
• The observer in a quality test should be familiar with the presented content (i.e.
voice should be presented in the mother tongue of the listener; video content
should cover common scenes…).
• In each case any source material with a highly emotional content should be
avoided.
Of course, those general requirements can only minimise content effects. To accurately
characterise a transmission system, the content must be carefully chosen. For voice
samples the spoken texts should reflect the phonological characteristics of that language
(if possible in a short excerpt). In the video domain typical genres should be defined and
used. Often, more than one type of content will need to be transmitted in order to get a
range of different quality levels, or to perform averaging over the achieved quality values.

| 9
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

Content dependency and cultural behavior


If a system is tested using a subjective quality test in different labs, the scores will often
turn out slightly differently, even though the system under test is identical (see also
section 1).
Often differences in ‘language’ are cited as a cause of these differences. This is not
entirely accurate. Although samples using different languages may well produce different
results, these are not usually due to the language itself but are instead due to other factors.
For example, a language that has a significantly higher occurrence of voiced parts may be
processed differently by an encoder than a language consisting mainly of unvoiced parts.
However, the chosen sample itself with its individual structure has usually more
influence. For example, there might be characteristics in the original unprocessed sample
that are not perceptible by a user but which might affect certain transmission systems (e.g.
thresholds for pause muting).
Perhaps the main reasons for the differences in the scores for different languages are the
cultural attitudes of the test group. These might cause small differences in the
interpretation of the quality labels, different experience in daily life and therefore
different expectations of quality as well as culturally driven ‘scaling’. Some cultural
groups may be more generous and will give good scores as long as there is no major
degradation in quality; whilst, on the other hand, there might be cultural groups that aren’t
as generous and who will give good scores only for absolutely perfect quality.
Modification of the awarded scores caused by context and cultural effects cannot be
reproduced by an objective predictor. However, as previously mentioned, the qualitative
rank-order normally remains unaffected and the predictor will rank the sample in the
same order in which they would be ranked by a subjective test.

4 Long term measurement


campaigns
Collection of Results
In a benchmarking or optimisation system, typically a large number of single
measurements will be obtained and aggregated to provide a good overall view of the
network. For example, we can imagine a test system that collects voice quality values
across different networks and technologies. A simple aggregation or even averaging will
not, however, lead to useful results.
It is a good idea to divide the data into different sub-selections describing specific
scenarios that you want to evaluate. These selections could be a choice of the individual
operators serving the geographic area and/or technologies. In a deeper view, the used
codec or codec rate can be a selection criterion, as can the fact of whether HalfRate or
FullRate channels are used and – of course – geographical regions, dates, times and even
the type of voice example being used.

10 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

By separation into those sub-sets it is much easier to see the true reasons for differences
in quality. They might, for example, be seen to be caused by a higher ratio of HalfRate for
a certain provider or in a special region.

Interpretation of Statistical Analysis


The common average is just one way of aggregating results. It is unfortunately also often
not the most useful method for quality analysis. Other aggregations are based on the
statistical distribution of the individual quality scores.
The following examples make use of a typical collection of Listening Quality scores in a
cellular network. At first the frequency of recurrence of particular score values is shown
(Figure 4-1):
Frequency of recurrence MOS-LQ
50%
45%
Network A
40%
Num of Values

35% Network B
30%
25%
20%
15%
10%
5%
0%
1.0-1.1
1.1-1.2

1.2-1.3
1.3-1.4
1.4-1.5

1.5-1.6
1.6-1.7

1.7-1.8
1.8-1.9

1.9-2.0
2.0-2.1
2.1-2.2

2.2-2.3
2.3-2.4

2.4-2.5
2.5-2.6

2.6-2.7

2.7-2.8
2.8-2.9

2.9-3.0
3.0-3.1

3.1-3.2
3.2-3.3

3.3-3.4

3.4-3.5
3.5-3.6

3.6-3.7
3.7-3.8

3.8-3.9
3.9-4.0

4.0-4.1

4.1-4.2
4.2-4.3

4.3-4.4
4.4-4.5

4.5-4.6
4.6-4.7

4.7-4.8

4.8-4.9
4.9-5.0
MOS-LQ

Figure 4-1 Distribution of MOS-LQ scores in two example networks


Here we see that the hypothetical ‘Network A’ (blue) generally is achieving very good
quality scores in the range MOS-LQ > 3.8. However there are also some measurements
showing bad quality with MOS-LQ < 3.0. By contrast we can look at a second
hypothetical ‘Network B’ (red), where all MOS-LQ > 3.3.
The simple calculated average MOS-LQ across all samples measured produces almost the
same result for both networks (A: 3.69 and B: 3.77). However, it is obvious that the
networks behave differently and, depending on the reason for the analysis, different
means should be applied and – of course – different actions should be taken to improve
the network.
In ‘Network A’ we could imagine that a very high quality might be reached under good
conditions, It appears that the voice codec is running at a high rate, no disturbing
components are in the network, but some clips are scored very low. This may be caused
by a high handover rate, low coverage in some cases or strong interferences. Along with
additional values based on the voice signal and/or L1 to L3 analysis, the actual causes can
be detected easily. By doing radio network optimisation this Network A can clearly be
improved.
In ‘Network B’ we have a different problem. There are neither severe problems in
dedicated locations nor increased hand-over rates. However, it looks as if there is a
systematic problem, such as the constant use of a lower bit-rate or a wrongly adjusted

| 11
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

1
VQE-system or PSTN gateway. A deeper analysis of the core network settings may help
improve the quality here in ‘Network B’.
So how can a statistical analysis help you to understand these phenomena and how can
they be easily described?
To begin with we can look at the median2 value instead of the average (mean). Whilst the
average value is better on ‘Network B’, the median value rates ‘Network A’ more highly.
Another method of analysis might be to look at ratio of GoB (good-or-better, i.e. all
scores above 3.5) or PoW (poor-or-worse, i.e. 2.5 or lower). The peak value may be
useful in doing this (at least in a visual observation of the distribution function) as well as
a 90% percentile value3.
The following table compares how these different statistical methods compare for the
chosen example:

Network A Network B

Average (Mean) 3.69 3.77

Median 3.95 3.84

GoB (MOS-LQ > 3.5) 90% 94%

PoW (MOS-LQ < 2.5) 7% 0%

90% percentile 4.01 3.89

You can see that the average value as well the GoB or PoW values are heavily affected by
the occurrence of very low values as in ‘Network A’. The “unproblematic” measures are
scored very highly, hence the core network settings seem to be well chosen. That can be
seen best by the 90% percentile or the median (which is the same as the 50% percentile).
By fixing the problematic cases the overall quality becomes excellent, since the core
network is already set well.
On the other hand, the ‘Network B’ displays a systematically low quality. Although it
performs well on average, it can be shown (e.g. by the 90% percentile) that a certain
quality is never exceeded and that this threshold is lower than the competing network.
With some optimisation in the core network, this reachable quality could be increased and
the resulting quality could become excellent due to the good radio performance exhibited
by the network.
This is a simple example. However you can see that by dividing data collections into
technologies, codec types and rates, Uplink/Downlink, regions and similar separators,
almost all data collections can be analysed by using a variety of statistical tools. This
delivers a much richer view of the network and a more useful set of measures for network
optimisation than could be obtained by using simple averaging of measurements alone.

1
VQE stands for Voice Quality Enhancement and covers noise suppression, echo cancellation and other
similar components.
2
The median determines the value at which 50% of the scores lie below and the other 50% lie above.
3
A Percentile value determines a value where a certain amount of scores lie below. For example, 90% of all
scores lie below the 90% percentile value.

12 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG

5 Conclusion
The prediction of MOS values allows for a deep insight into a network’s behavior from
the user’s perspective. It delivers valuable information for both optimisation and
benchmarking and complements traditional L1 to L3 measurements.
However, MOS is only one way – albeit the most popular way – of describing a
network’s quality. Based on the transmitted media stream much more information for
problem diagnostics can be obtained.
SwissQual’s measurement suite, SQuad™, delivers MOS values and diagnostics for
speech-based services, whereas VQuad™ does the same for video services.

_____________

| 13

Das könnte Ihnen auch gefallen