Beruflich Dokumente
Kultur Dokumente
SwissQual® AG
Allmendweg 8
CH-4528 Zuchwil
Switzerland
Internet: http://www.swissqual.com
Office: +41 32 686 65 65
Fax: +41 32 686 65 66
No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system,
or translated into any human or computer language without the prior written permission of SwissQual AG.
SwissQual has made every effort to ensure that eventual instructions contained in the document are
adequate and free of errors and omissions. SwissQual will, if necessary, explain issues which may not be
covered by the documents. SwissQual’s liability for any errors in the documents is limited to the correction of
errors and the aforementioned advisory services.
When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo
trademark somewhere in your text.
SwissQual®, Seven.Five®, SQuad®, QualiPoc®, NetQual®, VQuad as well as the following logos are
registered trademarks of SwissQual AG.
SwissQual acknowledges the following trademarks for company names and products:
Adobe®, Adobe Acrobat®, and Adobe Postscript® are trademarks of Adobe Systems Incorporated.
Intel®, Intel Itanium®, Intel Pentium®, and Intel Xeon™ are trademarks or registered trademarks of Intel
Corporation.
Microsoft®, Microsoft Windows®, Microsoft Windows NT®, and Windows Vista® are either registered
trademarks or trademarks of Microsoft Corporation in the United States and/or other countries U.S.
Contents
1 How do you measure “quality”? ...................................................................................... 4
What is the mean opinion score? ........................................................................................ 5
2 How to get objective measures for quality? ................................................................... 6
How do you interpret a scatter-plot?.................................................................................... 6
Application of objective quality measures ........................................................................... 7
Full-reference vs. no-reference assessment ....................................................................... 8
3 Sample and content dependencies ................................................................................. 9
Requirements on content .................................................................................................... 9
Content dependency and cultural behavior ....................................................................... 10
4 Long term measurement campaigns............................................................................. 10
Collection of Results .......................................................................................................... 10
Interpretation of Statistical Analysis................................................................................... 11
5 Conclusion ....................................................................................................................... 13
| 3
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
It is difficult to have an absolute measure for “quality”. It is not the same as trying to
measure a physical quantity such as weight, length or brightness.
To produce a valid measure of “quality” you need to take into account the perception,
cultural attitudes, preferences and expectations of human beings. And, in any given group
of people, these vary. Quality perception is driven by experience and expectation;
basically, it is the gap between what you expect and what you get.
People can also become habituated to certain levels of “quality”. For example, when
mobile phone technology was first deployed, people found the audio characteristics of the
codec difficult to listen to. Over time they “tuned in” to the characteristics of the codec
and began to find the quality more acceptable.
Because of the wide variation in human beings and their perception of “quality”, you
need to use a statistically significant number of people and ask them their opinions. To
get valid and meaningful results, human subjects are asked to judge, and give a score to,
samples of speech, music, video, game playing or whatever you are trying to measure.
These samples are experienced by the human subjects under identical and tightly
controlled conditions.
The scores derived in this way are truly valid only for the specific conditions in which the
tests were conducted and for the specific questions that the human subjects were asked.
To produce a truly useful measure of “quality”, the conditions should reflect real life as
closely as possible
4 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
Each subject’s score is, as previously discussed, driven by their individual experience,
expectation and preferences. In practice there is always a variation in the scores awarded
by different individuals. The score is also subject to the short-term nature of the test and
– of course – accidently wrong scoring. Consequently, the MOS is the average of a
distribution of individual scores.
One consequence of this is that, in practice, a statistically significant group of people will
never, as individuals, all award an ‘excellent’ score, whatever the quality of the sample
under test. Some subjects will lack confidence in their own perception, some will be
hyper-critical critical and still more will award a less then perfect score purely through
accident or because of mental distraction during the test. The highest MOS value reached
in subjective tests, therefore, is around 4.5.
At the other end of the scale we have a corresponding but slightly different effect. This
difference is caused by the fact that the lower end of “quality” is much broader (it can be
even ‘worse than bad’), whereas the upper end cannot be so easily extended (the quality
cannot be “better” than speech or video which has not been distorted in any way).
To constitute statistical significance a group of subjects should consist of at least 24
persons. In scientific papers MOS scores are accompanied by their standard deviation to
give some basic information about the width of distribution of the individual scores. An
additional value that is often given along with the MOS is the 95% confidence interval. It
shows a range around the MOS – by statistical means – where the true MOS of the whole
population worldwide will fall in with 95% probability. It gives an impression of how
close the MOS is to the ‘true quality’. Logically, this confidence interval becomes
smaller if the group of subjects in the test increases. In well designed traditional tests this
range of uncertainty is less than 0.2 (MOS).
The term MOS is by definition only a generic definition. It is meaningless without a
further specification of the kind of quality perception it describes. For example, a MOS
can be obtained for Listening Quality as well as for Visual Quality.
| 5
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
4.0
3.5
NiNA+
3.0
2.5
2.0
1.5
1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Auditory Test (MOS)
Figure 2-1: Figure 1: NiNA+ Listening Quality values for noise-free speech transmissions
6 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
Obviously, the objective method (NiNA+ in this example) predicts scores pretty close and
nearly symmetrically grouped along the 45° line. However, a few points stand out as
lying well away from the target.
Since the objective scores are drawn on the y-axis, all points above the line indicate that
the objective predictor (here: NiNA+) delivers a more optimistic score by comparison
with the observers in the subjective test. Vice versa, points below the line show that the
measure is more pessimistic.
For a more general view usually the correlation coefficient and/or the root mean square
error (RMSE, based on the differences between the subjectively derived MOS and its
objective prediction) are given as well.
In voice quality tests, a so-called ‘per-condition’ analysis is also often performed. Here all
scores for a particular condition (e.g. a given codec setting) will be averaged across the
individually used speakers and sentences in the experiment. This averaging minimises
dependencies on individual characteristics and focuses more on the system under test.
Usually, the accuracy values of the objective measure increase by condition averaging.
This is due to the ‘averaging out’ of smaller prediction errors and a more confident
subjective MOS through the consideration of many more individual scores.
| 7
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
Internal reference
‚expectation’
Human viewer
Methods requiring
a reference signal
(double-ended)
Methods requiring
NO reference
signal
(single-ended)
8 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
| 9
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
10 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
By separation into those sub-sets it is much easier to see the true reasons for differences
in quality. They might, for example, be seen to be caused by a higher ratio of HalfRate for
a certain provider or in a special region.
35% Network B
30%
25%
20%
15%
10%
5%
0%
1.0-1.1
1.1-1.2
1.2-1.3
1.3-1.4
1.4-1.5
1.5-1.6
1.6-1.7
1.7-1.8
1.8-1.9
1.9-2.0
2.0-2.1
2.1-2.2
2.2-2.3
2.3-2.4
2.4-2.5
2.5-2.6
2.6-2.7
2.7-2.8
2.8-2.9
2.9-3.0
3.0-3.1
3.1-3.2
3.2-3.3
3.3-3.4
3.4-3.5
3.5-3.6
3.6-3.7
3.7-3.8
3.8-3.9
3.9-4.0
4.0-4.1
4.1-4.2
4.2-4.3
4.3-4.4
4.4-4.5
4.5-4.6
4.6-4.7
4.7-4.8
4.8-4.9
4.9-5.0
MOS-LQ
| 11
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
1
VQE-system or PSTN gateway. A deeper analysis of the core network settings may help
improve the quality here in ‘Network B’.
So how can a statistical analysis help you to understand these phenomena and how can
they be easily described?
To begin with we can look at the median2 value instead of the average (mean). Whilst the
average value is better on ‘Network B’, the median value rates ‘Network A’ more highly.
Another method of analysis might be to look at ratio of GoB (good-or-better, i.e. all
scores above 3.5) or PoW (poor-or-worse, i.e. 2.5 or lower). The peak value may be
useful in doing this (at least in a visual observation of the distribution function) as well as
a 90% percentile value3.
The following table compares how these different statistical methods compare for the
chosen example:
Network A Network B
You can see that the average value as well the GoB or PoW values are heavily affected by
the occurrence of very low values as in ‘Network A’. The “unproblematic” measures are
scored very highly, hence the core network settings seem to be well chosen. That can be
seen best by the 90% percentile or the median (which is the same as the 50% percentile).
By fixing the problematic cases the overall quality becomes excellent, since the core
network is already set well.
On the other hand, the ‘Network B’ displays a systematically low quality. Although it
performs well on average, it can be shown (e.g. by the 90% percentile) that a certain
quality is never exceeded and that this threshold is lower than the competing network.
With some optimisation in the core network, this reachable quality could be increased and
the resulting quality could become excellent due to the good radio performance exhibited
by the network.
This is a simple example. However you can see that by dividing data collections into
technologies, codec types and rates, Uplink/Downlink, regions and similar separators,
almost all data collections can be analysed by using a variety of statistical tools. This
delivers a much richer view of the network and a more useful set of measures for network
optimisation than could be obtained by using simple averaging of measurements alone.
1
VQE stands for Voice Quality Enhancement and covers noise suppression, echo cancellation and other
similar components.
2
The median determines the value at which 50% of the scores lie below and the other 50% lie above.
3
A Percentile value determines a value where a certain amount of scores lie below. For example, 90% of all
scores lie below the 90% percentile value.
12 | Contents
An Introduction to MOS Scoring and Quality Measurements White Paper
© 2000 - 2008 SwissQual AG
5 Conclusion
The prediction of MOS values allows for a deep insight into a network’s behavior from
the user’s perspective. It delivers valuable information for both optimisation and
benchmarking and complements traditional L1 to L3 measurements.
However, MOS is only one way – albeit the most popular way – of describing a
network’s quality. Based on the transmitted media stream much more information for
problem diagnostics can be obtained.
SwissQual’s measurement suite, SQuad™, delivers MOS values and diagnostics for
speech-based services, whereas VQuad™ does the same for video services.
_____________
| 13