Sie sind auf Seite 1von 3

The dangers of Likert scale data

February 18, 2014advanced stuffinterval data, Likert, ordinal data

Imagine that you want to compare two products A and B and you ask the opinions of 100 users
via a survey. The table below shows a summary of the survey and the responses. The numbers
under product A and product B show the number of people who gave each of the responses on
the left-hand side.

This is known as a Likert scale and this post will give some thoughts on how to analyse these
data.

The first thing that is worth mentioning is that there is a simple form of analysis that is relatively
uncontentious. This is to say that 60% of people were very satisfied or quite satisfied with
product A whereas only 45% of people were similarly very satisfied or quite satisfied with
product B. On the one hand this is simple. However, can we use this analysis to say that product
A is better than product B? Note one problem straight away, which is that 20% of people are
very dissatisfied or quite dissatisfied with product A whereas only 15% of people were similarly
very dissatisfied or quite dissatisfied with product B. It seems that product A tends to polarise
opinion and it is not clear what conclusions can be drawn.

However, quite often we assign numbers to the categories (such as 5 = very satisfied, 4 = quite
satisfied, 3 = neutral, 2 = quite dissatisfied, and 1 = very dissatisfied) and when this is done we
can produce a number for each participant’s response; we can then average this to produce the
mean values shown in the figure above. According to this we can say that on average the
response to product A is 3.6 and to product B is 3.5. Can we now use these numbers to make the
following two statements? (1) that product A is better than product B (since 3.6 is bigger than
3.5) and that (2) both products A and B are well received by the participants (since 3.6 and 3.5
are both bigger than 3). What I want to do in this post is discuss the validity of these statements
by considering several aspects of Likert scales.

Is it valid to average the numbers?

There is a long-running dispute about whether it is valid to average the scores to produce the
mean values as in the table above. To explore this we need to introduce two types of data. The
first type are called ordinal data. This is the order in which things are. The Likert scale presented
in the table above strictly produces ordinal or rank data. Imagine that three people, Alan, Brian
and Clive run a race in which Alan wins, Brian is second, and Clive is third. Knowing the order
in which they finished is fine, but it doesn’t tell us whether Alan finished well ahead of the other
two or whether, for example, Alan and Brian were involved in a close finish with Clive a long
way behind. If, however, we know how many seconds they took to complete the race (Alan = 40
seconds, Brian = 41 seconds, and Clive = 52 seconds) we now know much more information
about the race. It turned out that Clive was a long way behind the other two. The race times, in
seconds, are called interval data. With interval data the differences between the numbers are
meaningful whereas with ordinal (rank) data they are not.

The problem with a Likert scale is that the scale [of very satisfied, quite satisfied, neutral, quite
dissatisfied, very dissatisfied, for example] produces ordinal data. We know that very satisfied is
better than quite satisfied and quite satisfied is better than neutral, but is the difference between
very satisfied and quite satisfied the same as the difference between quite satisfied and neutral?
Why am I worrying about this? Because when we assign numbers to the scale (the 1-5 numbers)
and then average the responses we are implicitly making the assumption that the scale items are
evenly spaced. We are treating the ordinal data as interval data. How can we be sure that the
participants treated the scale in this way? Would it have made a difference if we had used
satisfied and dissatisfied instead of quite satisfied and quite dissatisfied respectively? So it would
seem that is wrong to calculate means from Likert scales. If you click here you will see a post
from a PhD student (Achilleas Kostoulas) at the University of Manchester who states
categorically that it is wrong to compute means from Likert scale data. I choose this example
because it is simply and elegantly explained not because I necessarily agree entirely with his
view. It is also worth reading the article by Elaine Allen and Christopher Seaman in Quality
Progress (2007) who also take the view that Likert scale data should not be treated as interval
data. Interestingly they also suggest some other techniques that don’t suffer from the ‘ordinal-
data’ problem; for example, using slider bars to get a response on a continuous scale. However,
before you give up detailed analyses of Likert scale data I would urge you to read the paper by
Susan Jamieson called Likert scales: how to (ab)use them in Medical Education (2004: 38, 1212-
1218). Although Susan is also broadly speaking against treating Likert scale data as interval data
she does present the other side of the argument. In another paper, in Advances in Health Sciences
Education, Norman (2010, 15 (5), 625-632) argues that the concerns about Likert scales are not
serious and we should happily use means and other parametric statistics.

How much bigger do two averages need to be for an effect?

In the table at the start of this article product A and B receive scores of 3.6 and 3.5 respectively.
The paragraphs above explain that calculating these means may not be valid. However, assuming
that we do calculate means in this way, how different would the mean scores for product A and
B need to be for us to conclude that A was better than B? I have come across students (normally
in vivas) who would simply state that A is better than B because 3.6 > 3.5. To those students I
then would say, would you still take that view if instead of 3.6 and 3.5 it was 3.51 and 3.5? What
if it is 3.50001 and 3.5? Would they still maintain that A is better than B? It is clear that we need
to consider variance and noise and carry out a proper statistical test to conclude whether 3.6 is
significantly greater than 3.5. The test is called a student t-test and anyone can be taught to
perform one using Microsoft Excel in a matter of minutes. In the example at the start of this
article it turns out that there is no statistically significant difference. We cannot conclude that
product A is received better than product B.

However, can we conclude that both products are received favourably? Again, we need a
statistical test. It turns out that in this case, both 3.6 and 3.5 are statistically greater than 3 and we
can at least conclude that products A and B are received favourably. However, there is the caveat
that this assumes that we can treat the Likert scale data as interval data in the first place.

Other considerations
An interesting question is whether we should use 5-point scales at all. Would we get different
results if we used a 7-, 9- or 11-point scale? I have found one website that suggests that a 7-point
scale is better than a 5-point scale but not by much. A paper by Dawes in International Journal
of Market Research (2008: 55 (1)) looked at 5-, 7- and 10-point scales and concluded that the
results from a 10-point scale would be different from a 5- or 7-point scale (after suitable
normalisation).

Although odd-number scales (with a neutral point) are almost always used. A paper by Garland
(Marketing Bulletin, 1991: 2, 66-70) suggest that using a four-point scale (and removing the
neutral point) might remove the social desirabiity bias that comes from respondents wanting to
please the interviewer. I am not sure what current thinking is on this matter though and I would
normally use odd-number scales.

I am not providing any definitive views on these points but rather raising awareness of issues. If
you want to use a Likert scale then these are issues you need to familiarise yourself with.

My view

I will confess to having treated Likert scale data as interval data and carrying out parametric
statistics (these are statistics that use statistical parameters such as standard deviations).
However, deep down I know it is wrong. I am coming to the view that the best thing is not to use
a Likert scale at all. I think people often use this sort of scale because it seems simple. There are
ways to statistically analyse data like these and I would refer readers to categorical judgement
which is a well-used psychophysical technique. My colleague Ronnier Luo at Leeds University
has used this technique extensively for decades. However, it is far from simple to analyse the
results. I think there are better ways of obtaining information. I think use sliders bars and
allowing users to indicate using the slider bar their view between two extremes (e.g. between
very satisfied and very dissatisfied) is probably better and I will encourage my students to use
this technique in the future.

Das könnte Ihnen auch gefallen