Sie sind auf Seite 1von 15

Running head: TEST REVIEW 1

Test Review: TOFEL iBT, IELTS

Yuanyuan Sun

Colorado State University



According to Bachman & Palmer (2010), people make use of language assessments to

collect information to help make decisions both inside and outside the classroom. Some

decisions may be very significant and serious, since they have consequences for the

individuals, institutions, organizations or even societies. Suryaningsih (2014) stated that the

higher the stakes of the test, the stronger the urge to engage in specific test preparation

practices. As an English learner and future English teacher, I have personally experienced

the power of high-stakes assessment to influence peoples lives. Nowadays in China, there

are many English learners who want to pursue higher education in the U.S. In order to be

qualified candidates, they have to receive certification of standard and widely recognized

English language proficiency tests. In addition, they have to consider whether they have the

language ability to be academically successful in the United States (Arcuino, 2013).

For this paper, I review two large-scale language proficiency tests that are used for

admission purposes: Test of English as a Foreign Language Internet Based Test (TOEFL

iBT) administrated by Educational Testing Service (ETS) and International English

Language Testing System (IELTS) from University of Cambridge Local Examinations

Syndicate (UCLES). Since they are both high-stakes proficiency tests, I want to focus on how

they are designed and how well they measure language ability. Moreover, I want to

investigate if the information offered by the publishers demonstrates whether the tests are

effective, reliable and valid to help with enrollment and admission decisions in higher

education in the U.S.


Test of English as a Foreign Language Internet Based Test (TOEFL iBT)

Publisher: TOEFL Services, Educational Testing Service (ETS), P.O. Box

6151, Princeton, NJ 08541-6151, USA; Phone: 609-921-9000; Fax:


Publication Date: Late 2005

Target Population: Nonnative speakers of English who want to study in higher

education institutions in North America

Cost: Fees vary by country. See

Currently $195 in the U.S.; 1761 in China


*Information in the overview from Stoynoff, 2009.

The TOEFL was first introduced in 1964, and it was a five-part multiple-choice test assessing

the English proficiency of non-native speakers of the language for academic placement in

U.S. colleges and universities. It was replaced by the three-section (Listening, Structure and

Written Expression, and Reading Comprehension) test in 1976. In 1986, the Test of Written

English was included in some but not all the TOEFL as a separate component. The computer-

based TOEFL was launched in 1998. After years of validation activities supporting the

design and the test score interpretation and uses of the TOEFL, ETS released iBT as the latest

version of TOEFL. The official guide to the TOEFL test (ETS, 2012) points out three key

features of iBT: it measures speaking, listening, reading and writing skills that are significant

for effective communication in academic settings particularly; it reflects language use in


integrated tasks combining multiple skills in real academic settings; it represents the most

popular practices in language teaching and learning nowadays such as integrated language

skills and the communicative approach. An extended description of the TOEFL iBT is

provided in Table 1 as below.

Table 1

Extended Description of TOEFL iBT

Test Purpose The TOEFL iBT test is designed to evaluate the English proficiency of
nonnative English speakers, and it is primarily used to measure
international students English proficiency in an academic environment
such as university where instruction is in English.
(TOEFL iBT test framework, 2010)
Test Structure The entire test with 4 sections takes about 4 hours. Its administered via
the secure Internet network of testing centers.
Reading section (60-100 min) measures test takers ability to understand
university-level academic materials. The section has 3-4 passages
approximately 700 words each with 12-14 questions per passage. There
are 3 question formats: traditional multiple choice questions with four
choices and a single correct answer, questions asking test takers to insert
a sentence out of four choices to where it fits best in a passage and
reading to learn questions with more than four choices and more than
one correct answer.
Listening section (60-90 min) assesses test takers ability (making
inferences etc.) to understand academic lectures and long conversations.
The section has 4-6 lectures, each 3-5 minutes long with 6 questions; 2-3
conversations, each about 3 minutes long with 5 questions. There are
pictures accompanying each listening to show test takers the setting and
roles of speakers. Note-taking is allowed. There are four question
formats: traditional multiple-choice questions, multiple-choice questions
with more than one answer, events or steps ordering questions and
questions asking to match objects or text to categories in a chart.
Speaking section (20 min) includes six tasks in academic settings. The
first two independent speaking tasks require test takers to respond to
general topics familiar to them. Two integrated tasks require reading and
listening and another two require listening before speaking in response.
Writing section (50 min) includes two tasks. Integrated writing requires
test taker to read and listen opinions on an academic topic while taking

notes then write to summarize. Independent writing asks for an essay that
states, explains and supports test takers personal opinions on an issue.
(The official guide to the TOEFL test, 2012)
Scoring of Test Each section is scored on a scale of 0-30. Total score ranges from 0-120.
There are two types of scoring in the reading and listening sections.
Questions with single correct answer are scored on correct/incorrect
basis. Questions more than one correct answer allow partial-credit.
Speaking tasks are rated by 3-6 different certified raters on a scale from
0-4 according to the rubrics. Raters look for features of delivery,
language use and topic development in responses. Writing tasks are rated
based on the 0-5 rubrics by 2 certified raters. Raters focus on
development, organization, content and language use to evaluate the
quality of writing. The average of of the scores on the speaking and
writing tasks is converted to a scaled score of 0 to 30.
Final score report includes performance feedback in detail.
(The official guide to the TOEFL test, 2012)
Statistical Sawaki, Stricker & Oranje (2009) reported a survey conducted from 2003
Distribution of to 2004, with participants from 31 countries accounting for about 80% of
the Scores for the 2001-2002 TOEFL testing volume. Summary statistics for TOEFL
Normed Group iBT scores (study sample and TOEFL 2002-2003 candidates) are listed
Mean Std. Dev
Reading 17.04 6.99
Listening 16.98 6.95
Speaking 16.97 6.98
Writing 16.05 6.67
Total 67.04 24.58
Standard Error Reliability and Comparability of TOEFL iBT Scores (2011) offers the
of table showing SEM for total and part scores as below:
Measurement SEM
Reading 3.35
Listening 3.20
Speaking 1.62
Writing 2.76
Total 5.64
Evidence for ETS used data from operational tests in 2007 to produce parts and total
Reliability score reliability estimates. The reliability estimate of Reading, Listening,
Speaking, Writing and Total score are: 0.85, 0.85, 0.88, 0.72 and 0.94. As
the data shows, though the reliability of the Writing score is
comparatively lower, the reliability for the Reading, Listening, Speaking
and Total scores are relatively high. As ETS claimed in the report, for
making high-stakes decision such as admission to graduate or

undergraduate school, the total score which reflects the four skills is the
most important to consider, which is the highest in the data. Meanwhile,
ETSs report also supports high alternate form of reliability, since
Zhangs research in 2008 (cited in Reliability and Comparability, 2011)
shows high correlations of 12000 test repeaters score (0.77 for the
listening and writing sections, 0.78 for reading, 0.84 for speaking, and
0.91 for the total test score) in two iBT they took within a month.
(Reliability and Comparability, 2011)
Evidence for Stoynoff (2009) listed some evidence to justify the TOEFL iBTs
Validity construct validity. The iBT listening section, compared to CBT, includes
more and longer spoken discourse and new item types (complex tasks) to
assess important listening construct. The writing construct is also defined
more broadly in the iBT to include pragmatic considerations, complex
task with integrated skills as well as additional rhetorical functions, which
help to increase authenticity of tasks. Stoynoff (2009) also presents
research that Rosenfeld and his colleagues conducted in 22 North
American universities in 2001. Their corpus analysis data informing the
design and content of input used to assess listening comprehension in
undergraduate and graduate courses indicated that the content of the iBT
test approximates what could be encountered in the target language use
In the publication, Validity Evidence Supporting the Interpretation and
Use of TOEFL iBT Scores (2008) produced by ETS, the evidence for
valid score interpretation and use is provided. ETS claims the score of
TOEFL iBT is a reliable indicator to tell if the student has sufficient
English-language proficiency for study at an English-medium college or
university. This is shown by the relationship between TOEFL iBT test
scores and other measures or criteria of language proficiency such as self-
assessment, academic placement and local institutional tests for
international teaching assistants. The study statistics can be found in the
publication, all proving TOEFL iBT a valid indicator for learners
proficiency. For example, regarding self assessment, over 2,000 test
takers in the 2003-2004 completed the questionnaire, which indicated
how well the test takers agreed with a series of can do statements
including a mix of simple and complex speaking tasks such as My
instructor understands me when I ask a question in English. and I can
talk about facts or theories I know well and explain them in English. The
result showed it was more likely for test takers with higher scores to
indicate that they could do more complex tasks.

International English Language Testing System (IELTS)*

*Abbreviated from OSullivan 2005, updated information from IELTS Handbook (2007)

Publisher: University of Cambridge ESOL Examinations, the British Council,

and IDP: 1 Hills Road Cambridge, CB1 2EU United Kingdom; Tel:

44 1223 553355; Fax: 44 1223 460278; Email:

Publication Date: 1989 (introduced as ELTS in 1980-1981)

Target Population: Students for whom English is not a first language and who wish to

work or attend university in an English-speaking country

Cost: Fees vary greatly by location of test center; see

Currently the cost ranges between $215-240 in the U.S.; 1850 in



According to Chalhoub-Deville & Turner (2000), IELTS is administered by UCLES and

provides two test routes. IELTS Academic is for people seeking higher education at the

undergraduate or graduate level, and IELTS General Training is for people who plan to work,

receive work-related training or need language proficiency evidence for immigration

purposes in an English-speaking environment. The test routes share the same listening and

speaking tests. An extended review for the IELTS Academic test is provided in Table 2:

Table 2

Extended Description of IELTS Academic

Test Purpose The test assesses whether test-takers are ready to study or train
in English speaking environment, and reflects the features of
language used in academic study at an undergraduate or
postgraduate level. Its also designed for admission purposes to
undergraduate or postgraduate courses.
(IELTS Handbook, 2007)
Test Structure IELTS Academic includes Listening, Reading, Writing and
Speaking tests. Paper-based IELTS are offered by all test
IELTS is approximately 2 h and 30 min long in total.
The listening test (30 min) has 4 sections (2 about social needs,
2 about situations in educational or training contexts) with 40
questions including: multiple choice; short answer; sentence
completion; note/summary/flow-chart/table completion;
labelling a diagram; classification; matching. Listening excerpts
including conversations and monologues are played only once.
The reading test (60 min) has 3 reading passages (totally 2000-
2750 words) with 40 questions including multiple choice, short-
answer questions, sentence completion, note/summary/flow-
chart/table completion, labelling a diagram, matching headings
for identified paragraphs/sections of the text, identification of
writers views/claims, identification of information in the text,
classification, matching lists/ phrases. Texts are of general
interests, taken from magazines, journals, books and
The writing test (60 min) includes 2 tasks. Task 1 asks test-taker
to describe information in graph/table/chart/diagram. Task 2
requires test-takers to present argument, view or problem
towards issues regarding to general interest.
The speaking test (11-14 min) is an interview including 3 parts
for test-takers to complete. Part 1 is basic introduction and
answering verbal questions on familiar topics. Part 2 is speaking
on a topic based on written input (general instruction and
content-focused prompt). Part 3 is interactive discussion on Part
2 topic.
(IELTS Handbook, 2007)
Scoring of Test Test-takers are assessed on a Band Scale from 1-9. Each test
component is assessed individually, and the individual scores

are then averaged and rounded to get an overall band score

reported in whole and half bands.
One mark is awarded for each correct answer in the 40-item
listening and reading test. Scores out of 40 are then translated
into the IELTS 9-band scale based on a confidential Band Score
conversion table that is produced for each version of the
listening and reading test.
In the writing test, Task 2 has more weight over Task 1. Writing
responses are graded by certificated IELTS examiners. Raters
grade according to detailed written performance descriptors at
the nine bands. The criteria are coherence and cohesion, lexical
resource and grammatical range and accuracy, task achievement
(Task 1)/task response (Task 2).
The speaking test is assessed by certificated examiners based on
detailed spoken performance descriptors at the nine bands. The
criteria are: fluency and coherence, lexical resource,
grammatical range and accuracy, pronunciation.
A final Test Report Form with both individual test Band scores
and an Overall Band Score. Test takers are also provided with a
descriptive statement of language proficiency.
(IELTS Handbook, 2007)
Statistical Distribution Test performance 2015 report released by IELTS official
of the Scores for website offer statistics regarding to the IELTS listening and
Normed Group reading modules are listed as below (writing and speaking
statistics not reported on the site) :
Module Mean SD
Listening 6.10 1.3
Academic Reading 6.02 1.2
Standard Error of Test performance 2015 on IELTS official website reported SEM
Measurement statistics for IELTS listening and reading modules used in 2015.
SEM statistics for writing and speaking are not reported:

Module SEM
Listening 0.37
Academic Reading 0.38
Evidence for Reliability Reliability estimates were reported for Listening and Reading
Test Modules in the online report of test performance 2015,
which is listed as below:
Module Alpha
Listening 0.92
Academic Reading 0.90

The data of Cronbachs alpha indicates high reliability estimates

measuring the consistency of the 40-item tests. According to
OSullivan (2005), the evidence above seems to indicate IELTS
measures the abilities it claims to be measuring quite validly and
reliably. The official test performance report doesnt report the
reliability of the writing and speaking modules in the same
manner as for listening and reading, because they are not item-
The publisher of IELTS also claimed in the same report on the
official website that the reliability of rating particularly shown
by the rater reliability is assured by the face-to-face training
and certification of examiners and all examiners must undergo
a retraining and recertification process every two years. The
publisher also claimed that the jagged profile system
combining with targeted sample monitoring such as
rechecking recordings of the Speaking tests routinely help to
monitor and identify disqualified examiners or possible faulty
ratings, which help to maintain the global reliability of IELTS
Evidence for Validity OSullivan (2005) pointed out that since IELTS tests seems to
test four skills of equal importance, there may be a positive
washback on language teaching and learning of test takers.
Additionally, test takers and many teachers seem to agree that a
wide-ranging test of language ability shows a great degree of
face validity. Moreover, the opportunities IELTS provided for
test takers help to let them display their full range of language
ability under situations that approximates the academic, training
or work environment in the real world.


The teaching context I envision is an EFL language course provided by a private

Intensive English Program in China. The student population is 10 to 20 nonnative English

speakers whose L1 is Chinese, and they are preparing for a language proficiency test so that

they can pursue admission to higher education institutions in the U.S. Students may have

different education backgrounds, but all of them have at least obtained a junior high school

diploma, and they are working on enrollment in undergraduate or graduate school in the U.S.

Furthermore, students in this class may have been assessed by the IEP to be placed into the

same class. Their English proficiency levels may vary but most students range from

intermediate to advanced. Since all the students are learning English for admission to higher

education, which may very possibly influence their future lives, they are very motivated to go

to class and achieve class objectives. The class materials offered by IEP should target the

specific English proficiency test. Among many large-scale proficiency tests in China, as the

instructor, I am supposed to recommend a proficiency test for students for their best interest.

Both TOEFL and IELTS are the most popular and widely accepted English

proficiency tests in China. According to Arcuino (2013), TOEFL scores are recognized by

over 8,500 agencies and educational institutions in over 130 countries, while the IELTS is

administered in over 135 countries and the scores are accepted by over 7,000 educational

institutions around the world. The IEPs which prepare the student group I described for

English proficiency tests usually choose TOEFL or IELTS to set their curriculum.

Considering the students goal of getting English language proficiency scores to be granted

admission to universities in the U.S, both assessments should both work. However,

considering the context and the test review, I think TOEFL is more appropriate than IELTS.

First of all, in terms of reliability, according to Stoynoff (2009), ETS uses multiple

raters to evaluate examinees performance on the constructed response Speaking and

Writing tasks and examiners assessments are monitored daily. All these procedures help to

maintain the consistency of both iBT section scores and total scores. On the other hand,

though IELTS claims that they have sufficient rater reliability, Chalhoub-Deville & Turner

(2000) stated that the information to support the reliability of IELTS is lacking and limited,

and IELTS publications need to provide more documentation of rater reliability, the

reliability of the instrument, and the ensuing scores.

In terms of validity, both proficiency tests include integrated skills into their tasks.

Sawaki et al. (2009) indicates that integrated tasks in TOEFL iBT reflect the complex

context of language use in academic settings. The integrated tasks require students to

process language in multiple modalities to fulfill specific purposes such as reading a

textbook to prepare for a lecture, listening to the lecture and writing down key points of the

lecture for future reference (Sawaki et al., 2009). This would benefit students a lot only if

they get into a real academic setting in the U.S. However, as for IELTS, as shown in the test

review table above, many tasks focus on topics of general interests. Additionally, according

to Chalhoub-Deville & Turner (2000), IELTS has been more commonly used in the UK and

Australia. It seems like research is needed to see if the scores obtained from IELTS are

appropriate enough to be measures of academic language use in North America. So far, it is

difficult to be certain of what the ensuing test scores mean and how they should be used in

academic contexts in the U.S. Since the students in my classroom are targeting getting into

and succeeding in U.S. classrooms, TOEFL, in this case, would be a better choice.

There is one noteworthy feature of the IELTS, according to Stoynoff (2009), which

shows better authenticity and interactiveness compared to TOEFL, that is the speaking test in

the form of interview including an interlocutor, compared to the semi-direct assessment

method used in TOEFL iBT. However, TOEFL iBT also use pictures accompanying tasks,

which to some degree, may help with students performance and engage them in authentic



Arcuino, C. L. T. (2013). The relationship between the test of English as a foreign language

(TOEFL), the international English language testing system (IELTS) scores and

academic success of international master's students (Doctoral dissertation, Colorado

State University). Available from Dissertations & Theses @ Colorado State University;

ProQuest Dissertations & Theses Global: Literature & Language; ProQuest

Dissertations & Theses Global: Social Sciences. (1413309058). Retrieved from


Bachman, L., & Palmer, A. (2010). Language assessment in practice. New York, NY:

Oxford University Press

Chalhoub-Deville, M., & Turner, C. E. (2000). What to look for in ESL admission tests:

Cambridge certificate exams, IELTS, and TOEFL. System, 28(4), 523-539.

ETS. (2012). The official guide to the TOEFL test (4th ed.). New York, NY: McGraw-Hill


IELTS Handbook. (2007). Pasadena, CA: IELTS International.

OSullivan, B. (2005). International English Language Testing System (IELTS). In S.

Stoynoff, & C. Chapelle, ESOL tests and testing (pp. 7378). Alexandria, VA: TESOL.

Reliability and comparability of TOEFL iBT scores. (2011). Insight: TOEFL iBT Research,

1(3), 1-8.

Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL internet-

based test. Language Testing, 26(1), 5-30.

Stoynoff, S. (2009). Recent developments in language assessment and the case of four large-

scale tests of ESOL ability. Language Teaching, 42(1), 1-40.

Stoynoff, S., & Chapelle, C. (2005). ESOL tests and testing. Alexandria, VA: TESOL.

Suryaningsih, H. (2014). Students' perceptions of international English language testing

system (IELTS) and test of English as a foreign language (TOEFL) tests (Doctoral

dissertation, Indiana University of Pennsylvania). Available from ProQuest

Dissertations & Theses Global: Literature & Language; ProQuest Dissertations &

Theses Global: Social Sciences. (1534350812). Retrieved from https://search-proquest-

TOEFL iBT test framework and test development. (2010). Insight: TEOFL iBT Research,

1(1), 1-9.

Validity evidence supporting the interpretation and use of TOEFL iBT scores. (2008).

Insight: TEOFL iBT Research, 1(4), 1-11.

UCLES. (n.d.) Test performance 2015. Retrieved from