Created by Aaron T. Beck, the purpose of the Beck Depression Inventory-II (BDI-II) is to

assess the severity of depression. It is a self-administered, 21-question multiple choice test with a

score scale from 0-3. Higher total scores indicate the more severe depression. Overall, the BDI-II

is a good revision comparing with its previous two versions and is in better accordance with

Standards for Educational and Psychological Testing (AERA et al., 2014). It meets most of the

standards in regard to its purpose, instruction, scoring, reliability, and validity. But it is not

entirely in compliance with some standards regarding its sampling and testing environment. To

improve the test, we suggest that the BDI-II should provide more instructions on situations when

participants get cutoff scores and change questions to be more applicable to younger participants.


The Beck Depression Inventory (BDI) is a self-report inventory that is widely used for

measuring the severity of depression. The purpose of the inventory stated by Beck when he first

developed the scale was to provide a quantitative measurement of the depth of depression. (Beck,

1967) More specifically, the inventory score can be used for screening, clinical diagnosis, and

psychopathology researches. Stating these purposes complies with Standard 4.1 and 4.2(AERA

et al., 2014), and they are essential to our assessment of the tests construct validity.

There are multiple versions of this inventory. The first version, which is known as the

original BDI, was published in 1961 by Arch Gen Psychiatry (Beck, Ward, Mendelson, Mock,

1961). Following that, the first revision began in 1971, which aimed to improve wording of

items, and was published as the BDI-IA in 1978 by Guilford Press (Beck, Rush, Shaw, Emery,
1979). The second revision, which was known as the BDI-II, was published in 1996 by The

Psychological Corporation and contained substantive revision that addresses some problems of

previous versions (Beck, Steer, Brown, 1996). In addition to these three versions, Beck also

designed BDI for Primary Care (BDI-PC), which is a short screening scale and only produces

binary outcome. Our review focuses on the BDI-II, and is primarily based on the 38-page manual

published in 1996 by The Psychological Corporation (Beck, Steer, Brown, 1996).

Development of BDI

The BDI, as an effective and dependent measure of depression, has been widely used in

the clinic settings as well as in outcome studies of psychotherapy and antidepressant treatment.

(Piotrowski & Keller, 1989; Piotrowski & Lubin, 1990). It has gone through a series of changes

in the diagnostic context and the instrument itself since its first publication in 1961 (Beck, Ward,

Mendelson, Mock, & Erbaugh, 1961).

The first revision began because some items of the original BDI failed to meet the

standards for anti-discrimination and had gender inequality issues. (Santor, Ramsay, & Zuroff,

1994). Therefore, a reassessment in concept and a psychometrically informed revision of the

instrument was required. The BDI-IA, which was commonly referred to in the literature as

simply BDI, was published in 1979. This version was similar to the original one, except

rewording some items to avoid double negative statements, and extending the timeframe as over

the past week, including today (Beck & Steer, 1987). This version did not achieve a proper

reassessment and was discontented by consumers. It failed to expand the scope of depression

symptoms of the BDI and to meet the contemporary diagnostic criteria for depression(Arbisi &
Farmer, 2001). Considering both professionality and popularity, the BDI-IA did not surpass the

original BDI, and a thorough revision of the BDI was still in need.

The BDI-II, an overhaul long-expected for the classic BDI, was published in 1996. It not

only reflected diagnostic sensibilities but also revamped the reliable standards. The BDI-II

remained the format of the original version 21 items with four options under each items

ranging from 0 to 3. However, it dropped the item of (N) Body Image, (O) Work Inhibition, (S)

Weight Loss, and (T) Somatic Preoccupation in the BDI. Meanwhile, it added items of agitation,

worthlessness, loss of energy, and concentration difficulty. Furthermore, this version combined

the increases and decreases in appetite in the same item, as well as symptoms of hypersomnia

and hyposomnia. The BDI-II required consumers to rate over their past two weeks condition,

which was different from the one-week period rating in the BDI. These changes all show

consistency with DSM-IV(APA et al., 2014).

The BDI has been revised for multiple times due to new research data, and significant

changes in the domain represented that follows the Standard 4.24 (AERA et al., 2014).

Additionally, the inventory is only marked as revised when it was changed in a significant

way, which complies with Standard 4.25.

Test description

A. Administration time and target population

The test is a paper questionnaire and can be self-administered or verbally by a trained

administrator, and administration of the test is usually completed within 10 minutes. The target

population of BDI-II includes adults and adolescents aged 13 years and older.
B. Items and scoring

The BDI-II consists 21 self-report questions based on symptoms of depression listed on

American Psychological Association's Diagnostic and Statistical Manual of Mental Disorder

Fourth Edition (DSM-IV) including (1)sadness, (2) pessimism, (3) past failure, (4) loss of

pleasure, (5) guilty feelings, (6) punishment feelings, (7) self-dislike, (8) self-criticalness, (9)

suicidal thoughts or wishes, (10) crying, (11) agitation, (12) loss of interest, (13) indecisiveness,

(14) worthlessness, (15) loss of energy, (16) changes in sleeping pattern, (17) irritability, (18)

changes in appetite, (19) concentration difficulty, (20) tiredness or fatigue, and (21) loss of

interest in sex. The BDI-II is a single-scale test with no subtests.

Test takers are requested to choose only one of the four statements scored from 0 to 3 for

each question based on how they feel in the past two weeks, and they need to choose the

statement with the higher score if two or more statements apply equally well. The instruction is

clear, and it is listed on testing form which follows the Standard 5.5 stating that instruction

should clearly indicate how to make responses and it should be given to any equipment likely to

be unfamiliar to test takers (AERA et al., 2014).

Total scores range from 0 to 63 include four cutoffs: 0 -13 is minimal depression, 14 -19

is mild depression, 20 - 28 is moderate depression, and 29 - 63 is severe depression. The way of

scoring is straightforward and can be conducted easily by test takers and interpreters. Beck also

states that patients scoring high on the depression inventory have had life experiences during the

developmental period that predispose them to react to stress later by the appearance of, or

exacerbation of, depressive symptomatology. (Beck, 1961) This scoring instruction and
statement explain the interpretation of the test scores, which is in compliance with Standard 3.11,

Standard 3.22 and Standard 5.8 (AERA et al., 2014).

Beck states in the manual for the BDI-II that the items and categories in this inventory

were primarily based on clinical experience. He selected a group of attitudes and symptoms that

appeared to be specific for depressed patients through systematic observations and records for

the patients. Based on this selection, he constructed the inventory composed of attitudes and

symptoms which were consistent with descriptions of depression contained in the psychiatric

literature. (Beck, et al., 1996) However, based on the manuals and documents we reviewed, there

is no specific documentation or evidence that explains why and how these items are included in

the inventory. No documentation of the procedures used to develop and select items violates the

Standard 4.7(AERA et al., 2014). Additionally, the BDI-II, which was developed on the basis of

empirical relationships, lacks cross-validation to cross out the possibility that some items are

selected by chance. Lack of statistical optimization techniques violates Standard 4.11 (AERA et

al., 2014).

C. Materials and Testing Environment

The BDI-II complete kit is available to purchase from Pearson for about 133 USD, which

includes 25 record forms and a manual requested by Standards 6.1. The forms are available in

English and Spanish. No alternative form is either provided or mentioned in manual. The manual

contains specific procedures for administration, scoring, a short section on interpretation, and the

psychometric properties of the instrument. Beck suggests in the manual that score should better

be interpreted by professionals with appropriate clinical training and experience to reduce the

suicidal risk associated with depression (Beck, et al., 1996). Therefore, the test is "moderately
difficult to interpret based on the clinician's experience and interpretation considerations in the

manual" (CommunityUniversity Partnership for the Study of Children, Youth, and Families,

2011). Moreover, nothing related with testing environment has been mentioned, so noise and

disruption may occur during testing, which threats the requirement states in Standard 5.4 (AERA

et al., 2014).

D. Norms and Samples

Two samples were collected to review the score of BDI-II. The first normative sample

included 500 outpatients diagnosed per Statistical Manual of Mental Disorders Third Edition

Revised (DSM-III-R) or Statistical Manual of Mental Disorders Fourth Edition (DSM-IV)

criteria from rural and suburban institutions in US. The patient sample was used to analyze

psychometric characteristics of the test. The sample consisted 91% White, 4% African

American, 4% Asian American, and 1% Hispanic, and 63% women and 37% men. The second

sample was a student sample with 120 college students. This group was comprised of 56%

female and 44% male. The student sample was served as a comparative normal group. The

creation of samples follows the Standard 4.8 (AERA et al., 2014) as a test review process, but it

violates the requirement of Standard 4.9 which states that the sample(s) should be as

representative as possible of the population, since samples only included outpatients and college

students, which fail to represent the whole target population, as the psychometric data on the

BDI-II is mixed (Sharp and Lipsky, 2002).

The BDIII has been used for many years with high reliability. The scores from several
samples are used as the estimates, and we examine the compliance of these estimates with
Standard 2.1 through Standard 2.20 of the Standards for Educational and Psychological Testing.
A. Internal Consistency
Estimates of internal consistency are evaluated with the Cronbachs alpha. From the
normative samples, the BDI-II coefficient alphas are 0.92 for the outpatients (n=500) in the
sample referred to in the manual, and 0.93 for the college students (n=120) (Beck, Brown, &
Steer, 1996). Both clinical and nonclinical coefficient alphas are above 0.80 and consistent with
estimates of coefficient alpha, of the psychiatric samples, 0.91 (Beck, Brown, & Steer, 1996),
indicating high internal consistency.
Many replication studies corroborate these findings. In a research of African American
suicide attempters, the coefficient alpha of the BDI-II internal consistency is 0.94 (Joe, Woolley,
Brown, Ghahramanlou-Holloway, & Beck, 2008). It is comparable to estimates of the psychiatric
samples, coefficient alpha 0.91 (Beck, Brown, & Steer, 1996), and primary care African
American patients, coefficient alpha 0.90 (Dutton, Grothe, Whitehead, Kendra, & Brantley,
2014). Another study of the Italian version of the Beck Depression Inventory-II shows that the
values of internal consistency ranged from 0.80 to 0.87 (Ghisi, Flebus, Montano, Sanavio, &
Sica, 2006), which are also acceptable for research. These studies also demonstrate that no
significant differences are found between testers from various cultural backgrounds, supporting
that BDI-IIs internal consistency reliability across ethnic groups.
However, there are also some flaws concerning internal consistency. First, the sampling
procedure used to select examinees is not provided in normative data, partially in violation
Standard 2.4. Second, the test is designed for a wide age range of people but depression levels or
interpretations may vary from different age groups. The reliability data should be provided from
each age populations, and thus it violated Standard 2.12 (AERA et al., 2014).

B. Test-Retest Reliability
The BDI-II test-retest reliability estimates are calculated by the intra-class correlation
coefficient, which measures both rank agreement and level agreement. A Becks study of 26
outpatient people who had been referred for depression demonstrated high correlation. They
tested at first and second therapy sessions one week apart (Beck, Brown, & Steer, 1996). The
result is a high correlation of 0.93.
The test developers complies with Standard 2.3, which requires reliability data of the
sample for the test interpretation emphasizing differences, by reporting the estimate of test-retest
reliability. Additionally, the cut-score interpretations of the BDI-II are used to make categorical
decisions. Repeated measurements are consistent but the percentage of estimates for classifying
examinees are not provided and thus it does not completely meet the Standard 2.15 (AERA et al.,
2014). However, the carry-over effects between the two tests could affect estimates of test-retest
reliability, and this issue is not addressed in the BDI-II.

C. Inter-Rater Reliability
The inter-rater reliability estimates are not established because it is a self-report test.
Therefore, it neither complies with nor violates Standard 2.10 (AERA et al., 2014).

Overall, the test developers use the consistent techniques for calculating internal
consistency and test-retest reliability, and complies with Standard 2.5 (AERA et al., 2014).
However, some flaws discussed above should also be taken into consideration. As a result,
though the reliability appears to be high, there are still some concerns for application.


The overall validity of BDI-II was good as it conforms more closely and specifically to

the diagnostic criteria for depression. Specifically, the validity of BDI-II can be evaluated from

three aspects- construct validity, content validity and criterion validity.

A. Construct Validity

Beck, et al. (1996) reported that the BDI-II displayed construct validity with respect to

clinically rated depression. R.A. Steer, et al (1997) also confirmed the BDI-IIs construct validity

by testing correlation between BDIII and the SCL-90-R, an instrument that was often

employed for assessing self-reported depression and anxiety(Derogatis, 1983). The study was

conducted among 210 adults who never participated in Beck, et al.'s (1996) normative study and

found the BDI-II was highly positively correlated (r = .89, p<0.001) with scores on the
depression subscale of SCL-9C-R. This high correlation indicates that the BDI-II is able to well

identify and measure depression symptom, and the construct validity of the BDI-II is good.

However, there are some threats to the construct validity. For example, the cut scores may

increase the likelihood of underdiagnosis or overdiagnosis when the score is close to the cut-offs,

whereas the BDI-II lacks instruction on how to deal with cut-off scores, which doesnt comply

with Standard 1.2 and Standard 1.12 (AERA et al., 2014).

B. Content Validity

The content validity of the BDI-II is adequate as it was composed of questions relating to

both somatic symptoms and cognitive symptoms of depression. However, the content validity

became narrower compared to that of the former version (Wang et al., 2013), since the BDI-I

reflected six of the nine criteria for DSM-based depression while the BDI-II presented an

improved performance specifically to indicate DSM-based depression.This raises an issue that

the sensitivity to detect a broader concept of depression may have been affected. There are also

some items that are inappropriate when the test takers are adolescents. For example, item 21

asking interest in sex should not be a good test question for adolescents since they probably

don't yet start their sex life (Osman A, et al 2014), which violates Standard 1.5 (AERA et al.,


C. Criterion Validity

The validity of the BDI-II shows good measure on its close relationship with other

measures of depression. The Hamilton Rating Scale for Depression (HRSD) (Hamilton, 1960) is

one of depression-related instruments, which was initially considered as a golden standard for

rating depression in clinical research. The BDI-II manual (Beck, et al 1996) reported a

correlation of 0.68 between BDI-II and HRDS. Subsequent researches also confirmed that the
BDI-II correlates strongly with other related self- report instruments (Arnau, Meagher, Norris, &

Bramson, 2001; Osman et al., 1997; Steer & Clark, 1997). These papers presented adequate

statistical measurement results, such as correlation coefficients, means, standard deviations,

showing BDI-II adheres to Standard 1.15 (AERA et al., 2014).

Summary and Recommendation

Beck Depression Inventory (BDI, BDI-1A, BDI-II) is a self-administered, 21-question

multiple choice test that was created by Aaron T. Beck. The purpose of the test is to assess the

severity of depression. BDI-II was published in 1996, as a revision of the two previous editions

BDI and BDI-1A. BDI-II is scored on a scale from 0-3, and each of the 21 items assesses a

different symptom of how the participants feeling in two weeks.

The BDI-II meets most of the Standards for Educational and Psychological Testing

(AERA et al., 2014). The purpose of BDI-II is stated clearly, saying that the test can provide a

quantitative measurement of the depth of depression and can be used for screening, diagnosis,

and researches, which follows Standard 4.1 and 4.2. The BDI-II meets the Standard 4.24 and

4.25 because the test is revised when new data were collected and the change is significant. The

instruction of BDI-II meets Standard 5.5 by providing clear and detailed instruction of how to

respond the assessment. The scoring of BDI-II fulfills Standard 3.22 and Standard 5.8 by having

a straightforward scoring system and the interpretation of scoring also meets the Standard 3.11.

The BDI-II meets Standards 6.1 by including 25 record forms and a manual. Two samples were

presented in the study, which is in accordance with Standard 4.8. As for the reliability of the

BDI-II, it meets Standard 2.5 by using consistent techniques. The BDI-II follows Standard 2.3 by

reporting test-retest reliability. The criterion validity of BDI-II is in accordance with Standard
1.15 by presenting sufficient statistical measurements.

The BDI-II does however, have some shortcoming worth nothing.

First, several important pieces of information are missing in the manual and documents.

Specifically, it lacks elaboration on the testing environment and the procedure on how and why

these items were developed and selected, whereas these information is critical for reliability and

validity. As for the test environment, a noisy and disruptive testing environment could have a

negative effect on the test takers mood, which may lead to a biased score. Failing to provide

guidance on testing environment violates Standard 5.4, and elaboration on testing environment is

suggested for higher reliability. As for the procedure of item development, since BDI-II is

primarily based on clinical experience, documentation of the procedure is critical for reviewing

construct validity. Additionally, statistical optimization is necessary for crossing out the

possibility that some items are selected accidently. Lack of the documentation of the procedure

and cross-validation violates Standard 4.7 and 4.11, and BDI-II can more adhere to these

standards by providing an explanation of the logic behind choosing these 21 questions instead of

other items.

Second, the test design is not appropriate for the whole target population, and two

questions are worth noting: the sampling and the test constructs. As for the sampling, the

samples of BDI-II were only collected from outpatients and college students, which is not

representative for the target population, which partially violates the Standard 2.4, 2.12 and 4.9.

An improvement can be made by collecting more samples from target populations such as

teenagers, middle-aged group and the elderly. As for the test constructs, BDI-II maintains an

item of loss of interest in sex, which may be inapplicable for younger participants. To avoid
this problem, the test can change this item to something more relevant to teenagers like loss of

interest in academics engagement (Jaycox, L. H., Stein, B. D., Paddock, S., Miles, J. N.,

Chandra, A., Meredith, L. S., & Burnam, M. A, 2009).

Third, BDI-II lacks instruction on the cutoff scores, which does not comply with

Standard 1.2 and 1.12 because the cutoffs may cause inaccurate screening results whereas the

manual fails to include the specific instruction on this. Lack of instruction on cutoff scores also

hurts the reliability because due to the sample is not representative, the base rate may not be

appropriate to apply to general populations. The cutoff scores should be implemented

cautiously , and the manual should also provide more instructions on situations when participant

get cutoff scores.

Last, since BDI-II is a self-report test, it is subject to some natural limitation. Test takers

can easily exaggerate or minimize the scores. A patient finishing the test at a hospital setting

could have a higher score than the patient finish the questionnaire at home because he/she may

feel more relaxed in a familiar setting, so the score is biased. To avoid biased screening scores,

participants can use BDI-II with other depression questionnaires such as Hamilton Rating Scale

for Depression (HRSD) and Patient Health Questionnaire(PHQ) to assess for a more accurate


In conclusion, in spite of the several limitations mentioned above, we think BDI-II is a

good measure of intensity of depression that is in accordance with many standards in the

Standards for Educational and Psychological Testing. Considering the nature and limitation of

self-report test, we recommend that it can be used as a screening device rather than a diagnostics

tool, and detailed interpretation of test scores by expert clinicians is also suggested.
