Sie sind auf Seite 1von 470

Encyclopedia of Psychometrics

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Fri, 19 Apr 2013 20:34:01 UTC

Contents
Articles
Accuracy and precision Activity vector analysis Adaptive comparative judgement Anchor test Assessment centre Assessment day Base rate Bias in Mental Testing Bipolar spectrum diagnostic scale Borderline intellectual functioning Choice set Citizen survey Classical test theory Cluster analysis (in marketing) Cognitive Process Profile Common-method variance Computer-Adaptive Sequential Testing Computerized adaptive testing Computerized classification test Congruence coefficient Conjoint analysis Correction for attenuation Counternull Criterion-referenced test Cronbach's alpha Cutscore Descriptive statistics Dot cancellation test Elementary cognitive task Equating Factor analysis Figure rating scale Fuzzy concept G factor (psychometrics) 1 6 7 10 10 11 11 13 15 15 17 18 18 22 24 25 26 26 32 36 37 38 40 41 43 46 46 48 48 49 51 61 61 69

Francis Galton Group size measures Guttman scale High-stakes testing Historiometry House-Tree-Person test Idiographic image Intelligence quotient Internal consistency Intra-rater reliability IPPQ Item bank Item response theory Jenkins activity survey Jensen box KuderRichardson Formula 20 Latent variable Law of comparative judgment Likert scale Linear-on-the-fly testing Frederic M. Lord Measurement invariance Mediation (statistics) Mental age Mental chronometry Missing completely at random Moderated mediation Moderation (statistics) Multidimensional scaling Multiple mini interview Multistage testing Multitrait-multimethod matrix Neo-Piagetian theories of cognitive development NOMINATE (scaling method) Non-response bias Norm-referenced test Normal curve equivalent Objective test

90 100 103 106 109 111 112 114 134 135 136 137 138 147 148 149 150 151 155 158 159 159 160 169 170 177 178 180 183 187 189 190 193 207 212 213 216 217

Online assessment Operational definition Operationalization Opinion poll Optimal discriminant analysis Pairwise comparison Pathfinder network Perceptual mapping Person-fit analysis Phrase completions Point-biserial correlation coefficient Polychoric correlation Polynomial conjoint measurement Polytomous Rasch model Progress testing Projective test Prometric Psychological statistics Psychometric function Psychometrics of racism Quantitative marketing research Quantitative psychology Questionnaire construction Rasch model Rasch model estimation Rating scale Rating scales for depression Reliability (psychometrics) Repeatability Reproducibility Riddle scale Risk Inclination Formula Risk Inclination Model Role-based assessment Scale (social sciences) Self-report inventory Semantic differential Sequential probability ratio test

218 220 225 228 237 238 241 243 245 246 247 249 250 252 256 260 265 266 268 269 270 273 275 279 288 290 292 294 298 300 302 304 304 305 307 311 314 316

SESAMO Situational judgement test Psychometric software SpearmanBrown prediction formula Standard-setting study Standards for Educational and Psychological Testing StanfordBinet Intelligence Scales Stanine Statistical hypothesis testing Statistical inference Survey methodology Sten scores Structural equation modeling Lewis Terman Test (assessment) Test score Theory of conjoint measurement Thurstone scale Thurstonian model Torrance Tests of Creative Thinking William H. Tucker Validity (statistics) Values scales Vestibulo emotional reflex Visual analogue scale Youth Outcome Questionnaire Attribute Hierarchy Method Differential item functioning Psychometrics Vineland Adaptive Behavior Scale

319 323 328 336 337 338 340 344 345 360 368 374 375 381 385 393 394 405 407 408 411 413 419 422 424 425 426 437 446 454

References
Article Sources and Contributors Image Sources, Licenses and Contributors 455 463

Article Licenses
License 465

Accuracy and precision

Accuracy and precision


In the fields of science, engineering, industry, and statistics, the accuracy[1] of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual (true) value. The precision[1] of a measurement system, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.[] Although the two words reproducibility and repeatability can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method.

Accuracy indicates proximity of measurement results to the true value, precision to the repeatability, or reproducibility of the measurement

A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision. A measurement system is designated valid if it is both accurate and precise. Related terms include bias (non-random or directed effects caused by a factor or factors unrelated to the independent variable) and error (random variability). The terminology is also applied to indirect measurementsthat is, values obtained by a computational procedure from observed data. In addition to accuracy and precision, measurements may also have a measurement resolution, which is the smallest change in the underlying physical quantity that produces a response in the measurement. In the case of full reproducibility, such as when rounding a number to a representable floating point number, the word precision has a meaning not related to reproducibility. For example, in the IEEE 754-2008 standard it means the number of bits in the significand, so it is used as a measure for the relative accuracy with which an arbitrary number can be represented.

Accuracy versus precision: the target analogy


Accuracy is the degree of veracity while in some contexts precision may mean the degree of reproducibility. Accuracy is dependent on how data is collected, and is usually judged by comparing several measurements from the same or different sources.[citation
needed]

The analogy used here to explain the difference between accuracy and precision is the target comparison. In this analogy, repeated measurements are compared to arrows that are shot at a target. Accuracy describes the closeness of arrows to the bullseye at the target center. Arrows that strike closer to the bullseye are considered more accurate. The closer a system's measurements are to the accepted value, the more accurate the system is considered to be.

High accuracy, but low precision.

Accuracy and precision

To continue the analogy, if a large number of arrows are shot, precision would be the size of the arrow cluster. (When only one arrow is shot, precision is the size of the cluster one would expect if this were repeated many times under the same conditions.) When all arrows are grouped tightly together, the cluster is considered precise since they all struck close to the same spot, even if not necessarily near the bullseye. The measurements are precise, though not necessarily accurate. However, it is not possible to reliably achieve accuracy in individual measurements High precision, but low without precisionif the arrows are not grouped close to one another, they cannot all be accuracy close to the bullseye. (Their average position might be an accurate estimation of the bullseye, but the individual arrows are inaccurate.) See also circular error probable for application of precision to the science of ballistics.

Quantification
Ideally a measurement device is both accurate and precise, with measurements all close to and tightly clustered around the known value. The accuracy and precision of a measurement process is usually established by repeatedly measuring some traceable reference standard. Such standards are defined in the International System of Units (abbreviated SI from French: Systme international d'units) and maintained by national standards organizations such as the National Institute of Standards and Technology in the United States. This also applies when measurements are repeated and averaged. In that case, the term standard error is properly applied: the precision of the average is equal to the known standard deviation of the process divided by the square root of the number of measurements averaged. Further, the central limit theorem shows that the probability distribution of the averaged measurements will be closer to a normal distribution than that of individual measurements. With regard to accuracy we can distinguish: the difference between the mean of the measurements and the reference value, the bias. Establishing and correcting for bias is necessary for calibration. the combined effect of that and precision. A common convention in science and engineering is to express accuracy and/or precision implicitly by means of significant figures. Here, when not explicitly stated, the margin of error is understood to be one-half the value of the last significant place. For instance, a recording of 843.6m, or 843.0m, or 800.0m would imply a margin of 0.05m (the last significant place is the tenths place), while a recording of 8,436m would imply a margin of error of 0.5m (the last significant digits are the units). A reading of 8,000m, with trailing zeroes and no decimal point, is ambiguous; the trailing zeroes may or may not be intended as significant figures. To avoid this ambiguity, the number could be represented in scientific notation: 8.0103m indicates that the first zero is significant (hence a margin of 50m) while 8.000103m indicates that all three zeroes are significant, giving a margin of 0.5m. Similarly, it is possible to use a multiple of the basic measurement unit: 8.0km is equivalent to 8.0103m. In fact, it indicates a margin of 0.05km (50m). However, reliance on this convention can lead to false precision errors when accepting data from sources that do not obey it. Precision is sometimes stratified into: Repeatability the variation arising when all efforts are made to keep conditions constant by using the same instrument and operator, and repeating during a short time period; and Reproducibility the variation arising using the same measurement process among different instruments and operators, and over longer time periods.

Accuracy and precision

Terminology of ISO 5725


A shift in the meaning of these terms appeared with the publication of the ISO 5725 series of standards. According to ISO 5725-1, the terms trueness and precision are used to describe the accuracy of a measurement. Trueness refers to the closeness of the mean of the measurement results to the "correct" value and precision refers to the closeness of agreement within individual results. Therefore, according to the ISO standard, the term "accuracy" refers According to ISO 5725-1, Accuracy consists of Trueness (proximity of to both trueness and precision. The standard measurement results to the true value) and Precision (repeatability or reproducibility of the measurement) also avoids the use of the term bias, because it has different connotations outside the fields of science and engineering, as in medicine and law.[2] The terms "accuracy" and "trueness" were again redefined in 2008 with a slight shift in their exact meanings in the "BIMP International Vocabulary of Metrology", items 2.13 and 2.14 [1]
Accuracy according to BIPM and ISO 5725

Low accuracy, good trueness, poor precision

Accuracy and precision

Low accuracy, poor trueness, good precision

In binary classification
Accuracy is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition.
Condition as determined by Gold standard True Test Positive outcome Negative True positive False False positive Positive predictive value or Precision Negative predictive value Accuracy

False negative

True negative

Sensitivity or recall Specificity (or its complement, Fall-Out)

That is, the accuracy is the proportion of true results (both true positives and true negatives) in the population. It is a parameter of the test.

On the other hand, precision or positive predictive value is defined as the proportion of the true positives against all the positive results (both true positives and false positives)

An accuracy of 100% means that the measured values are exactly the same as the given values. Also see Sensitivity and specificity. Accuracy may be determined from Sensitivity and Specificity, provided Prevalence is known, using the equation:

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.[citation needed] In situations where the minority class is more important, F-measure may be more appropriate, especially in situations with very skewed class imbalance.

Accuracy and precision Another useful performance measure is the balanced accuracy which avoids inflated performance estimates on imbalanced datasets. It is defined as the arithmetic mean of sensitivity and specificity, or the average accuracy obtained on either class:

If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to chance.[3] A closely related chance corrected measure is:
[]

while a direct approach to debiasing and renormalizing Accuracy is Cohen's kappa whilst Informedness has been shown to be a Kappa family debiased renormalization of Recall.[4] Informedness and Kappa have the advantage that chance level is defined to be 0, and they have the form of a probability. Informedness has the stronger property that it is the probability that an informed decision is made (rather than a guess), when positive. When negative this is still true for the absolutely value of Informedness, but the information has been used to force an incorrect response.[]

In psychometrics and psychophysics


In psychometrics and psychophysics, the term accuracy is interchangeably used with validity and constant error. Precision is a synonym for reliability and variable error. The validity of a measurement instrument or psychological test is established through experiment or correlation with behavior. Reliability is established with a variety of statistical techniques, classically through an internal consistency test like Cronbach's alpha to ensure sets of related questions have related responses, and then comparison of those related question between reference and target population.[citation needed]

In logic simulation
In logic simulation, a common mistake in evaluation of accurate models is to compare a logic simulation model to a transistor circuit simulation model. This is a comparison of differences in precision, not accuracy. Precision is measured with respect to detail and accuracy is measured with respect to reality.[5][6]

In information systems
The concepts of accuracy and precision have also been studied in the context of data bases, information systems and their sociotechnical context. The necessary extension of these two concepts on the basis of theory of science suggests that they (as well as data quality and information quality) should be centered on accuracy defined as the closeness to the true value seen as the degree of agreement of readings or of calculated values of one same conceived entity, measured or calculated by different methods, in the context of maximum possible disagreement.[7]

Accuracy and precision

References
[1] JCGM 200:2008 International vocabulary of metrology (http:/ / www. bipm. org/ utils/ common/ documents/ jcgm/ JCGM_200_2008. pdf) Basic and general concepts and associated terms (VIM) [2] BS ISO 5725-1: "Accuracy (trueness and precision) of measurement methods and reults - Part 1: General principles and definitions", pp.1 (1994) [3] K.H. Brodersen, C.S. Ong, K.E. Stephan, J.M. Buhmann (2010). The balanced accuracy and its posterior distribution (http:/ / www. icpr2010. org/ pdfs/ icpr2010_WeBCT8. 62. pdf). Proceedings of the 20th International Conference on Pattern Recognition, 3121-3124. [5] John M. Acken, Encyclopedia of Computer Science and Technology, Vol 36, 1997, page 281-306 [6] 1990 Workshop on Logic-Level Modelling for ASICS, Mark Glasser, Rob Mathews, and John M. Acken, SIGDA Newsletter, Vol 20. Number 1, June 1990 [7] Ivanov, K. (1972). "Quality-control of information: On the concept of accuracy of information in data banks and in management information systems" (http:/ / www. informatik. umu. se/ ~kivanov/ diss-avh. html).

External links
BIPM - Guides in metrology (http://www.bipm.org/en/publications/guides/) - Guide to the Expression of Uncertainty in Measurement (GUM) and International Vocabulary of Metrology (VIM) "Beyond NIST Traceability: What really creates accuracy" (http://img.en25.com/Web/Vaisala/NIST-article. pdf) - Controlled Environments magazine Precision and Accuracy with Three Psychophysical Methods (http://www.yorku.ca/psycho) Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results, Appendix D.1: Terminology (http://physics.nist.gov/Pubs/guidelines/appd.1.html) Accuracy and Precision (http://digipac.ca/chemical/sigfigs/contents.htm) Accuracy vs Precision (http://www.youtube.com/watch?v=_LL0uiOgh1E&feature=youtube_gdata_player) a brief, clear video by Matt Parker

Activity vector analysis


Activity vector analysis (AVA) is a psychometric questionnaire designed to measure four personality factors or vectors: aggressiveness, sociability, emotional control and social adaptability.[1] It is used as an employment test. The AVA was developed by the psychologist Walter V. Clarke in 1942, based on work by Prescott Lecky, William Marston and others.[2]

References
[1] Edwin A. Locke, Charles L. Hulin, 'A review and evaluation of the validity studies of activity vector analysis', Personnel Psychology, Volume 15, Issue 1, pages 2542, March 1962 | http:/ / onlinelibrary. wiley. com/ doi/ 10. 1111/ j. 1744-6570. 1962. tb01844. x/ abstract [2] http:/ / www. bizet. com/ ava. php?pg=history_ava | Retrieved 2012-03-03

Adaptive comparative judgement

Adaptive comparative judgement


Adaptive Comparative Judgement is a technique borrowed from psychophysics which is able to generate reliable results for educational assessment - as such it is an alternative to traditional exam script marking. In the approach judges are presented with pairs of student work and are then asked to choose which is better, one or the other. By means of an iterative and adaptive algorithm, a scaled distribution of student work can then be obtained without reference to criteria.

Introduction
Traditional exam script marking began in Cambridge 1792 when, with undergraduate numbers rising, the importance of proper ranking of students was growing. So in 1792 the new Proctor of Examinations, William Farish, introduced marking, a process in which every examiner gives a numerical score to each response by every student, and the overall total mark puts the students in the final rank order. Francis Galton (1869) noted that, in an unidentified year about 1863, the Senior Wrangler scored 7,634 out of a maximum of 17,000, while the Second Wrangler scored 4,123. (The Wooden Spoon scored only 237.) Prior to 1792, a team of Cambridge examiners convened at 5pm on the last day of examining, reviewed the 19 papers each student had sat and published their rank order at midnight. Marking solved the problems of numbers and prevented unfair personal bias, and its introduction was a step towards modern objective testing, the format it is best suited to. But the technology of testing that followed, with its major emphasis on reliability and the automatisation of marking, has been an uncomfortable partner for some areas of educational achievement: assessing writing or speaking, and other kinds of performance need something more qualitative and judgemental. The technique of Adaptive Comparative Judgement is an alternative to marking. It returns to the pre-1792 idea of sorting papers according to their quality, but retains the guarantee of reliability and fairness. It is by far the most reliable way known to score essays or more complex performances. It is much simpler than marking, and has been preferred by almost all examiners who have tried it. The real appeal of Adaptive Comparative Judgement lies in how it can re-professionalise the activity of assessment and how it can re-integrate assessment with learning.

History
Thurstone s Law of Comparative Judgement
There is no such thing as absolute judgement" Laming (2004)[1] The science of comparative judgement began with Louis Leon Thurstone of the University of Chicago. A pioneer of psychophysics, he proposed several ways to construct scales for measuring sensation and other psychological properties. One of these was the Law of comparative judgment (Thurstone, 1927a, 1927b),[2][3] which defined a mathematical way of modeling the chance that one object will beat another in a comparison, given values for the quality of each. This is all that is needed to construct a complete measurement system. A variation on his model (see Pairwise comparison and the BTL model), states that the difference between their quality values is equal to the log of the odds that object-A will beat object-B:

Before the availability of modern computers, the mathematics needed to calculate the values of each objects quality meant that the method could only be used with small sets of objects, and its application was limited. For Thurstone, the objects were generally sensations, such as intensity, or attitudes, such as the seriousness of crimes, or statements of opinions. Social researchers continued to use the method, as did market researchers for whom the objects might be different hotel room layouts, or variations on a proposed new biscuit.

Adaptive comparative judgement In the 1970s and 1980s Comparative Judgement appeared, almost for the first time in educational assessment, as a theoretical basis or precursor for the new Latent Trait or Item Response Theories. (Andrich, 1978) These models are now standard, especially in item banking and adaptive testing systems.

Re-introduction in education
The first published paper using Comparative Judgement in education was Pollitt & Murray (1994), essentially a research paper concerning the nature of the English proficiency scale assessed in the speaking part of Cambridges CPE exam. The objects were candidates, represented by 2-minute snippets of video recordings from their test sessions, and the judges were Linguistics post-graduate students with no assessment training. The judges compared pairs of video snippets, simply reporting which they thought the better student, and were then clinically interviewed to elicit the reasons for their decisions. Pollitt then introduced Comparative Judgement to the UK awarding bodies, as a method for comparing the standards of A Levels from different boards. Comparative judgement replaced their existing method which required direct judgement of a script against the official standard of a different board. For the first two or three years of this Pollitt carried out all of the analyses for all the boards, using a program he had written for the purpose. It immediately became the only experimental method used to investigate exam comparability in the UK; the applications for this purpose from 1996 to 2006 are fully described in Bramley (2007) [4] In 2004 Pollitt presented a paper at the conference of the International Association for Educational Assessment titled Lets Stop Marking Exams, and another at the same conference in 2009 titled Abolishing Marksism. In each paper the aim was to convince the assessment community that there were significant advantages to using Comparative Judgement in place of marking for some types of assessment. In 2010 he presented a paper at the Association for Educational Assessment Europe, How to Assess Writing Reliably and Validly, which presented evidence of the extraordinarily high reliability that has been achieved with Comparative Judgement in assessing primary school pupilsskill in first language English writing.

Adaptive Comparative Judgement


Comparative Judgement becomes a viable alternative to marking when it is implemented as an adaptive web-based assessment system. In this, the 'scores' (the model parameter for each object) are re-estimated after each 'round' of judgements in which, on average, each object has been judged one more time. In the next round, each script is compared only to another whose current estimated score is similar, which increases the amount of statistical information contained in each judgement. As a result, the estimation procedure is more efficient than random pairing, or any other pre-determined pairing system like those used in classical comparative judgement applications. As with computer-adaptive testing, this adaptivity maximises the efficiency of the estimation procedure, increasing the separation of the scores and reducing the standard errors. The most obvious advantage is that this produces significantly enhanced reliability, compared to assessment by marking, with no loss of validity.

Current Comparative Judgement projects


e-scape The first application of Comparative Judgement to the direct assessment of students was in a project called e-scape, led by Prof. Richard Kimbell of London Universitys Goldsmiths College (Kimbell & Pollitt, 2008).[5] The development work was carried out in collaboration with a number of awarding bodies in a Design & Technology course. Kimbells team developed a sophisticated and authentic project in which students were required to develop, as far as a prototype, an object such as a childrens pill dispenser in two three-hour supervised sessions. The web-based judgement system was designed by Karim Derrick and Declan Lynch from TAG Developments, a part of Sherston Software, and based on the MAPS (software) assessment portfolio system. Goldsmiths, TAG

Adaptive comparative judgement Developments and Pollitt ran three trials, increasing the sample size from 20 to 249 students, and developing both the judging system and the assessment system. There are three pilots, involving Geography and Science as well as the original in Design & Technology. Primary school writing In late 2009 TAG Developments and Pollitt trialled a new version of the system for assessing writing. A total of 1000 primary school scripts were evaluated by a team of 54 judges in a simulated national assessment context. The reliability of the resulting scores after each script had been judged 16 times was 0.96, considerably higher than in any other reported study of similar writing assessment. Further development of the system has shown that reliability of 0.93 can be reached after about 9 judgements of each script, when the system is no more expensive than single marking but still much more reliable. Several projects are underway at present, in England, Scotland, Ireland, Israel, Singapore and Australia. They range from primary school to university in context, and include both formative and summative assessment, from writing to Mathemtatics. The basic web system is now available on a commercial basis from TAG Developments (http:/ / www.tagdevelopments.com), and can be modified to suit specific needs.

References
[1] * Laming, D R J (2004) Human judgment : the eye of the beholder. London, Thomson. [2] Thurstone, L L (1927a). Psychophysical analysis. American Journal of Psychology, 38, 368-389. Chapter 2 in Thurstone, L.L. (1959). The measurement of values. University of Chicago Press, Chicago, Illinois. [3] Thurstone, L L (1927b). The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21, 384-400. Chapter 7 in Thurstone, L.L. (1959). The measurement of values. University of Chicago Press, Chicago, Illinois [4] Bramley, T (2007) Paired comparison methods. In Newton, P, Baird, J, Patrick, H, Goldstein, H, Timms, P and Wood, A (Eds). Techniques for monitoring the comparability of examination standards. London, QCA. [5] Kimbell R, A and Pollitt A (2008) Coursework assessment in high stakes examinations: authenticity, creativity, reliability Third international Rasch measurement conference. Perth: Western Australia: January.

APA, AERA and NCME (1999) Standards for Educational and Psychological Testing. Galton, F (1855) Hereditary genius : an inquiry into its laws and consequences. London : Macmillan. Kimbell, R A, Wheeler A, Miller S, and Pollitt A (2007) e-scape portfolio assessment (e-solutions for creative assessment in portfolio environments) phase 2 report. TERU Goldsmiths, University of London ISBN 978-1-904158-79-0 Pollitt, A (2004) Lets stop marking exams. Annual Conference of the International Association for Educational Assessment, Philadelphia, June. Available at http://www.camexam.co.uk publications. Pollitt, A, (2009) Abolishing Marksism, and rescuing validity. Annual Conference of the International Association for Educational Assessment, Brisbane, September. Available at http://www.camexam.co.uk publications. Pollitt, A, & Murray, NJ (1993) What raters really pay attention to. Language Testing Research Colloquium, Cambridge. Republished in Milanovic, M & Saville, N (Eds), Studies in Language Testing 3: Performance Testing, Cognition and Assessment, Cambridge University Press, Cambridge.

External links
E-scape

Anchor test

10

Anchor test
In psychometrics, an anchor test is a common set of test items administered in combination with two or more alternative forms of the test with the aim of establishing the equivalence of the test scores on the alternative forms. The purpose of the anchor test is to provide a baseline for an equating analysis between different forms of a test.[1]

References
[1] Kolen, M.J., & Brennan, R.L. (1995). Test Equating. New York: Spring.

Assessment centre
An assessment centre is a place at which a person, such as a member of staff, is assessed to determine their suitability for particular roles, especially management or military command. The candidates' personality and aptitudes are determined by a variety of techniques including interviews, examinations and psychometric testing.

History
Assessment centres were first created in World War II to select officers. Examples include the Admiralty Interview Board of the Royal Navy and the War Office Selection Board of the British Army.[1] AT&T created a building for recruitment of staff in the 1950s. This was called The Assessment Centre and this was influential on subsequent personnel methods in other businesses.[2] Other companies use this method to recruit for their graduate programmes by assessing the personality and intellect of potential employees who are fresh out of university and have no work history. The big four accountancy firms conduct assessment centre days to recruit their trainees. 68% of employers in the UK and USA now use some form of assessment centre as part of their recruitment/promotion process.[3]
[4]

References
[3] www.assessmentcentrehq.com

Assessment day

11

Assessment day
An assessment day is usually used in the context of recruitment. On this day, the job applicants are invited to an assessment centre where there are a combination of more than one objective selection techniques used to measure suitability for a job.These technique include exercises such as e-tray, in-tray, presentation, group exercise, attending conference call, role play, personality questionnaire etc. Most large companies now use this method to recruit the fresh talent in their graduate programmes. There are many consultancies who focus on preparing the candidates for these assessment days, for example, Green Turn is a famous consultancy who trains applicants for assessment days of big 4 accountancy firms.

History
Assessment centres were first created in World War II to select officers. Examples include the Admiralty Interview Board of the Royal Navy and the War Office Selection Board of the British Army.[1] AT&T created a building for recruitment of staff in the 1950s. This was called The Assessment Centre and this was influential on subsequent personnel methods in other businesses.[2]

References

Base rate
In probability and statistics, base rate generally refers to the (base) class probabilities unconditioned on featural evidence, frequently also known as prior probabilities. In plainer words, if it were the case that 1% of the public were "medical professionals", and 99% of the public were not "medical professionals", then the base rate of medical professionals is simply 1%. In science, particularly medicine, the base rate is critical for comparison. It may at first seem impressive that 1000 people beat their winter cold while using 'Treatment X', until we look at the entire 'Treatment X' population and find that the base rate of success is actually only 1/100 (i.e. 100 000 people tried the treatment, but the other 99 000 people never really beat their winter cold). The treatment's effectiveness is clearer when such base rate information (i.e. "1000 people... out of how many?") is available. Note that controls may likewise offer further information for comparison; maybe the control groups, who were using no treatment at all, had their own base rate success of 5/100. Controls thus indicate that 'Treatment X' actually makes things worse, despite that initial proud claim about 1000 people.

Overview
Mathematician Keith Devlin provides an illustration of the risks of committing, and the challenges of avoiding, the base rate fallacy. He asks us to imagine that there is a type of cancer that afflicts 1% of all people. A doctor then says there is a test for that cancer which is about 80% reliable. He also says that the test provides a positive result for 100% of people who have the cancer, but it is also results in a 'false positive' for 20% of people - who actually do not have the cancer. Now, if we test positive, we may be tempted to think it is 80% likely that we have the cancer. Devlin explains that, in fact, our odds are less than 5%. What is missing from the jumble of statistics is the most relevant base rate information. We should ask the doctor "Out of the number of people who test positive at all (this is the base rate group that we care about), how many end up actually having the cancer?".[1] Naturally, in assessing the probability that a given individual is a member of a particular class, we must account for other information besides the base rate. In particular, we must account for featural evidence. For example, when we see a person

Base rate wearing a white doctor's coat and stethoscope, and prescribing medication, we have evidence which may allow us to conclude that the probability of this particular individual being a "medical professional" is considerably greater than the category base rate of 1%. The normative method for integrating base rates (prior probabilities) and featural evidence (likelihoods) is given by Bayes rule. A large number of psychological studies have examined a phenomenon called base-rate neglect in which category base rates are not integrated with featural evidence in the normative manner.

12

References
[1] http:/ / www. edge. org/ responses/ what-scientific-concept-would-improve-everybodys-cognitive-toolkit

Bias in Mental Testing

13

Bias in Mental Testing


Bias in Mental Testing
Author(s) Publisher Arthur R. Jensen Free Press

Publication date 1980 Pages ISBN 786 0-029-16430-3

Bias in Mental Testing is a book by Arthur Jensen about the idea of bias in IQ tests.

Background
In 1969, Arthur Jensen's article "How Much Can We Boost IQ and Scholastic Achievement?" initiated an immense controversy because of its suggestion that the reason for the difference in average IQ between African Americans and White Americans might involve genetic as well as cultural factors. One argument against this idea was that IQ tests are culturally biased against African Americans, and that any observed difference in average IQ must therefore be an artifact of the tests themselves. In the 1970s Jensen began researching the idea of test bias, and soon decided it would be beneficial to write a book reviewing the matter. Although he at first intended the book to be rather short, over the course of writing it he came to realize that the topic deserved a much more in-depth analysis, and the book eventually grew into something much larger.[1]

Summary
The book is based on the fact that the average IQ of African Americans had been consistently found to lie approximately 15 points lower than that of White Americans, and the accusation made by some psychologists that IQ tests are therefore culturally biased against African Americans. The book does not address the question whether the cause of the IQ gap is genetic or environmental, but only whether the tests themselves are valid.[2] The book presents several arguments that IQ tests are not biased. African Americans' lower average performance on IQ tests cannot be because of differences in vocabulary, because African Americans have slightly better performance on verbal tests than on nonverbal tests. The IQ difference also cannot be because the tests depend on White culture, or that Whites inevitably do better on tests designed by Whites. In fact, Blacks perform better on tests that are culturally loaded than they do on tests designed to not include cultural references unfamiliar to Blacks, and Japanese children tend to outscore White children by an average of six points. Nor can the difference be a reflection of socioeconomic status, because when Black and White children are tested who are at the same socioeconomic level, the difference between their average IQs is still twelve points.[2] The book also presents evidence that IQ tests work the same way for all English-speaking Americans born in the United States, regardless of race. One is that IQ tests have been very successful in predicting performance for all Americans in school, work, and the armed forces. Another is that the race and sex of the person administering a test does not significantly affect how African Americans perform on it. The ranking in difficulty of test items on IQ tests is the same for both groups, and so is the overall shape of the graph showing the number of people achieving each score, except that the curve is centered slightly lower for Blacks than it is for Whites.[2] Based on this data, Jensen concludes that tests which show a difference in average IQ between races are showing something real, rather than an artifact of the tests themselves. He argues that in competition for college admission

Bias in Mental Testing and jobs, IQ tests have the potential to be more fair that many of the alternatives, because they can judge ability in a way that's colorblind instead of relying on the judgement of an interviewer.[2]

14

Reception and impact


The journal Behavioral and Brain Sciences devoted an issue to Bias in Mental Testing in 1981, publishing 28 reviews of the book.[3] The 1984 book Perspectives on Bias in Mental Testing was written in response to the book. It is a collection of chapters by several authors on the topic of test bias, although not all of them respond directly to Jensen's book. Some of these chapters are supportive of Jensen's conclusions, while others give competing viewpoints.[4] One criticism of the book argues that while Jensen's data shows test bias is not a sufficient explanation for the black/white IQ gap, it does not support his conclusion that no test bias exists at all. Lorrie A. Shepard writes, "Bias in the tests cannot explain away the observed difference between blacks and whites. But the evidence reviewed here does not support the conclusion that there is absolutely no bias nor the dismissing of the bias issue as a worth scientific question."[5] Bias and Mental Testing has been subject to over 200 book reviews, and has been listed by the journal Current Contests as a citation classic.[1] It also is described as the definitive text on the topic of bias in IQ tests.[6][7] The content of the reviews has ranged from technical criticims to ad hominem attacks and extravagant praise.[3] A 1999 literature review re-examined the conclusions of Bias in Mental Testing using new data. It concluded that empirical evidence strongly supported Jensen's conclusion that mental tests are equally valid measures of ability for all English-speaking people born in the United States. The review further argued that misinformation about bias in IQ tests is very pervasive, and thus it is important for the empirical data in this field to be clearly conveyed to the public.[3]

References
[1] This Week's Citation Classic (http:/ / garfield. library. upenn. edu/ classics1987/ A1987K668400001. pdf). Current Contests number 46, November 16, 1987 [2] The Return of Arthur Jensen (http:/ / www. time. com/ time/ magazine/ article/ 0,9171,947407,00. html). Time magazine, Sept. 24, 1979 [3] Robert T. Brown, Cecil R. Reynolds, and Jean S. Whitaker."Bias in Mental Testing since Bias in Mental Testing". School Psychology Quarterly, Vol 14(3), 1999, 208-238. [4] Book Review : Perspectives on Bias in Mental Testing Cecil R. Reynolds and Robert T. Brown. Applied Psychological Measurement March 1985 vol. 9 no. 1 99-107. [5] Shephard, Lorie A. "The Case for Bias in Tests of Achievement and Scholastic Aptitude." In Arthur Jensen: Consensus and Controversy, edited by Sohan and Celiea Modgil. The Falmer Press, 1987. Page 189. [6] Brody, Nathan. Intelligence: Second edition. Academic Press, 1992. Page 287. [7] John R. Graham and Jack A Naglieri. Handbook of Psychology. John Wiley & Sons, 2003. Page 58.

Bipolar spectrum diagnostic scale

15

Bipolar spectrum diagnostic scale


The Bipolar spectrum diagnostic scale (BSDS) is a psychiatric screening rating scale for bipolar disorder.[1] It was developed by Ronald Pies, and was later refined and tested by S. Nassir Ghaemi and colleagues. The BSDS arose from Pies's experience as a psychopharmacology consultant, where he was frequently called on to manage cases of "treatment-resistant depression". Their English version of the scale consists of 19 question items and two sections. The scale was validated in its original version and demonstrated a high sensitivity.[] In general, instruments for the screening of BD, including the BSDS, have low sensitivity and limited diagnostic validity.[]

References
[1] Psychiatric Times. Clinically Useful Psychiatric Scales: Bipolar Spectrum Diagnostic Scale (http:/ / www. psychiatrictimes. com/ clinical-scales/ bsds/ ). Retrieved March 9, 2009.

Borderline intellectual functioning


Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community

Borderline intellectual functioning


16
Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

Borderline intellectual functioning, also called borderline mental retardation, is a categorization of intelligence wherein a person has below average cognitive ability (generally an IQ of 70-85),[1] but the deficit is not as severe as mental retardation (70 or below). It is sometimes called below average IQ (BAIQ). This is technically a cognitive impairment; however, this group is not sufficiently mentally disabled to be eligible for specialized services.[2] Additionally, the DSM-IV-TR codes borderline intellectual functioning as V62.89,[3] which is generally not a billable code, unlike the codes for mental retardation. During school years, individuals with borderline intellectual functioning are often "slow learners."[2] Although a large percentage of this group fails to complete high school and can often achieve only a low socioeconomic status, most adults in this group blend in with the rest of the population.[2] Persons who fall into this categorization have a relatively normal expression of affect for their age, although their ability to think abstractly is rather limited. Reasoning displays a preference for concrete thinking. They are usually able to function day to day without assistance, including holding down a simple job and the basic responsibilities of maintaining a dwelling.

References
[2] The Best Test Preparation for the Advanced Placement Examination in Psychology, Research & Education Association. (2003), p. 99

Further reading
Gillberg, Christopher (1995). Clinical child neuropsychiatry. Cambridge: Cambridge University Press. pp.4748. ISBN0-521-54335-5. Harris, James C. (2006). Intellectual disability : understanding its development, causes, classification, evaluation, and treatment. New York: Oxford University Press. ISBN0-19-517885-8.

Choice set

17

Choice set
A choice set is one scenario, also known as a treatment, provided for evaluation by respondents in a choice experiment. Responses are collected and used to create a choice model. Respondents are usually provided with a series of differing choice sets for evaluation. The choice set is generated from an experimental design and usually involves two or more alternatives being presented together.

Example of a choice set


A choice set has the following elements

Alternatives
A number of hypothetical alternatives, Car A and Car B in this example. There may be one or more Alternatives including the 'None' Alternative.

Attributes
The attributes of the alternatives ideally are mutually exclusive and independent. When this is not possible, attributes are nested.

Example produced using SurveyEngine

Levels
Each Attribute has a number of possible levels that the attributes may range over. The specific levels that are shown are driven by an experimental design. Levels are discrete, even in the case that the attribute is a scalar such as price. In this case, the levels are discretized evenly along the range of allowable values.

Choice task
The respondent is asked a choice task. Usually this is which of the alternatives they prefer. In this example, the Choice task is 'forced'. An 'unforced' choice would allow the respondents to also select 'Neither'. The choice task is used as the dependent variable in the resulting choice model

Citizen survey

18

Citizen survey
A citizen survey is a kind of opinion poll which typically asks the residents of a specific jurisdiction for their perspectives on local issues, such as the quality of life in the community, their level of satisfaction with local government, or their political leanings. Such a survey can be conducted by mail, telephone, Internet, or in person. Citizen surveys were advanced by Harry Hatry[1] of the Urban Institute, who believed resident opinions to be as necessary to the actions of local government managers and elected officials as customer surveys are to business executives. Local government officials use the data from citizen surveys to assist them in allocating resources for maximum community benefit and forming strategic plans for community programs and policies. Many private firms and universities also conduct their own citizen surveys for similar purposes. In 1991, the International City and County Manager's Association (ICMA)[2] published a book by Thomas Miller and Michelle Miller Kobayashi titled Citizen Surveys: How To Do Them, How To Use Them, and What They Mean, that directed local government officials in the basic methods for conducting citizen surveys. The book was revised and republished in 2000. In 2001, ICMA partnered with Miller and Kobayashi's organization National Research Center, Inc.,[3] to bring The National Citizen Survey, a low-cost survey service, to local governments. National Research Center, Inc. maintains a database of over 500 jurisdictions representing more than 40 million Americans, allowing local governments to compare their cities' results with similar communities nearby or across the nation.

References
[1] Selected Research - http:/ / www. urban. org/ expert. cfm?ID=HarryPHatry [2] Untitled Document (http:/ / www. icma. org) [3] National Research Center-Specializing in Performance Measurement and Evaluation (http:/ / www. n-r-c. com)

Classical test theory


Classical test theory is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests. Classical test theory may be regarded as roughly synonymous with true score theory. The term "classical" refers not only to the chronology of these models but also contrasts with the more recent psychometric theories, generally referred to collectively as item response theory, which sometimes bear the appellation "modern" as in "modern latent trait theory". Classical test theory as we know it today was codified by Novick (1966) and described in classic texts such as Lord & Novick (1968) and Allen & Yen (1979/2002). The description of classical test theory below follows these seminal publications.

History
Classical Test Theory was born only after the following 3 achievements or ideas were conceptualized: one, a recognition of the presence of errors in measurements, two, a conception of that error as a random variable, and third, a conception of correlation and how to index it. In 1904, Charles Spearman was responsible for figuring out how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction.[1] Spearman's finding is thought to be the beginning of Classical Test Theory by some (Traub, 1997). Others who had an influence in the Classical Test Theory's framework include: George Udny Yule, Truman Lee Kelley, those involved in making the Kuder-Richardson Formulas, Louis Guttman,

Classical test theory and, most recently, Melvin Novick, not to mention others over the next quarter century after Spearman's initial findings

19

Definitions
Classical test theory assumes that each person has a true score,T, that would be obtained if there were no errors in measurement. A person's true score is defined as the expected number-correct score over an infinite number of independent administrations of the test. Unfortunately, test users never observe a person's true score, only an observed score, X. It is assumed that observed score = true score plus some error: X observed score = T true score + E error , , and in the population.

Classical test theory is concerned with the relations between the three variables

These relations are used to say something about the quality of test scores. In this regard, the most important concept is that of reliability. The reliability of the observed test scores , which is denoted as , is defined as the ratio of true score variance to the observed score variance :

Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the variance of error scores, this is equivalent to

This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to the proportion of the variance in the test scores that we could explain if we knew the true scores. The square root of the reliability is the correlation between true and observed scores.

Evaluating tests and scores: Reliability


Reliability cannot be estimated directly since that would require one to know the true scores, which according to classical test theory is impossible. However, estimates of reliability can be obtained by various means. One way of estimating reliability is by constructing a so-called parallel test. The fundamental property of a parallel test is that it yields the same true score and the same observed score variance as the original test for every individual. If we have parallel tests x and x', then this means that

and

Under these assumptions, it follows that the correlation between parallel test scores is equal to reliability (see Lord & Novick, 1968, Ch. 2, for a proof).

Using parallel tests to estimate reliability is cumbersome because parallel tests are very hard to come by. In practice the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's . Consider a test consisting of items , . The total test score is defined as the sum of the individual item scores, so that for individual

Classical test theory

20

Then Cronbach's alpha equals

Cronbach's can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the reliability of test scores in a population is always higher than the value of Cronbach's in that population. Thus, this method is empirically feasible and, as a result, it is very popular among researchers. Calculation of Cronbach's is included in many standard statistical packages such as SPSS and SAS.[] As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability. Reliability is supposed to say something about the general quality of the test scores in question. The general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is supposed to be. Too high a value for , say over .9, indicates redundancy of items. Around .8 is recommended for personality research, while .9+ is desirable for individual high-stakes testing.[2] These 'criteria' are not based on formal arguments, but rather are the result of convention and professional practice. The extent to which they can be mapped to formal principles of statistical inference is unclear.

Evaluating items: P and item-total correlations


Reliability provides a convenient index of test quality in a single number, reliability. However, it does not provide any information for evaluating single items. Item analysis within the classical approach often relies on two statistics: the P-value (proportion) and the item-total correlation (point-biserial correlation coefficient). The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as item difficulty. The item-total correlation provides an index of the discrimination or differentiating power of the item, and is typically referred to as item discrimination. In addition, these statistics are calculated for each response of the oft-used multiple choice item, which are used to evaluate items and diagnose possible issues, such as a confusing distractor. Such valuable analysis is provided by specially-designed psychometric software.

Alternatives
Classical test theory is an influential theory of test scores in the social sciences. In psychometrics, the theory has been superseded by the more sophisticated models in Item Response Theory (IRT) and Generalizability theory (G-theory). However, IRT is not included in standard statistical packages like SPSS and SAS, whereas these packages routinely provide estimates of Cronbach's . Specialized psychometric software is necessary for IRT or G-theory. However, general statistical packages often do not provide a complete classical analysis (Cronbach's is only one of many important statistics), and in many cases, specialized software for classical analysis is also necessary.

Shortcomings of Classical Test Theory


One of the most important or well known shortcomings of Classical Test Theory is that examinee characteristics and test characteristics cannot be separated: each can only be interpreted in the context of the other. Another shortcoming lies in the definition of Reliability that exists in Classical Test Theory, which states that reliability is "the correlation between test scores on parallel forms of a test". [3] The problem with this is that there are differing opinions of what parallel tests are. Various reliability coefficients provide either lower bound estimates of reliability or reliability estimates with unknown biases. A third shortcoming involves the standard error of measurement. The problem here is that, according to Classical Test Theory, the standard error of measurement is assumed to be the same for all

Classical test theory examinees. However, as Hambleton explains in his book, scores on any test are unequally precise measures for examinees of different ability, thus making the assumption of equal errors of measurement for all examinees implausible (Hambleton, Swaminathan, Rogers, 1991, p.4). A fourth, and final shortcoming of the Classical Test Theory is that it is test oriented, rather than item oriented. In other words, Classical Test Theory cannot help us make predictions of how well an individual or even a group of examinees might do on a test item. [4]

21

Notes
[1] Traub, R. (1997). Classical Test Theory in Historical Perspective. Educational Measurement: Issues and Practice, 16 (4), 8-14. doi:doi:10.1111/j.1745-3992.1997.tb00603.x [3] Hambleton, R., Swaminathan, H., Rogers, H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage Publications, Inc. [4] Hambleton, R., Swaminathan, H., Rogers, H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage Publications, Inc.

References
Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL: Waveland Press. Novick, M.R. (1966) The axioms and principal results of classical test theory Journal of Mathematical Psychology Volume 3, Issue 1, February 1966, Pages 1-18 Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Welsley Publishing Company

Further reading
Gregory, Robert J. (2011). Psychological Testing: History, Principles, and Applications (Sixth ed.). Boston: Allyn & Bacon. ISBN978-0-205-78214-7. Lay summary (http://www.pearsonhighered.com/bookseller/ product/Psychological-Testing-History-Principles-and-Applications-6E/9780205782147.page) (7 November 2010). Hogan, Thomas P.; Brooke Cannon (2007). Psychological Testing: A Practical Introduction (Second ed.). Hoboken (NJ): John Wiley & Sons. ISBN978-0-471-73807-7. Lay summary (http://www.wiley.com/ WileyCDA/WileyTitle/productCd-EHEP000675.html) (21 November 2010).

External links
International Test Commission article on Classical Test Theory (http://www.intestcom.org/Publications/ ORTA/Classical+test+theory.php)

Cluster analysis (in marketing)

22

Cluster analysis (in marketing)


Cluster analysis is a class of statistical techniques that can be applied to data that exhibit natural groupings. Cluster analysis sorts through the raw data and groups them into clusters. A cluster is a group of relatively homogeneous cases or observations. Objects in a cluster are similar to each other. They are also dissimilar to objects outside the cluster, particularly objects in other clusters. In marketing, cluster analysis is used for Segmenting the market and determining target markets Product positioning and New Product Development Selecting test markets (see : experimental techniques)

Examples
The diagram below illustrates the results of a survey that studied drinkers perceptions of spirits (alcohol). Each point represents the results from one respondent. The research indicates there are four clusters in this market. The axes represent two traits of the market. In more complex cluster analyses you may have more than that number.

Illustration of clusters Another example is the vacation travel market. Recent research has identified three clusters or market segments. They are the: 1) The demanders - they want exceptional service and expect to be pampered; 2) The escapists - they want to get away and just relax; 3) The educationalist - they want to see new things, go to museums, go on a safari, or experience new cultures. Cluster analysis, like factor analysis and multi-dimensional scaling, is an interdependence technique: it makes no distinction between dependent and independent variables. The entire set of interdependent relationships is examined. It is similar to multi-dimensional scaling in that both examine inter-object similarity by examining the complete set of interdependent relationships. The difference is that multi-dimensional scaling identifies underlying dimensions, while cluster analysis identifies clusters. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations or cases by grouping them into a smaller set of clusters.

Cluster analysis (in marketing)

23

Procedure
1. Formulate the problem - select the variables to which you wish to apply the clustering technique 2. Select a distance measure - various ways of computing distance: Squared Euclidean distance - the sum of the squared differences in value for each variable Manhattan distance - the sum of the absolute differences in value for any variable Chebyshev distance - the maximum absolute difference in values for any variable Mahalanobis (or correlation) distance - this measure uses the correlation coefficients between the observations and uses that as a measure to cluster them. This is an important measure since it is unit invariant (can figuratively compare apples to oranges) Select a clustering procedure (see below) Decide on the number of clusters Map and interpret clusters - draw conclusions - illustrative techniques like perceptual maps, icicle plots, and dendrograms are useful Assess reliability and validity - various methods: repeat analysis but use different distance measure repeat analysis but use different clustering technique split the data randomly into two halves and analyze each part separately repeat analysis several times, deleting one variable each time repeat analysis several times, using a different order each time

3. 4. 5. 6.

Clustering procedures
There are several types of clustering methods: Non-Hierarchical clustering (also called k-means clustering) first determine a cluster center, then group all objects that are within a certain distance examples: Sequential Threshold method - first determine a cluster center, then group all objects that are within a predetermined threshold from the center - one cluster is created at a time Parallel Threshold method - simultaneously several cluster centers are determined, then objects that are within a predetermined threshold from the centers are grouped Optimizing Partitioning method - first a non-hierarchical procedure is run, then objects are reassigned so as to optimize an overall criterion. Hierarchical clustering objects are organized into an hierarchical structure as part of the procedure examples: Divisive clustering - start by treating all objects as if they are part of a single large cluster, then divide the cluster into smaller and smaller clusters Agglomerative clustering - start by treating each object as a separate cluster, then group them into bigger and bigger clusters examples: Centroid methods - clusters are generated that maximize the distance between the centers of clusters (a centroid is the mean value for all the objects in the cluster) Variance methods - clusters are generated that minimize the within-cluster variance example:

Cluster analysis (in marketing) Wards Procedure - clusters are generated that minimize the squared Euclidean distance to the center mean Linkage methods - cluster objects based on the distance between them examples: Single Linkage method - cluster objects based on the minimum distance between them (also called the nearest neighbour rule) Complete Linkage method - cluster objects based on the maximum distance between them (also called the furthest neighbour rule) Average Linkage method - cluster objects based on the average distance between all pairs of objects (one member of the pair must be from a different cluster)

24

References
Sheppard, A. G. (1996). "The sequence of factor analysis and cluster analysis: Differences in segmentation and dimensionality through the use of raw and factor scores". Tourism Analysis. 1 (Inaugural Volume): 4957.

Cognitive Process Profile


The Cognitive Process Profile (CPP) is an automated simulation exercise that externalises and tracks thinking processes to evaluate: a person's preferred cognitive style a suitable work environment (according to Elliott Jacques stratified systems theory) personal strengths and development areas learning potential the judgement and strategic capability of adults in the work environment

Unlike conventional psychometric ability and IQ tests, which primarily measure crystallised ability in specific content domains, the CPP measures information processing tendencies and capabilities. It also measures 'fluid intelligence' and 'learning potential', by tracking information processing in unfamiliar and fuzzy environments. The CPP predicts cognitive performance in complex, dynamic and vague (or VUCA) work contexts such as professional, strategic and executive environments. It was developed by Dr S M Prinsloo, founder of Cognadev, and released in 1994. Since then it has been translated into several languages and applied internationally for the purposes of leadership assessment, succession planning, selection and development, team compilation as well as personal and team development within the corporate environment.

Cognitive Process Profile

25

References
Thompson, D. (2008) Themes of Measurement and Prediction, in Business Psychology in Practice (ed P. Grant), Whurr Publishers Ltd, London, UK. Print ISBN 978-1-86156-476-4 Online ISBN 978-0-470-71328-0

External links
Cognadev developer of the CPP [1]

Further reading
Jacques, Elliott. (1988) Requisite Organisations,Cason Hall & Co, Arlington,VA. ISBN 1-886436-03-7 Beer, Stafford. The Viable System Model: Its Provenance, Development, Methodology and Pathology, The Journal of the Operational Research Society Vol. 35, No. 1 (Jan., 1984), pp.725

References
[1] http:/ / www. cognadev. com/ products. aspx?pid=1/

Common-method variance
In applied statistics, (e.g., applied to the social sciences and psychometrics), common-method variance (CMV) is the spurious "variance that is attributable to the measurement method rather than to the constructs the measures represent"[] or equivalently as "systematic error variance shared among variables measured with and introduced as a function of the same method and/or source".[] Studies affected by CMV or common-method bias suffer from false correlations and run the risk of reporting incorrect research results.[]

Remedies
Ex-ante remedies
Several ex ante remedies exist that help to avoid or minimize possible common method variance. Important remedies have been collected by Chang et al. (2010).[]

Ex-post remedies
Using simulated data sets, Richardson et al. (2009) investigate three ex post techniques to test for common method variance: the correlational marker technique, the confirmatory factor analysis (CFA) marker technique, and the unmeasured latent method construct (ULMC) technique. Only the CFA marker technique turns out to provide some value.[] A comprehensive example of this technique has been demonstrated by Williams et al. (2010).[]

References

Computer-Adaptive Sequential Testing

26

Computer-Adaptive Sequential Testing


Computer-adaptive sequential testing (CAST) is another term for multistage testing. A CAST test is a type of computer-adaptive test or computerized classification test that uses pre-defined groups of items called testlets rather than operating at the level of individual items.[1] CAST is a term introduced by psychometricians working for the National Board of Medical Examiners.[2] In CAST, the testlets are referred to as panels.

References
[1] Luecht, R.M. (2005). Some useful cost-benefit criteria for evaluating computer-based test delivery models and systems. Journal of Applied Testing Technology, 7(2). (http:/ / www. testpublishers. org/ Documents/ JATT2005_rev_Criteria4CBT_RMLuecht_Apr2005. pdf) [2] Luecht, R. M. & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35, 229-249.

Computerized adaptive testing


Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing.

How CAT works


CAT successively selects questions for the purpose of maximizing the precision of the exam based on what is known about the examinee from previous questions.[1] From the examinee's perspective, the difficulty of the exam seems to tailor itself to his or her level of ability. For example, if an examinee performs well on an item of intermediate difficulty, he will then be presented with a more difficult question. Or, if he performed poorly, he would be presented with a simpler question. Compared to static multiple choice tests that nearly everyone has experienced, with a fixed set of items administered to all examinees, computer-adaptive tests require fewer test items to arrive at equally accurate scores.[1] (Of course, there is nothing about the CAT methodology that requires the items to be multiple-choice; but just as most exams are multiple-choice, most CAT exams also use this format.) The basic computer-adaptive testing method is an iterative algorithm with the following steps:[2] 1. The pool of available items is searched for the optimal item, based on the current estimate of the examinee's ability 2. The chosen item is presented to the examinee, who then answers it correctly or incorrectly 3. The ability estimate is updated, based upon all prior answers 4. Steps 13 are repeated until a termination criterion is met Nothing is known about the examinee prior to the administration of the first item, so the algorithm is generally started by selecting an item of medium, or medium-easy, difficulty as the first item. As a result of adaptive administration, different examinees receive quite different tests.[3] The psychometric technology that allows equitable scores to be computed across different sets of items is item response theory (IRT). IRT is also the preferred methodology for selecting optimal items which are typically selected on the basis of information rather than difficulty, per se.[2] In the USA, the Graduate Management Admission Test are currently primarily administered as a computer-adaptive test. A list of active CAT programs is found at International Association for Computerized Adaptive Testing [4], along with a list of current CAT research programs and a near-inclusive bibliography of all published CAT research. A related methodology called multistage testing (MST) or CAST is used in the Uniform Certified Public Accountant Examination. MST avoids or reduces some of the disadvantages of CAT as described below. See the 2006 special

Computerized adaptive testing issue of Applied Measurement in Education [5] for more information on MST.

27

Advantages
Adaptive tests can provide uniformly precise scores for most test-takers.[2] In contrast, standard fixed tests almost always provide the best precision for test-takers of medium ability and increasingly poorer precision for test-takers with more extreme test scores. An adaptive test can typically be shortened by 50% and still maintain a higher level of precision than a fixed version.[1] This translates into a time savings for the test-taker. Test-takers do not waste their time attempting items that are too hard or trivially easy. Additionally, the testing organization benefits from the time savings; the cost of examinee seat time is substantially reduced. However, because the development of a CAT involves much more expense than a standard fixed-form test, a large population is necessary for a CAT testing program to be financially fruitful. Like any computer-based test, adaptive tests may show results immediately after testing. Adaptive testing, depending on the item selection algorithm, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others (namely the medium or medium/easy items presented to most examinees at the beginning of the test).[2]

Disadvantages
The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items (e.g., to pick the optimal item), all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam (the responses are recorded but do not contribute to the test-takers' scores), called "pilot testing," "pre-testing," or "seeding."[2] This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items;[6] all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees.[6] Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items. Although adaptive tests have exposure control algorithms to prevent overuse of a few items,[2] the exposure conditioned upon ability is often not controlled and can easily become close to 1. That is, it is common for some items to become very common on tests for people of the same ability. This is a serious security concern because groups sharing items may well have a similar functional ability level. In fact, a completely randomized exam is the most secure (but also least efficient). Review of past items is generally disallowed. Adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick wrong answers, leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctlypossibly achieving a very high score. Test-takers frequently complain about the inability to review.[7] Because of the sophistication, the development of a CAT has a number of prerequisites.[8] The large sample sizes (typically hundreds of examinees) required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available.

Computerized adaptive testing

28

CAT components
There are five technical components in building a CAT (the following is adapted from Weiss & Kingsbury, 1984[1] ). This list does not include practical issues, such as item pretesting or live field release. 1. 2. 3. 4. 5. Calibrated item pool Starting point or entry level Item selection algorithm Scoring procedure Termination criterion

Calibrated Item Pool


A pool of items must be available for the CAT to choose from.[1] The pool must be calibrated with a psychometric model, which is used as a basis for the remaining four components. Typically, item response theory is employed as the psychometric model.[1] One reason item response theory is popular is because it places persons and items on the same metric (denoted by the Greek letter theta), which is helpful for issues in item selection (see below).

Starting Point
In CAT, items are selected based on the examinee's performance up to a given point in the test. However, the CAT is obviously not able to make any specific estimate of examinee ability when no items have been administered. So some other initial estimate of examinee ability is necessary. If some previous information regarding the examinee is known, it can be used,[1] but often the CAT just assumes that the examinee is of average ability - hence the first item often being of medium difficulty.

Item Selection Algorithm


As mentioned previously, item response theory places examinees and items on the same metric. Therefore, if the CAT has an estimate of examinee ability, it is able to select an item that is most appropriate for that estimate.[6] Technically, this is done by selecting the item with the greatest information at that point.[1] Information is a function of the discrimination parameter of the item, as well as the conditional variance and pseudoguessing parameter (if used).

Scoring Procedure
After an item is administered, the CAT updates its estimate of the examinee's ability level. If the examinee answered the item correctly, the CAT will likely estimate their ability to be somewhat higher, and vice versa. This is done by using the item response function from item response theory to obtain a likelihood function of the examinee's ability. Two methods for this are called maximum likelihood estimation and Bayesian estimation. The latter assumes an a priori distribution of examinee ability, and has two commonly used estimators: expectation a posteriori and maximum a posteriori. Maximum likelihood is equivalent to a Bayes maximum a posteriori estimate if a uniform (f(x)=1) prior is assumed.[6] Maximum likelihood is asymptotically unbiased, but cannot provide a theta estimate for a nonmixed (all correct or incorrect) response vector, in which case a Bayesian method may have to be used temporarily.[1]

Computerized adaptive testing

29

Termination Criterion
The CAT algorithm is designed to repeatedly administer items and update the estimate of examinee ability. This will continue until the item pool is exhausted unless a termination criterion is incorporated into the CAT. Often, the test is terminated when the examinee's standard error of measurement falls below a certain user-specified value, hence the statement above that an advantage is that examinee scores will be uniformly precise or "equiprecise."[1] Other termination criteria exist for different purposes of the test, such as if the test is designed only to determine if the examinee should "Pass" or "Fail" the test, rather than obtaining a precise estimate of their ability.[1][9]

Other issues
Pass-Fail CAT
In many situations, the purpose of the test is to classify examinees into two or more mutually exclusive and exhaustive categories. This includes the common "mastery test" where the two classifications are "pass" and "fail," but also includes situations where there are three or more classifications, such as "Insufficient," "Basic," and "Advanced" levels of knowledge or competency. The kind of "item-level adaptive" CAT described in this article is most appropriate for tests that are not "pass/fail" or for pass/fail tests where providing good feedback is extremely important.) Some modifications are necessary for a pass/fail CAT, also known as a computerized classification test (CCT).[9] For examinees with true scores very close to the passing score, computerized classification tests will result in long tests while those with true scores far above or below the passing score will have shortest exams. For example, a new termination criterion and scoring algorithm must be applied that classifies the examinee into a category rather than providing a point estimate of ability. There are two primary methodologies available for this. The more prominent of the two is the sequential probability ratio test (SPRT).[10][11] This formulates the examinee classification problem as a hypothesis test that the examinee's ability is equal to either some specified point above the cutscore or another specified point below the cutscore. Note that this is a point hypothesis formulation rather than a composite hypothesis formulation[12] that is more conceptually appropriate. A composite hypothesis formulation would be that the examinee's ability is in the region above the cutscore or the region below the cutscore. A confidence interval approach is also used, where after each item is administered, the algorithm determines the probability that the examinee's true-score is above or below the passing score.[13][14] For example, the algorithm may continue until the 95% confidence interval for the true score no longer contains the passing score. At that point, no further items are needed because the pass-fail decision is already 95% accurate, assuming that the psychometric models underlying the adaptive testing fit the examinee and test. This approach was originally called "adaptive mastery testing"[13] but it can be applied to non-adaptive item selection and classification situations of two or more cutscores (the typical mastery test has a single cutscore).[14] As a practical matter, the algorithm is generally programmed to have a minimum and a maximum test length (or a minimum and maximum administration time). Otherwise, it would be possible for an examinee with ability very close to the cutscore to be administered every item in the bank without the algorithm making a decision. The item selection algorithm utilized depends on the termination criterion. Maximizing information at the cutscore is more appropriate for the SPRT because it maximizes the difference in the probabilities used in the likelihood ratio.[15] Maximizing information at the ability estimate is more appropriate for the confidence interval approach because it minimizes the conditional standard error of measurement, which decreases the width of the confidence interval needed to make a classification.[14]

Computerized adaptive testing

30

Practical Constraints of Adaptivity


ETS researcher Martha Stocking has quipped that most adaptive tests are actually barely adaptive tests (BATs) because, in practice, many constraints are imposed upon item choice. For example, CAT exams must usually meet content specifications;[2] a verbal exam may need to be composed of equal numbers of analogies, fill-in-the-blank and synonym item types. CATs typically have some form of item exposure constraints,[2] to prevent the most informative items from being over-exposed. Also, on some tests, an attempt is made to balance surface characteristics of the items such as gender of the people in the items or the ethnicities implied by their names. Thus CAT exams are frequently constrained in which items it may choose and for some exams the constraints may be substantial and require complex search strategies (e.g., linear programming) to find suitable items. A simple method for controlling item exposure is the "randomesque" or strata method. Rather than selecting the most informative item at each point in the test, the algorithm randomly selects the next item from the next five or ten most informative items. This can be used throughout the test, or only at the beginning.[2] Another method is the Sympson-Hetter method,[16] in which a random number is drawn from U(0,1), and compared to a ki parameter determined for each item by the test user. If the random number is greater than ki, the next most informative item is considered.[2] Wim van der Linden and colleagues[17] have advanced an alternative approach called shadow testing which involves creating entire shadow tests as part of selecting items. Selecting items from shadow tests helps adaptive tests meet selection criteria by focusing on globally optimal choices (as opposed to choices that are optimal for a given item).

References
[1] Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375. [2] Thissen, D., & Mislevy, R.J. (2000). Testing Algorithms. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. [3] Green, B.F. (2000). System design and operation. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. [4] http:/ / www. iacat. org/ [5] http:/ / www. leaonline. com/ toc/ ame/ 19/ 3 [6] Wainer, H., & Mislevy, R.J. (2000). Item response theory, calibration, and estimation. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. [7] http:/ / edres. org/ scripts/ cat/ catdemo. htm [8] http:/ / www. fasttestweb. com/ ftw-docs/ CAT_Requirements. pdf [9] Lin, C.-J. & Spray, J.A. (2000). Effects of item-selection criteria on classification testing with the sequential probability ratio test. (Research Report 2000-8). Iowa City, IA: ACT, Inc. [10] Wald, A. (1947). Sequential analysis. New York: Wiley. [11] Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press. [12] Weitzman, R. A. (1982). Sequential testing for selection. Applied Psychological Measurement, 6, 337-351. [13] Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press. [14] Eggen, T. J. H. M, & Straetmans, G. J. J. M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713-734. [15] Spray, J. A., & Reckase, M. D. (1994). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education (New Orleans, LA, April 57, 1994). [16] Sympson, B.J., & Hetter, R.D. (1985). Controlling item-exposure rates in computerized adaptive testing. Paper presented at the annual conference of the Military Testing Association, San Diego. [17] For example: van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics, 29, 273291.

Computerized adaptive testing

31

Additional sources
Drasgow, F., & Olson-Buchanan, J. B. (Eds.). (1999). Innovations in computerized assessment. Hillsdale, NJ: Erlbaum. Van der Linden, W. J., & Glas, C.A.W. (Eds.). (2000). Computerized adaptive testing: Theory and practice. Boston, MA: Kluwer. Wainer, H. (Ed.). (2000). Computerized adaptive testing: A Primer (2nd Edition). Mahwah, NJ: ELawrence Erlbaum Associates. Weiss, D.J. (Ed.). (1983). New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.

Further reading
"First Adaptive Test: Binet's IQ Test" (http://iacat.org/node/442), International Association for Computerized Adaptive Testing (IACAT) Sands, William A. (Ed); Waters, Brian K. (Ed); McBride, James R. (Ed), Computerized adaptive testing: From inquiry to operation (http://psycnet.apa.org/books/10244/), Washington, DC, US: American Psychological Association. (1997). xvii 292 pp. doi: 10.1037/10244-000 Zara, Anthony R., "Using Computerized Adaptive Testing to Evaluate Nurse Competence for Licensure: Some History and Forward Look" (http://www.springerlink.com/content/mh6p73432451g446/), Advances in Health Sciences Education, Volume 4, Number 1 (1999), 39-48, DOI: 10.1023/A:1009866321381

External links
International Association for Computerized Adaptive Testing (http://www.iacat.org) Concerto: Open-source CAT Platform (http://www.psychometrics.cam.ac.uk/page/300/ concerto-testing-platform.htm) CAT Central (http://www.psych.umn.edu/psylabs/catcentral/) by David J. Weiss Frequently Asked Questions about Computer-Adaptive Testing (CAT) (http://www.carla.umn.edu/ assessment/CATfaq.html). Retrieved April 15, 2005. An On-line, Interactive, Computer Adaptive Testing Tutorial (http://edres.org/scripts/cat/catdemo.htm) by Lawrence L. Rudner. November 1998. Retrieved April 15, 2005. Special issue: An introduction to multistage testing. (http://www.leaonline.com/toc/ame/19/3) Applied Measurement in Education, 19(3). Computerized Adaptive Tests (http://www.ericdigests.org/pre-9213/tests.htm) - from the Education Resources Information Center Clearinghouse on Tests Measurement and Evaluation, Washington, DC

Computerized classification test

32

Computerized classification test


A computerized classification test (CCT) refers to, as its name would suggest, a test that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test (accurate classification) with a fraction of the number of items used in a conventional fixed-form test. A CCT requires several components: 1. An item bank calibrated with a psychometric model selected by the test designer 2. A starting point 3. An item selection algorithm 4. A termination criterion and scoring procedure The starting point is not a topic of contention; research on CCT primarily investigates the application of different methods for the other three components. Note: The termination criterion and scoring procedure are separate in CAT, but the same in CCT because the test is terminated when a classification is made. Therefore, there are five components that must be specified to design a CAT. An introduction to CCT is found in Thompson (2007)[1] and a book by Parshall, Spray, Kalohn and Davey (2006).[2] A bibliography of published CCT research is found below.

How a CCT Works


A CCT is very similar to a CAT. Items are administered one at a time to an examinee. After the examinee responds to the item, the computer scores it and determines if the examinee is able to be classified yet. If they are, the test is terminated and the examinee is classified. If not, another item is administered. This process repeats until the examinee is classified or another ending point is satisfied (all items in the bank have been administered, or a maximum test length is reached).

Psychometric Model
Two approaches are available for the psychometric model of a CCT: classical test theory (CTT) and item response theory (IRT). Classical test theory assumes a state model because it is applied by determining item parameters for a sample of examinees determined to be in each category. For instance, several hundred "masters" and several hundred "nonmasters" might be sampled to determine the difficulty and discrimination for each, but doing so requires that you be able to easily identify a distinct set of people that are in each group. IRT, on the other hand, assumes a trait model; the knowledge or ability measured by the test is a continuum. The classification groups will need to be more or less arbitrarily defined along the continuum, such as the use of a cutscore to demarcate masters and nonmasters, but the specification of item parameters assumes a trait model. There are advantages and disadvantages to each. CTT offers greater conceptual simplicity. More importantly, CTT requires fewer examinees in the sample for calibration of item parameters to be used eventually in the design of the CCT, making it useful for smaller testing programs. See Frick (1992)[3] for a description of a CTT-based CCT. Most CCTs, however, utilize IRT. IRT offers greater specificity, but the most important reason may be that the design of a CCT (and a CAT) is expensive, and is therefore more likely done by a large testing program with extensive resources. Such a program would likely use IRT.

Computerized classification test

33

Starting point
A CCT must have a specified starting point to enable certain algorithms. If the sequential probability ratio test is used as the termination criterion, it implicitly assumes a starting ratio of 1.0 (equal probability of the examinee being a master or nonmaster). If the termination criterion is a confidence interval approach, a specified starting point on theta must be specified. Usually, this is 0.0, the center of the distribution, but it could also be randomly drawn from a certain distribution if the parameters of the examinee distribution are known. Also, previous information regarding an individual examinee, such as their score the last time they took the test (if re-taking) may be used.

Item Selection
In a CCT, items are selected for administration throughout the test, unlike the traditional method of administering a fixed set of items to all examinees. While this is usually done by individual item, it can also be done in groups of items known as testlets (Leucht & Nungester, 1996;[4] Vos & Glas, 2000[5]). Methods of item selection fall into two categories: cutscore-based and estimate-based. Cutscore-based methods (also known as sequential selection) maximize the information provided by the item at the cutscore, or cutscores if there are more than one, regardless of the ability of the examinee. Estimate-based methods (also known as adaptive selection) maximize information at the current estimate of examinee ability, regardless of the location of the cutscore. Both work efficiently, but the efficiency depends in part on the termination criterion employed. Because the sequential probability ratio test only evaluates probabilities near the cutscore, cutscore-based item selection is more appropriate. Because the confidence interval termination criterion is centered around the examinees ability estimate, estimate-based item selection is more appropriate. This is because the test will make a classification when the confidence interval is small enough to be completely above or below the cutscore (see below). The confidence interval will be smaller when the standard error of measurement is smaller, and the standard error of measurement will be smaller when there is more information at the theta level of the examinee.

Termination criterion
There are three termination criteria commonly used for CCTs. Bayesian decision theory methods offer great flexibility by presenting an infinite choice of loss/utility structures and evaluation considerations, but also introduce greater arbitrariness. A confidence interval approach calculates a confidence interval around the examinee's current theta estimate at each point in the test, and classifies the examinee when the interval falls completely within a region of theta that defines a classification. This was originally known as adaptive mastery testing (Kingsbury & Weiss, 1983), but does not necessarily require adaptive item selection, nor is it limited to the two-classification mastery testing situation. The sequential probability ratio test (Reckase, 1983) defines the classification problem as a hypothesis test that the examinee's theta is equal to a specified point above the cutscore or a specified point below the cutscore.

Computerized classification test

34

References
[1] Thompson, N. A. (2007). A Practitioners Guide for Variable-length Computerized Classification Testing. Practical Assessment Research & Evaluation, 12(1). (http:/ / pareonline. net/ getvn. asp?v=12& n=1) [2] Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2006). Practical considerations in computer-based testing. New York: Springer. [3] Frick, T. (1992). Computerized Adaptive Mastery Tests as Expert Systems. Journal of Educational Computing Research, 8(2), 187-213. [4] Luecht, R. M., & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35, 229-249. [5] Vos, H.J. & Glas, C.A.W. (2000). Testlet-based adaptive mastery testing. In van der Linden, W.J., and Glas, C.A.W. (Eds.) Computerized Adaptive Testing: Theory and Practice.

A bibliography of CCT research


Armitage, P. (1950). Sequential analysis with more than two alternative hypotheses, and its relation to discriminant function analysis. Journal of the Royal Statistical Society, 12, 137-144. Braun, H., Bejar, I.I., and Williamson, D.M. (2006). Rule-based methods for automated scoring: Application in a licensing context. In Williamson, D.M., Mislevy, R.J., and Bejar, I.I. (Eds.) Automated scoring of complex tasks in computer-based testing. Mahwah, NJ: Erlbaum. Dodd, B. G., De Ayala, R. J., & Koch, W. R. (1995). Computerized adaptive testing with polytomous items. Applied Psychological Measurement, 19, 5-22. Eggen, T. J. H. M. (1999). Item selection in adaptive testing with the sequential probability ratio test. Applied Psychological Measurement, 23, 249-261. Eggen, T. J. H. M, & Straetmans, G. J. J. M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713-734. Epstein, K. I., & Knerr, C. S. (1977). Applications of sequential testing procedures to performance testing. Paper presented at the 1977 Computerized Adaptive Testing Conference, Minneapolis, MN. Ferguson, R. L. (1969). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction. Unpublished doctoral dissertation, University of Pittsburgh. Frick, T. W. (1989). Bayesian adaptation during computer-based tests and computer-guided exercises. Journal of Educational Computing Research, 5, 89-114. Frick, T. W. (1990). A comparison of three decisions models for adapting the length of computer-based mastery tests. Journal of Educational Computing Research, 6, 479-513. Frick, T. W. (1992). Computerized adaptive mastery tests as expert systems. Journal of Educational Computing Research, 8, 187-213. Huang, C.-Y., Kalohn, J.C., Lin, C.-J., and Spray, J. (2000). Estimating Item Parameters from Classical Indices for Item Pool Development with a Computerized Classification Test. (Research Report 2000-4). Iowa City, IA: ACT, Inc. Jacobs-Cassuto, M.S. (2005). A Comparison of Adaptive Mastery Testing Using Testlets With the 3-Parameter Logistic Model. Unpublished doctoral dissertation, University of Minnesota, Minneapolis, MN. Jiao, H., & Lau, A. C. (2003). The Effects of Model Misfit in Computerized Classification Test. Paper presented at the annual meeting of the National Council of Educational Measurement, Chicago, IL, April 2003. Jiao, H., Wang, S., & Lau, C. A. (2004). An Investigation of Two Combination Procedures of SPRT for Three-category Classification Decisions in Computerized Classification Test. Paper presented at the annual meeting of the American Educational Research Association, San Antonio, April 2004. Kalohn, J. C., & Spray, J. A. (1999). The effect of model misspecification on classification decisions made using a computerized test. Journal of Educational Measurement, 36, 47-59. Kingsbury, G.G., & Weiss, D.J. (1979). An adaptive testing strategy for mastery decisions. Research report 79-05. Minneapolis: University of Minnesota, Psychometric Methods Laboratory.

Computerized classification test Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp.237254). New York: Academic Press. Lau, C. A. (1996). Robustness of a unidimensional computerized testing mastery procedure with multidimensional testing data. Unpublished doctoral dissertation, University of Iowa, Iowa City IA. Lau, C. A., & Wang, T. (1998). Comparing and combining dichotomous and polytomous items with SPRT procedure in computerized classification testing. Paper presented at the annual meeting of the American Educational Research Association, San Diego. Lau, C. A., & Wang, T. (1999). Computerized classification testing under practical constraints with a polytomous model. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada. Lau, C. A., & Wang, T. (2000). A new item selection procedure for mixed item type in computerized classification testing. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, Louisiana. Lewis, C., & Sheehan, K. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement, 14, 367-386. Lin, C.-J. & Spray, J.A. (2000). Effects of item-selection criteria on classification testing with the sequential probability ratio test. (Research Report 2000-8). Iowa City, IA: ACT, Inc. Linn, R. L., Rock, D. A., & Cleary, T. A. (1972). Sequential testing for dichotomous decisions. Educational & Psychological Measurement, 32, 85-95. Luecht, R. M. (1996). Multidimensional Computerized Adaptive Testing in a Certification or Licensure Context. Applied Psychological Measurement, 20, 389-404. Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp.237254). New York: Academic Press. Rudner, L. M. (2002). An examination of decision-theory adaptive testing procedures. Paper presented at the annual meeting of the American Educational Research Association, April 15, 2002, New Orleans, LA. Sheehan, K., & Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement, 16, 65-76. Spray, J. A. (1993). Multiple-category classification using a sequential probability ratio test (Research Report 93-7). Iowa City, Iowa: ACT, Inc. Spray, J. A., Abdel-fattah, A. A., Huang, C., and Lau, C. A. (1997). Unidimensional approximations for a computerized test when the item pool and latent space are multidimensional (Research Report 97-5). Iowa City, Iowa: ACT, Inc. Spray, J. A., & Reckase, M. D. (1987). The effect of item parameter estimation error on decisions made using the sequential probability ratio test (Research Report 87-17). Iowa City, IA: ACT, Inc. Spray, J. A., & Reckase, M. D. (1994). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education (New Orleans, LA, April 57, 1994). Spray, J. A., & Reckase, M. D. (1996). Comparison of SPRT and sequential Bayes procedures for classifying examinees into two categories using a computerized test. Journal of Educational & Behavioral Statistics,21, 405-414. Thompson, N.A. (2006). Variable-length computerized classification testing with item response theory. CLEAR Exam Review, 17(2). Vos, H. J. (1998). Optimal sequential rules for computer-based instruction. Journal of Educational Computing Research, 19, 133-154.

35

Computerized classification test Vos, H. J. (1999). Applications of Bayesian decision theory to sequential mastery testing. Journal of Educational and Behavioral Statistics, 24, 271-292. Wald, A. (1947). Sequential analysis. New York: Wiley. Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375. Weissman, A. (2004). Mutual information item selection in multiple-category classification CAT. Paper presented at the Annual Meeting of the National Council for Measurement in Education, San Diego, CA. Weitzman, R. A. (1982a). Sequential testing for selection. Applied Psychological Measurement, 6, 337-351. Weitzman, R. A. (1982b). Use of sequential testing to prescreen prospective entrants into military service. In D. J. Weiss (Ed.), Proceedings of the 1982 Computerized Adaptive Testing Conference. Minneapolis, MN: University of Minnesota, Department of Psychology, Psychometric Methods Program, 1982.

36

External links
Measurement Decision Theory (http://edres.org/mdt/) by Lawrence Rudner CAT Central (http://www.psych.umn.edu/psylabs/catcentral/) by David J. Weiss

Congruence coefficient
In multivariate statistics, the congruence coefficient is an index of the similarity between factors that have been derived in a factor analysis. It was introduced in 1948 by Cyril Burt who referred to it as unadjusted correlation. It is also called Tucker's congruence coefficient after Ledyard Tucker who popularized the technique. Its values range between -1 and +1. It can be used to study the similarity of extracted factors across different samples of, for example, test takers who have taken the same test.[1][2][3] Generally, a congruence coefficient of 0.90 is interpreted as indicating a high degree of factor similarity, while a coefficient of 0.95 or higher indicates that the factors are virtually identical. Alternatively, a value in the range 0.850.94 has been seen as corresponding to a fair similarity, with values higher than 0.95 indicating that the factors can be considered to be equal.[1][2]

Definition
Let X and Y be column vectors of factor loadings for two different samples. The formula for the congruence coefficient, or rc, is then[2]

The congruence coefficient can also be defined as the cosine of the angle between factor axes based on the same set of variables (e.g., tests) obtained for two samples (see Cosine similarity). For example, with perfect congruence the angle between the factor axes is 0 degrees, and the cosine of 0 is 1.[2]

Congruence coefficient

37

Comparison with Pearson's r


The congruence coefficient is preferred to Pearson's r as a measure of factor similarity, because the latter may produce misleading results. The computation of the congruence coefficient is based on the deviations of factor loadings from zero, whereas r is based on the deviations from the mean of the factor loadings.[2]

References
[1] Lorenzo-Seva, U. & ten Berge, J.M.F. (2006). Tuckers Congruence Coefficient as a Meaningful Index of Factor Similarity. Methodology, 2, 5764. [2] Jensen, A.R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger, pp. 99100. [3] Herv, A. (2007). RV Coefcient and Congruence Coefcient. (http:/ / wwwpub. utdallas. edu/ ~herve/ Abdi-RV2007-pretty. pdf) In Neil Salkind (Ed.), Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.

Conjoint analysis
See also: Conjoint analysis (in marketing), Conjoint analysis (in healthcare), IDDEA, Rule Developing Experimentation, Value based pricing. Conjoint analysis, also called multi-attribute compositional models or stated preference analysis, is a statistical technique that originated in mathematical psychology. Today it is used in many of the social sciences and applied sciences including marketing, product management, and operations research. It is not to be confused with the theory of conjoint measurement.

Methodology
Conjoint analysis requires research participants to make a series of trade-offs. Analysis of these trade-offs will reveal the relative importance of component attributes. To improve the predictive ability of this analysis, research participants should be grouped into similar segments based on objectives, values and/or other factors. The exercise can be administered to survey respondents in a number of different ways. Traditionally it is administered as a ranking exercise and sometimes as a rating exercise (where the respondent awards each trade-off scenario a score indicating appeal). In more recent years it has become common practice to present the trade-offs as a choice exercise (where the respondent simply chooses the most preferred alternative from a selection of competing alternatives - particularly common when simulating consumer choices) or as a constant sum allocation exercise (particularly common in pharmaceutical market research, where physicians indicate likely shares of prescribing, and each alternative in the trade-off is the description a real or hypothetical therapy). Analysis is traditionally carried out with some form of multiple regression, but more recently the use of hierarchical Bayesian analysis has become widespread, enabling fairly robust statistical models of individual respondent decision behaviour to be developed. When there are many attributes, experiments with Conjoint Analysis include problems of information overload that affect the validity of such experiments. The impact of these problems can be avoided or reduced by using Hierarchical Information Integration.[1]

Conjoint analysis

38

Example
A real estate developer is interested in building a high rise apartment complex near an urban Ivy League university. To ensure the success of the project, a market research firm is hired to conduct focus groups with current students. Students are segmented by academic year (freshman, upper classmen, graduate studies) and amount of financial aid received. Study participants are given a series of index cards. Each card has 6 attributes to describe the potential building project (proximity to campus, cost, telecommunication packages, laundry options, floor plans, and security features offered). The estimated cost to construct the building described on each card is equivalent. Participants are asked to order the cards from least to most appealing. This forced ranking exercise will indirectly reveal the participants' priorities and preferences. Multi-variate regression analysis may be used to determine the strength of preferences across target market segments.

References

Correction for attenuation


Correction for attenuation is a statistical procedure, due to Spearman (1904), to "rid a correlation coefficient from the weakening effect of measurement error" (Jensen, 1998), a phenomenon also known as regression dilution. In measurement and statistics, it is also called disattenuation. The correlation between two sets of parameters or measurements is estimated in a manner that accounts for measurement error contained within the estimates of those parameters.

Background
Correlations between parameters are diluted or weakened by measurement error. Disattenuation provides for a more accurate estimate of the correlation between the parameters by accounting for this effect.

Derivation of the formula


Let and be the true values of two attributes of some person or statistical unit. These values are regarded as and be and derived either directly by observation-with-error or from application of a measurement random variables by virtue of the statistical unit being selected randomly from some population. Let estimates of model, such as the Rasch model. Also, let where and are the measurement errors associated with the estimates and .

The correlation between two sets of estimates is

which, assuming the errors are uncorrelated with each other and with the estimates, gives

Correction for attenuation

39

where as follows:

is the separation index of the set of estimates of

, which is analogous to Cronbach's alpha; this is, in

terms of Classical test theory,

is analogous to a reliability coefficient. Specifically, the separation index is given

where the mean squared standard error of person estimate gives an estimate of the variance of the errors, standard errors are normally produced as a by-product of the estimation process (see Rasch model estimation). The disattenuated estimate of the correlation between two sets of parameters or measures is therefore

. The

That is, the disattenuated correlation is obtained by dividing the correlation between the estimates by the square root of the product of the separation indices of the two sets of estimates. Expressed in terms of Classical test theory, the correlation is divided by the square root of the product of the reliability coefficients of two tests. Given two random variables , the correlation between and and , with correlation , and a known reliability for each variable, . and

corrected for attenuation is

How well the variables are measured affects the correlation of X and Y. The correction for attenuation tells you what the correlation would be if you could measure X and Y with perfect reliability. If then and are taken to be imperfect measurements of underlying variables measures the true correlation between and . and with independent errors,

References
Jensen, A.R. (1998). The g Factor: The Science of Mental Ability Praeger, Connecticut, USA. ISBN 0-275-96103-6 Spearman, C. (1904) "The Proof and Measurement of Association between Two Things". The American Journal of Psychology, 15 (1), 72101 JSTOR1412159 [1]

External links
Disattenuating correlations [2] Disattenuation of correlation and regression coefficients: Jason W. Osborne [3]

References
[1] http:/ / www. jstor. org/ stable/ 1412159 [2] http:/ / www. rasch. org/ rmt/ rmt101g. htm [3] http:/ / pareonline. net/ getvn. asp?v=8& n=11

Counternull

40

Counternull
In statistics, and especially in the statistical analysis of psychological data, the counternull is a statistic used to aid the understanding and presentation of research results. It revolves around the effect size, which is the mean magnitude of some effect divided by the standard deviation.[1] The counternull value is the effect size that is just as well supported by the data as the null hypothesis.[2] In particular, when results are drawn from a distribution that is symmetrical about its mean, the counternull value is exactly twice the observed effect size. The null hypothesis is a hypothesis set up to be tested against an alternative. Thus the counternull is an alternative hypothesis that, when used to replace the null hypothesis, generates the same p-value as had the original null hypothesis of no difference.[3] Some researchers contend that reporting the counternull, in addition to the p-value, serves to counter two common errors of judgment:[] assuming that failure to reject the null hypothesis at the chosen level of statistical significance means that the observed size of the "effect" is zero; and assuming that rejection of the null hypothesis at a particular p-value means that the measured "effect" is not only statistically significant, but also scientifically important. These arbitrary statistical thresholds create a discontinuity, causing unnecessary confusion and artificial controversy.[4] Other researchers prefer confidence intervals as a means of countering these common errors.[5]

References
[4] Pasher (2002), p. 348: "The reject/fail-to-reject UNIQ-nowiki-0-0713b852e134e18e-QINU dichotomy keeps the field awash in confusion and artificial controversy."

Further reading
Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on other people's published data: General procedures for research consumers. Psychological Methods, 1, 331-340

Criterion-referenced test

41

Criterion-referenced test
A criterion-referenced test is one that provides for translating test scores into a statement about the behavior to be expected of a person with that score or their relationship to a specified subject matter. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. The objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment. Criterion-referenced testing was a major focus of psychometric research in the 1970s.[1]

Definition of criterion
A common misunderstanding regarding the term is the meaning of criterion. Many, if not most, criterion-referenced tests involve a cutscore, where the examinee passes if their score exceeds the cutscore and fails if it does not (often called a mastery test). The criterion is not the cutscore; the criterion is the domain of subject matter that the test is designed to assess. For example, the criterion may be "Students should be able to correctly add two single-digit numbers," and the cutscore may be that students should correctly answer a minimum of 80% of the questions to pass. The criterion-referenced interpretation of a test score identifies the relationship to the subject matter. In the case of a mastery test, this does mean identifying whether the examinee has "mastered" a specified level of the subject matter by comparing their score to the cutscore. However, not all criterion-referenced tests have a cutscore, and the score can simply refer to a person's standing on the subject domain.[2] Again, the ACT is an example of this; there is no cutscore, it simply is an assessment of the student's knowledge of high-school level subject matter. Because of this common misunderstanding, criterion-referenced tests have also been called standards-based assessments by some education agencies,[3] as students are assessed with regards to standards that define what they "should" know, as defined by the state.[4]

Comparison of criterion-referenced and norm-referenced tests


Sample scoring for the history question: What caused World War II?
Student answers Criterion-referenced assessment This answer is correct. Norm-referenced assessment

Student #1: WWII was caused by Hitler and Germany invading Poland.

This answer is worse than Student #2's answer, but better than Student #3's answer. This answer is better than Student #1's and Student #3's answers.

Student #2: WWII was caused by multiple factors, including the Great Depression and the general economic situation, the rise of nationalism, fascism, and imperialist expansionism, and unresolved resentments related to WWI. The war in Europe began with the German invasion of Poland. Student #3: WWII was caused by the assassination of Archduke Ferdinand.

This answer is correct.

This answer is wrong.

This answer is worse than Student #1's and Student #2's answers.

Both terms criterion-referenced and norm-referenced were originally coined by Robert Glaser.[5] Unlike a criterion-reference test, a norm-referenced test indicates whether the test-taker did better or worse than other people who took the test. For example, if the criterion is "Students should be able to correctly add two single-digit numbers," then reasonable test questions might look like " " or " " A criterion-referenced test would report the student's performance strictly according to whether the individual student correctly answered these questions. A

Criterion-referenced test norm-referenced test would report primarily whether this student correctly answered more questions compared to other students in the group. Even when testing similar topics, a test which is designed to accurately assess mastery may use different questions than one which is intended to show relative ranking. This is because some questions are better at reflecting actual achievement of students, and some test questions are better at differentiating between the best students and the worst students. (Many questions will do both.) A criterion-referenced test will use questions which were correctly answered by students who know the specific material. A norm-referenced test will use questions which were correctly answered by the "best" students and not correctly answered by the "worst" students (e.g. Cambridge University's pre-entry 'S' paper). Some tests can provide useful information about both actual achievement and relative ranking. The ACT provides both a ranking, and indication of what level is considered necessary to likely success in college.[6] Some argue that the term "criterion-referenced test" is a misnomer, since it can refer to the interpretation of the score as well as the test itself.[7] In the previous example, the same score on the ACT can be interpreted in a norm-referenced or criterion-referenced manner.

42

Relationship to high-stakes testing


Many criterion-referenced tests are also high-stakes tests, where the results of the test have important implications for the individual examinee. Examples of this include high school graduation examinations and licensure testing where the test must be passed to work in a profession, such as to become a physician or attorney. However, being a high-stakes test is not specifically a feature of a criterion-referenced test. It is instead a feature of how an educational or government agency chooses to use the results of the test.

Examples
Driving tests are criterion-referenced tests, because their goal is to see whether the test taker is skilled enough to be granted a driver's license, not to see whether one test taker is more skilled than another test taker. Citizenship tests are usually criterion-referenced tests, because their goal is to see whether the test taker is sufficiently familiar with the new country's history and government, not to see whether one test taker is more knowledgeable than another test taker.

References
[2] (http:/ / www. questionmark. com/ us/ glossary. htm) QuestionMark Glossary [3] Assessing the Assessment of Outcomes Based Education (http:/ / www. apapdc. edu. au/ archive/ ASPA/ conference2000/ papers/ art_3_9. htm) by Dr Malcolm Venter. Cape Town, South Africa. "OBE advocates a criterion-based system, which means getting rid of the bell curve, phasing out grade point averages and comparative grading". [4] Homeschool World (http:/ / www. home-school. com/ exclusive/ standards. html): "The Education Standards Movement Spells Trouble for Private and Home Schools" [6] Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.

Cronbach's alpha

43

Cronbach's alpha
In statistics, Cronbach's (alpha)[] is a coefficient of internal consistency. It is commonly used as an estimate of the reliability of a psychometric test for a sample of examinees. It was first named alpha by Lee Cronbach in 1951, as he had intended to continue with further coefficients. The measure can be viewed as an extension of the KuderRichardson Formula 20 (KR-20), which is an equivalent measure for dichotomous items. Alpha is not robust against missing data. Several other Greek letters have been used by later researchers to designate other measures used in a similar context.[1] Somewhat related is the average variance extracted (AVE). This article discusses the use of in psychology, but Cronbach's alpha statistic is widely used in the social sciences, business, nursing, and other disciplines. The term item is used throughout this article, but items could be anything questions, raters, indicators of which one might ask to what extent they "measure the same thing." Items that are manipulated are commonly referred to as variables.

Definition
Suppose that we measure a quantity which is a sum of . Cronbach's is defined as components (K-items or testlets):

where

the variance of the observed total test scores, and can also be defined as

the variance of component i for the current sample

of persons.[2] Alternatively, the Cronbach's

where

is as above,

the average variance of each component (item), and

the average of all covariances

between the components across the current sample of persons (that is, without including the variances of each component). The standardized Cronbach's alpha can be defined as

where

is as above and

the mean of the

non-redundant correlation coefficients (i.e., the mean

of an upper triangular, or lower triangular, correlation matrix). Cronbach's is related conceptually to the SpearmanBrown prediction formula. Both arise from the basic classical test theory result that the reliability of test scores can be expressed as the ratio of the true-score and total-score (error plus true score) variances:

The theoretical value of alpha varies from zero to 1, since it is the ratio of two variances. However, depending on the estimation procedure used, estimates of alpha can take on any value less than or equal to 1, including negative values, although only positive values make sense.[3] Higher values of alpha are more desirable. Some professionals,[4] as a rule of thumb, require a reliability of 0.70 or higher (obtained on a substantial sample) before they will use an instrument. Obviously, this rule should be applied with caution when has been computed from items that systematically violate its assumptions.Wikipedia:Citing sources Furthermore, the appropriate degree of reliability depends upon the use of the instrument. For example, an instrument designed to be used as part of a battery of tests may be intentionally designed to be as short as possible, and therefore somewhat less reliable. Other

Cronbach's alpha situations may require extremely precise measures with very high reliabilities. In the extreme case of a two-item test, the SpearmanBrown prediction formula is more appropriate than Cronbach's alpha. [5] This has resulted in a wide variance of test reliability. In the case of psychometric tests, most fall within the range of 0.75 to 0.83 with at least one claiming a Cronbach's alpha above 0.90 (Nunnally 1978, page 245246).

44

Internal consistency
Cronbach's alpha will generally increase as the intercorrelations among test items increase, and is thus known as an internal consistency estimate of reliability of test scores. Because intercorrelations among test items are maximized when all items measure the same construct, Cronbach's alpha is widely believed to indirectly indicate the degree to which a set of items measures a single unidimensional latent construct. However, the average intercorrelation among test items is affected by skew just like any other average. Thus, whereas the modal intercorrelation among test items will equal zero when the set of items measures several unrelated latent constructs, the average intercorrelation among test items will be greater than zero in this case. Indeed, several investigators have shown that alpha can take on quite high values even when the set of items measures several unrelated latent constructs.[6][][7][8][9][10]As a result, alpha is most appropriately used when the items measure different substantive areas within a single construct. When the set of items measures more than one construct, coefficient omega_hierarchical is more appropriate.[][] Alpha treats any covariance among items as true-score variance, even if items covary for spurious reasons. For example, alpha can be artificially inflated by making scales which consist of superficial changes to the wording within a set of items or by analyzing speeded tests. A commonly accepted rule of thumb for describing internal consistency using Cronbach's alpha is as follows,[11][12] however, a greater number of items in the test can artificially inflate the value of alpha[6] and so this rule of thumb should be used with caution:
Cronbach's alpha Internal consistency 0.9 0.8 < 0.9 0.7 < 0.8 0.6 < 0.7 0.5 < 0.6 < 0.5 Excellent Good Acceptable Questionable Poor Unacceptable

Generalizability theory
Cronbach and others generalized some basic assumptions of classical test theory in their generalizability theory. If this theory is applied to test construction, then it is assumed that the items that constitute the test are a random sample from a larger universe of items. The expected score of a person in the universe is called the universe score, analogous to a true score. The generalizability is defined analogously as the variance of the universe scores divided by the variance of the observable scores, analogous to the concept of reliability in classical test theory. In this theory, Cronbach's alpha is an unbiased estimate of the generalizability. For this to be true the assumptions of essential -equivalence or parallelness are not needed. Consequently, Cronbach's alpha can be viewed as a measure of how well the sum score on the selected items capture the expected score in the entire domain, even if that domain is heterogeneous.

Cronbach's alpha

45

Intra-class correlation
Cronbach's alpha is said to be equal to the stepped-up consistency version of the intra-class correlation coefficient, which is commonly used in observational studies. But this is only conditionally true. In terms of variance components, this condition is, for item sampling: if and only if the value of the item (rater, in the case of rating) variance component equals zero. If this variance component is negative, alpha will underestimate the stepped-up intra-class correlation coefficient; if this variance component is positive, alpha will overestimate this stepped-up intra-class correlation coefficient.

Factor analysis
Cronbach's alpha also has a theoretical relation with factor analysis. As shown by Zinbarg, Revelle, Yovel and Li,[] alpha may be expressed as a function of the parameters of the hierarchical factor analysis model which allows for a general factor that is common to all of the items of a measure in addition to group factors that are common to some but not all of the items of a measure. Alpha may be seen to be quite complexly determined from this perspective. That is, alpha is sensitive not only to general factor saturation in a scale but also to group factor saturation and even to variance in the scale scores arising from variability in the factor loadings. Coefficient omega_hierarchical[][] has a much more straightforward interpretation as the proportion of observed variance in the scale scores that is due to the general factor common to all of the items comprising the scale.

Notes
[3] Ritter, N. (2010). "Understanding a widely misunderstood statistic: Cronbach's alpha". Paper presented at Southwestern Educational Research Association (SERA) Conference 2010, New Orleans, LA (ED526237). [6] Cortina, J.M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98104. [11] George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon. [12] Kline, P. (1999). The handbook of psychological testing (2nd ed.). London: Routledge

Further Reading
Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL: Waveland Press. Bland J.M., Altman D.G. (1997). Statistics notes: Cronbach's alpha (http://www.bmj.com/cgi/content/full/ 314/7080/572). BMJ 1997;314:572. Cronbach, Lee J., and Richard J. Shavelson. (2004). My Current Thoughts on Coefficient Alpha and Successor Procedures. Educational and Psychological Measurement 64, no. 3 (June 1): 391418. doi: 10.1177/0013164404266386 (http://dx.doi.org/10.1177/0013164404266386).

Cutscore

46

Cutscore
A cutscore, also known as a passing score or passing point, is a single point on a score continuum that differentiates between classifications along the continuum. The most common cutscore, that many are familiar with, is a score that differentiates between the classifications of "pass" and "fail" on a professional or educational test.

Setting a cutscore
Many tests with low stakes set cutscores arbitrarily; for example, an elementary school teacher my require students to correctly answer 60% of the items on a test to pass. However, for a high-stakes test with a cutscore to be legally defensible and meet the Standards for Educational and Psychological Testing, the cutscore must be set with a formal standard-setting study or equated to another form of the test.

Descriptive statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data.[1] Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory.[2] Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. Descriptive statistics is also a set of brief descriptive coefficients that summarizes a given data set that represents either the entire population or a sample. The measures that describe the data set are measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum variables, kurtosis and skewness.[3]

Use in statistical analysis


Descriptive statistics provides simple summaries about the sample and about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may either form the basis of the initial description of the data as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular investigation. For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the number of shots made divided by the number of shots taken. For example, a player who shoots 33% is making approximately one shot in every three. The percentage summarizes or describes multiple discrete events. Consider also the grade point average. This single number describes the general performance of a student across the range of their course experiences.[] The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation of populations and of economic data was the first way the topic of statistics appeared. More recently, a collection of summarisation techniques has been formulated under the heading of exploratory data analysis: an example of such a technique is the box plot. In the business world, Descriptive statistics provide a useful summary of security returns when performing empirical and analytical analysis, as they provide a historical account of return behavior. Although past information is useful in

Descriptive statistics any analysis, one should always consider the expectations of future events.[3]

47

Univariate analysis
Univariate analysis involves describing the distribution of a single variable, including its central tendency (including the mean, median, and mode) and dispersion (including the range and quantiles of the data-set, and measures of spread such as the variance and standard deviation). The shape of the distribution may also be described via indices such as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in graphical or tabular format, including histograms and stem-and-leaf display.

Bivariate analysis
When a sample consists of more than one variable, descriptive statistics may be used to describe the relationship between pairs of variables. In this case, descriptive statistics include: Cross-tabulations and contingency tables Graphical representation via scatterplots Quantitative measures of dependence Descriptions of conditional distributions

The main reason for differentiating univariate and bivariate is that bivariate analysis is not only simple descriptive analysis, but also it describes the relationship between two different variables.[4] Quantitative measures of dependence include correlation (such as Pearson's r when both variables are continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale variables are measured on). The slope, in regression analysis, also reflects the relationship between variables. The unstandardised slope indicates the unit change in the criterion variable for a one unit change in the predictor. The standardised slope indicates this change in standardised (z-score) units. Furthermore, analysts always ensure that such a sample used in data is a good representative of the whole population in highly skewed statistics, it is done by transforming those highly skewed data with a use of logarithm. Use of logarithm makes graphs more symmetrical and look more similar to Normal distribution, and it is mostly used to analyze data in molecular biology.[5]

References
[1] [2] [3] [4] [5] Mann, Prem S. (1995) Introductory Statistics, 2nd Edition, Wiley. ISBN 0-471-31009-3 Dodge, Y (2003) The Oxford Dictionary of Statistical Terms OUP. ISBN 0-19-850994-4 Investopedia, Descriptive Statistics Terms (http:/ / www. investopedia. com/ terms/ d/ descriptive_statistics. asp#axzz2DxCoTnMM) Earl R. Babbie, The Practice of Social Research", 12th edition, Wadsworth Publishing, 2009, ISBN 0-495-59841-0, pp. 436440 Todd G.Nick "Descriptive Statistics" p.47

External links
Descriptive Statistics Lecture: University of Pittsburgh Supercourse: http://www.pitt.edu/~super1/lecture/ lec0421/index.htm

Dot cancellation test

48

Dot cancellation test


The Dot cancellation test or Bourdon-Wiersma test is a commonly used test of combined visual perception and vigilance.[1][] The test has been used in the evaluation of stroke where subjects were instructed to cross out all groups of 4 dots on an A4 paper. The numbers of uncrossed groups of 4 dots, groups of dots other than 4 crossed, and the time spent (maximum, 15 minutes) were taken into account.[] The Group-Bourdon test, a modification of the Bourdon-Wiersma, is one of a number of psychometric tests which trainee train drivers in the UK are required to pass.[2][3]

References Further reading


Grewel, F (October 1953). "The Bourdon-Wiersma test.". Folia psychiatrica, neurologica et neurochirurgica Neerlandica 56 (5): 694703.

Elementary cognitive task


An elementary cognitive task (ECT) is any of a range of basic tasks which require only a small number of mental processes and which have easily specified correct outcomes.[1] Although ECTs may be cognitively simple there is evidence that performance on such tasks correlates well with other measures of general intelligence such as Raven's Progressive Matrices.[2] For example, correcting for attenuation, the correlation between IQ test scores and ECT performance is about 0.5.[3] The term was proposed by John Bissell Carroll in 1980, who posited that all test performance could be analyzed and broken down to building blocks called ECTs. Test batteries such as Microtox were developed based on this theory and have shown utility in the evaluation of test subjects under the influence of carbon monoxide or alcohol.[4]

Examples
Memory span Reaction time

References
[1] Human Cognitive Abilities: A Survey of Factor-Analytic Studies By John Bissell Carroll 1993 Cambridge University Press ISBN 0-521-38712-4 p11 [2] Arthur R. Jensen Process differences and individual differences in some cognitive tasks Intelligence, Volume 11, Issue 2, AprilJune 1987, Pages 107-136 [3] J. Grudnik and J. Kranzler, Meta-analysis of the relationship between intelligence and inspection time, Intelligence 29 (2001), pp. 523535.

Equating

49

Equating
Test equating traditionally refers to the statistical process of determining comparable scores on different forms of an exam.[1] It can be accomplished using either classical test theory or item response theory. In item response theory, equating is the process of equating the units and origins of two scales on which the abilities of students have been estimated from results on different tests. The process is analogous to equating degrees Fahrenheit with degrees Celsius by converting measurements from one scale to the other. The determination of comparable scores is a by-product of equating that results from equating the scales obtained from test results.

Why is equating necessary?


Suppose that Dick and Jane both take a test to become licensed in a certain profession. Because the high stakes (you get to practice the profession if you pass the test) may create a temptation to cheat, the organization that oversees the test creates two forms. If we know that Dick scored 60% on form A and Jane score 70% on form B, do we know for sure which one has a better grasp of the material? What if form A is composed of very difficult items, while form B is relatively easy? Equating analyses are performed to address this very issue, so that scores are as fair as possible.

Equating in item response theory


In item response theory, person locations are estimated on a scale; i.e. locations are estimated in relation to a unit and origin. It is common in educational assessment to employ tests in order to assess different groups of students with the intention of establishing a common scale by equating the origins, and sometimes units, of the scales obtained from response data from the different tests. The process is referred to as equating or test equating.

In item response theory, two different kinds of equating are horizontal and vertical equating.[2] Vertical equating refers to the process of equating tests administered to groups of students with different abilities, such as students in different grades (years of schooling).[3] Horizontal equating refers the equating of tests administered to groups with similar abilities; for example, two tests administered students in the same grade in two consecutive calendar years. Different tests are used to avoid practice effects. In terms of item response theory, equating is just a special case of the more general process of scaling, applicable when more than one test is used. In practice, though, scaling is often implemented separately for different tests and then the scales subsequently equated. A distinction is often made between two methods of equating; common person and common item equating. Common person equating involves the administration of two tests to a common group of persons. The mean and standard deviation of the scale locations of the groups on the two tests are equated using a linear transformation. Common item equating involves the use of a set of common items referred to as the anchor test embedded in two different tests. The mean item location of the common items is equated.

Figure 1: Test characteristic curves showing the relationship between total score and person location for two different tests in relation to a common scale. In this example a total of 37 on Assessment 1 equates to a total of 34.9 on Assessment 2 as shown by the vertical line

Equating

50

Classical approaches to equating


In classical test theory, mean equating simply adjusts the distribution of scores so that the mean of one form is comparable to the mean of the other form. While mean equating is attractive because of its simplicity, it lacks flexibility, namely accounting for the possibility that the standard deviations of the forms differ.[1] Linear equating adjusts so that the two forms have a comparable mean and standard deviation. There are several types of linear equating that differ in the assumptions and mathematics used to estimate parameters. The Tucker and Levine Observed Score methods estimate the relationship between observed scores on the two forms, while the Levine True Score method estimates the relationship between true scores on the two forms.[1] Equipercentile equating determines the equating relationship as one where a score could have an equivalent percentile on either form. This relationship can be nonlinear. Unlike with item response theory, equating based on classical test theory is somewhat distinct from scaling. Equating is a raw-to-raw transformation in that it estimates a raw score on Form B that is equivalent to each raw score on the base Form A. Any scaling transformation used is then applied on top of, or with, the equating.

References
[1] Kolen, M.J., & Brennan, R.L. (1995). Test Equating. New York: Spring. [2] Baker, F. (1983). Comparison of ability metrics obtained under two latent trait theory procedures. Applied Psychological Measurement, 7, 97-110. [3] Baker, F. (1984). Ability metric transformations involved in vertical equating under item response theory. Applied Psychological Measurement, 8(3), 261-271.

External links
Equating and the SAT (http://www.collegeboard.com/student/testing/sat/scores/understanding/equating. html) Equating and AP Tests (http://collegeboard.com/student/testing/ap/exgrd_set.html) IRTEQ:Windows Application that Implements IRT Scaling and Equating (http://www.umass.edu/remp/ software/irteq/)

Factor analysis

51

Factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. In other words, it is possible, for example, that variations in three or four observed variables mainly reflect the variations in fewer unobserved variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modeled as linear combinations of the potential factors, plus "error" terms. The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. Computationally this technique is equivalent to low rank approximation of the matrix of observed variables. Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product management, operations research, and other applied sciences that deal with large quantities of data. Factor analysis is related to principal models, including factor analysis, use while PCA is a descriptive statistical equivalence or otherwise of the two analysis).[citation needed] component analysis (PCA), but the two are not identical. Latent variable regression modelling techniques to test hypotheses producing error terms, technique.[] There has been significant controversy in the field over the techniques (see exploratory factor analysis versus principal components

Statistical model
Definition
Suppose we have a set of observable random variables, and with means , where . and Suppose for some unknown constants , where , we have unobserved random variables

Here, the are independently distributed error terms with zero mean and finite variance, which may not be the same for all . Let , so that we have

In matrix terms, we have

If we have

observations, then we will have the dimensions

, and

. Each column of

and

denote values for one particular observation, and matrix Also we will impose the following assumptions on . 1. 2. 3. loading matrix. Suppose or and are independent.

does not vary across observations.

(to make sure that the factors are uncorrelated) is defined as the factors, and , we have as the

Any solution of the above set of equations following the constraints for

. Then note that from the conditions just imposed on

or

Factor analysis Note that for any orthogonal matrix if we set and , the criteria for being factors and

52

factor loadings still hold. Hence a set of factors and factor loadings is identical only up to orthogonal transformation.

Example
The following example is for expository purposes, and should not be taken as being realistic. Suppose a psychologist proposes a theory that there are two kinds of intelligence, "verbal intelligence" and "mathematical intelligence", neither of which is directly observed. Evidence for the theory is sought in the examination scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a large population, then each student's 10 scores are random variables. The psychologist's theory may say that for each of the 10 academic fields, the score averaged over the group of all students who share some common pair of values for verbal and mathematical "intelligences" is some constant times their level of verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a combination of those two "factors". The numbers for a particular subject, by which the two kinds of intelligence are multiplied to obtain the expected score, are posited by the theory to be the same for all intelligence level pairs, and are called "factor loadings" for this subject. For example, the theory may hold that the average student's aptitude in the field of taxonomy is {10 the student's verbal intelligence} + {6 the student's mathematical intelligence}. The numbers 10 and 6 are the factor loadings associated with taxonomy. Other academic subjects may have different factor loadings. Two students having identical degrees of verbal intelligence and identical degrees of mathematical intelligence may have different aptitudes in taxonomy because individual aptitudes differ from average aptitudes. That difference is called the "error" a statistical term that means the amount by which an individual differs from what is average for his or her levels of intelligence (see errors and residuals in statistics). The observable data that go into factor analysis would be 10 scores of each of the 1000 students, a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each student must be inferred from the data.

Mathematical model of the same example


In the example above, for i = 1, ..., 1,000 the ith student's scores are

where xk,i is the ith student's score for the kth subject is the mean of the students' scores for the kth subject (assumed to be zero, for simplicity, in the example as described above, which would amount to a simple shift of the scale used) vi is the ith student's "verbal intelligence", mi is the ith student's "mathematical intelligence", are the factor loadings for the kth subject, for j = 1, 2. k,i is the difference between the ith student's score in the kth subject and the average score in the kth subject of all students whose levels of verbal and mathematical intelligence are the same as those of the ith student, In matrix notation, we have

N is 1000 students X is a 10 1,000 matrix of observable random variables,

Factor analysis is a 10 1 column vector of unobservable constants (in this case "constants" are quantities not differing from one individual student to the next; and "random variables" are those assigned to individual students; the randomness arises from the random way in which the students are chosen). Note that, is an outer product of with a 11000 row vector of ones, yielding a 10 1000 matrix of the elements of , L is a 10 2 matrix of factor loadings (unobservable constants, ten academic topics, each with two intelligence parameters that determine success in that topic), F is a 2 1,000 matrix of unobservable random variables (two intelligence parameters for each of 1000 students), is a 10 1,000 matrix of unobservable random variables. Observe that by doubling the scale on which "verbal intelligence"the first component in each column of Fis measured, and simultaneously halving the factor loadings for verbal intelligence makes no difference to the model. Thus, no generality is lost by assuming that the standard deviation of verbal intelligence is 1. Likewise for mathematical intelligence. Moreover, for similar reasons, no generality is lost by assuming the two factors are uncorrelated with each other. The "errors" are taken to be independent of each other. The variances of the "errors" associated with the 10 different subjects are not assumed to be equal. Note that, since any rotation of a solution is also a solution, this makes interpreting the factors difficult. See disadvantages below. In this particular example, if we do not know beforehand that the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two different types of intelligence. Even if they are uncorrelated, we cannot tell which factor corresponds to verbal intelligence and which corresponds to mathematical intelligence without an outside argument. The values of the loadings L, the averages , and the variances of the "errors" must be estimated given the observed data X and F (the assumption about the levels of the factors is fixed for a given F).

53

Practical implementation
Type of factor analysis
Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and group items that are part of unified concepts.[] The researcher makes no "a priori" assumptions about relationships among factors.[] Confirmatory factor analysis (CFA) is a more complex approach that tests the hypothesis that the items are associated with specific factors.[] CFA uses structural equation modeling to test a measurement model whereby loading on the factors allows for evaluation of relationships between observed variables and unobserved variables.[] Structural equation modeling approaches can accommodate measurement error, and are less restrictive than least-squares estimation.[] Hypothesized models are tested against actual data, and the analysis would demonstrate loadings of observed variables on the latent variables (factors), as well as the correlation between the latent variables.[]

Types of factoring
Principal component analysis (PCA): PCA is a widely used method for factor extraction, which is the first phase of EFA.[] Factor weights are computed in order to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left.[] The factor model must then be rotated for analysis.[] Canonical factor analysis, also called Rao's canonical factoring, is a different method of computing the same model as PCA, which uses the principal axis method. Canonical factor analysis seeks factors which have the highest canonical correlation with the observed variables. Canonical factor analysis is unaffected by arbitrary rescaling of the data. Common factor analysis, also called principal factor analysis (PFA) or principal axis factoring (PAF), seeks the least number of factors which can account for the common variance (correlation) of a set of variables.

Factor analysis Image factoring: based on the correlation matrix of predicted variables rather than actual variables, where each variable is predicted from the others using multiple regression. Alpha factoring: based on maximizing the reliability of factors, assuming variables are randomly sampled from a universe of variables. All other methods assume cases to be sampled and variables fixed. Factor regression model: a combinatorial model of factor model and regression model; or alternatively, it can be viewed as the hybrid factor model,[] whose factors are partially known.

54

Terminology
Factor loadings: The factor loadings, also called component loadings in PCA, are the correlation coefficients between the variables (rows) and factors (columns). Analogous to Pearson's r, the squared factor loading is the percent of variance in that indicator variable explained by the factor. To get the percent of variance in all the variables accounted for by each factor, add the sum of the squared factor loadings for that factor (column) and divide by the number of variables. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1.) This is the same as dividing the factor's eigenvalue by the number of variables. Interpreting factor loadings: By one rule of thumb in confirmatory factor analysis, loadings should be .7 or higher to confirm that independent variables identified a priori are represented by a particular factor, on the rationale that the .7 level corresponds to about half of the variance in the indicator being explained by the factor. However, the .7 standard is a high one and real-life data may well not meet this criterion, which is why some researchers, particularly for exploratory purposes, will use a lower level such as .4 for the central factor and .25 for other factors call loadings above .6 "high" and those below .4 "low". In any event, factor loadings must be interpreted in the light of theory, not by arbitrary cutoff levels. In oblique rotation, one gets both a pattern matrix and a structure matrix. The structure matrix is simply the factor loading matrix as in orthogonal rotation, representing the variance in a measured variable explained by a factor on both a unique and common contributions basis. The pattern matrix, in contrast, contains coefficients which just represent unique contributions. The more factors, the lower the pattern coefficients as a rule since there will be more common contributions to variance explained. For oblique rotation, the researcher looks at both the structure and pattern coefficients when attributing a label to a factor. Communality: The sum of the squared factor loadings for all factors for a given variable (row) is the variance in that variable accounted for by all the factors, and this is called the communality. The communality measures the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. Spurious solutions: If the communality exceeds 1.0, there is a spurious solution, which may reflect too small a sample or the researcher has too many or too few factors. Uniqueness of a variable: That is, uniqueness is the variability of a variable minus its communality. Eigenvalues:/Characteristic roots: The eigenvalue for a given factor measures the variance in all the variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables. If a factor has a low eigenvalue, then it is contributing little to the explanation of variances in the variables and may be ignored as redundant with more important factors. Eigenvalues measure the amount of variation in the total sample accounted for by each factor. Extraction sums of squared loadings: Initial eigenvalues and eigenvalues after extraction (listed by SPSS as "Extraction Sums of Squared Loadings") are the same for PCA extraction, but for other extraction methods, eigenvalues after extraction will be lower than their initial counterparts. SPSS also prints "Rotation Sums of Squared Loadings" and even for PCA, these eigenvalues will differ from initial and extraction eigenvalues, though their total will be the same.

Factor analysis Factor scores (also called component scores in PCA): are the scores of each case (row) on each factor (column). To compute the factor score for a given case for a given factor, one takes the case's standardized score on each variable, multiplies by the corresponding factor loading of the variable for the given factor, and sums these products. Computing factor scores allows one to look for factor outliers. Also, factor scores may be used as variables in subsequent modeling.

55

Criteria for determining the number of factors


Using one or more of the methods below, the researcher determines an appropriate range of solutions to investigate. Methods may not agree. For instance, the Kaiser criterion may suggest five factors and the scree test may suggest two, so the researcher may request 3-, 4-, and 5-factor solutions discuss each in terms of their relation to external data and theory. Comprehensibility: A purely subjective criterion would be to retain those factors whose meaning is comprehensible to the researcher. This is not recommended [citation needed]. Kaiser criterion: The Kaiser rule is to drop all components with eigenvalues under 1.0 this being the eigenvalue equal to the information accounted for by an average single item. The Kaiser criterion is the default in SPSS and most statistical software but is not recommended when used as the sole cut-off criterion for estimating the number of factors as it tends to overextract factors.[1] Variance explained criteria: Some researchers simply use the rule of keeping enough factors to account for 90% (sometimes 80%) of the variation. Where the researcher's goal emphasizes parsimony (explaining variance with as few factors as possible), the criterion could be as low as 50% Scree plot: The Cattell scree test plots the components as the X axis and the corresponding eigenvalues as the Y-axis. As one moves to the right, toward later components, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components after the one starting the elbow. This rule is sometimes criticised for being amenable to researcher-controlled "fudging". That is, as picking the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve, the researcher may be tempted to set the cut-off at the number of factors desired by his or her research agenda. Horn's Parallel Analysis (PA): A Monte-Carlo based simulation method that compares the observed eigenvalues with those obtained from uncorrelated normal variables. A factor or component is retained if the associated eigenvalue is bigger than the 95th of the distribution of eigenvalues derived from the random data. PA is one of the most recommendable rules for determining the number of components to retain,[citation needed] but only few programs include this option.[2] Before dropping a factor below one's cut-off, however, the researcher should check its correlation with the dependent variable. A very small factor can have a large correlation with the dependent variable, in which case it should not be dropped.

Rotation methods
The unrotated output maximises the variance accounted for by the first and subsequent factors, and forcing the factors to be orthogonal. This data-compression comes at the cost of having most items load on the early factors, and usually, of having many items load substantially on more than one factor. Rotation serves to make the output more understandable, by seeking so-called "Simple Structure": A pattern of loadings where items load most strongly on one factor, and much more weakly on the other factors. Rotations can be orthogonal or oblique (allowing the factors to correlate). Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the effect of differentiating the original variables by extracted factor. Each factor will tend to have either large or small loadings of any particular variable. A

Factor analysis varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option. However, the orthogonality (i.e., independence) of factors is often an unrealistic assumption. Oblique rotations are inclusive of orthogonal rotation, and for that reason, oblique rotations are a preferred method.[3] Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed to explain each variable. This type of rotation often generates a general factor on which most variables are loaded to a high or medium degree. Such a factor structure is usually not helpful to the research purpose. Equimax rotation is a compromise between Varimax and Quartimax criteria. Direct oblimin rotation is the standard method when one wishes a non-orthogonal (oblique) solution that is, one in which the factors are allowed to be correlated. This will result in higher eigenvalues but diminished interpretability of the factors. See below.Wikipedia:Please clarify Promax rotation is an alternative non-orthogonal (oblique) rotation method which is computationally faster than the direct oblimin method and therefore is sometimes used for very large datasets.

56

Factor analysis in psychometrics


History
Charles Spearman pioneered the use of factor analysis in the field of psychology and is sometimes credited with the invention of factor analysis. He discovered that school children's scores on a wide variety of seemingly unrelated subjects were positively correlated, which led him to postulate that a general mental ability, or g, underlies and shapes human cognitive performance. His postulate now enjoys broad support in the field of intelligence research, where it is known as the g theory. Raymond Cattell expanded on Spearman's idea of a two-factor theory of intelligence after performing his own tests and factor analysis. He used a multi-factor theory to explain intelligence. Cattell's theory addressed alternate factors in intellectual development, including motivation and psychology. Cattell also developed several mathematical methods for adjusting psychometric graphs, such as his "scree" test and similarity coefficients. His research led to the development of his theory of fluid and crystallized intelligence, as well as his 16 Personality Factors theory of personality. Cattell was a strong advocate of factor analysis and psychometrics. He believed that all theory should be derived from research, which supports the continued use of empirical observation and objective testing to study human intelligence.

Applications in psychology
Factor analysis is used to identify "factors" that explain a variety of results on different tests. For example, intelligence research found that people who get a high score on a test of verbal ability are also good on other tests that require verbal abilities. Researchers explained this by using factor analysis to isolate one factor, often called crystallized intelligence or verbal intelligence, which represents the degree to which someone is able to solve problems involving verbal skills. Factor analysis in psychology is most often associated with intelligence research. However, it also has been used to find factors in a broad range of domains such as personality, attitudes, beliefs, etc. It is linked to psychometrics, as it can assess the validity of an instrument by finding if the instrument indeed measures the postulated factors.

Factor analysis

57

Advantages
Reduction of number of variables, by combining two or more variables into a single factor. For example, performance at running, ball throwing, batting, jumping and weight lifting could be combined into a single factor such as general athletic ability. Usually, in an item by people matrix, factors are selected by grouping related items. In the Q factor analysis technique, the matrix is transposed and factors are created by grouping related people: For example, liberals, libertarians, conservatives and socialists, could form separate groups. Identification of groups of inter-related variables, to see how they are related to each other. For example, Carroll used factor analysis to build his Three Stratum Theory. He found that a factor called "broad visual perception" relates to how good an individual is at visual tasks. He also found a "broad auditory perception" factor, relating to auditory task capability. Furthermore, he found a global factor, called "g" or general intelligence, that relates to both "broad visual perception" and "broad auditory perception". This means someone with a high "g" is likely to have both a high "visual perception" capability and a high "auditory perception" capability, and that "g" therefore explains a good part of why someone is good or bad in both of those domains.

Disadvantages
"...each orientation is equally acceptable mathematically. But different factorial theories proved to differ as much in terms of the orientations of factorial axes for a given solution as in terms of anything else, so that model fitting did not prove to be useful in distinguishing among theories." (Sternberg, 1977[]). This means all rotations represent different underlying processes, but all rotations are equally valid outcomes of standard factor analysis optimization. Therefore, it is impossible to pick the proper rotation using factor analysis alone. Factor analysis can be only as good as the data allows. In psychology, where researchers often have to rely on less valid and reliable measures such as self-reports, this can be problematic. Interpreting factor analysis is based on using a "heuristic", which is a solution that is "convenient even if not absolutely true".[4] More than one interpretation can be made of the same data factored the same way, and factor analysis cannot identify causality.

Exploratory factor analysis versus principal components analysis


While exploratory factor analysis and principal component analysis are treated as synonymous techniques in some fields of statistics, this has been criticised (e.g. Fabrigar et al., 1999;[] Suhr, 2009[]). In factor analysis, the researcher makes the assumption that an underlying causal model exists, whereas PCA is simply a variable reduction technique.[] Researchers have argued that the distinctions between the two techniques may mean that there are objective benefits for preferring one over the other based on the analytic goal.

Arguments contrasting PCA and EFA


Fabrigar et al. (1999)[] address a number of reasons used to suggest that principal components analysis is equivalent to factor analysis: 1. It is sometimes suggested that principal components analysis is computationally quicker and requires fewer resources than factor analysis. Fabrigar et al. suggest that the ready availability of computer resources have rendered this practical concern irrelevant.[] 2. PCA and factor analysis can produce similar results. This point is also addressed by Fabrigar et al.; in certain cases, whereby the communalities are low (e.g., .40), the two techniques produce divergent results. In fact, Fabrigar et al. argue that in cases where the data correspond to assumptions of the common factor model, the results of PCA are inaccurate results.[] 3. There are certain cases where factor analysis leads to 'Heywood cases'. These encompass situations whereby 100% or more of the variance in a measured variable is estimated to be accounted for by the model. Fabrigar et al. suggest that these cases are actually informative to the researcher, indicating a misspecified model or a violation

Factor analysis of the common factor model. The lack of Heywood cases in the PCA approach may mean that such issues pass unnoticed.[] 4. Researchers gain extra information from a PCA approach, such as an individuals score on a certain component such information is not yielded from factor analysis. However, as Fabrigar et al. contend, the typical aim of factor analysis i.e. to determine the factors accounting for the structure of the correlations between measured variables does not require knowledge of factor scores and thus this advantage is negated.[] It is also possible to compute factor scores from a factor analysis.

58

Variance versus covariance


Factor analysis takes into account the random error that is inherent in measurement, whereas PCA fails to do so. This point is exemplified by Brown (2009),[] who indicated that, in respect to the correlation matrices involved in the calculations: "In PCA, 1.00s are put in the diagonal meaning that all of the variance in the matrix is to be accounted for (including variance unique to each variable, variance common among variables, and error variance). That would, therefore, by definition, include all of the variance in the variables. In contrast, in EFA, the communalities are put in the diagonal meaning that only the variance shared with other variables is to be accounted for (excluding variance unique to each variable and error variance). That would, therefore, by definition, include only variance that is common among the variables." Brown (2009), Principal components analysis and exploratory factor analysis Definitions, differences and choices For this reason, Brown (2009) recommends using factor analysis when theoretical ideas about relationships between variables exist, whereas PCA should be used if the goal of the researcher is to explore patterns in their data.

Differences in procedure and results


The differences between principal components analysis and factor analysis are further illustrated by Suhr (2009): PCA results in principal components that account for a maximal amount of variance for observed variables; FA account for common variance in the data.[] PCA inserts ones on the diagonals of the correlation matrix; FA adjusts the diagonals of the correlation matrix with the unique factors.[] PCA minimizes the sum of squared perpendicular distance to the component axis; FA estimates factors which influence responses on observed variables.[] The component scores in PCA represent a linear combination of the observed variables weighted by eigenvectors; the observed variables in FA are linear combinations of the underlying and unique factors.[] In PCA, the components yielded are uninterpretable, i.e. they do not represent underlying constructs; in FA, the underlying constructs can be labeled and readily interpreted, given an accurate model specification.[]

Factor analysis

59

Factor analysis in marketing


The basic steps are: Identify the salient attributes consumers use to evaluate products in this category. Use quantitative marketing research techniques (such as surveys) to collect data from a sample of potential customers concerning their ratings of all the product attributes. Input the data into a statistical program and run the factor analysis procedure. The computer will yield a set of underlying attributes (or factors). Use these factors to construct perceptual maps and other product positioning devices.

Information collection
The data collection stage is usually done by marketing research professionals. Survey questions ask the respondent to rate a product sample or descriptions of product concepts on a range of attributes. Anywhere from five to twenty attributes are chosen. They could include things like: ease of use, weight, accuracy, durability, colourfulness, price, or size. The attributes chosen will vary depending on the product being studied. The same question is asked about all the products in the study. The data for multiple products is coded and input into a statistical program such as R, SPSS, SAS, Stata, STATISTICA, JMP, and SYSTAT.

Analysis
The analysis will isolate the underlying factors that explain the data using a matrix of associations.[5] Factor analysis is an interdependence technique. The complete set of interdependent relationships is examined. There is no specification of dependent variables, independent variables, or causality. Factor analysis assumes that all the rating data on different attributes can be reduced down to a few important dimensions. This reduction is possible because some attributes may be related to each other. The rating given to any one attribute is partially the result of the influence of other attributes. The statistical algorithm deconstructs the rating (called a raw score) into its various components, and reconstructs the partial scores into underlying factor scores. The degree of correlation between the initial raw score and the final factor score is called a factor loading.

Advantages
Both objective and subjective attributes can be used provided the subjective attributes can be converted into scores. Factor analysis can identify latent dimensions or constructs that direct analysis may not. It is easy and inexpensive.

Disadvantages
Usefulness depends on the researchers' ability to collect a sufficient set of product attributes. If important attributes are excluded or neglected, the value of the procedure is reduced. If sets of observed variables are highly similar to each other and distinct from other items, factor analysis will assign a single factor to them. This may obscure factors that represent more interesting relationships. Wikipedia:Please clarify Naming factors may require knowledge of theory because seemingly dissimilar attributes can correlate strongly for unknown reasons.

Factor analysis

60

Factor analysis in physical sciences


Factor analysis has also been widely used in physical sciences such as geochemistry, ecology, and hydrochemistry.[6] In groundwater quality management, it is important to relate the spatial distribution of different chemical parameters to different possible sources, which have different chemical signatures. For example, a sulfide mine is likely to be associated with high levels of acidity, dissolved sulfates and transition metals. These signatures can be identified as factors through R-mode factor analysis, and the location of possible sources can be suggested by contouring the factor scores.[7] In geochemistry, different factors can correspond to different mineral associations, and thus to mineralisation.[8]

Factor analysis in microarray analysis


Factor analysis can be used for summarizing high-density oligonucleotide DNA microarrays data at probe level for Affymetrix GeneChips. In this case, the latent variable corresponds to the RNA concentration in a sample.[9]

Implementation
Factor analysis has been implemented in several statistical analysis programs since the 1980s: SAS, BMDP and SPSS.[10] It is also implemented in the R programming language (with the factanal function) and in OpenOpt. Rotations are implemented in the GPArotation R package.

References
[2] * [4] Richard B. Darlington (2004) [5] Ritter, N. (2012). A comparison of distribution-free and non-distribution free methods in factor analysis. Paper presented at Southwestern Educational Research Association (SERA) Conference 2012, New Orleans, LA (ED529153).

Further reading
Child, Dennis (2006). The Essentials of Factor Analysis (http://books.google.com/books?id=rQ2vdJgohH0C) (3rd ed.). Continuum International. ISBN978-0-8264-8000-2. Fabrigar, L.R.; Wegener, D.T.; MacCallum, R.C.; Strahan, E.J. (September 1999). "Evaluating the use of exploratory factor analysis in psychological research" (http://psycnet.apa.org/journals/met/4/3/272/). Psychological Methods 4 (3): 272299. doi: 10.1037/1082-989X.4.3.272 (http://dx.doi.org/10.1037/ 1082-989X.4.3.272). Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washington DC: American Psychological Association. ISBN1591470935.

External links
Factor Analysis. Retrieved July 23, 2004, from http://www2.chass.ncsu.edu/garson/pa765/factor.htm Raymond Cattell. Retrieved July 22, 2004, from http://www.indiana.edu/~intell/rcattell.shtml Exploratory Factor Analysis - A Book Manuscript by Tucker, L. & MacCallum R. (1993). Retrieved June 8, 2006, from: http://www.unc.edu/~rcm/book/factornew.htm Garson, G. David, "Factor Analysis," from Statnotes: Topics in Multivariate Analysis. Retrieved on April 13, 2009 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm Factor Analysis at 100 (http://www.fa100.info/index.html) conference material FARMS - Factor Analysis for Robust Microarray Summarization, an R package (http://www.bioinf.jku.at/ software/farms/farms.html) software

Figure rating scale

61

Figure rating scale


The figure rating scale is a psychometric scale developed in the 1950s as a tool to determine body dissatisfaction in women, men, and children.[1] Each figure presents nine silhouettes, ranging from very thin to very large, and the participant is asked to select the one that best indicates his or her current body size and ideal body size (IBS).[2]

Trends in research
Studies of body dissatisfaction have shown that women have a tendency to pick a smaller IBS than current body size.[3] Discrepancies between the two selections indicate body dissatisfaction, which can lead to eating disorders or depression.

References
[1] Grogan, S. (2009). Routledge: New York. [2] International Journal of Eating Disorders (http:/ / www3. interscience. wiley. com/ journal/ 112417746/ abstract?CRETRY=1& SRETRY=0) [3] Cororve Fingeret, M., Gleaves, D., & Pearson, C. (2004). On the Methodology of Body Image Assessment: the use of figural rating scales to evaluate body dissatisfaction and the ideal body standards of women. Body Image, 2, 207-212

Fuzzy concept
A fuzzy concept is a concept of which the meaningful content, value, or boundaries of application can vary considerably according to context or conditions, instead of being fixed once and for all.[1] This generally means the concept is vague, lacking a fixed, precise meaning, without however being meaningless altogether.[2] It has a meaning, or multiple meanings (it has different semantic associations). But these can become clearer only through further elaboration and specification, including a closer definition of the context in which they are used. Fuzzy concepts "lack clarity and are difficult to test or operationalize".[3] In logic, fuzzy concepts are often regarded as concepts which in their application, or formally speaking, are neither completely true nor completely false, or which are partly true and partly false; they are ideas which require further elaboration, specification or qualification to understand their applicability (the conditions under which they truly make sense). In mathematics and statistics, a fuzzy variable (such as "the temperature", "hot" or "cold") is a value which could lie in a probable range defined by quantitative limits or parameters, and which can be usefully described with imprecise categories (such as "high", "medium" or "low"). In mathematics and computer science, the gradations of applicable meaning of a fuzzy concept are described in terms of quantitative relationships defined by logical operators. Such an approach is sometimes called "degree-theoretic semantics" by logicians and philosophers,[4] but the more usual term is fuzzy logic or many-valued logic. The basic idea is, that a real number is assigned to each statement written in a language, within a range from 0 to 1, where 1 means that the statement is completely true, and 0 means that the statement is completely false, while values less than 1 but greater than 0 represent that the statements are "partly true", to a given, quantifiable extent. This makes its possible to analyze a distribution of statements for their truth-content, identify data patterns, make inferences and predictions, and model how processes operate. Fuzzy reasoning (i.e. reasoning with graded concepts) has many practical uses.[5] It is nowadays widely used in the programming of vehicle and transport electronics, household appliances, video games, language filters, robotics, and various kinds of electronic equipment used for pattern recognition, surveying and monitoring (such as radars). Fuzzy reasoning is also used in artificial intelligence and virtual intelligence research.[6] "Fuzzy risk scores" are used by project managers and portfolio managers to express risk assessments.[7]

Fuzzy concept

62

Origins and etymology


The intellectual origins of the idea of fuzzy logic have been traced to a diversity of famous and less wellknown thinkers including Plato, Georg Wilhelm Friedrich Hegel, Karl Marx, Friedrich Engels, Friedrich Nietzsche, Jan ukasiewicz, Alfred Tarski, Stanisaw Jakowski[8] and Donald Knuth.[9] However, usually the Iranian computer scientist Lotfi A. Zadeh is credited with inventing the specific idea of a "fuzzy concept" in his seminal 1965 paper on fuzzy sets, because he gave a formal mathematical presentation of the phenomenon.[10] In fact, the German scholar Dieter Klaua also published a paper on fuzzy sets in the same year, but he used a different terminology (he referred to "many-valued sets").[11] Radim Belohlavek explains: "There exists strong evidence, established in the 1970s in the psychology of concepts... that human concepts have a graded structure in that whether or not a concept applies to a given object is a matter of degree, rather than a yes-or-no question, and that people are capable of working with the degrees in a consistent way. This finding is intuitively quite appealing, because people say "this product is more or less good" or "to a certain degree, he is a good athlete", implying the graded structure of concepts. In his classic paper, Zadeh called the concepts with a graded structure fuzzy concepts and argued that these concepts are a rule rather than an exception when it comes to how people communicate knowledge. Moreover, he argued that to model such concepts mathematically is important for the tasks of control, decision making, pattern recognition, and the like. Zadeh proposed the notion of a fuzzy set that gave birth to the field of fuzzy logic..." [12] Hence, a concept is regarded as "fuzzy" by logicians if: defining characteristics of the concept apply to it "to a certain degree or extent" (or with a certain magnitude of likelihood) or, the fuzzy concept itself consists of a fuzzy set. The fact that a concept is fuzzy does not prevent its use in logical reasoning, it merely affects the type of reasoning which can be applied (see fuzzy logic). The idea of fuzzy concepts was subsequently applied in the philosophical, sociological and linguistic analysis of human behaviour. In a 1973 paper, George Lakoff for example analyzed hedges in the interpretation of the meaning of categories.[13] Charles Ragin and others have applied the idea to sociological analysis.[14] In a more general sociological or journalistic sense, a "fuzzy concept" has come to mean a concept which is meaningful but inexact, implying that it does not exhaustively or completely define the meaning of the phenomenon to which it refers - often because it is too abstract. To specify the relevant meaning more precisely, additional distinctions, conditions and/or qualifiers would be required. For example, in a handbook of sociology we find a statement such as "The theory of interaction rituals contains some gaps that need to be filled and some fuzzy concepts that need to be differentiated."[15] The idea is that if finer distinctions are introduced, then the fuzziness or vagueness would be eliminated. The main reason why the term is now often used in describing human behaviour, is that human interaction has many characteristics which are difficult to quantify and measure precisely, among other things because they are interactive and reflexive (the observers and the observed mutually influence the meaning of events).[16] Those human characteristics can be usefully expressed only in an approximate way (see reflexivity (social theory)).[17] Newspaper stories frequently contain fuzzy concepts, which are readily understood and used, even although they are far from exact. Thus, many of the meanings which people ordinarily use to negotiate their way through life in reality turn out to be "fuzzy concepts". While people often do need to be exact about some things (e.g. money or time), many areas of their lives involve expressions which are far from exact.

Fuzzy concept

63

Uncertainty
Fuzzy concepts can generate uncertainty because they are imprecise (especially if they refer to a process in motion, or a process of transformation where something is "in the process of turning into something else"). In that case, they do not provide a clear orientation for action or decision-making ("what does X really mean or imply?"); reducing fuzziness, perhaps by applying fuzzy logic, would generate more certainty. However, this is not necessarily always so.[18] A concept, even although it is not fuzzy at all, and even though it is very exact, could equally well fail to capture the meaning of something adequately. That is, a concept can be very precise and exact, but not - or insufficiently - applicable or relevant in the situation to which it refers. In this sense, a definition can be "very precise", but "miss the point" altogether. A fuzzy concept may indeed provide more security, because it provides a meaning for something when an exact concept is unavailable - which is better than not being able to denote it at all. A concept such as God, although not easily definable, for instance can provide security to the believer.

Language
Ordinary language, which uses symbolic conventions and associations which are often not logical, inherently contains many fuzzy concepts - "knowing what you mean" in this case depends on knowing the context or being familiar with the way in which a term is normally used, or what it is associated with. This can be easily verified for instance by consulting a dictionary, a thesaurus or an encyclopedia which show the multiple meanings of words, or by observing the behaviours involved in ordinary relationships which rely on mutually understood meanings. To communicate, receive or convey a message, an individual somehow has to bridge his own intended meaning and the meanings which are understood by others, i.e. the message has to be conveyed in a way that it will be socially understood, preferably in the intended manner. Thus, people might state: "you have to say it in a way that I understand". This may be done instinctively, habitually or unconsciously, but it usually involves a choice of terms, assumptions or symbols whose meanings may often not be completely fixed, but which depend among other things on how the receiver of the message responds to it, or the context. In this sense, meaning is often "negotiated" or "interactive" (or, more cynically, manipulated). This gives rise to many fuzzy concepts. But even using ordinary set theory and binary logic[19] to reason something out, logicians have discovered that it is possible to generate statements which are logically speaking not completely true or imply a paradox,[20] even although in other respects they conform to logical rules.

Psychology
The origin of fuzzy concepts is partly due to the fact that the human brain does not operate like a computer (see also Chinese room).[21] While computers use strict binary logic gates, the brain does not; i.e., it is capable of making all kinds of neural associations according to all kinds of ordering principles (or fairly chaotically) in associative patterns which are not logical but nevertheless meaningful. For example, a work of art can be meaningful without being logical. Something can be meaningful although we cannot name it, or we might only be able to name it and nothing else. The human brain can also interpret the same phenomenon in several different but interacting frames of reference, at the same time, or in quick succession, without there necessarily being an explicit logical connection between the frames. In part, fuzzy concepts arise also because learning or the growth of understanding involves a transition from a vague awareness, which cannot orient behaviour greatly, to clearer insight, which can orient behaviour. For example, the Dutch theologian Kees de Groot explores the imprecise notion that psychotherapy is like an "implicit religion",

Fuzzy concept defined as a "fuzzy concept" (it all depends on what one means by "psychotherapy" and "religion").[22] Some logicians argue that fuzzy concepts are a necessary consequence of the reality that any kind of distinction we might like to draw has limits of application. As a certain level of generality, a distinction works fine. But if we pursued its application in a very exact and rigorous manner, or overextend its application, it appears that the distinction simply does not apply in some areas or contexts, or that we cannot fully specify how it should be drawn. An analogy might be that zooming a telescope, camera, or microscope in and out reveals that a pattern which is sharply focused at a certain distance disappears at another distance (or becomes blurry). Faced with any large, complex and continually changing phenomenon, any short statement made about that phenomenon is likely to be "fuzzy", i.e. it is meaningful, but - strictly speaking - incorrect and imprecise. It will not really do justice to the reality of what is happening with the phenomenon. A correct, precise statement would require a lot of elaborations and qualifiers. Nevertheless, the "fuzzy" description turns out to be a useful shorthand that saves a lot of time in communicating what is going on ("you know what I mean"). In psychophysics it has been discovered that the perceptual distinctions we draw in the mind are often more sharply defined than they are in the real world. Thus, the brain actually tends to "sharpen up" our perceptions of differences in the external world. Between black and white, we are able to detect only a limited number of shades of gray, or colour gradations. If there are more gradations and transitions in reality than our conceptual distinctions can capture, then it could be argued, that how those distinctions will actually apply, must necessarily become vaguer at some point. If, for example, one wants to count and quantify distinct objects using numbers, one needs to be able to distinguish between those separate objects, but if this is difficult or impossible, then, although this may not invalidate a quantitative procedure as such, quantification is not really possible in practice; at best, we may be able to assume or infer indirectly a certain distribution of quantities. Finally, in interacting with the external world, the human mind may often encounter new, or partly new phenomena or relationships which cannot (yet) be sharply defined given the background knowledge available, and by known distinctions, associations or generalizations. "Crisis management plans cannot be put 'on the fly' after the crisis occurs. At the outset, information is often vague, even contradictory. Events move so quickly that decision makers experience a sense of loss of control. Often denial sets in, and managers unintentionally cut off information flow about the situation" - L. Paul Bremer, "Corporate governance and crisis management", in: Directors & Boards, Winter 2002 It also can be argued that fuzzy concepts are generated by a certain sort of lifestyle or way of working which evades definite distinctions, makes them impossible or inoperable, or which is in some way chaotic. To obtain concepts which are not fuzzy, it must be possible to test out their application in some way. But in the absence of any relevant clear distinctions, or when everything is "in a state of flux" or in transition, it may not be possible to do so, so that the amount of fuzziness increases.

64

Applications
Fuzzy concepts often play a role in the creative process of forming new concepts to understand something. In the most primitive sense, this can be observed in infants who, through practical experience, learn to identify, distinguish and generalise the correct application of a concept, and relate it to other concepts.[23] However, fuzzy concepts may also occur in scientific, journalistic, programming and philosophical activity, when a thinker is in the process of clarifying and defining a newly emerging concept which is based on distinctions which, for one reason or another, cannot (yet) be more exactly specified or validated. Fuzzy concepts are often used to denote complex phenomena, or to describe something which is developing and changing, which might involve shedding some old meanings and acquiring new ones. In politics, it can be highly important and problematic how exactly a conceptual distinction is drawn, or indeed whether a distinction is drawn at all; distinctions used in administration may be deliberately sharpened, or kept

Fuzzy concept fuzzy, due to some political motive or power relationship. A politician may be deliberately vague about some things, and very clear and explicit about others. The "fuzzy area" can also refer simply to a residual number of cases which cannot be allocated to a known and identifiable group, class or set. In translation work, fuzzy concepts are analyzed for the purpose of good translation. A concept in one language may not have quite the same meaning or significance in another language, or it may not be feasible to translate it literally, or at all. Some languages have concepts which do not exist in another language, raising the problem of how one would most easily render their meaning. In computer-assisted translation, a technique called fuzzy matching is used to find the most likely translation of a piece of text, using previous translated texts as a basis. In information services fuzzy concepts are frequently encountered because a customer or client asks a question about something which could be interpreted in many different ways, or, a document is transmitted of a type or meaning which cannot be easily allocated to a known type or category, or to a known procedure. It might take considerable inquiry to "place" the information, or establish in what framework it should be understood. In the legal system, it is essential that rules are interpreted and applied in a standard way, so that the same cases and the same circumstances are treated equally. Otherwise one would be accused of arbitrariness, which would not serve the interests of justice. Consequently, lawmakers aim to devise definitions and categories which are sufficiently precise that they are not open to different interpretations. For this purpose, it is critically important to remove fuzziness, and differences of interpretation are typically resolved through a court ruling based on evidence. Alternatively, some other procedure is devised which permits the correct distinction to be discovered and made. In statistical research, it is an aim to measure the magnitudes of phenomena. For this purpose, phenomena have to be grouped and categorized so that distinct and discrete counting units can be defined. It must be possible to allocate all observations to mutually exclusive categories so that they are properly quantifiable. Survey observations do not spontaneously transform themselves into countable data; they have to be identified, categorized and classified in such a way that they are not counted twice or more. Again, for this purpose it is a requirement that the concepts used are exactly defined, and not fuzzy. There could be a margin of error, but the amount of error must be kept within tolerable limits, and preferably its magnitude should be known. In hypnotherapy, fuzzy language is deliberately used for the purpose of trance induction. Hypnotic suggestions are often couched in a somewhat vague, general or ambiguous language requiring interpretation by the subject. The intention is to distract and shift the conscious awareness of the subject away from external reality to his own internal state. In response to the somewhat confusing signals he gets, the awareness of the subject spontaneously tends to withdraw inward, in search of understanding or escape.[24] In biology, protein complexes with multiple structural forms are called fuzzy complexes. The different conformations can result in different, even opposite functions. The conformational ensemble is modulated by the environmental conditions. Post-translational modifications or alternative splicing can also impact the ensemble and thereby affinity or specificity of interactions. In theology an attempt is made to define more precisely the meaning of spiritual concepts, which refer to how human beings construct the meaning of human existence, and, often, the relationship people have with a supernatural world. Many spiritual concepts and beliefs are fuzzy, to the extent that, although abstract, they often have a highly personalized meaning, or involve personal interpretation of a type that is not easy to define in a cut-and-dried way. In meteorology, where changes and effects of complex interactions in the atmosphere are studied, the weather reports often use fuzzy expressions indicating a broad trend, likelihood or level. The main reason is that the forecast can rarely be totally exact for any given location. In phenomenology which studies the structure of subjective experience, an important insight is that how someone experiences something can be influenced both by the influence of the thing being experienced itself, but also by

65

Fuzzy concept how the person responds to it. Thus, the actual experience the person has, is shaped by an "interactive object-subject relationship". To describe this experience, fuzzy categories are often necessary, since it is often impossible to predict or describe with great exactitude what the interaction will be, and how it is experienced. It could be argued that many concepts used fairly universally in daily life (e.g. "love" or "God" or "health" or "social") are inherently or intrinsically fuzzy concepts, to the extent that their meaning can never be completely and exactly specified with logical operators or objective terms, and can have multiple interpretations, which are in part exclusively subjective. Yet despite this limitation, such concepts are not meaningless. People keep using the concepts, even if they are difficult to define precisely. It may also be possible to specify one personal meaning for the concept, without however placing restrictions on a different use of the concept in other contexts (as when, for example, one says "this is what I mean by X" in contrast to other possible meanings). In ordinary speech, concepts may sometimes also be uttered purely randomly; for example a child may repeat the same idea in completely unrelated contexts, or an expletive term may be uttered arbitrarily. A feeling or sense is conveyed, without it being fully clear what it is about. Fuzzy concepts can be used deliberately to create ambiguity and vagueness, as an evasive tactic, or to bridge what would otherwise be immediately recognized as a contradiction of terms. They might be used to indicate that there is definitely a connection between two things, without giving a complete specification of what the connection is, for some or other reason. This could be due to a failure or refusal to be more precise. But it could also could be a prologue to a more exact formulation of a concept, or a better understanding. Fuzzy concepts could also simply be a practical method to describe something of which a complete description would be an unmanageably large undertaking, or very time-consuming; thus, a simplified indication of what is at issue is regarded as sufficient, although it is not exact. There is also such a thing as an "economy of distinctions", meaning that it is not helpful or efficient to use more detailed definitions than are really necessary for a given purpose. The provision of "too many details" could be disorienting and confusing, instead of being enlightening, while a fuzzy term might be sufficient to provide an orientation. The reason for using fuzzy concepts can therefore be purely pragmatic[19], if it is not feasible for practical purposes to provide "all the details" about the meaning of a shared symbol or sign. Thus people might say "I realize this is not exact, but you know what I mean" - they assume practically that stating all the details is not required for the purpose of the communication.

66

Analysis
In mathematical logic, computer programming, philosophy and linguistics fuzzy concepts can be analyzed and defined more accurately or comprehensively, by describing or modelling the concepts using the terms of fuzzy logic. More generally, techniques can be used such as: concretizing the concept by finding specific examples, illustrations or cases to which it applies. specifying a range of conditions to which the concept applies (for example, in computer programming of a procedure). classifying or categorizing all or most cases or uses to which the concept applies (taxonomy). probing the assumptions on which a concept is based, or which are associated with its use (Critical thought). identifying operational rules for the use of the concept, which cover all or most cases. allocating different applications of the concept to different but related sets (e.g. using Boolean logic). examining how probable it is that the concept applies, statistically or intuitively. examining the distribution or distributional frequency of (possibly different) uses of the concept. some other kind of measure or scale of the degree to which the concept applies. specifying a series of logical operators (an inferential system or algorithm) which captures all or most cases to which the concept applies. mapping or graphing the applications of the concept using some basic parameters.

Fuzzy concept applying a meta-language which includes fuzzy concepts in a more inclusive categorical system which is not fuzzy. reducing or restating fuzzy concepts in terms which are simpler or similar, and which are not fuzzy or less fuzzy. relating the fuzzy concept to other concepts which are not fuzzy or less fuzzy, or simply by replacing the fuzzy concept altogether with another, alternative concept which is not fuzzy yet "works exactly the same way". In this way, we can obtain a more exact understanding of the use of a fuzzy concept, and possibly decrease the amount of fuzziness. It may not be possible to specify all the possible meanings or applications of a concept completely and exhaustively, but if it is possible to capture the majority of them, statistically or otherwise, this may be useful enough for practical purposes. A process of defuzzification is said to occur, when fuzzy concepts can be logically described in terms of (the relationships between) fuzzy sets, An operationalization diagram, one method of clarifing fuzzy concepts. which makes it possible to define variations in the meaning or applicability of concepts as quantities. Effectively, qualitative differences may then be described more precisely as quantitative variations or quantitative variability (assigning a numerical value then denotes the magnitude of variation). The difficulty that can occur in judging the fuzziness of a concept can be illustrated with the question "Is this one of those?". If it is not possible to clearly answer this question, that could be because "this" (the object) is itself fuzzy and evades definition, or because "one of those" (the concept of the object) is fuzzy and inadequately defined. Thus, the source of fuzziness may be in the nature of the reality being dealt with, the concepts used to interpret it, or the way in which the two are being related by a person. It may be that the personal meanings which people attach to something are quite clear to the persons themselves, but that it is not possible to communicate those meanings to others except as fuzzy concepts.

67

References
[1] Susan Haack, Deviant logic, fuzzy logic: beyond the formalism. Chicago: University of Chicago Press, 1996. [2] Richard Dietz & Sebastiano Moruzzi (eds.), Cuts and clouds. Vagueness, Its Nature, and Its Logic. Oxford University Press, 2009. [3] Ann Markusen, "Fuzzy Concepts, Scanty Evidence, Policy Distance: The Case for Rigour and Policy Relevance in Critical Regional Studies." In: Regional Studies, Volume 37, Issue 6-7, 2003, pp. 701-717. [4] Roy T. Cook, A dictionary of philosophical logic. Edinburgh University Press, 2009, p. 84. [5] Kazuo Tanaka, An Introduction to Fuzzy Logic for Practical Applications. Springer, 1996; Constantin Zopounidis, Panos M. Pardalos & George Baourakis, Fuzzy Sets in Management, Economics and Marketing. Singapore; World Scientific Publishing Co. 2001. [7] Irem Dikmen, M. Talat Birgonal and Sedat Han, "Using fuzzy risk assessment to rate cost overrun risk in international construction projects." International Journal of Project Management, Vol. 25 No. 5, July 2007, pp. 494-505. [8] Susan Haack notes that Stanisaw Jakowski provided axiomatizations of many-valued logics in: Jakowski, "On the rules of supposition in formal logic. Studia Logica No. 1, 1934. (http:/ / www. logik. ch/ daten/ jaskowski. pdf) See Susan Haack, Philosophy of Logics. Cambridge University Press, 1978, p. 205. [9] Priyanka Kaushal, Neeraj Mohan and Parvinder S. Sandhu, "Relevancy of Fuzzy Concept in Mathematics". International Journal of Innovation, Management and Technology, Vol. 1, No. 3, August 2010. (http:/ / ijimt. org/ papers/ 58-M450. pdf)

Fuzzy concept
[10] Lotfi A. Zadeh, "Fuzzy sets". In: Information and Control, Vol. 8, June 1965, pp. 338353. (http:/ / www-bisc. cs. berkeley. edu/ Zadeh-1965. pdf) [11] Siegfried Gottwald, "Shaping the logic of fuzzy set theory". In: Cintula, Petr et al. (eds.), Witnessed years. Essays in honour of Petr Hjek. London: College Publications, 2009, pp. 193-208. (http:/ / www. uni-leipzig. de/ ~logik/ gottwald/ Hajek09. pdf) [12] Radim Belohlavek, "What is a fuzzy concept lattice? II", in: Sergei O. Kuznetsov et al. (eds.), Rough sets, fuzzy sets, data mining and granular computing. Berlin: Springer Verlag, 2011, pp. 19-20. (http:/ / belohlavek. inf. upol. cz/ publications/ BeVy_Wifcl. pdf) [13] George Lakoff, "Hedges: A Study in Meaning Criteria and the Logic of Fuzzy Concepts." Journal of Philosophical Logic, Vol. 2, 1973, pp. 458-508. (http:/ / georgelakoff. files. wordpress. com/ 2011/ 01/ hedges-a-study-in-meaning-criteria-and-the-logic-of-fuzzy-concepts-journal-of-philosophical-logic-2-lakoff-19731. pdf) [14] Charles Ragin, Redesigning Social Inquiry: Fuzzy Sets and Beyond. University of Chicago Press, 2008. Shaomin Li, "Measuring the fuzziness of human thoughts: An application of fuzzy sets to sociological research". The Journal of Mathematical Sociology, Volume 14, Issue 1, 1989, pp. 67-84. [15] Jrg Rssel and Randall Collins, "Conflict theory and interaction rituals. The microfoundations of conflict theory." In: Jonathan H. Turner (ed.), Handbook of Sociological Theory. New York: Springer, 2001, p. 527. [16] Loc Wacquant, "The fuzzy logic of practical sense." in: Pierre Bourdieu and Loc Wacquant, An invitation to reflexive sociology. London: Polity Press, 1992, chapter I section 4. [17] Ph. Manning Fuzzy Description: Discovery and Invention in Sociology. In: History of the Human Sciences, Vol. 7, No. 1, 1994, pp. 117-23. [18] Masao Mukaidono, Fuzzy logic for beginners. Singapore: World Scientific Publishing, 2001. [19] http:/ / toolserver. org/ %7Edispenser/ cgi-bin/ dab_solver. py?page=Fuzzy_concept& editintro=Template:Disambiguation_needed/ editintro& client=Template:Dn [20] Patrick Hughes & George Brecht, Vicious Circles and Infinity. An anthology of Paradoxes. Penguin Books, 1978. [21] See further Radim Belohlavek & George J. Klir (eds.) Concepts and Fuzzy Logic. MIT Press, 2011. John R. Searle, "Minds, brains and programs". The behavioral and brain sciences, Vol. 3, No. 3, 1980, pp. 417-457. [22] C.N. de Groot, "Sociology of religion looks at psychotherapy." Recherches sociologiques (Louvain-la-Neuve, Belgium), Vol. 29, No. 2, 1998, pp. 3-17 at p. 4. (http:/ / arno. uvt. nl/ show. cgi?fid=76988) [23] Philip J. Kelman & Martha E. Arterberry, The cradle of knowledge: development of perception in infancy. Cambridge, Mass.: The MIT Press, 2000. [24] Ronald A. Havens (ed.), The wisdom of Milton H. Erickson, Volume I: hypnosis and hypnotherapy. New York: Irvington Publishers, 1992, p. 106. Joseph O'Connor & John Seymour (ed.), Introducing neuro-linguistic programming. London: Thorsons, 1995, p. 116f.

68

External links
James F. Brule, Fuzzy systems tutorial (http://www.austinlinks.com/Fuzzy/tutorial.html) "Fuzzy Logic", Stanford Encyclopedia of Philosophy (http://plato.stanford.edu/entries/logic-fuzzy/)

G factor (psychometrics)

69

G factor (psychometrics)
Human intelligence
Abilities, traits and constructs

Abstract thought Communication Creativity Emotional intelligence g factor Intelligence quotient Knowledge Learning Memory Problem solving Reaction time Reasoning Understanding Visual processing Models and theories

CattellHornCarroll theory Fluid and crystallized intelligence Theory of multiple intelligences Three stratum theory Triarchic theory of intelligence PASS theory of intelligence Fields of study

Cognitive epidemiology Evolution of human intelligence Psychometrics Heritability of IQ Impact of health on intelligence Environment and intelligence Neuroscience and intelligence Race and intelligence Religiosity and intelligence

The g factor (short for "general factor") is a construct developed in psychometric investigations of cognitive abilities. It is a variable that summarizes positive correlations among different cognitive tasks, reflecting the fact that an individual's performance at one type of cognitive task tends to be comparable to his or her performance at other kinds of cognitive tasks. The g factor typically accounts for 40 to 50 percent of the variance in IQ test performance, and IQ scores are frequently regarded as estimates of individuals' standing on the g factor.[1] The terms IQ, general intelligence, general cognitive ability, general mental ability, or simply intelligence are often used interchangeably to refer to the common core shared by cognitive tests.[2] The existence of the g factor was originally proposed by the English psychologist Charles Spearman in the early years of the 20th century. He observed that children's performance ratings across seemingly unrelated school subjects were positively correlated, and reasoned that these correlations reflected the influence of an underlying general mental ability that entered into performance on all kinds of mental tests. Spearman suggested that all mental

G factor (psychometrics) performance could be conceptualized in terms of a single general ability factor, which he labeled g, and a large number of narrow task-specific ability factors. Today's factor models of intelligence typically represent cognitive abilities as a three-level hierarchy, where there are a large number of narrow factors at the bottom of the hierarchy, a handful of broad, more general factors at the intermediate level, and at the apex a single factor, referred to as the g factor, which represents the variance common to all cognitive tasks. Traditionally, research on g has concentrated on psychometric investigations of test data, with a special emphasis on factor analytic approaches. However, empirical research on the nature of g has also drawn upon experimental cognitive psychology and mental chronometry, brain anatomy and physiology, quantitative and molecular genetics, and primate evolution.[3] While the existence of g as a statistical regularity is well-established and uncontroversial, there is no consensus as to what causes the positive correlations between tests. Behavioral genetic research has established that the construct of g is highly heritable. It has a number of other biological correlates, including brain size. It is also a significant predictor of individual differences in many social outcomes, particularly in education and the world of work. The most widely accepted contemporary theories of intelligence incorporate the g factor.[4] However, critics of g have contended that an emphasis on g is misplaced and entails a devaluation of other important abilities.

70

Mental testing and g


Spearman's correlation matrix for six measures of school performance. All the correlations are positive, a phenomenon referred as the positive manifold. The bottom row shows the g loadings of each performance measure.[5]
Classics French English Math Pitch Music Classics French English Math Pitch discrimination Music g .83 .78 .70 .66 .63 .958 .67 .67 .65 .57 .882 .64 .54 .51 .803 .45 .51 .750 .40 .673 .646

Subtest intercorrelations in a sample of Scottish subjects who completed the WAIS-R battery. The subtests are Vocabulary, Similarities, Information, Comprehension, Picture arrangement, Block design, Arithmetic, Picture completion, Digit span, Object assembly, and Digit symbol. The bottom row shows the g loadings of each subtest.[6]
V V S I C PA .67 S I C PA BD A PC DSp OA DS

.72 .59

.70 .58 .59

.51 .53 .50 .42

BD .45 .46 .45 .39 .43 A

.48 .43 .55 .45 .41 .44

G factor (psychometrics)

71
PC .49 .52 .52 .46 .48 .45 .30 .14 .27 .56 .25 -

DSp .46 .40 .36 .36 .31 .32 .47 .23 OA .32 .40 .32 .29 .36 .58 .33 .41 DS g .32 .33 .26 .30 .28 .36 .28 .26 .83 .80 .80 .75 .70 .70 .68 .68

.56 .48

Mental tests may be designed to measure different aspects of cognition. Specific domains assessed by tests include mathematical skill, verbal fluency, spatial visualization, and memory, among others. However, individuals who excel at one type of test tend to excel at other kinds of tests, too, while those who do poorly on one test tend to do so on all tests, regardless of the tests' contents.[7] The English psychologist Charles Spearman was the first to describe this phenomenon.[8] In a famous research paper published in 1904[9], he observed that children's performance measures across seemingly unrelated school subjects were positively correlated. This finding has since been replicated numerous times. The consistent finding of universally positive correlation matrices of mental test results (or the "positive manifold"), despite large differences in tests' contents, has been described as "arguably the most replicated result in all psychology."[10] Zero or negative correlations between tests suggest the presence of sampling error or restriction of the range of ability in the sample studied.[11] Using factor analysis or related statistical methods, it is possible to compute a single common factor that can be regarded as a summary variable characterizing the correlations between all the different tests in a test battery. Spearman referred to this common factor as the general factor, or simply g. (By convention, g is always printed as a lower case italic.) Mathematically, the g factor is a source of variance among individuals, which entails that one cannot meaningfully speak of any one individual's mental abilities consisting of g or other factors to any specified degrees. One can only speak of an individual's standing on g (or other factors) compared to other individuals in a relevant population.[12][13][11] Different tests in a test battery may correlate with (or "load onto") the g factor of the battery to different degrees. These correlations are known as g loadings. An individual test taker's g factor score, representing his or her relative standing on the g factor in the total group of individuals, can be estimated using the g loadings. Full-scale IQ scores from a test battery will usually be highly correlated with g factor scores, and they are often regarded as estimates of g. For example, the correlations between g factor scores and full-scale IQ scores from Wechsler's tests have been found to be greater than .95.[14][11][1] The terms IQ, general intelligence, general cognitive ability, general mental ability, or simply intelligence are frequently used interchangeably to refer to the common core shared by cognitive tests.[2] The g loadings of mental tests are always positive and usually range between .10 and .90, with a mean of about .60 and a standard deviation of about .15. Raven's Progressive Matrices is among the tests with the highest g loadings, around .80. Tests of vocabulary and general information are also typically found to have high g loadings.[15][16] However, the g loading of the same test may vary somewhat depending on the composition of the test battery.[17] The complexity of tests and the demands they place on mental manipulation are related to the tests' g loadings. For example, in the forward digit span test the subject is asked to repeat a sequence of digits in the order of their presentation after hearing them once at a rate of one digit per second. The backward digit span test is otherwise the same except that the subject is asked to repeat the digits in the reverse order to that in which they were presented. The backward digit span test is more complex than the forward digit span test, and it has a significantly higher g loading. Similarly, the g loadings of arithmetic computation, spelling, and word reading tests are lower than those of arithmetic problem solving, text composition, and reading comprehension tests, respectively.[18][19] Test difficulty and g loadings are distinct concepts that may or may not be empirically related in any specific situation. Tests that have the same difficulty level, as indexed by the proportion of test items that are failed by test takers, may exhibit a wide range of g loadings. For example, tests of rote memory have been shown to have the same level of difficulty but considerably lower g loadings than many tests that involve reasoning.[20][21]

G factor (psychometrics)

72

Theories of g
While the existence of g as a statistical regularity is well-established and uncontroversial among experts, there is no consensus as to what causes the positive intercorrelations. Several explanations have been proposed.[22]

Mental energy or efficiency


Charles Spearman reasoned that correlations between tests reflected the influence of a common causal factor, a general mental ability that enters into performance on all kinds of mental tasks. However, he thought that the best indicators of g were those tests that reflected what he called the eduction of relations and correlates, which included abilities such as deduction, induction, problem solving, grasping relationships, inferring rules, and spotting differences and similarities. Spearman hypothesized that g was equivalent with "mental energy". However, this was more of a metaphorical explanation, and he remained agnostic about the physical basis of this energy, expecting that future research would uncover the exact physiological nature of g.[23] Following Spearman, Arthur Jensen maintained that all mental tasks tap into g to some degree. According to Jensen, the g factor represents a "distillate" of scores on different tests rather than a summation or an average of such scores, with factor analysis acting as the distillation procedure.[24] He argued that g cannot be described in terms of the item characteristics or information content of tests, pointing out that very dissimilar mental tasks may have nearly equal g loadings. David Wechsler similarly contended that g is not an ability at all but rather some general property of the brain. Jensen hypothesized that g corresponds to individual differences in the speed or efficiency of the neural processes associated with mental abilities.[25] He also suggested that given the associations between g and elementary cognitive tasks, it should be possible to construct a ratio scale test of g that uses time as the unit of measurement.[26]

Sampling theory
The so-called sampling theory of g, originally developed by E.L. Thorndike and Godfrey Thomson, proposes that the existence of the positive manifold can be explained without reference to a unitary underlying capacity. According to this theory, there are a number of uncorrelated mental processes, and all tests draw upon different samples of these processes. The intercorrelations between tests are caused by an overlap between processes tapped by the tests.[27][28] Thus, the positive manifold arises due to a measurement problem, an inability to measure more fine-grained, presumably uncorrelated mental processes.[13] It has been shown that it is not possible to distinguish statistically between Spearman's model of g and the sampling model; both are equally able to account for intercorrelations among tests.[29] The sampling theory is also consistent with the observation that more complex mental tasks have higher g loadings, because more complex tasks are expected to involve a larger sampling of neural elements and therefore have more of them in common with other tasks.[30] Some researchers have argued that the sampling model invalidates g as a psychological concept, because the model suggests that g factors derived from different test batteries simply reflect the shared elements of the particular tests contained in each battery rather than a g that is common to all tests. Similarly, high correlations between different batteries could be due to them measuring the same set of abilities rather than the same ability.[31] Critics have argued that the sampling theory is incongruent with certain empirical findings. Based on the sampling theory, one might expect that related cognitive tests share many elements and thus be highly correlated. However, some closely related tests, such as forward and backward digit span, are only modestly correlated, while some seemingly completely dissimilar tests, such as vocabulary tests and Raven's matrices, are consistently highly correlated. Another problematic finding is that brain damage frequently leads to specific cognitive impairments rather than a general impairment one might expect based on the sampling theory.[32][13]

G factor (psychometrics)

73

Mutualism
The "mutualism" model of g proposes that cognitive processes are initially uncorrelated, but that the positive manifold arises during individual development due to mutual beneficial relations between cognitive processes. Thus there is no single process or capacity underlying the positive correlations between tests. During the course of development, the theory holds, any one particularly efficient process will benefit other processes, with the result that the processes will end up being correlated with one another. Thus similarly high IQs in different persons may stem from quite different initial advantages that they had.[33][13] Critics have argued that the observed correlations between the g loadings and the heritability coefficients of subtests are problematic for the mutualism theory.[34]

Factor structure of cognitive abilities


Factor analysis is a family of mathematical techniques that can be used to represent correlations between intelligence tests in terms of a smaller number of variables known as factors. The purpose is to simplify the correlation matrix by using hypothetical underlying factors to explain the patterns in it. When all correlations in a matrix are positive, as they are in the case of IQ, factor analysis will yield a general factor common to all tests. The general factor of IQ tests is referred to as the g factor, and it typically accounts for 40 to 50 percent of the variance in IQ test batteries.[35] Charles Spearman developed factor analysis in order to study correlations between tests. Initially, he developed a model of intelligence in which variations in all intelligence test scores are explained by only two kinds of variables: first, factors that are An illustration of Spearman's two-factor intelligence theory. Each small oval is a hypothetical mental test. specific to each test (denoted s); and second, a g factor that The blue areas correspond to test-specific variance (s), accounts for the positive correlations across tests. This is known as while the purple areas represent the variance attributed Spearman's two-factor theory. Later research based on more to g. diverse test batteries than those used by Spearman demonstrated that g alone could not account for all correlations between tests. Specifically, it was found that even after controlling for g, some tests were still correlated with each other. This led to the postulation of group factors that represent variance that groups of tests with similar task demands (e.g., verbal, spatial, or numerical) have in common in addition to the shared g variance.[36]

G factor (psychometrics)

74

Through factor rotation, it is, in principle, possible to produce an infinite number of different factor solutions that are mathematically equivalent in their ability to account for the intercorrelations among cognitive tests. These include solutions that do not contain a g factor. Thus factor analysis alone cannot establish what the underlying structure of intelligence is. In choosing between different factor solutions, researchers have to examine the results of factor analysis together with other information about the structure of cognitive abilities.[37]

An illustration of John B. Carroll's three stratum theory, an influential contemporary model of cognitive abilities. The broad abilities recognized by the model are fluid intelligence (Gf), crystallized intelligence (Gc), general memory and learning (Gy), broad visual perception (Gv), broad auditory perception (Gu), broad retrieval ability (Gr), broad cognitive speediness (Gs), and processing speed (Gt). Carroll regarded the broad abilities as different "flavors" of g.

There are many psychologically relevant reasons for preferring factor solutions that contain a g factor. These include the existence of the positive manifold, the fact that certain kinds of tests (generally the more complex ones) have consistently larger g loadings, the substantial invariance of g factors across different test batteries, the impossibility of constructing test batteries that do not yield a g factor, and the widespread practical validity of g as a predictor of individual outcomes. The g factor, together with group factors, best represents the empirically established fact that, on average, overall ability differences between individuals are greater than differences among abilities within individuals, while a factor solution with orthogonal factors without g obscures this fact. Moreover, g appears to be the most heritable component of intelligence.[38] Research utilizing the techniques of confirmatory factor analysis has also provided support for the existence of g.[37] A g factor can be computed from a correlation matrix of test results using several different methods. These include exploratory factor analysis, principal components analysis (PCA), and confirmatory factor analysis. Different factor-extraction methods produce highly consistent results, although PCA has sometimes been found to produce inflated estimates of the influence of g on test scores.[39][17] There is a broad contemporary consensus that cognitive variance between people can be conceptualized at three hierarchical levels, distinguished by their degree of generality. At the lowest, least general level there are a large number of narrow first-order factors; at a higher level, there are a relatively small number somewhere between five and ten of broad (i.e., more general) second-order factors (or group factors); and at the apex, there is a single third-order factor, g, the general factor common to all tests.[40][41][42] The g factor usually accounts for the majority of the total common factor variance of IQ test batteries.[43] Contemporary hierarchical models of intelligence include the three stratum theory and the CattellHornCarroll theory.[44]

"Indifference of the indicator"


Spearman proposed the principle of the indifference of the indicator, according to which the precise content of intelligence tests is unimportant for the purposes of identifying g, because g enters into performance on all kinds of tests. Any test can therefore be used as an indicator of g. Following Spearman, Arthur Jensen more recently argued that a g factor extracted from one test battery will always be the same, within the limits of measurement error, as that extracted from another battery, provided that the batteries are large and diverse.[45] According to this view, every mental test, no matter how distinctive, contains some g. Thus a composite score of a number of different tests will have relatively more g than any of the individual test scores, because the g components cumulate into the composite

G factor (psychometrics) score, while the uncorrelated non-g components will cancel each other out. Theoretically, the composite score of an infinitely large, diverse test battery would, then, be a perfect measure of g.[46] In contrast, L.L. Thurstone argued that a g factor extracted from a test battery reflects the average of all the abilities called for by the particular battery, and that g therefore varies from one battery to another and "has no fundamental psychological significance."[47] Along similar lines, John Horn argued that g factors are meaningless because they are not invariant across test batteries, maintaining that correlations between different ability measures arise because it is difficult to define a human action that depends on just one ability.[48][49] To show that different batteries reflect the same g, one must administer several test batteries to the same individuals, extract g factors from each battery, and show that the factors are highly correlated.[50] Wendy Johnson and colleagues have published two such studies.[51][52] The first found that the correlations between g factors extracted from three different batteries were .99, .99, and 1.00, supporting the hypothesis that g factors from different batteries are the same and that the identification of g is not dependent on the specific abilities assessed. The second study found that g factors derived from four of five test batteries correlated at between .951.00, while the correlations ranged from .79 to .96 for the fifth battery, the Cattell Culture Fair Intelligence Test (the CFIT). They attributed the somewhat lower correlations with the CFIT battery to its lack of content diversity for it contains only matrix-type items, and interpreted the findings as supporting the contention that g factors derived from different test batteries are the same provided that the batteries are diverse enough. The results suggest that the same g can be consistently identified from different test batteries.[53][40]

75

Population distribution
The form of the population distribution of g is unknown, because g cannot be measured on a ratio scale. (The distributions of scores on typical IQ tests are roughly normal, but this is achieved by construction, i.e., by appropriate item selection by test developers.) It has been argued that there are nevertheless good reasons for supposing that g is normally distributed in the general population, at least within a range of 2 standard deviations from the mean. In particular, g can be thought of as a composite variable that reflects the additive effects of a large number of independent genetic and environmental influences, and such a variable should, according to the central limit theorem, follow a normal distribution.[54]

Spearman's law of diminishing returns


A number of researchers have suggested that the proportion of variation accounted for by g may not be uniform across all subgroups within a population. Spearman's law of diminishing returns (SLDR), also termed the ability differentiation hypothesis, predicts that the positive correlations among different cognitive abilities are weaker among more intelligent subgroups of individuals. More specifically, SLDR predicts that the g factor will account for a smaller proportion of individual differences in cognitive tests scores at higher scores on the g factor. SLDR was originally proposed by Charles Spearman,[55] who reported that the average correlation between 12 cognitive ability tests was .466 in 78 normal children, and .782 in 22 "defective" children. Detterman and Daniel rediscovered this phenomenon in 1989.[56] They reported that for subtests of both the WAIS and the WISC, subtest intercorrelations decreased monotonically with ability group, ranging from approximately an average intercorrelation of .7 among individuals with IQs less than 78 to .4 among individuals with IQs greater than 122.[57] SLDR has been replicated in a variety of child and adult samples who have been measured using broad arrays of cognitive tests. The most common approach has been to divide individuals into multiple ability groups using an observable proxy for their general intellectual ability, and then to either compare the average interrelation among the subtests across the different groups, or to compare the proportion of variation accounted for by a single common factor, in the different groups.[58] However, as both Deary et al. (1996).[58] and Tucker-Drob (2009)[59] have pointed out, dividing the continuous distribution of intelligence into an arbitrary number of discrete ability groups is less than

G factor (psychometrics) ideal for examining SLDR. Tucker-Drob (2009)[59] extensively reviewed the literature on SLDR and the various methods by which it had been previously tested, and proposed that SLDR could be most appropriately captured by fitting a common factor model that allows the relations between the factor and its indicators to be nonlinear in nature. He applied such a factor model to a nationally representative data of children and adults in the United States and found consistent evidence for SLDR. For example, Tucker-Drob (2009) found that a general factor accounted for approximately 75% of the variation in seven different cognitive abilities among very low IQ adults, but only accounted for approximately 30% of the variation in the abilities among very high IQ adults.

76

Practical validity
The practical validity of g as a predictor of educational, economic, and social outcomes is more far-ranging and universal than that of any other known psychological variable. The validity of g is greater the greater the complexity of the task.[60][61] A test's practical validity is measured by its correlation with performance on some criterion external to the test, such as college grade-point average, or a rating of job performance. The correlation between test scores and a measure of some criterion is called the validity coefficient. One way to interpret a validity coefficient is to square it to obtain the variance accounted by the test. For example, a validity coefficient of .30 corresponds to 9 percent of variance explained. This approach has, however, been criticized as misleading and uninformative, and several alternatives have been proposed. One arguably more interpretable approach is to look at the percentage of test takers in each test score quintile who meet some agreed-upon standard of success. For example, if the correlation between test scores and performance is .30, the expectation is that 67 percent of those in the top quintile will be above-average performers, compared to 33 percent of those in the bottom quintile.[62][63]

Academic achievement
The predictive validity of g is most conspicuous in the domain of scholastic performance. This is apparently because g is closely linked to the ability to learn novel material and understand concepts and meanings.[64] In elementary school, the correlation between IQ and grades and achievement scores is between .60 and .70. At more advanced educational levels, more students from the lower end of the IQ distribution drop out, which restricts the range of IQs and results in lower validity coefficients. In high school, college, and graduate school the validity coefficients are .50.60, .40.50, and .30.40, respectively. The g loadings of IQ scores are high, but it is possible that some of the validity of IQ in predicting scholastic achievement is attributable to factors measured by IQ independent of g. According to research by Robert L. Thorndike, 80 to 90 percent of the predictable variance in scholastic performance is due to g, with the rest attributed to non-g factors measured by IQ and other tests.[65] Achievement test scores are more highly correlated with IQ than school grades. This may be because grades are more influenced by the teacher's idiosyncratic perceptions of the student.[66] In a longitudinal English study, g scores measured at age 11 correlated with all the 25 subject tests of the national GCSE examination taken at age 16. The correlations ranged from .77 for the mathematics test to .42 for the art test. The correlation between g and a general educational factor computed from the GCSE tests was .81.[67] Research suggests that the SAT, widely used in college admissions, is primarily a measure of g. A correlation of .82 has been found between g scores computed from an IQ test battery and SAT scores. In a study of 165,000 students at 41 U.S. colleges, SAT scores were found to be correlated at .47 with first-year college grade-point average after correcting for range restriction in SAT scores (when course difficulty is held constant, i.e., if all students attended the same set of classes, the correlation rises to .55).[62][68]

G factor (psychometrics)

77

Job attainment and performance


There is a high correlation of .90 to .95 between the prestige rankings of occupations, as rated by the general population, and the average general intelligence scores of people employed in each occupation. At the level of individual employees, the association between job prestige and g is lower one large U.S. study reported a correlation of .65 (.72 corrected for attenuation). Mean level of g thus increases with perceived job prestige. It has also been found that the dispersion of general intelligence scores is smaller in more prestigious occupations than in lower level occupations, suggesting that higher level occupations have minimum g requirements.[69][70] Research indicates that tests of g are the best single predictors of job performance, with an average validity coefficient of .55 across several meta-analyses of studies based on supervisor ratings and job samples. The average meta-analytic validity coefficient for performance in job training is .63.[71] The validity of g in the highest complexity jobs (professional, scientific, and upper management jobs) has been found to be greater than in the lowest complexity jobs, but g has predictive validity even for the simplest jobs. Research also shows that specific aptitude tests tailored for each job provide little or no increase in predictive validity over tests of general intelligence. It is believed that g affects job performance mainly by facilitating the acquisition of job-related knowledge. The predictive validity of g is greater than that of work experience, and increased experience on the job does not decrease the validity of g.[69][72]

Income
The correlation between income and g, as measured by IQ scores, averages about .40 across studies. The correlation is higher at higher levels of education and it increases with age, stabilizing when people reach their highest career potential in middle age. Even when education, occupation and socioeconomic background are held constant, the correlation does not vanish.[73]

Other correlates
The g factor is reflected in many social outcomes. Many social behavior problems, such as dropping out of school, chronic welfare dependency, accident proneness, and crime, are negatively correlated with g independent of social class of origin.[74] Health and mortality outcomes are also linked to g, with higher childhood test scores predicting better health and mortality outcomes in adulthood (see Cognitive epidemiology).[75]

Genetic and environmental determinants


Heritability is the proportion of phenotypic variance in a trait in a population that can be attributed to genetic factors. The heritability of g has been estimated to fall between 40 to 80 percent using twin, adoption, and other family study designs as well as molecular genetic methods. It has been found to increase linearly with age. For example, a large study involving more than 11,000 pairs of twins from four countries reported the heritability of g to be 41 percent at age nine, 55 percent at age twelve, and 66 percent at age seventeen. Other studies have estimated that the heritability is as high as 80 percent in adulthood, although it may decline in old age. Most of the research on the heritability of g has been conducted in the USA and Western Europe, but studies in Russia (Moscow), the former East Germany, Japan, and rural India have yielded similar estimates of heritability as Western studies.[76][77][78][40] Behavioral genetic research has also established that the shared (or between-family) environmental effects on g are strong in childhood, but decline thereafter and are negligible in adulthood. This indicates that the environmental effects that are important to the development of g are unique and not shared between members of the same family.[77] The genetic correlation is a statistic that indicates the extent to which the same genetic effects influence two different traits. If the genetic correlation between two traits is zero, the genetic effects on them are independent, whereas a correlation of 1.0 means that the same set of genes explains the heritability of both traits (regardless of how high or

G factor (psychometrics) low the heritability of each is). Genetic correlations between specific mental abilities (such as verbal ability and spatial ability) have been consistently found to be very high, close to 1.0. This indicates that genetic variation in cognitive abilities is almost entirely due to genetic variation in whatever g is. It also suggests that what is common among cognitive abilities is largely caused by genes, and that independence among abilities is largely due to environmental effects. Thus it has been argued that when genes for intelligence are identified, they will be "generalist genes", each affecting many different cognitive abilities.[77][79][80] The g loadings of mental tests have been found to correlate with their heritabilities, with correlations ranging from moderate to perfect in various studies. Thus the heritability of a mental test is usually higher the larger its g loading is.[34] Much research points to g being a highly polygenic trait influenced by a large number of common genetic variants, each having only small effects. Another possibility is that heritable differences in g are due to individuals having different "loads" of rare, deleterious mutations, with genetic variation among individuals persisting due to mutationselection balance.[80][81] A number of candidate genes have been reported to be associated with intelligence differences, but the effect sizes have been small and almost none of the findings have been replicated. No individual genetic variants have been conclusively linked to intelligence in the normal range so far. Many researchers believe that very large samples will be needed to reliably detect individual genetic polymorphisms associated with g.[40][81] However, while genes influencing variation in g in the normal range have proven difficult to find, a large number of single-gene disorders with mental retardation among their symptoms have been discovered.[82] Several studies suggest that tests with larger g loadings are more affected by inbreeding depression lowering test scores. There is also evidence that tests with larger g loadings are associated with larger positive heterotic effects on test scores. Inbreeding depression and heterosis suggest the presence of genetic dominance effects for g.[83]

78

Neuroscientific findings
g has a number of correlates in the brain. Studies using magnetic resonance imaging (MRI) have established that g and total brain volume are moderately correlated (r~.3.4). External head size has a correlation of ~.2 with g. MRI research on brain regions indicates that the volumes of frontal, parietal and temporal cortices, and the hippocampus are also correlated with g, generally at .25 or more, while the correlations, averaged over many studies, with overall grey matter and overall white matter have been found to be .31 and .27, respectively. Some but not all studies have also found positive correlations between g and cortical thickness. However, the underlying reasons for these associations between the quantity of brain tissue and differences in cognitive abilities remain largely unknown.[2] Most researchers believe that intelligence cannot be localized to a single brain region, such as the frontal lobe. It has been suggested that intelligence could be characterized as a small-world network. For example, high intelligence could be dependent on unobstructed transfer of information between the involved brain regions along white matter fibers. Brain lesion studies have found small but consistent associations indicating that people with more white matter lesions tend to have lower cognitive ability. Research utilizing NMR spectroscopy has discovered somewhat inconsistent but generally positive correlations between intelligence and white matter integrity, supporting the notion that white matter is important for intelligence.[2] Some research suggests that aside from the integrity of white matter, also its organizational efficiency is related to intelligence. The hypothesis that brain efficiency has a role in intelligence is supported by functional MRI research showing that more intelligent people generally process information more efficiently, i.e., they use fewer brain resources for the same task than less intelligent people.[2] Small but relatively consistent associations with intelligence test scores include also brain activity, as measured by EEG records or event-related potentials, and nerve conduction velocity.[84][85]

G factor (psychometrics)

79

Other biological associations


Height is correlated with intelligence (r~.2), but this correlation has not generally been found within families (i.e., among siblings), suggesting that it results from cross-assortative mating for height and intelligence. Myopia is known to be associated with intelligence, with a correlation of around .2 to .25, and this association has been found within families, too.[86] There is some evidence that a g factor underlies the abilities of nonhuman animals, too. Several studies suggest that a general factor accounts for a substantial percentage of covariance in cognitive tasks given to such animals as rats, mice, and rhesus monkeys.[87][85]

Group similarities and differences


Cross-cultural studies indicate that the g factor can be observed whenever a battery of diverse, complex cognitive tests is administered to a human sample. The factor structure of IQ tests has also found to be consistent across sexes and ethnic groups in the U.S. and elsewhere.[85] The g factor has been found to be the most invariant of all factors in cross-cultural comparisons. For example, when the g factors computed from an American standardization sample of Wechsler's IQ battery and from large samples who completed the Japanese translation of the same battery were compared, the congruence coefficient was .99, indicating virtual identity. Similarly, the congruence coefficient between the g factors obtained from white and black standardization samples of the WISC battery in the U.S. was .995, and the variance in test scores accounted for by g was highly similar for both groups.[88] Most studies suggest that there are negligible differences in the mean level of g between the sexes, and that sex differences in cognitive abilities are to be found in more narrow domains. For example, males generally outperform females in spatial tasks, while females generally outperform males in verbal tasks. Another difference that has been found in many studies is that males show more variability in both general and specific abilities than females, with proportionately more males at both the low end and the high end of the test score distribution.[89] Consistent differences between racial and ethnic groups in g have been found, particularly in the U.S. A 2001 meta-analysis of millions of subjects indicated that there is a 1.1 standard deviation gap in the mean level of g between white and black Americans, favoring the former. The mean score of Hispanic Americans was found to be .72 standard deviations below that of non-Hispanic whites.[90] In contrast, Americans of East Asian descent generally slightly outscore white Americans.[91] Several researchers have suggested that the magnitude of the black-white gap in cognitive ability tests is dependent on the magnitude of the test's g loading, with tests showing higher g loadings producing larger gaps (see Spearman's hypothesis).[92] It has also been claimed that racial and ethnic differences similar to those found in the U.S. can be observed globally.[93]

G factor (psychometrics)

80

Relation to other psychological constructs


Elementary cognitive tasks
Elementary cognitive tasks (ECTs) also correlate strongly with g. ECTs are, as the name suggests, simple tasks that apparently require very little intelligence, but still correlate strongly with more exhaustive intelligence tests. Determining whether a light is red or blue and determining whether there are four or five squares drawn on a computer screen are two examples of ECTs. The answers to such questions are usually provided by quickly pressing buttons. Often, in addition to buttons for the two options provided, a third button is held down from the start of the test. When the stimulus is An illustration of the Jensen box, an apparatus for measuring choice reaction time. given to the subject, they remove their hand from the starting button to the button of the correct answer. This allows the examiner to determine how much time was spent thinking about the answer to the question (reaction time, usually measured in small fractions of second), and how much time was spent on physical hand movement to the correct button (movement time). Reaction time correlates strongly with g, while movement time correlates less strongly.[94] ECT testing has allowed quantitative examination of hypotheses concerning test bias, subject motivation, and group differences. By virtue of their simplicity, ECTs provide a link between classical IQ testing and biological inquiries such as fMRI studies.

Working memory
One theory holds that g is identical or nearly identical to working memory capacity. Among other evidence for this view, some studies have found factors representing g and working memory to be perfectly correlated. However, in a meta-analysis the correlation was found to be considerably lower.[95] One criticism that has been made of studies that identify g with working memory is that "we do not advance understanding by showing that one mysterious concept is linked to another."[96]

Piagetian tasks
Psychometric theories of intelligence aim at quantifying intellectual growth and identifying ability differences between individuals and groups. In contrast, Jean Piaget's theory of cognitive development seeks to understand qualitative changes in children's intellectual development. Piaget designed a number of tasks to verify hypotheses arising from his theory. The tasks were not intended to measure individual differences, and they have no equivalent in psychometric intelligence tests.[97][98] For example, in one of the best-known Piagetian conservation tasks a child is asked if the amount of water in two identical glasses is the same. After the child agrees that the amount is the same, the investigator pours the water from one of the glasses into a glass of different shape so that the amount appears different although it remains the same. The child is then asked if the amount of water in the two glasses is the same or different. Notwithstanding the different research traditions in which psychometric tests and Piagetian tasks were developed, the correlations between the two types of measures have been found to be consistently positive and generally

G factor (psychometrics) moderate in magnitude. A common general factor underlies them. It has been shown that it is possible to construct a battery consisting of Piagetian tasks that is as good a measure of g as standard IQ tests.[99][100]

81

Personality
The traditional view in psychology is that there is no meaningful relationship between personality and intelligence, and that the two should be studied separately. Intelligence can be understood in terms of what an individual can do, or what his or her maximal performance is, while personality can be thought of in terms of what an individual will typically do, or what his or her general tendencies of behavior are. Research has indicated that correlations between measures of intelligence and personality are small, and it has thus been argued that g is a purely cognitive variable that is independent of personality traits. In a 2007 meta-analysis the correlations between g and the "Big Five" personality traits were found to be as follows: conscientiousness -.04 agreeableness .00 extraversion .02 openness .22 emotional stability .09

The same meta-analysis found a correlation of .20 between self-efficacy and g.[101][102][103] Some researchers have argued that the associations between intelligence and personality, albeit modest, are consistent. They have interpreted correlations between intelligence and personality measures in two main ways. The first perspective is that personality traits influence performance on intelligence tests. For example, a person may fail to perform at a maximal level on an IQ test due to his or her anxiety and stress-proneness. The second perspective considers intelligence and personality to be conceptually related, with personality traits determining how people apply and invest their cognitive abilities, leading to knowledge expansion and greater cognitive differentiation.[101][104]

Creativity
Some researchers believe that there is a threshold level of g below which socially significant creativity is rare, but that otherwise there is no relationship between the two. It has been suggested that this threshold is at least one standard deviation above the population mean. Above the threshold, personality differences are believed to be important determinants of individual variation in creativity.[105][106] Others have challenged the threshold theory. While not disputing that opportunity and personal attributes other than intelligence, such as energy and commitment, are important for creativity, they argue that g is positively associated with creativity even at the high end of the ability distribution. The longitudinal Study of Mathematically Precocious Youth has provided evidence for this contention. It has showed that individuals identified by standardized tests as intellectually gifted in early adolescence accomplish creative achievements (for example, securing patents or publishing literary or scientific works) at several times the rate of the general population, and that even within the top 1 percent of cognitive ability, those with higher ability are more likely to make outstanding achievements. The study has also suggested that the level of g acts as a predictor of the level of achievement, while specific cognitive ability patterns predict the realm of achievement.[107][108]

G factor (psychometrics)

82

Challenges to g
Gf-Gc theory
Raymond Cattell, a student of Charles Spearman's, rejected the unitary g factor model and divided g into two broad, relatively independent domains: fluid intelligence (Gf) and crystallized intelligence (Gc). Gf is conceptualized as a capacity to figure out novel problems, and it is best assessed with tests with little cultural or scholastic content, such as Raven's matrices. Gc can be thought of as consolidated knowledge, reflecting the skills and information that an individual acquires and retains throughout his or her life. Gc is dependent on education and other forms of acculturation, and it is best assessed with tests that emphasize scholastic and cultural knowledge.[109][44][2] Gf can be thought to primarily consist of current reasoning and problem solving capabilities, while Gc reflects the outcome of previously executed cognitive processes.[110] The rationale for the separation of Gf and Gc was to explain individuals' cognitive development over time. While Gf and Gc have been found to be highly correlated, they differ in the way they change over a lifetime. Gf tends to peak at around age 20, slowly declining thereafter. In contrast, Gc is stable or increases across adulthood. A single general factor has been criticized as obscuring this bifurcated pattern of development. Cattell argued that Gf reflected individual differences in the efficiency of the central nervous system. Gc was, in Cattell's thinking, the result of a person "investing" his or her Gf in learning experiences throughout life.[44][2][111][31] Cattell, together with John Horn, later expanded the Gf-Gc model to include a number of other broad abilities, such as Gq (quantitative reasoning) and Gv (visual-spatial reasoning). While all the broad ability factors in the extended Gf-Gc model are positively correlated and thus would enable the extraction of a higher order g factor, Cattell and Horn maintained that it would be erroneous to posit that a general factor underlies these broad abilities. They argued that g factors computed from different test batteries are not invariant and would give different values of g, and that the correlations among tests arise because it is difficult to test just one ability at a time.[112][113][2] However, several researchers have suggested that the Gf-Gc model is compatible with a g-centered understanding of cognitive abilities. For example, John B. Carroll's three-stratum model of intelligence includes both Gf and Gc together with a higher-order g factor. Based on factor analyses of many data sets, some researchers have also argued that Gf and g are one and the same factor and that g factors from different test batteries are substantially invariant provided that the batteries are large and diverse.[44][114][115]

Theories of uncorrelated abilities


Several theorists have proposed that there are intellectual abilities that are uncorrelated with each other. Among the earliest was L.L. Thurstone who created a model of primary mental abilities representing supposedly independent domains of intelligence. However, Thurstone's tests of these abilities were found to produce a strong general factor. He argued that the lack of independence among his tests reflected the difficulty of constructing "factorially pure" tests that measured just one ability. Similarly, J.P. Guilford proposed a model of intelligence that comprised up to 180 distinct, uncorrelated abilities, and claimed to be able to test all of them. Later analyses have shown that the factorial procedures Guilford presented as evidence for his theory did not provide support for it, and that the test data that he claimed provided evidence against g did in fact exhibit the usual pattern of intercorrelations after correction for statistical artifacts.[116][117] More recently, Howard Gardner has developed the theory of multiple intelligences. He posits the existence of eight different and independent domains of intelligence, such as linguistic, spatial, musical, and bodily-kinesthetic intelligences, and contends that individuals who fail in some of them may excel in others. According to Gardner, tests and schools traditionally emphasize only linguistic and logical abilities while neglecting other forms of intelligence. While popular among educationalists, Gardner's theory has been much criticized by psychologists and psychometricians. One criticism is that the theory does violence to both scientific and everyday usages of the word "intelligence." Several researchers have argued that not all of Gardner's intelligences fall within the cognitive sphere.

G factor (psychometrics) For example, Gardner contends that a successful career in professional sports or popular music reflects bodily-kinesthetic intelligence and musical intelligence, respectively, even though one might usually talk of athletic and musical skills, talents, or abilities instead. Another criticism of Gardner's theory is that many of his purportedly independent domains of intelligence are in fact correlated with each other. Responding to empirical analyses showing correlations between the domains, Gardner has argued that the correlations exist because of the common format of tests and because all tests require linguistic and logical skills. His critics have in turn pointed out that not all IQ tests are administered in the paper-and-pencil format, that aside from linguistic and logical abilities, IQ test batteries contain also measures of, for example, spatial abilities, and that elementary cognitive tasks (for example, inspection time and reaction time) that do not involve linguistic or logical reasoning correlate with conventional IQ batteries, too.[118][119][67][120] Robert Sternberg, working with various colleagues, has also suggested that intelligence has dimensions independent of g. He argues that there are three classes of intelligence: analytic, practical, and creative. According to Sternberg, traditional psychometric tests measure only analytic intelligence, and should be augmented to test creative and practical intelligence as well. He has devised several tests to this effect. Sternberg equates analytic intelligence with academic intelligence, and contrasts it with practical intelligence, defined as an ability to deal with ill-defined real-life problems. Tacit intelligence is an important component of practical intelligence, consisting of knowledge that is not explicitly taught but is required in many real-life situations. Assessing creativity independent of intelligence tests has traditionally proved difficult, but Sternberg and colleagues have claimed to have created valid tests of creativity, too. The validation of Sternberg's theory requires that the three abilities tested are substantially uncorrelated and have independent predictive validity. Sternberg has conducted many experiments which he claims confirm the validity of his theory, but several researchers have disputed this conclusion. For example, in his reanalysis of a validation study of Sternberg's STAT test, Nathan Brody showed that the predictive validity of the STAT, a test of three allegedly independent abilities, was solely due to a single general factor underlying the tests, which Brody equated with the g factor.[121][122]

83

Other criticisms
Perhaps the most famous critique of the construct of g is that of the paleontologist and biologist Stephen Jay Gould's, presented in his 1981 book The Mismeasure of Man. He argued that psychometricians have fallaciously reified the g factor as a physical thing in the brain, even though it is simply the product of statistical calculations (i.e., factor analysis). He further noted that it is possible to produce factor solutions of cognitive test data that do not contain a g factor yet explain the same amount of information as solutions that yield a g. According to Gould, there is no rationale for preferring one factor solution to another, and factor analysis therefore does not lend support to the existence of an entity like g. More generally, Gould criticized the g theory for abstracting intelligence as a single entity and for ranking people "in a single series of worthiness", arguing that such rankings are used to justify the oppression of disadvantaged groups.[123][37] Many researchers have criticized Gould's arguments. For example, they have rejected the accusation of reification, maintaining that the use of extracted factors such as g as potential causal variables whose reality can be supported or rejected by further investigations constitutes a normal scientific practice that in no way distinguishes psychometrics from other sciences. Critics have also suggested that Gould did not understand the purpose of factor analysis, and that he was ignorant of relevant methodological advances in the field. While different factor solutions may be mathematically equivalent in their ability to account for intercorrelations among tests, solutions that yield a g factor are psychologically preferable for several reasons extrinsic to factor analysis, including the phenomenon of the positive manifold, the fact that the same g can emerge from quite different test batteries, the widespread practical validity of g, and the linkage of g to many biological variables.[38][124][37] John Horn and John McArdle have argued that the modern g theory, as espoused by, for example, Arthur Jensen, is unfalsifiable, because the existence of a common factor follows tautologically from positive correlations among

G factor (psychometrics) tests. They contrasted the modern hierarchical theory of g with Spearman's original two-factor theory which was readily falsifiable (and indeed was falsified).[31]

84

Notes
[1] [2] [3] [4] [5] Kamphaus et al. 2005 Deary et al. 2010 Jensen 1998, 545 Neisser et al. 1996 Adapted from Jensen 1998, 24. The correlation matrix was originally published in Spearman 1904, and it is based on the school performance of a sample of English children. While this analysis is historically important and has been highly influential, it does not meet modern technical standards. See Mackintosh 2011, 44ff. and Horn & McArdle 2007 for discussion of Spearman's methods. [6] Adapted from Chabris 2007, Table 19.1. [7] Gottfredson 1998 [8] Deary 2001, 12 [9] Spearman 1904 [10] Deary 2000, 6 [11] Jensen 1992 [12] Jensen 1998, 28 [13] van deer Maas et al. 2006 [14] Jensen 1998, 26, 3639 [15] Jensen 1998, 26, 3639, 8990 [16] Jensen 2002 [17] Floyd et al. 2009 [18] Jensen 1980, 213 [19] Jensen 1992 [20] Jensen 1980, 213 [21] Jensen 1998, 94 [22] Hunt 2011, 94 [23] Jensen 1998, 1819, 3536, 38. The idea of a general, unitary mental ability was introduced to psychology by Herbert Spencer and Francis Galton in the latter half of the 19th century, but their work was largely speculative, with little empirical basis. [24] Jensen 2002 [25] Jensen 1998, 9192, 95 [26] Jensen 2000 [27] Mackintosh 2011, 157 [28] Jensen 1998, 117 [29] Bartholomew et al. 2009 [30] Jensen 1998, 120 [31] Horn & McArdle 2007 [32] Jensen 1998, 120121 [33] Mackintosh 2011, 157158 [34] Rushton & Jensen 2010 [35] Mackintosh 2011, 4445 [36] Jensen 1998, 18, 3132 [37] Carroll 1995 [38] Jensen 1982 [39] Jensen 1998, 73 [40] Deary 2012 [41] Mackintosh 2011, 57 [42] Jensen 1998, 46 [43] Carroll 1997. The total common factor variance consists of the variance due to the g factor and the group factors considered together. The variance not accounted for by the common factors, referred to as uniqueness, comprises subtest-specific variance and measurement error. [44] Davidson & Kemp 2011 [45] Mackintosh 2011, 151 [46] Jensen 1998, 31 [47] [48] [49] [50] Mackintosh 2011, 151153 McGrew 2005 Kvist & Gustafsson 2008 Hunt 2011, 94

G factor (psychometrics)
[51] Johnson et al. 2004 [52] Johnson et al. 2008 [53] Mackintosh 2011, 150153. See also Keith et al. 2001 where the g factors from the CAS and WJ III test batteries were found to be statistically indistinguishable. [54] Jensen 1998, 88, 101103 [55] Spearman 1927 [56] Detterman & Daniel 1989 [57] Deary & Pagliari 1991 [58] Deary et al. 1996 [59] Tucker-Drob 2009 [60] Jensen 1998, 270 [61] Gottfredson 2002 [62] Sackett et al. 2008 [63] Jensen 1998, 272, 301 [64] Jensen 1998, 270 [65] Jensen 1998, 279280 [66] Jensen 1998, 279 [67] Brody 2006 [68] Frey & Detterman 2003 [69] Schmidt & Hunter 2004 [70] Jensen 1998, 292293 [71] Schmidt & Hunter 2004. These validity coefficients have been corrected for measurement error in the dependent variable (i.e., job or training performance) and for range restriction but not for measurement error in the independent variable (i.e., measures of g). [72] Jensen 1998, 270 [73] Jensen 1998, 568 [74] Jensen 1998, 271 [75] Gottfredson 2007 [76] Deary et al. 2006 [77] Plomin & Spinath 2004 [78] Haworth et al. 2010 [79] Kovas & Plomin 2006 [80] Penke et al. 2007 [81] Chabris et al. 2012 [82] Plomin 2003 [83] Jensen 1998, 189197 [84] Mackintosh 2011, 134138 [85] Chabris 2007 [86] Jensen 1998, 146, 149150 [87] Jensen 1998, 164165 [88] Jensen 1998, 8788 [89] Mackintosh 2011, 360373 [90] Roth et al. 2001 [91] Hunt 2011, 421 [92] Jensen 1998, 369399 [93] Lynn 2003 [94] Jensen 1998, 213 [95] Ackerman et al. 2005 [96] Mackintosh 2011, 158 [97] Weinberg 1989 [98] Lautrey 2002 [99] Humphreys et al. 1985 [100] Weinberg 1989 [101] von Stumm et al. 2011 [102] Jensen 1998, 573 [103] Judge et al. 2007 [104] von Stumm et al. 2009 [105] Jensen 1998, 577 [106] Eysenck 1995 [107] Lubinski 2009

85

G factor (psychometrics)
[108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] Robertson et al. 2010 Jensen 1998, 122123 Sternberg et al. 1981 Jensen 1998, 123 Jensen 1998, 124 McGrew 2005 Jensen 1998, 125 Mackintosh 2011, 152153 Jensen 1998, 7778, 115117 Mackintosh 2011, 52, 239 Jensen 1998, 128132 Deary 2001, 1516 Mackintosh 2011, 236237 Hunt 2011, 120130 Mackintosh 2011, 223235 Gould 1996, 5657 Korb 1994

86

References
Ackerman, P. L., Beier, M. E., & Boyle, M. O. (2005). Working memory and intelligence: The same or different constructs? Psychological Bulletin, 131, 3060. Bartholomew, D.J., Deary, I.J., & Lawn, M. (2009). A New Lease of Life for Thomsons Bonds Model of Intelligence. (http://www.psy.ed.ac.uk/people/iand/Bartholomew (2009) Psych Review thomson intelligence.pdf) Psychological Review, 116, 567579. Brody, N. (2006). Geocentric theory: A valid alternative to Gardner's theory of intelligence. In Schaler J. A. (Ed.), Howard Gardner under fire: The rebel psychologist faces his critics. Chicago: Open Court. Carroll, J.B. (1995). Reflections on Stephen Jay Gould's The Mismeasure of Man (1981): A Retrospective Review. (http://www.psych.utoronto.ca/users/reingold/courses/intelligence/cache/carroll-gould.html) Intelligence, 21, 121134. Carroll, J.B. (1997). Psychometrics, Intelligence, and Public Perception. (http://www.iapsych.com/wj3ewok/ LinkedDocuments/carroll1997.pdf) Intelligence, 24, 2552. Chabris, C.F. (2007). Cognitive and Neurobiological Mechanisms of the Law of General Intelligence. (http:// www.wjh.harvard.edu/~cfc/Chabris2007a.pdf) In Roberts, M. J. (Ed.) Integrating the mind: Domain general versus domain specific processes in higher cognition. Hove, UK: Psychology Press. Chabris, C.F., Hebert, B.M, Benjamin, D.J., Beauchamp, J.P., Cesarini, D., van der Loos, M.J.H.M., Johannesson, M., Magnusson, P.K.E., Lichtenstein, P., Atwood, C.S., Freese, J., Hauser, T.S., Hauser, R.M., Christakis, N.A., and Laibson, D. (2012). "Most Reported Genetic Associations with General Intelligence Are Probably False Positives" (http://coglab.wjh.harvard.edu/~cfc/Chabris2012a-FalsePositivesGenesIQ.pdf). Psychological Science 23 (11): 13141323. Davidson, J.E. & Kemp, I.A. (2011). Contemporary models of intelligence. In R.J. Sternberg & S.B. Kaufman (Eds.), The Cambridge Handbook of Intelligence. New York, NY: Cambridge University Press. Deary, I.J. (2012). Intelligence. Annual Review of Psychology, 63, 453482. Deary, I.J. (2001). Intelligence. A Very Short Introduction. Oxford: Oxford University Press. Deary I.J. (2000). Looking Down on Human Intelligence: From Psychometrics to the Brain. Oxford, England: Oxford University Press. Deary, I.J., & Pagliari, C. (1991). The strength of g at different levels of ability: Have Detterman and Daniel rediscovered Spearmans law of diminishing returns? Intelligence, 15, 247250. Deary, I.J., Egan, V., Gibson, G.J., Brand, C.R., Austin, E., & Kellaghan, T. (1996). Intelligence and the differentiation hypothesis. Intelligence, 23, 105132. Deary, I.J., Spinath, F.M. & Bates, T.C. (2006). Genetics of intelligence. Eur J Hum Genet, 14, 690700.

G factor (psychometrics) Deary, I.J., Penke, L., & Johnson, W. (2010). The neuroscience of human intelligence differences (http://www. larspenke.eu/pdfs/Deary_Penke_Johnson_2010_-_Neuroscience_of_intelligence_review.pdf). Nature Reviews Neuroscience, 11, 201211. Detterman, D.K., & Daniel, M.H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low-IQ groups. Intelligence, 13, 349359. Eysenck, H.J. (1995). Creativity as a product of intelligence and personality. In Saklofske, D.H. & Zeidner, M. (Eds.), International Handbook of Personality and Intelligence (pp. 231247). New York, NY, US: Plenum Press. Floyd, R. G., Shands, E. I., Rafael, F. A., Bergeron, R., & McGrew, K. S. (2009). The dependability of general-factor loadings: The effects of factor-extraction methods, test battery composition, test battery size, and their interactions. (http://www.iapsych.com/kmpubs/floyd2009b.pdf) Intelligence, 37, 453465. Frey, M. C.; Detterman, D. K. (2003). "Scholastic Assessment or g? The Relationship Between the Scholastic Assessment Test and General Cognitive Ability" (http://www.psychologicalscience.org/pdf/ps/frey.pdf). Psychological Science 15 (6): 373378. doi: 10.1111/j.0956-7976.2004.00687.x (http://dx.doi.org/10.1111/j. 0956-7976.2004.00687.x). PMID 15147489 (http://www.ncbi.nlm.nih.gov/pubmed/15147489). Gottfredson, L. S. (1998, Winter). The general intelligence factor. Scientific American Presents, 9(4), 2429. Gottfredson, L. S. (2002). g: Highly general and highly practical. Pages 331380 in R. J. Sternberg & E. L. Grigorenko (Eds.), The general factor of intelligence: How general is it? Mahwah, NJ: Erlbaum. Gottfredson, L.S. (2007). Innovation, fatal accidents, and the evolution of general intelligence. (http://www. udel.edu/educ/gottfredson/reprints/2007evolutionofintelligence.pdf) In M. J. Roberts (Ed.), Integrating the mind: Domain general versus domain specific processes in higher cognition (pp. 387425). Hove, UK: Psychology Press. Gottfredson, L.S. (2011). Intelligence and social inequality: Why the biological link? (http://www.udel.edu/ educ/gottfredson/reprints/2011SocialInequality.pdf) Pp. 538575 in T. Chamorro-Premuzic, A. Furhnam, & S. von Stumm (Eds.), Handbook of Individual Differences. Wiley-Blackwell. Gould, S.J. (1996, Revised Edition). The Mismeasure of Man. New York: W. W. Norton & Company. Haworth, C.M.A. et al. (2010). The heritability of general cognitive ability increases linearly from childhood to young adulthood. Mol Psychiatry, 15, 11121120. Horn, J. L. & McArdle, J.J. (2007). Understanding human intelligence since Spearman. In R. Cudeck & R. MacCallum, (Eds.). Factor Analysis at 100 years (pp. 205247). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Humphreys, L.G., Rich, S.A. & Davey, T.C. (1985). A Piagetian Test of General Intelligence. Developmental Psychology, 21, 872877. Hunt, E.B. (2011). Human Intelligence. Cambridge, UK: Cambridge University Press. Jensen, A.R. (1980). Bias in Mental Testing. New York: The Free Press. Jensen, A.R. (1982). The Debunking of Scientific Fossils and Straw Persons. (http://www.debunker.com/texts/ jensen.html) Contemporary Education Review, 1, 121135. Jensen, A.R. (1992). Understanding g in terms of information processing. Educational Psychology Review, 4, 271308. Jensen, A.R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger. ISBN 0-275-96103-6 Jensen, A.R. (2000). A Nihilistic Philosophy of Science for a Scientific Psychology? (http://www.cogsci.ecs. soton.ac.uk/cgi/psyc/newpsy?11.088) Psycoloquy, 11, Issue 088, Article 49. Jensen, A.R. (2002). Psychometric g: Denition and substantiation. In R.J. Sternberg & E.L. Grigorenko (Eds.), General factor of intelligence: How general is it? (pp. 3954). Mahwah, NJ: Erlbaum. Johnson, W., Bouchard, T.J., Krueger, R.F., McGue, M. & Gottesman, I.I. (2004). Just one g: Consistent results from three test batteries. Intelligence, 32, 95107.

87

G factor (psychometrics) Johnson, W., te Nijenhuis, J. & Bouchard Jr., T. (2008). Still just 1 g: Consistent results from five test batteries. Intelligence, 36, 8195. Judge, T. A., Jackson, C. L., Shaw, J. C., Scott, B. A., and Rich, B. L. (2007). Self-efficacy and work-related performance: The integral role of individual differences. Journal of Applied Psychology, 92, 107127. Kamphaus, R.W., Winsor, A.P., Rowe, E.W., & Kim, S. (2005). A history of intelligence test interpretation. In D.P. Flanagan and P.L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (2nd Ed.) (pp. 2338). New York: Guilford. Kane, M. J., Hambrick, D. Z., & Conway, A. R. A. (2005). Working memory capacity and fluid intelligence are strongly related constructs: Comment on Ackerman, Beier, and Boyle (2004). Psychological Bulletin, 131, 6671. Keith, T.Z., Kranzler, J.H., and Flanagan, D.P. (2001). What does the Cognitive Assessment System (CAS) measure? Joint confirmatory factor analysis of the CAS and the Woodcock-Johnson Tests of Cognitive Ability (3rd Edition). School Psychology Review, 30, 89119. Korb, K. B. (1994). Stephen Jay Gould on intelligence. Cognition, 52, 111123. Kovas, Y. & Plomin, R. (2006). Generalist genes: implications for the cognitive sciences. TRENDS in Cognitive Sciences, 10, 198203. Kvist, A. & Gustafsson, J.-E. (2008). The relation between fluid intelligence and the general factor as a function of cultural background: A test of Cattell's Investment theory. Intelligence 36, 422436. Lautrey, J. (2002). Is there a general factor of cognitive development? In Sternberg, R.J. & Grigorenko, E.L. (Eds.), The general factor of intelligence: How general is it? Mahwah, NJ: Erlbaum. Lubinski, D. (2009). Exceptional Cognitive Ability: The Phenotype. Behavior Genetics, 39, 350358, DOI: 10.1007/s10519-009-9273-0. Lynn, R. (2003). The Geography of Intelligence. In Nyborg, H. (ed.), The Scientific Study of General Intelligence: Tribute to Arthur R. Jensen (pp. 126146). Oxford: Pergamon. Mackintosh, N.J. (2011). IQ and Human Intelligence. Oxford, UK: Oxford University Press. McGrew, K.S. (2005). The Cattell-Horn-Carroll Theory of Cognitive Abilities: Past, Present, and Future. Contemporary Intellectual Assessment: Theories, Tests, and Issues. (pp. 136181) New York, NY, US: Guilford Press Flanagan, Dawn P. (Ed); Harrison, Patti L. (Ed), (2005). xvii, 667 pp. Neisser, U., Boodoo, G., Bouchard Jr., T.J., Boykin, A.W., Brody, N., Ceci, S.J., Halpern, D.F., Loehlin, J.C. & Perloff, R. (1996). "Intelligence: Knowns and Unknowns". American Psychologist, 51, 77101 Oberauer, K., Schulze, R., Wilhelm, O., & S, H.-M. (2005). Working memory and intelligence their correlation and their relation: A comment on Ackerman, Beier, and Boyle (2005). Psychological Bulletin, 131, 6165. Penke, L., Denissen, J.J.A., and Miller, G.F. (2007). The Evolutionary Genetics of Personality (http:// matthewckeller.com/Penke_EvoGenPersonality_2007.pdf). European Journal of Personality, 21, 549587. Plomin, R. (2003). Genetics, genes, genomics and g. Molecular Psychiatry, 8, 15. Plomin, R. & Spinath, F.M. (2004). Intelligence: genetics, genes, and genomics. J Pers Soc Psychol, 86, 112129. Robertson, K.F., Smeets, S., Lubinski, D., & Benbow, C.P. (2010). Beyond the Threshold Hypothesis: Even Among the Gifted and Top Math/Science Graduate Students, Cognitive Abilities, Vocational Interests, and Lifestyle Preferences Matter for Career Choice, Performance, and Persistence. Current Directions in Psychological Science, 19, 346351. Roth, P.L., Bevier, C.A., Bobko, P., Switzer, F.S., III, & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297330. Rushton, J.P. & Jensen, A.R. (2010). The rise and fall of the Flynn Effect as a reason to expect a narrowing of the BlackWhite IQ gap. Intelligence, 38, 213219. doi:10.1016/j.intell.2009.12.002. Sackett, P.R., Borneman, M.J., and Connelly, B.S. (2008). High-Stakes Testing in Higher Education and Employment. Appraising the Evidence for Validity and Fairness. American Psychologist, 63, 215227.

88

G factor (psychometrics) Schmidt, F.L. & Hunter, J. (2004). General Mental Ability in the World of Work: Occupational Attainment and Job Performance (http://www.unc.edu/~nielsen/soci708/cdocs/Schmidt_Hunter_2004.pdf). Journal of Personality and Social Psychology, 86, 162173. Spearman, C.E. (1904). "'General intelligence', Objectively Determined And Measured" (http://www.psych. umn.edu/faculty/waller/classes/FA2010/Readings/Spearman1904.pdf). American Journal of Psychology, 15, 201293. Spearman, C.E. (1927). The Abilities of Man. London: Macmillan. Sternberg, R. J., Conway, B. E., Ketron, J. L. & Bernstein, M. (1981). Peoples conception of intelligence. Journal of Personality and Social Psychology, 41, 3755. von Stumm, S., Chamorro-Premuzic, T., Quiroga, M.., and Colom, R. (2009). Separating narrow and general variances in intelligence-personality associations. Personality and Individual Differences, 47, 336341. von Stumm, S., Chamorro-Premuzic, T., Ackerman, P. L. (2011). Re-visiting intelligence-personality associations: Vindicating intellectual investment. In T. Chamorro-Premuzic, S. von Stumm, & A. Furnham (eds.), Handbook of Individual Differences. Chichester, UK: Wiley-Blackwell. Tucker-Drob, E.M. (2009). Differentiation of cognitive abilities across the life span. Developmental Psychology, 45, 10971118. van der Maas, H. L. J., Dolan, C. V., Grasman, R. P. P. P., Wicherts, J. M., Huizenga, H. M., & Raaijmakers, M. E. J. (2006). A dynamical model of general intelligence: The positive manifold of intelligence by mutualism. (http://wicherts.socsci.uva.nl/maas2006.pdf) Psychological Review, 13, 842860. Weinberg, R.A. (1989). Intelligence and IQ. Landmark Issues and Great Debates. American Psychologist, 44, 98104.

89

External links
The General Intelligence Factor by Linda S. Gottfredson (http://www.udel.edu/educ/gottfredson/reprints/ 1998generalintelligencefactor.pdf)

Francis Galton

90

Francis Galton
Sir Francis Galton

Born

February 16, 1822 Birmingham, England 17 January 1911 (aged88) Haslemere, Surrey, England England English Anthropology and polymathy Meteorological Council Royal Geographical Society King's College London Cambridge University

Died

Residence Nationality Fields Institutions

Alma mater

Academic advisors William Hopkins Notable students Knownfor Karl Pearson Eugenics The Galton board Regression toward the mean Standard deviation Weather map Linnean Society of London's DarwinWallace Medal in 1908. Copley medal (1910)

Notable awards

Sir Francis Galton, FRS (/frnssHelp:IPA for English#Keyltn/; 16 February 1822 17 January 1911), cousin of Douglas Strutt Galton, cousin of Charles Darwin, was an English Victorian polymath: anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician. He was knighted in 1909. Galton produced over 340 papers and books. He also created the statistical concept of correlation and widely promoted regression toward the mean. He was the first to apply statistical methods to the study of human differences and inheritance of intelligence, and introduced the use of questionnaires and surveys for collecting data on human

Francis Galton communities, which he needed for genealogical and biographical works and for his anthropometric studies. He was a pioneer in eugenics, coining the term itself and the phrase "nature versus nurture". His book Hereditary Genius (1869) was the first social scientific attempt to study genius and greatness.[1] As an investigator of the human mind, he founded psychometrics (the science of measuring mental faculties) and differential psychology and the lexical hypothesis of personality. He devised a method for classifying fingerprints that proved useful in forensic science. He also conducted research on the power of prayer, concluding it had none by its null effects on the longevity of those prayed for.[2] As the initiator of scientific meteorology, he devised the first weather map, proposed a theory of anticyclones, and was the first to establish a complete record of short-term climatic phenomena on a European scale.[3] He also invented the Galton Whistle for testing differential hearing ability. [4]

91

Biography
Early life
Galton was born at "The Larches", a large house in the Sparkbrook area of Birmingham, England, built on the site of "Fair Hill", the former home of Joseph Priestley, which the botanist William Withering had renamed. He was Charles Darwin's half-cousin, sharing the common grandparent Erasmus Darwin. His father was Samuel Tertius Galton, son of Samuel "John" Galton. The Galtons were famous and highly successful Quaker gun-manufacturers and bankers, while the Darwins were distinguished in medicine and science. Both families boasted Fellows of the Royal Society and members who loved to invent in their spare time. Both Erasmus Darwin and Samuel Galton were founding members of the famous Lunar Society of Birmingham, whose members included Boulton, Watt, Wedgwood, Priestley, Edgeworth, and other distinguished scientists and industrialists. Likewise, both families were known for their literary talent: Erasmus Darwin composed lengthy technical treatises in verse; Galton's aunt Mary Anne Galton wrote on aesthetics and religion, and her notable autobiography detailed the unique environment of her childhood populated by Lunar Society members. Galton was by many accounts a child prodigy he was reading by the age of 2, at age 5 he knew some Greek, Latin and long division, and by the age of six he had moved on to adult books, including Shakespeare for pleasure, and poetry, which he quoted at length (Bulmer 2003, p.4). Later in life, Galton would propose a connection between genius and insanity based on his own experience. He stated, Men who leave their mark on the world are very often those who, being gifted and full of nervous power, are at the same time haunted and driven by a dominant idea, and are therefore within a measurable distance of insanity[5] Galton attended King Edward's School, Birmingham, but chafed at the narrow classical curriculum and left at 16.[6] His parents pressed him to enter the medical profession, and he studied for two years at Birmingham General Hospital and King's College, London Medical School. He followed this up with mathematical studies at Trinity College, University of Cambridge, from 1840 to early 1844.[7]

Portrait of Galton by Octavius Oakley, 1840

According to the records of the United Grand Lodge of England, it was in February 1844 that Galton became a freemason at the so-called Scientific lodge, held at the Red Lion Inn in Cambridge, progressing through the three masonic degrees as follows: Apprentice, 5 Feb 1844; Fellow Craft, 11 March 1844; Master Mason, 13 May 1844. A curious note in the record states: "Francis Galton Trinity College student, gained his certificate 13 March 1845".[8]

Francis Galton One of Galton's masonic certificates from Scientific lodge can be found among his papers at University College, London.[9] A severe nervous breakdown altered Galton's original intention to try for honours. He elected instead to take a "poll" (pass) B.A. degree, like his half-cousin Charles Darwin (Bulmer 2003, p.5). (Following the Cambridge custom, he was awarded an M.A. without further study, in 1847.) He then briefly resumed his medical studies. The death of his father in 1844 had left him financially independent but emotionally destitute,[10] and he terminated his medical studies entirely, turning to foreign travel, sport and technical invention. In his early years Galton was an enthusiastic traveller, and made a notable solo trip through Eastern Europe to Constantinople, before going up to Cambridge. In 1845 and 1846 he went to Egypt and travelled down the Nile to Khartoum in the Sudan, and from there to Beirut, Damascus and down the Jordan. In 1850 he joined the Royal Geographical Society, and over the next two years mounted a long and difficult expedition into then little-known South West Africa (now Namibia). He wrote a successful book on his experience, "Narrative of an Explorer in Tropical South Africa". He was awarded the Royal Geographical Society's gold medal in 1853 and the Silver Medal of the French Geographical Society for his pioneering cartographic survey of the region (Bulmer 2003, p.16). This established his reputation as a geographer and explorer. He proceeded to write the best-selling The Art of Travel, a handbook of practical advice for the Victorian on the move, which went through many editions and is still in print. In January 1853 Galton met Louisa Jane Butler (18221897) at his neighbour's home and they were married on 1 August 1853. The union of 43 years proved childless. [11] [12]

92

Middle years
Galton was a polymath who made important contributions in many fields of science, including meteorology (the anti-cyclone and the first popular weather maps), statistics (regression and correlation), psychology (synaesthesia), biology (the nature and mechanism of heredity), and criminology (fingerprints). Much of this was influenced by his penchant for counting or measuring. Galton prepared the first weather map published in The Times (1 April 1875, showing the weather from the previous day, 31 March), now a standard feature in newspapers worldwide.[13] He became very active in the British Association for the Advancement of Science, presenting many papers on a wide variety of topics at its meetings from 1858 to 1899 (Bulmer 2010, p.29). He was the general secretary from 1863 to 1867, president of the Geographical section in 1867 and 1872, and president of the Anthropological Section in 1877 and 1885. He was active on the council of the Royal Geographical Society for over forty years, in various committees of the Royal

Louisa Jane Butler

Society, and on the Meteorological Council. James McKeen Cattell, a student of Wilhelm Wundt who had been reading Galton's articles, decided he wanted to study under him. He eventually built a professional relationship with Galton, measuring subjects and working together on research.[14] In 1888, Galton established a lab in the science galleries of the South Kensington Museum. In Galton's lab, participants could be measured in order to gain knowledge of their strengths and weaknesses. Galton also used these data for his own research. He would typically charge people a small fee for his services.[15]

Francis Galton During this time, Galton wrote a controversial letter to the Times titled 'Africa for the Chinese', where he argued that the Chinese, as a race capable of high civilization and (in his opinion) only temporarily stunted by the recent failures of Chinese dynasties, should be encouraged to immigrate to Africa and displace the supposedly inferior aboriginal blacks.[16]

93

Heredity and eugenics


The publication by his cousin Charles Darwin of The Origin of Species in 1859 was an event that changed Galton's life.[17] He came to be gripped by the work, especially the first chapter on "Variation under Domestication" concerning the breeding of domestic animals. Galton devoted much of the rest of his life to exploring variation in human populations and its implications, at which Darwin had only hinted. In so doing, he established a research programme which embraced multiple aspects of human variation, from mental characteristics to height; from facial images to fingerprint patterns. This required inventing novel measures of traits, devising large-scale collection of data using those measures, and in the end, the discovery of new statistical techniques for describing and understanding the data.

Galton in his later years

Galton was interested at first in the question of whether human ability was hereditary, and proposed to count the number of the relatives of various degrees of eminent men. If the qualities were hereditary, he reasoned, there should be more eminent men among the relatives than among the general population. To test this, he invented the methods of historiometry. Galton obtained extensive data from a broad range of biographical sources which he tabulated and compared in various ways. This pioneering work was described in detail in his book Hereditary Genius in 1869.[1] Here he showed, among other things, that the numbers of eminent relatives dropped off when going from the first degree to the second degree relatives, and from the second degree to the third. He took this as evidence of the inheritance of abilities. Galton recognized the limitations of his methods in these two works, and believed the question could be better studied by comparisons of twins. His method envisaged testing to see if twins who were similar at birth diverged in dissimilar environments, and whether twins dissimilar at birth converged when reared in similar environments. He again used the method of questionnaires to gather various sorts of data, which were tabulated and described in a paper The history of twins in 1875. In so doing he anticipated the modern field of behavior genetics, which relies heavily on twin studies. He concluded that the evidence favored nature rather than nurture. He also proposed adoption studies, including trans-racial adoption studies, to separate the effects of heredity and environment. Galton recognised that cultural circumstances influenced the capability of a civilization's citizens, and their reproductive success. In Hereditary Genius, he envisaged a situation conducive to resilient and enduring civilisation as follows: The best form of civilization in respect to the improvement of the race, would be one in which society was not costly; where incomes were chiefly derived from professional sources, and not much through inheritance; where every lad had a chance of showing his abilities, and, if highly gifted, was enabled to achieve a first-class education and entrance into professional life, by the liberal help of the exhibitions and scholarships which he had gained in his early youth; where marriage was held in as high honour as in ancient Jewish times; where the pride of race was encouraged (of course I do not refer to the nonsensical sentiment of the present day, that goes under that name); where the weak could find a welcome and a refuge in celibate monasteries or sisterhoods, and lastly, where the better sort of

Francis Galton emigrants and refugees from other lands were invited and welcomed, and their descendants naturalized. (p362) [1] Galton invented the term eugenics in 1883 and set down many of his observations and conclusions in a book, Inquiries into Human Faculty and Its Development.[18] He believed that a scheme of 'marks' for family merit should be defined, and early marriage between families of high rank be encouraged by provision of monetary incentives. He pointed out some of the tendencies in British society, such as the late marriages of eminent people, and the paucity of their children, which he thought were dysgenic. He advocated encouraging eugenic marriages by supplying able couples with incentives to have children. On October 29, 1901, Galton chose to address eugenic issues when he delivered the second Huxley lecture at the Royal Anthropological Institute[14] The Eugenics Review, the journal of the Eugenics Education Society, commenced publication in 1909. Galton, the Honorary President of the society, wrote the foreword for the first volume.[14] The First International Congress of Eugenics was held in July 1912. Galton died just two weeks before the day of the congress. Winston Churchill and Carls Elliot were among the attendees.[14]

94

Empirical test of pangenesis and Lamarkism


Galton conducted wide-ranging inquiries into heredity which led him to challenge Charles Darwin's hypothetical theory of pangenesis. Darwin had proposed as part of this hypothesis that certain particles, which he called "gemmules" moved throughout the body and were also responsible for the inheritance of acquired characteristics. Galton, in consultation with Darwin, set out to see if they were transported in the blood. In a long series of experiments in 1869 to 1871, he transfused the blood between dissimilar breeds of rabbits, and examined the features of their offspring.[19] He found no evidence of characters transmitted in the transfused blood (Bulmer 2003, pp.116118). Darwin challenged the validity of Galton's experiment, giving his reasons in an article published in Nature where he wrote: Now, in the chapter on Pangenesis in my Variation of Animals and Plants under Domestication I have not said one word about the blood, or about any fluid proper to any circulating system. It is, indeed, obvious that the presence of gemmules in the blood can form no necessary part of my hypothesis; for I refer in illustration of it to the lowest animals, such as the Protozoa, which do not possess blood or any vessels; and I refer to plants in which the fluid, when present in the vessels, cannot be considered as true blood." He goes on to admit: "Nevertheless, when I first heard of Mr. Galton's experiments, I did not sufficiently reflect on the subject, and saw not the difficulty of believing in the presence of gemmules in the blood.[20] Galton explicitly rejected the idea of the inheritance of acquired characteristics (Lamarckism), and was an early proponent of "hard heredity" through selection alone. He came close to rediscovering Mendel's particulate theory of inheritance, but was prevented from making the final breakthrough in this regard because of his focus on continuous, rather than discrete, traits (now known as polygenic traits). He went on to found the biometric approach to the study of heredity, distinguished by its use of statistical techniques to study continuous traits and population-scale aspects of heredity. This approach was later taken up enthusiastically by Karl Pearson and W.F.R. Weldon; together, they founded the highly influential journal Biometrika in 1901. (R.A. Fisher would later show how the biometrical approach could be reconciled with the Mendelian approach.) The statistical techniques that Galton invented (correlation, regressionsee below) and phenomena he established (regression to the mean) formed the basis of the biometric approach and are now essential tools in all the social sciences.

Francis Galton

95

Innovations in statistics and psychological theory


Historiometry The method used in Hereditary Genius has been described as the first example of historiometry. To bolster these results, and to attempt to make a distinction between 'nature' and 'nurture' (he was the first to apply this phrase to the topic), he devised a questionnaire that he sent out to 190 Fellows of the Royal Society. He tabulated characteristics of their families, such as birth order and the occupation and race of their parents. He attempted to discover whether their interest in science was 'innate' or due to the encouragements of others. The studies were published as a book, English men of science: their nature and nurture, in 1874. In the end, it promoted the nature versus nurture question, though it did not settle it, and provided some fascinating data on the sociology of scientists of the time. The Lexical Hypothesis Sir Francis was the first scientist to recognize what is now known as the Lexical Hypothesis.[] This is the idea that the most salient and socially relevant personality differences in peoples lives will eventually become encoded into language. The hypothesis further suggests that by sampling language, it is possible to derive a comprehensive taxonomy of human personality traits. The questionnaire Galton's inquiries into the mind involved detailed recording of people's subjective accounts of whether and how their minds dealt with phenomena such as mental imagery. In order to better elicit this information, he pioneered the use of the questionnaire. In one study, he asked his fellow members of the Royal Society of London to describe mental images that they experienced. In another, he collected in-depth surveys from eminent scientists for a work examining the effects of nature and nurture on the propensity toward scientific thinking.[21] Variance and standard deviation Core to any statistical analysis, is the concept that measurements vary: they have both a central tendency or mean, and a spread around this central value: variance. In the late 1860s, Galton conceived of a measure to quantify normal variation: the standard deviation. [22] Galton was a keen observer. In 1906, visiting a livestock fair, he stumbled upon an intriguing contest. An ox was on display, and the villagers were invited to guess the animal's weight after it was slaughtered and dressed. Nearly 800 participated, but not one person hit the exact mark: 1,198 pounds. Galton stated that "the middlemost estimate expresses the vox populi, every other estimate being condemned as too low or too high by a majority of the voters",[23] and calculated this value (in modern terminology, the median) as 1,207 pounds. To his surprise, this was within 0.8% of the weight measured by the judges. Soon afterwards, he acknowledged[24] that the mean of the guesses, at 1,197 pounds, was even more accurate.[25][26] Experimental derivation of the normal distribution Studying variation, Galton invented the quincunx, a pachinko-like device, also known as the bean machine, as a tool for demonstrating the law of error and the normal distribution (Bulmer 2003, p.4). Bivariate normal distribution He also discovered the properties of the bivariate normal distribution and its relationship to regression analysis. Correlation and regression After examining forearm and height measurements, Galton introduced the concept of correlation in 1888 (Bulmer 2003, pp.191196). Correlation is the term used by Aristotle in his studies of animal classification, and later and most notably by Georges Cuvier in Histoire des progrs des sciences naturelles depuis 1789 jusqu' ce jour (5

Francis Galton volumes, 18261836). Correlation originated in the study of correspondence as described in the study of morphology. See R.S. Russell, Form and Function. He was not the first to describe the mathematical relationship represented by the correlation coefficient, but he rediscovered this relationship and demonstrated its application in the study of heredity, anthropology, and psychology.[21] Galton's later statistical study of the probability of extinction of surnames led to the concept of GaltonWatson stochastic processes (Bulmer 2003, pp.182184). This is now a core of modern statistics and regression. Galton invented the use of the regression line (Bulmer 2003, p.184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas. He is responsible for the choice of r (for reversion or regression) to represent the correlation coefficient.[21] In the 1870s and 1880s he was a pioneer in the use of normal distribution to fit histograms of actual tabulated data. Theories of perception Galton went beyond measurement and summary to attempt to explain the phenomena he observed. Among such developments, he proposed an early theory of ranges of sound and hearing, and collected large quantities of anthropometric data from the public through his popular and long-running Anthropometric Laboratory, which he established in 1884 where he studied over 9,000 people.[14] It was not until 1985 that these data were analyzed in their entirety. Differential psychology Galton's study of human abilities ultimately led to the foundation of differential psychology and the formulation of the first mental tests. He was interested in measuring humans in every way possible. This included measuring their ability to make sensory discrimination which he assumed was linked to intellectual prowess. Galton suggested that individual dierences in general ability are reected in performance on relatively simple sensory capacities and in speed of reaction to a stimulus, variables that could be objectively measured by tests of sensory discrimination and reaction time Jensen, Arthur R. (April 2002). "GALTONS LEGACY TO RESEARCH ON INTELLIGENCE" [27]. Journal of Biosocial Science. 34 (2): 145-172.He also measured how quickly people reacted which he later linked to internal wiring which ultimately limited intelligence ability. Throughout his research Galton assumed that people who reacted faster were more intelligent than others. Composite photography Galton also devised a technique called composite portraiture" (produced by superimposing multiple photographic portraits of individuals' faces registered on their eyes) to create an average face. (See averageness). In the 1990's, a hundred years after his discovery, much psychological research has examined the attractiveness of these faces, an aspect that Galton had remarked on in his original lecture. Others, including Sigmund Freud in his work on dreams, picked up Galton's suggestion that these composites might represent a useful metaphor for an ideal or a concept of a natural kind" (see Eleanor Rosch) such as Jewish men, criminals, patients with tuberculosis, etc. onto the same photographic plate, thereby yielding a blended whole, or composite), that he hoped could generalize the facial appearance of his subject into an average or central type..[4][28] See also entry Modern physiognomy under Physiognomy). This work began in the 1880s while the Jewish scholar Joseph Jacobs studied anthropology and statistics with Francis Galton. Jacobs asked Galton to create a composite photograph of a Jewish type.[29] One of Jacobs' first publications that used Galton's composite imagery was The Jewish Type, and Galtons Composite Photographs, Photographic News, 29, (April 24, 1885): 268269. Galton hoped his technique would aid medical diagnosis, and even criminology through the identification of typical criminal faces. However, his technique did not prove useful and fell into disuse, although after much work on it including by photographers Lewis Hine and John L. Lovell and Arthur Batut.

96

Francis Galton

97

Fingerprints
In a Royal Institution paper in 1888 and three books (Finger Prints, 1892; Decipherment of Blurred Finger Prints, 1893; and Fingerprint Directories, 1895)[30] Galton estimated the probability of two persons having the same fingerprint and studied the heritability and racial differences in fingerprints. He wrote about the technique (inadvertently sparking a controversy between Herschel and Faulds that was to last until 1917), identifying common pattern in fingerprints and devising a classification system that survives to this day. The method of identifying criminals by their fingerprints had been introduced in the 1860s by Sir William James Herschel in India, and their potential use in forensic work was first proposed by Dr Henry Faulds in 1880, but Galton was the first to place the study on a scientific footing, which assisted its acceptance by the courts (Bulmer 2003, p.35). Galton pointed out that there were specific types of fingerprint patterns. He described and classified them into eight broad categories. 1: plain arch, 2: tented arch, 3: simple loop, 4: central pocket loop, 5: double loop, 6: lateral pocket loop, 7: plain whorl, and 8: accidental.[]

Final years
In an effort to reach a wider audience, Galton worked on a novel entitled Kantsaywhere from May until December 1910. The novel described a utopia organized by a eugenic religion, designed to breed fitter and smarter humans. His unpublished notebooks show that this was an expansion of material he had been composing since at least 1901. He offered it to Methuen for publication, but they showed little enthusiasm. Galton wrote to his niece that it should be either smothered or superseded. His niece appears to have burnt most of the novel, offended by the love scenes, but large fragments survived.[31]

Honours and impact


Over the course of his career Galton received many major awards, including the Copley medal of the Royal Society (1910). He received in 1853 the highest award from the Royal Geographical Society, one of two gold medals awarded that year, for his explorations and map-making of southwest Africa. He was elected a member of the prestigious Athenaeum Club in 1855 and made a Fellow of the Royal Society in 1860. His autobiography also lists the following:[32] Silver Medal, French Geographical Society (1854) Gold Medal of the Royal Society (1886) Officier de l'Instruction Publique, France (1891) D.C.L. Oxford (1894) Sc.D. (Honorary), Cambridge (1895) Huxley Medal, Anthropological Institute (1901) Elected Hon. Fellow Trinity College, Cambridge (1902) Darwin Medal, Royal Society (1902) Linnean Society of London's DarwinWallace Medal (1908)

Galton was knighted in 1909. His statistical heir Karl Pearson, first holder of the Galton Chair of Eugenics at University College London (now Galton Chair of Genetics), wrote a three-volume biography of Galton, in four parts, after his death (Pearson 1914, 1924, 1930). The eminent psychometrician Lewis Terman estimated that his childhood IQ was on the order of 200, based on the fact that he consistently performed mentally at roughly twice his chronological age (Forrest 1974). (This follows the original definition of IQ as mental age divided by chronological age, rather than the modern definition based on the standard distribution and standard deviation.) The flowering plant genus Galtonia was named in his honour.

Francis Galton

98

Major Works
Galton, F. (1869). Hereditary Genius [33]. London: Macmillan. Galton, F (1883). Inquiries into Human Faculty and Its Development [34]. London: J.M. Dent & Company

References
[1] Galton, F. (1869). Hereditary Genius (http:/ / galton. org/ books/ hereditary-genius/ ). London: Macmillan. [2] http:/ / www. abelard. org/ galton/ galton. htm [3] Francis Galton (18221911) from Eric Weisstein's World of Scientific Biography (http:/ / scienceworld. wolfram. com/ biography/ Galton. html) [4] Galton, Francis (1883). Inquiries into Human Faculty and Its Development (http:/ / www. galton. org/ books/ human-faculty/ index. html). London: J.M. Dent & Co. [5] Pearson, K. (1914). The life, letters and labours of Francis Galton (4 vols.). Cambridge:Cambridge University Press. [6] Oxford Dictionary of National Biography accessed 31 January 2010 [8] 'Scientific Lodge No. 105 Cambridge' in Membership Records: Foreign and Country Lodges, Nos. 17-145, 1837-1862. London: Library and Museum of Freemasonry (manuscript) [9] M. Merrington and J. Golden (1976) A List of the Papers and Correspondence of Sir Francis Galton (1822-1911) held in The Manuscripts Room, The Library, University College London. The Galton Laboratory, University College London (typescript), at Section 88 on p. 10 [10] citation? [11] Life of Francis Galton by Karl Pearson Vol 2 : image 0320 (http:/ / galton. org/ cgi-bin/ searchImages/ search/ pearson/ vol2/ pages/ vol2_0320. htm) [12] http:/ / www. stanford. edu/ group/ auden/ cgi-bin/ auden/ individual. php?pid=I7570& ged=auden-bicknell. ged [13] http:/ / www. galton. org/ meteorologist. html [14] Gillham, Nicholas Wright (2001). A Life of Sir Francis Galton: From African Exploration to the Birth of Eugenics, Oxford University Press. ISBN 0-19-514365-5. [15] Hergenhahn, B.R., (2008). An Introduction to the History of Psychology. Colorado: Wadsworth Pub. [16] http:/ / galton. org/ letters/ africa-for-chinese/ AfricaForTheChinese. htm [17] Forrest DW 1974. Francis Galton: the life and work of a Victorian genius. Elek, London. p84 [18] Inquiries into Human Faculty and Its Development by Francis Galton (http:/ / galton. org/ books/ human-faculty/ ) [19] Science Show 25/11/00: Sir Francis Galton (http:/ / www. abc. net. au/ rn/ science/ ss/ stories/ s216074. htm) [20] http:/ / darwin-online. org. uk/ content/ frameset?itemID=F1751& viewtype=side& pageseq=1 [21] Clauser, Brian E. (2007). The Life and Labors of Francis Galton: A review of Four Recent Books About the Father of Behavioral Statistics. 32(4), p. 440-444. [22] http:/ / www. sciencetimeline. net/ 1866. htm. [23] Galton, F., " Vox Populi (http:/ / galton. org/ essays/ 1900-1911/ galton-1907-vox-populi. pdf)", Nature, March 7, 1907, accessed 2012-07-25 [24] " The Ballot Box (http:/ / galton. org/ cgi-bin/ searchImages/ galton/ search/ essays/ pages/ galton-1907-ballot-box_1. htm)", Nature, March 28, 1907, accessed 2012-07-25 [25] adamsmithlives.blogs.com posting (http:/ / adamsmithlives. blogs. com/ thoughts/ 2007/ 10/ experts-and-inf. html) [27] http:/ / journals2. scholarsportal. info. myaccess. library. utoronto. ca/ tmp/ 2802204478791895184. pdf [28] Galton, F. (1878). Composite portraits. (http:/ / www. galton. org/ essays/ 1870-1879/ galton-1879-jaigi-composite-portraits. pdf) Journal of the Anthropological Institute of Great Britain and Ireland, 8, 132142. [29] Daniel Akiva Novak. Realism, photography, and nineteenth-century (http:/ / books. google. com/ books?id=UeiMt7Yzb1MC& pg=PA100& lpg=PA100& dq=Francis+ Galton+ jewish+ boys& source=bl& ots=Hj6o5LrTjj& sig=R4e5tBliXpezKQhnX2hgG1YGwjg& hl=en& ei=S-QBSo7oBpbisgOluOz8BQ& sa=X& oi=book_result& ct=result& resnum=1) Cambridge University Press, 2008 ISBN 0-521-88525-6 [30] Conklin, Barbara Gardner., Robert Gardner, and Dennis Shortelle. Encyclopedia of Forensic Science: a Compendium of Detective Fact and Fiction. Westport, Conn.: Oryx, 2002. Print. [31] Life of Francis Galton by Karl Pearson Vol 3a : image 470 (http:/ / www. mugu. com/ browse/ galton/ search/ pearson/ vol3a/ pages/ vol3a_0470. htm) [33] http:/ / galton. org/ books/ hereditary-genius/ [34] http:/ / www. galton. org/ books/ human-faculty/ index. html

Francis Galton

99

Further reading
Brookes, Martin (2004). Extreme Measures: The Dark Visions and Bright Ideas of Francis Galton. Bloomsbury. Bulmer, Michael (2003). Francis Galton: Pioneer of Heredity and Biometry. Johns Hopkins University Press. ISBN0-8018-7403-3 Cowan, Ruth Schwartz (1985, 1969). Sir Francis Galton and the Study of Heredity in the Nineteenth Century. Garland (1985). Originally Cowan's Ph.D. dissertation, Johns Hopkins University, (1969). Ewen, Stuart and Elizabeth Ewen (2006; 2008) "Nordic Nightmares," pp.257325 in Typecasting: On the Arts and Sciences of Human Inequality, Seven Stories Press. ISBN 978-1-58322-735-0 Forrest, D.W (1974). Francis Galton: The Life and Work of a Victorian Genius. Taplinger. ISBN0-8008-2682-5 Galton, Francis (1909). Memories of My Life: (http://books.google.com/?id=MvAIAAAAIAAJ&pg=PA3& dq=Samuel+"John"+Galton). New York: E. P. Dutton and Company. Gillham, Nicholas Wright (2001). A Life of Sir Francis Galton: From African Exploration to the Birth of Eugenics, Oxford University Press. ISBN 0-19-514365-5 Pearson, Karl (1914, 1924, 1930). "The life, letters and labours of Francis Galton (3 vols.)" (http://galton.org) Danille Posthuma, Eco J. C. De Geus, Wim F. C. Baar, Hilleke E. Hulshoff Pol, Ren S. Kahn & Dorret I. Boomsma (2002). "The association between brain volume and intelligence is of genetic origin". Nature Neuroscience 5 (2): 8384. doi: 10.1038/nn0202-83 (http://dx.doi.org/10.1038/nn0202-83). PMID 11818967 (http://www.ncbi.nlm.nih.gov/pubmed/11818967) Quinche, Nicolas, Crime, Science et Identit. Anthologie des textes fondateurs de la criminalistique europenne (18601930). Genve: Slatkine, 2006, 368p., passim. Stigler, S. M. (2010). "Darwin, Galton and the Statistical Enlightenment". Journal of the Royal Statistical Society: Series A (Statistics in Society) 173 (3): 469482. doi: 10.1111/j.1467-985X.2010.00643.x (http://dx.doi.org/10. 1111/j.1467-985X.2010.00643.x).

External links
Galton's Complete Works (http://galton.org) at Galton.org (including all his published books, all his published scientific papers, and popular periodical and newspaper writing, as well as other previously unpublished work and biographical material). Works by Francis Galton (http://www.gutenberg.org/author/Francis+Galton) at Project Gutenberg The Galton Machine or Board demonstrating the normal distribution. (http://www.youtube.com/ watch?v=9xUBhhM4vbM) Portraits of Galton (http://www.npg.org.uk/live/search/person.asp?LinkID=mp01715) from the National Portrait Gallery (United Kingdom) The Galton laboratory homepage (http://www.gene.ucl.ac.uk/) Wikipedia:Link rot (originally The Francis Galton Laboratory of National Eugenics) at University College London O'Connor, John J.; Robertson, Edmund F., "Francis Galton" (http://www-history.mcs.st-andrews.ac.uk/ Biographies/Gillham.html), MacTutor History of Mathematics archive, University of St Andrews. Biography and bibliography (http://vlp.mpiwg-berlin.mpg.de/people/data?id=per78) in the Virtual Laboratory of the Max Planck Institute for the History of Science History and Mathematics (http://urss.ru/cgi-bin/db.pl?cp=&page=Book&id=53184&lang=en&blang=en& list=Found) Human Memory University of Amsterdam (http://memory.uva.nl/testpanel/gc/en/) website with test based on the work of Galton An 8-foot-tall (2.4m) Probability Machine (named Sir Francis Galton) comparing stock market returns to the randomness of the beans dropping through the quincunx pattern. (http://www.youtube.com/ watch?v=AUSKTk9ENzg) from Index Funds Advisors IFA.com (http://www.ifa.com)

Francis Galton Catalogue of the Galton papers held at UCL Archives (http://archives.ucl.ac.uk/DServe/dserve. exe?dsqServer=localhost&dsqIni=Dserve.ini&dsqApp=Archive&dsqCmd=Show.tcl&dsqDb=Catalog& dsqPos=2&dsqSearch=((text)='galton')) "Composite Portraits", by Francis Galton, 1878 (as published in the Journal of the Anthropological Institute of Great Britain and Ireland, volume 8). (http://www.galton.org/essays/1870-1879/ galton-1879-jaigi-composite-portraits.pdf) "Enquiries into Human Faculty and its Development", book by Francis Galton, 1883. (http://www.galton.org/ books/human-faculty/text/galton-1883-human-faculty-v4.pdf)

100

Group size measures


Many animals, including humans, tend to live in groups, herds, flocks, bands, packs, shoals, or colonies (hereafter: groups) of conspecific individuals. The size of these groups, as expressed by the number of participant individuals, is an important aspect of their social environment. Group size tend to be highly variable even within the same species, thus we often need statistical measures to quantify group size and statistical tests to compare these measures between two or more samples. Unfortunately, group size measures are notoriously hard to handle statistically since groups size values typically exhibit an aggregated (right-skewed) distribution: most groups are small, few are large, and a very few are very large.

A group acts as a social environment of individuals: a flock of nine Common Cranes.

Statistical measures of group size roughly fall into two categories.

Group size measures

101

Outsiders view of group size


Group size is the number of individuals within a group; Mean group size , i.e. the arithmetic mean of group sizes averaged across groups; Confidence interval for mean group size; Median group size, i.e. the median of group sizes calculated across groups; Confidence interval for median group size.

Insiders view of group size


As Jarman (1974) pointed out, average individuals live in groups larger than average simply because the groups smaller than average have fewer individuals than the groups larger than average. (Except for an unrealistic case when all groups are of equal size.) Therefore, when we wish to characterize a typical (average) individuals social environment, we should not apply the outsiders view of group size. Reiczigel et al. (2008) proposed the following measures: Crowding is the number of individuals within a group (equals to group size: 1 for a solitary individual, 2 for both individuals in a group of 2, etc.);

Mean crowding, i.e. the arithmetic mean of crowding measures averaged across individuals (this was called "Typical Group Size" according to Jarman's 1974 terminology); Confidence interval for mean crowding.

Colony size measures for rooks breeding in Normany. The distribution of colonies (vertical axis above) and the distribution of individuals (vertical axis below) across the size classes of colonies (horizontal axis). The number of individuals is given in pairs. Animal group size data tend to exhibit aggregated (right-skewed) distributions, i.e. most groups are small, a few are large, and a very few are very large. Note that average individuals live in colonies larger than the average colony size. (Data from Normandy, 1999-2000 (smoothed), Debout, 2003)

Statistical methods
Due to the aggregated (right-skewed) distribution of group members among groups, the application of parametric statistics would be misleading. Another problem arises when analyzing crowding values. Crowding data consist of nonindependent values, or ties, which show multiple and simultaneous changes due to a single biological event. (Say, all group members' crowding values change simultaneously whenever an individual joins or leaves.) The paper by Reiczigel et al. (2008) discusses the statistical problems associated with group size measures (calculating confidence intervals, 2-sample tests, etc.) and offers a free statistical toolset (Flocker 1.1) to handle them in a user-friendly manner.

Group size measures

102

Literature
Debout G 2003. Le corbeau freux (Corvus frugilegus) nicheur en Normandie: recensement 1999 & 2000. Cormoran, 13, 115121. Jarman PJ 1974. The social organisation of antelope in relation to their ecology. Behaviour, 48, 215268. Reiczigel J, Lang Z, Rzsa L, Tthmrsz B 2008. Measures of sociality: two different views of group size. [1] Animal Behaviour, 75, 715721.

External links
Flocker 1.1 a statistical toolset to analyze group size measures (with all the abovementioned calculations available) [2]

Gallery

An Aphid colony

European Paper Wasp colony

Bluestripe snapper schooling.

Flamingos

Gannet colony

Common Coots

Great Woodswallows allopreening.

Red-billed Quelea flock

Wolf pack hunting

African Wild Dogs

Elephant seals

Vicuas

Bottlenose dolphins

African buffalo herd

Sheep flock

Group size measures

103

References
[1] http:/ / www. zoologia. hu/ list/ AnimBehav. pdf [2] http:/ / www. zoologia. hu/ flocker/ flocker. html

Guttman scale
In statistical surveys conducted by means of structured interviews or questionnaires, a subset of the survey items having binary (e.g., YES or NO) answers forms a Guttman scale (named after Louis Guttman) if they can be ranked in some order so that, for a rational respondent, the response pattern can be captured by a single index on that ordered scale. In other words, on a Guttman scale, items are arranged in an order so that an individual who agrees with a particular item also agrees with items of lower rank-order. For example, a series of items could be (1) "I am willing to be near ice cream"; (2) "I am willing to smell ice cream"; (3) "I am willing to eat ice cream"; and (4) "I love to eat ice cream". Agreement with any one item implies agreement with the lower-order items. This contrasts with topics studied using a Likert scale or a Thurstone scale. The concept of Guttman scale likewise applies to series of items in other kinds of tests, such as achievement tests, that have binary outcomes. For example, a test of math achievement might order questions based on their difficulty and instruct the examinee to begin in the middle. The assumption is if the examinee can successfully answer items of that difficulty (e.g., summing two 3-digit numbers), s/he would be able to answer the earlier questions (e.g., summing two 2-digit numbers). Some achievement tests are organized in a Guttman scale to reduce the duration of the test. By designing surveys and tests such that they contain Guttman scales, researchers can simplify the analysis of the outcome of surveys, and increase the robustness. Guttman scales also make it possible to detect and discard randomized answer patterns, as may be given by uncooperative respondents. A hypothetical, perfect Guttman scale consists of a unidimensional set of items that are ranked in order of difficulty from least extreme to most extreme position. For example, a person scoring a "7" on a ten item Guttman scale, will agree with items 1-7 and disagree with items 8,9,10. An important property of Guttman's model is that a person's entire set of responses to all items can be predicted from their cumulative score because the model is deterministic. A well-known example of a Guttman scale is the Bogardus Social Distance Scale. Another example is the original Beaufort wind force scale, assigning a single number to observed conditions of the sea surface ("Flat", ..., "Small waves", ..., "Sea heaps up and foam begins to streak", ...), which was in fact a Guttman scale. The observation "Flat = YES" implies "Small waves = NO".

Deterministic model
An important objective in Guttman scaling is to maximize the reproducibility of response patterns from a single score. A good Guttman scale should have a coefficient of reproducibility (the percentage of original responses that could be reproduced by knowing the scale scores used to summarize them) above .85. Another commonly used metric for assessing the quality of a Guttman scale, is Menzel's coefficient of scalability and the coefficient of homogeneity (Loevinger, 1948; Cliff, 1977; Krus and Blackman, 1988). To maximize unidimensionality, misfitting items are re-written or discarded.

Guttman scale

104

Stochastic models
Guttman's deterministic model is brought within a probabilistic framework in item response theory models, and especially Rasch measurement. The Rasch model requires a probabilistic Guttman structure when items have dichotomous responses (e.g. right/wrong). In the Rasch model, the Guttman response pattern is the most probable response pattern for a person when items are ordered from least difficult to most difficult (Andrich, 1985). In addition, the Polytomous Rasch model is premised on a deterministic latent Guttman response subspace, and this is the basis for integer scoring in the model (Andrich, 1978, 2005). Analysis of data using item response theory requires comparatively longer instruments and larger datasets to scale item and person locations and evaluate the fit of data to model. In practice, actual data from respondents do not closely match Guttman's deterministic model. Several probabilistic models of Guttman implicatory scales were developed by Krus (1977) and Krus and Bart (1974).

Applications
The Guttman scale is used mostly when researchers want to design short questionnaires with good discriminating ability. The Guttman model works best for constructs that are hierarchical and highly structured such as social distance, organizational hierarchies, and evolutionary stages.

Unfolding models
A class of unidimensional models that contrast with Guttman's model are unfolding models. These models also assume unidimensionality but posit that the probability of endorsing an item is proportional to the distance between the items standing on the unidimensional trait and the standing of the respondent. For example, items like "I think immigration should be reduced" on a scale measuring attitude towards immigration would be unlikely to be endorsed both by those favoring open policies and also by those favoring no immigration at all. Such an item might be endorsed by someone in the middle of the continuum. Some researchers feel that many attitude items fit this unfolding model while most psychometric techniques are based on correlation or factor analysis, and thus implicitly assume a linear relationship between the trait and the response probability. The effect of using these techniques would be to only include the most extreme items, leaving attitude instruments with little precision to measure the trait standing of individuals in the middle of the continuum.

Example
Here is an example of a Guttman scale - the Bogardus Social Distance Scale: (Least extreme) 1. 2. 3. 4. 5. Are you willing to permit immigrants to live in your country? Are you willing to permit immigrants to live in your community? Are you willing to permit immigrants to live in your neighbourhood? Are you willing to permit immigrants to live next door to you? Would you permit your child to marry an immigrant?

(Most extreme) E.g., agreement with item 3 implies agreement with items 1 and 2.

Guttman scale

105

References
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 357-74. Andrich, D. (2005). The Rasch model explained. In Sivakumar Alagumalai, David D Durtis, and Njora Hungi (Eds.) Applied Rasch Measurement: A book of exemplars. Springer-Kluwer. Chapter 3, 308-328. Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. Brandon-Tuma (Ed.), Sociological Methodology, San Francisco, Jossey-Bass. (Chapter 2, pp.3380.). Cliff, N. (1977). A theory of consistency of ordering generalizable to tailored testing. Psychometrika, 42, 375-399. Gordon, R. (1977) Unidimensional Scaling of Social Variables: Concepts and Procedures. New York: The Free Press. Guttman, L. (1950). The basis for scalogram analysis. In Stouffer et al. Measurement and Prediction. The American Soldier Vol. IV. New York: Wiley Kenny D.A., Rubin D.C. (1977). Estimating chance reproducibility in Guttman scaling. Social Science Research, 6, 188-196. Krus, D.J. (1977) Order analysis: an inferential model of dimensional analysis and scaling. Educational and Psychological Measurement, 37, 587-601. (Request reprint). [1] Krus, D. J., & Bart, W. M. (1974) An ordering theoretic method of multidimensional scaling of items. Educational and Psychological Measurement, 34, 525-535. Krus, D.J., & Blackman, H.S. (1988).Test reliability and homogeneity from perspective of the ordinal test theory. Applied Measurement in Education, 1, 79-88 (Request reprint). [2] Loevinger, J. (1948). The technic of homogeneous tests compared with some aspects of scale analysis and factor analysis. Psychological Bulletin, 45, 507-529. Robinson J. P. (1972) Toward a More Appropriate Use of Guttman Scaling. Public Opinion Quarterly, Vol. 37:(2). (Summer, 1973), pp.260267. Schooler C. (1968). A Note of Extreme Caution on the Use of Guttman Scales. American Journal of Sociology, Vol. 74:(3) (Nov. 1968), 296-301.

External links
Guttman scaling description [3]

References
[1] http:/ / www. visualstatistics. net/ Scaling/ Order%20Analysis/ Order%20Analysis. htm [2] http:/ / www. visualstatistics. net/ Scaling/ Homogeneity/ Homogeneity. htm [3] http:/ / www. socialresearchmethods. net/ kb/ scalgutt. htm

High-stakes testing

106

High-stakes testing
A high-stakes test is a test with important consequences for the test taker.[1] Passing has important benefits, such as a high school diploma, a scholarship, or a license to practice a profession. Failing has important disadvantages, such as being forced to take remedial classes until the test can be passed, not being allowed to drive a car, or not being able to find employment. The use and misuse of high-stakes tests are a controversial topic in public education, especially in the United States where they have become especially popular in recent years, used not only to assess students but in attempts to increase teacher [2] accountability.

A driving test is a high-stakes test: Without passing the test, the test taker cannot obtain a driver's license.

Definitions
In common usage, a high-stakes test is any test that has major consequences or is the basis of a major decision.[1][][] Under a more precise definition, a high-stakes test is any test that: is a single, defined assessment, has clear line drawn between those who pass and those who fail, and has direct consequences for passing or failing (something "at stake").[] High-stakes testing is not synonymous with high-pressure testing. An American high school student might feel pressure to perform well on the SAT-I college aptitude exam. However, SAT scores do not directly determine admission to any college or university, and there is no clear line drawn between those who pass and those who fail, so it is not formally considered a high-stakes test.[3][4] On the other hand, because the SAT-I scores are given significant weight in the admissions process at some schools, many people believe that it has consequences for doing well or poorly and is therefore a high-stakes test under the simpler, common definition.[5][6]

The stakes
High stakes are not a characteristic of the test itself, but rather of the consequences placed on the outcome. For example, no matter what test is used written multiple choice, oral examination, performance test a medical licensing test must be passed to practice medicine. The perception of the stakes may vary. For example, college students who wish to skip an introductory-level course are often given exams to see whether they have already mastered the material and can be passed to the next level. Passing the exam can reduce tuition costs and time spent at university. A student who is anxious to have these benefits may consider the test to be a high-stakes exam. Another student, who places no importance on the outcome, so long as he is placed in a class that is appropriate to his skill level, may consider the same exam to be a low-stakes test.[] The phrase "high stakes" is derived directly from a gambling term. In gambling, a stake is the quantity of money or other goods that is risked on the outcome of some specific event. A high-stakes game is one in which, in the player's

High-stakes testing personal opinion, a large quantity of money is being risked. The term is meant to imply that implementing such a system introduces uncertainty and potential losses for test takers,[citation needed] who must pass the exam to "win," instead of being able to obtain the goal through other means.[citation needed] Examples of high-stakes tests and their "stakes" include: Driver's license tests and the legal ability to drive Theater auditions and the part in the performance College entrance examinations in some countries, such as Japan's Common first-stage exam, and admission to a high-quality university Many job interviews or drug tests and being hired High school exit examinations and high-school diplomas Progression from one grade to another grade in primary and secondary high school No Child Left Behind tests and school funding and ratings Ph.D. oral exams and the dissertation Professional licensing and certification examinations (such as the bar exams, FAA written tests, and medical exams) and the license or certification being sought The Test of English as a Foreign Language (TOEFL) and recognition as a speaker of English (if a minimum score is required, but not if it is used merely for information [normally in work and school placement contexts])

107

Stakeholders
A high-stakes system may be intended to benefit people other than the test-taker. For professional certification and licensure examinations, the purpose of the test is to protect the general public from incompetent practitioners. The individual stakes of the medical student and the medical school are, hopefully, balanced against the social stakes of possibly allowing an incompetent doctor to practice medicine.[7] A test may be "high-stakes" based on consequences for others beyond the individual test-taker.[] For example, an individual medical student who fails a licensing exam will not be able to practice his or her profession. However, if enough students at the same school fail the exam, then the school's reputation and accreditation may be in jeopardy. Similarly, testing under the U.S.'s No Child Left Behind Act has no direct negative consequences for failing students,[8] but potentially serious consequences for their schools, including loss of accreditation, funding, teacher pay, teacher employment, or changes to the school's management.[9] The stakes are therefore high for the school, but low for the individual test-takers.

Assessments used in high-stakes testing


Any form of assessment can be used as a high-stakes test. Many times, an inexpensive multiple-choice test is chosen for convenience. A high-stakes assessment may also involve answering open-ended questions or a practical, hands-on section. For example, a typical high-stakes licensing exam for a medical nurse determines whether the nurse can insert an I.V. line by watching the nurse actually do this task. These assessments are called authentic assessments or performance tests.[] Some high-stakes tests may be standardized tests (in which all examinees take the test under reasonably equal conditions), with the expectation that standardization affords all examinees a fair and equal opportunity to pass.[] Some high-stakes tests are non-standardized, such as a theater audition. As with other tests, high-stakes tests may be criterion-referenced or norm-referenced.[] For example, a written driver's license examination typically is criterion-referenced, with an unlimited number of potential drivers able to pass if they correctly answer a certain percentage of questions. On the other hand, essay portions of some bar exams are often norm-referenced, with the worst essays failed and the best essays passed, without regard for the overall quality of the essays.

High-stakes testing

108

Criticism
High-stakes tests are often criticized for the following reasons: The test does not correctly measure the individual's knowledge or skills. For example, a test might purport to be a general reading-skills test, but it might actually determine whether or not the examinee has read a specific book. The test may not measure what the critic wants measured. For example, a test might accurately measure whether a law student has acquired fundamental knowledge of the legal system, but the critic might want students to be tested on legal ethics instead of legal knowledge. Testing causes stress for some students. Critics suggest that since some people perform poorly under the pressure associated with tests, any test is likely to be less representative of their actual standard of achievement than a non-test alternative.[] This is called test anxiety or performance anxiety. High-stakes tests are often given as a single long exam. Some critics prefer continuous assessment instead of one larger test. For example, the American Psychological Association (APA) opposes high school exit examinations, saying, "Any decision about a student's continued education, such as retention, tracking, or graduation, should not be based on the results of a single test, but should include other relevant and valid information."[] Since the stakes are related to consequences, not method, however, short tests can also be high-stakes. High-stakes testing creates more incentive for cheating.[] Because cheating on a single critical exam may be easier than either learning the required material or earning credit through attendance, diligence, or many smaller tests, more examinees that do not actually have the necessary knowledge or skills, but who are effective cheaters, may pass. Also, some people who would otherwise pass the test but are not confident enough of themselves might decide to additionally secure the outcome by cheating, get caught and often face even worse consequences than just failing. Additionally, if the test results are used to determine the teachers' pay or continued employment, or to evaluate the school, then school personnel may fraudulently alter student test papers to artificially inflate student performance.[] Sometimes a high-stakes test is tied to a controversial reward. For example, some people may want a high-school diploma to represent the verified acquisition of specific skills or knowledge, and therefore use a high-stakes assessment to deny a diploma to anyone who cannot perform the necessary skills.[] Others may want a high school diploma to represent primarily a certificate of attendance, so that a student who faithfully attended school but cannot read or write will still get the social benefits of graduation. [citation needed] This use of tests to deny a high school diploma, and thereby access to most jobs and higher education for a lifetime is controversial even when the test itself accurately identifies students that do not have the necessary skills. Criticism is usually framed as over-reliance on a single measurement[10] or in terms of social justice, if the absence of skill is not entirely the test taker's fault, as in the case of a student who cannot read because of unqualified teachers, or an elderly person with advanced dementia that can no longer pass a driving exam due to loss of cognitive function.[] Tests can penalize test takers that do not have the necessary skills through no fault of their own. An absence of skill may not be the test taker's fault, but high-stakes test measure only skill proficiency, regardless of whether the test takers had an equal opportunity to learn the material.[][][11] Additionally, wealthy students may use private tutoring or test preparation programs to improve their scores. Some affluent parents pay thousands of dollars to prepare their children for tests.[12] Critics see this as being unfair to students who cannot afford additional educational services. High-stakes tests reveal that some examinees do not know the required material, or do not have the necessary skills. While failing these people may have many public benefits, the consequences of repeated failure can be very high for the individual. For example, a person who fails a practical driving exam will not be able to drive a car legally, which means they cannot drive to work and may lose their job if alternative transportation options are not available. The person may suffer social embarrassment when his acquaintances discover that his

High-stakes testing lack of skill resulted in loss of his driver's license. In the context of high school exit exams, poorly performing school districts have formally opposed high-stakes testing after low test results, which accurately and publicly exposed the districts' failures, proved to be politically embarrassing,[13] and criticized high-stakes tests for identifying students who lack the required knowledge.[]

109

References
[2] Rosemary Sutton & Kelvin Seifert (2009) Educational Psychology, 2nd Edition: Chapter 1: The Changing Teaching Profession and You. pp 14 (http:/ / www. saylor. org/ site/ wp-content/ uploads/ 2012/ 06/ Educational-Psychology. pdf) [7] Mehrens, W.A. (1995). Legal and Professional Bases for Licensure Testing.' In Impara, J.C. (Ed.) Licensure testing: Purposes, procedures, and practices, pp. 33-58. Lincoln, NE: Buros Institute.

Further reading
Featherston, Mark Davis, 2011. "High-Stakes Testing Policy in Texas: Describing the Attitudes of Young College Graduates." (http://ecommons.txstate.edu/arp/350) Applied Research Projects, Texas State University-San Marcos.

Historiometry
Historiometry is the historical study of human progress or individual personal characteristics, using statistics to analyze references to geniuses,[1] their statements, behavior and discoveries in relatively neutral texts. Historiometry combines techniques from cliometrics, which studies the history of economics and from psychometrics, the psychological study of an individual's personality and abilities.

Origins
Historiometry started in the early 19th century with studies on the relationship between age and achievement by Belgian mathematician Adolphe Quetelet in the careers of prominent French and English playwrights [2][3] but it was Sir Francis Galton, a pioneering English eugenist who popularized historiometry in his 1869 work, Hereditary Genius.[4] It was further developed by Frederick Adams Woods (who coined the term historiometry[5][6]) in the beginning of the 20th century.[7] Also psychologist Paul E. Meehl published several papers on historiometry later in his career, mainly in the area of medical history, although it is usually referred to as cliometric metatheory by him.[8][9] Historiometry was the first field studying genius by using scientific methods.[1]

Francis Galton, one of the pioneers of historiometry.

Historiometry

110

Current research
Prominent current historiometry researchers include Dean Keith Simonton and Charles Murray.[] Historiometry is defined by Dean Keith Simonton as: a quantitative method of statistical analysis for retrospective data. In Simonton's work the raw data comes from psychometric assessment of famous personalities, often already deceased, in an attempt to assess creativity, genius and talent development.[10] Charles Murray's Human Accomplishment is one example of this approach to quantify the impact of individuals on technology, science and the arts. It tracks the most important achievements across time, and for the different peoples of the world, and provides a thorough discussion of the methodology used, together with an assessment of its reliability and accuracy.[]Wikipedia:No original research

Examples of research
Since historiometry deals with subjective personal traits as creativity, charisma or openness most studies deal with the comparison of scientists, artists or politicians. The study (Human Accomplishment) by Charles Murray classifies, for example, Einstein and Newton as the most important physicists and Michelangelo as the top ranking western artist.[] As another example, several studies have compared charisma and even the IQ of presidents and presidential candidates of the United States of America.[11][] The latter study classifies John Quincy Adams as the most clever US president, with an estimated IQ between 165 to 175.[]

Critique
Since historiometry is based on indirect information like historic documents and relies heavily on statistics, the results of these studies are questioned by some researchers, mainly because of concerns about over-interpretation of the estimated results.[12][13] The previously mentioned study of the intellectual capacity of US presidents, a study by Dean Keith Simonton, attracted a lot of media attention and critique mainly because it classified the former US president, George W. Bush, as second to last of all US presidents since 1900.[][14] The IQ of G.W. Bush was estimated as between 111.1 and 138.5, with an average of 125,[] exceeding only that of president Warren Harding, who is regarded as a failed president,[] with an average IQ of 124. Although controversial and imprecise (due to gaps in available data), the approach used by Simonton to generate his results was regarded "reasonable" by fellow researchers.[15] In the media, the study was sometimes compared with the U.S. Presidents IQ hoax, a hoax that circulated via email in mid-2001, which suggested that G.W. Bush had the lowest IQ of all US presidents.[16]

References
[1] A Reflective Conversation with Dean Keith Simonton, North American Journal of Psychology, 2008, Vol. 10, No. 3, 595-602.

External links
History and Mathematics (http://urss.ru/cgi-bin/db.pl?cp=&page=Book&id=53184&lang=en&blang=en& list=Found)

House-Tree-Person test

111

House-Tree-Person test
The House-Tree-Person test (HTP) is a projective test designed to measure aspects of a persons personality. The test can also be used to assess brain damage and general mental functioning. The test is a diagnostic tool for clinical psychologists, educators, and employers. The subject receives a short, unclear instruction (the stimulus) to draw a house, a tree, and the figure of a person. Once the subject is done, he is asked to describe the pictures that he has done. The assumption is that when the subject is drawing he is projecting his inner world onto the page. The administrator of the test uses tools and skills that have been established for the purpose of investigating the subject's inner world through the drawings.

4-year-old's drawing of a person

Generally this test is administered as part of a series of personality and intelligence tests, like the Rorschach, TAT (or CAT for children), Bender, and Wechsler tests. The examiner integrates the results of these tests, creating a basis for evaluating the subject's personality from a cognitive, emotional, intra- and interpersonal perspective. The test and its method of administration have been criticized for having substantial weaknesses in validity, but a number of researchers in the past few decades have found positive results as regards its validity for specific populations. [citation needed]

History
HTP was designed by John Buck and was originally based on the Goodenough scale of intellectual functioning. The HTP was developed in 1948, and updated in 1969. Buck included both qualitative and quantitative measurements of intellectual ability in the HTP (V). A 350-page manual was written by Buck to instruct the test-giver on proper grading of the HTP, which is more subjective than quantitative.[] In contrast with him, Zoltn Vass published a more sophisticated approach, based on system analysis (SSCA, Seven-Step Configuration Analysis [1]).

Administering the test


HTP is given to persons above the age of three and takes approximately 150 minutes to complete based on the subject's level of mental functioning. During the first phase, the test-taker is asked to draw the house, tree, and person and the test-giver asks questions about each picture. There are 60 questions originally designed by Buck but art therapists and trained test givers can also design their own questions, or ask follow up questions. This phase is done with a crayon.[] During the second phase of HTP, the test-taker draws the same pictures with a pencil or pen. Again the test-giver asks similar questions about the drawings. Note: some mental health professionals only administer phase one or two and may change the writing instrument as desired. Variations of the test may ask the person to draw one person of each sex, or put all drawings on the same page.[] Examples of follow up questions: After the House: Who lives here? Is the occupant happy? What goes on inside the house? What's it like at night? Do people visit the house? What else do the people in the house want to add to the drawing? [] After the Tree: What kind of tree is this? How old is the tree? What season is it? Has anyone tried to cut it down? What else grows nearby? Who waters this tree? Trees need sunshine to live so does it get enough sunshine?[]

House-Tree-Person test After the Person is drawn: who is the person? How old is the person? What do they like and dislike doing? Has anyone tried to hurt them? Who looks out for them?[]

112

Interpretation of results
By virtue of being a projective test, the results of the HTP are subjective and open to interpretation by the administrator of the exam.[] The subjective analysis of the test takers responses and drawings aims to make inferences of personality traits and past experiences. The subjective nature of this aspect of the HTP, as with other qualitative tests, has little empirical evidence to support its reliability or validity. This test, however, is still considered an accurate measure of brain damage and used in the assessment of schizophrenic patients also suffering from brain damage.[] In addition, the quantitative measure of intelligence for the House-tree-person has been shown to highly correlate with the WAIS and other well-established intelligence tests.[]

References
[1] http:/ / www. freado. com/ read/ 11970/ a-psychological-interpretation-of-drawings-and-paintings

Idiographic image
In the field of clinical sciences, an idiographic image (from Greek -: dios + graphiks, meaning "to describe a peculiarity") is the representation of a result which has been obtained thanks to a study or research method whose subject-matters are specific cases, i.e. a portrayal which avoids nomothetic generalizations. "Diagnostic formulation follows an idiographic criterion, while diagnostic classification follows a nomothetic criterion".[1] In the field of psychiatry, psychology and clinical psychopathology, idiographic criterion is a method (also called historical method) which involves evaluating past experiences and selecting and comparing information about a specific individual or event. An example of idiographic image is a report, diagram or health history showing medical, psychological and pathological features which make the subject under examination unique. "Where there is no prior detailed presentation of clinical data, the summary should present sufficient relevant information to support the diagnostic and aetiological components of the formulation. The term diagnostic formulation is preferable to diagnosis, because it emphasises that matters of clinical concern about which the clinician proposes aetiological hypotheses and targets of intervention include much more than just diagnostic category assignment, though this is usually an important component".[2] The expression idiographic image appeared for the first time in 1996 in the SESAMO research method Manual.[3] This term was coined to mean that the report of the test provided an anamnestic report containing a family, relational and health history of the subject and providing semiological data regarding both the psychosexual and the social-affective profile. These profiles were useful to the clinician in order to formulate pathogenetic and pathognomonic hypotheses.[4]

Idiographic image

113

Bibliography
[1] Battacchi M.W., (1990), Trattato enciclopedico di psicologia dell'et evolutiva, Piccin, Padova. ISBN 88-299-0206-3 [2] Shields R., Emergency psychiatry. Review of psychiatry. Australian and New Zealand Journal of Psychiatry, 37, 4, 498-499, 2003. (http:/ / member. melbpc. org. au/ ~rshields/ psychiatricformulation. html) [3] Boccadoro L., (1996) SESAMO: Sexuality Evaluation Schedule Assessment Monitoring. Approccio differenziale al profilo idiografico psicosessuale e socioaffettivo. O.S., Firenze. IT\ICCU\CFI\0327719 [4] Boccadoro L., Carulli S., (2008) The place of the denied love. Sexuality and secret psychopathologies (Abstract English, Spanish, Italian) (http:/ / sexology. interfree. it/ abstract_english. html). Edizioni Tecnoprint, Ancona. ISBN 978-88-95554-03-7

External links
Glossario di Sessuologia clinica (Italian)[[Category:Articles with Italian language external links (http:// sexology.it/glossario_sessuologia.html)] - Glossary of clinical sexology (English)]

Intelligence quotient

114

Intelligence quotient
Intelligence quotient
Diagnostics

An example of one kind of IQ test item, modeled after items in the Raven's Progressive Matrices test. ICD-9-CM MedlinePlus 94.01 [1] [2]

001912

Human intelligence
Abilities, traits and constructs

Abstract thought Communication Creativity Emotional intelligence g factor Intelligence quotient Knowledge Learning Memory Problem solving Reaction time Reasoning Understanding Visual processing Models and theories

CattellHornCarroll theory Fluid and crystallized intelligence Theory of multiple intelligences Three stratum theory Triarchic theory of intelligence PASS theory of intelligence Fields of study

Intelligence quotient

115
Cognitive epidemiology Evolution of human intelligence Psychometrics Heritability of IQ Impact of health on intelligence Environment and intelligence Neuroscience and intelligence Race and intelligence Religiosity and intelligence

An intelligence quotient, or IQ, is a score derived from one of several standardized tests designed to assess intelligence. The abbreviation "IQ" comes from the German term Intelligenz-Quotient, originally coined by psychologist William Stern. When modern IQ tests are devised, the mean (average) score within an age group is set to 100 and the standard deviation (SD) almost always to 15, although this was not always so historically.[] Thus, the intention is that approximately 95% of the population scores within two SDs of the mean, i.e. has an IQ between 70 and 130. IQ scores have been shown to be associated with such factors as morbidity and mortality,[3] parental social status,[4] and, to a substantial degree, biological parental IQ. While the heritability of IQ has been investigated for nearly a century, there is still debate about the significance of heritability estimates[5][6] and the mechanisms of inheritance.[] IQ scores are used as predictors of educational achievement, special needs, job performance and income. They are also used to study IQ distributions in populations and the correlations between IQ and other variables. The average IQ scores for many populations have been rising at an average rate of three points per decade since the early 20th century, a phenomenon called the Flynn effect. It is disputed whether these changes in scores reflect real changes in intellectual abilities.

History
Early history
The first large-scale mental test may have been the imperial examination system in China. According to psychologist Robert Sternberg, the ancient Chinese game known in the West as the tangram was designed to evaluate a person's intelligence, along with the game jiulianhuan or nine linked rings.[] Sternberg states that it is considered "the earliest psychological test in the world," although one made for entertainment rather than analysis.[] Modern mental testing began in France in the 19th century. It contributed to separating mental retardation from mental illness and reducing the neglect, torture, and ridicule heaped on both groups.[7] Englishman Francis Galton coined the terms psychometrics and eugenics, and developed a method for measuring intelligence based on nonverbal sensory-motor tests. It was initially popular, but was abandoned after the discovery that it had no relationship to outcomes such as college grades.[7][8] French psychologist Alfred Binet, together with psychologists Victor Henri and Thodore Simon, after about 15 years of development, published the Binet-Simon test in 1905, which focused on verbal abilities. It was intended to identify mental retardation in school children.[7] The score on the Binet-Simon scale would reveal the child's mental age. For example, a six-year-old child who passed all the tasks usually passed by six-year-oldsbut nothing beyondwould have a mental age that exactly matched his chronological age, 6.0. (Fancher, 1985). In Binet's view, there were limitations with the scale and he stressed what he saw as the remarkable diversity of intelligence and the subsequent need to study it using qualitative, as opposed to quantitative, measures (White, 2000). American psychologist Henry H. Goddard published a translation of it in 1910. The eugenics movement in the USA seized on it as a means to give them credibility in diagnosing mental retardation, and thousands of American women, most of

Intelligence quotient them poor African Americans, were forcibly sterilized based on their scores on IQ tests, often without their consent or knowledge.[9] American psychologist Lewis Terman at Stanford University revised the Binet-Simon scale, which resulted in the Stanford-Binet Intelligence Scales (1916). It became the most popular test in the United States for decades.[7][10][11][12]

116

General factor (g)


The many different kinds of IQ tests use a wide variety of methods. Some tests are visual, some are verbal, some tests only use abstract-reasoning problems, and some tests concentrate on arithmetic, spatial imagery, reading, vocabulary, memory or general knowledge. The psychologist Charles Spearman in 1904 made the first formal factor analysis of correlations between the tests. He found a single common factor explained the positive correlations among tests. This is an argument still accepted in principle by many psychometricians. Spearman named it g for "general factor" and labelled the smaller, specific factors or abilities for specific areas s. In any collection of IQ tests, by definition the test that best measures g is the one that has the highest correlations with all the others. Most of these g-loaded tests typically involve some form of abstract reasoning. Therefore, Spearman and others have regarded g as the (perhaps genetically determined) real essence of intelligence. This is still a common but not universally accepted view. Other factor analyses of the data, with different results, are possible. Some psychometricians regard g as a statistical artifact. One of the best measures of g is Raven's Progressive Matrices which is a test of visual reasoning.[][7]

The War Years


During World War I, a way was needed to evaluate and assign recruits. This led to the rapid development of several mental tests. The testing generated controversy and much public debate. Nonverbal or "performance" tests were developed for those who could not speak English or were suspected of malingering.[7] After the war, positive publicity on army psychological testing helped to make psychology a respected field.[13] Subsequently, there was an increase in jobs and funding in psychology.[14] Group intelligence tests were developed and became widely used in schools and industry.[] L.L. Thurstone argued for a model of intelligence that included seven unrelated factors (verbal comprehension, word fluency, number facility, spatial visualization, associative memory, perceptual speed, reasoning, and induction). While not widely used, it influenced later theories.[7] David Wechsler produced the first version of his test in 1939. It gradually became more popular and overtook the Binet in the 1960s. It has been revised several times, as is common for IQ tests, to incorporate new research. One explanation is that psychologists and educators wanted more information than the single score from the Binet. Wechsler's 10+ subtests provided this. Another is Binet focused on verbal abilities, while the Wechsler also included nonverbal abilities. The Binet has also been revised several times and is now similar to the Wechsler in several aspects, but the Wechsler continues to be the most popular test in the United States.[7]

CattellHornCarroll theory
Raymond Cattell (1941) proposed two types of cognitive abilities in a revision of Spearman's concept of general intelligence. Fluid intelligence (Gf) was hypothesized as the ability to solve novel problems by using reasoning, and crystallized intelligence (Gc) was hypothesized as a knowledge-based ability that was very dependent on education and experience. In addition, fluid intelligence was hypothesized to decline with age, while crystallized intelligence was largely resistant. The theory was almost forgotten, but was revived by his student John L. Horn (1966) who later argued Gf and Gc were only two among several factors, and he eventually identified 9 or 10 broad abilities. The theory continued to be called Gf-Gc theory.[7] John B. Carroll (1993), after a comprehensive reanalysis of earlier data, proposed the Three Stratum theory, which is a hierarchical model with three levels. The bottom stratum consists of narrow abilities that are highly specialized

Intelligence quotient (e.g., induction, spelling ability). The second stratum consists of broad abilities. Carroll identified eight second-stratum abilities. Carroll accepted Spearman's concept of general intelligence, for the most part, as a representation of the uppermost, third stratum.[][] More recently (1999), a merging of the Gf-Gc theory of Cattell and Horn with Carroll's Three-Stratum theory has led to the CattellHornCarroll theory. It has greatly influenced many of the current broad IQ tests.[7] It is argued that this reflects much of what is known about intelligence from research. A hierarchy of factors is used; g is at the top. Under it are 10 broad abilities that in turn are subdivided into 70 narrow abilities. The broad abilities are:[7] Fluid intelligence (Gf) includes the broad ability to reason, form concepts, and solve problems using unfamiliar information or novel procedures. Crystallized intelligence (Gc) includes the breadth and depth of a person's acquired knowledge, the ability to communicate one's knowledge, and the ability to reason using previously learned experiences or procedures. Quantitative reasoning (Gq) is the ability to comprehend quantitative concepts and relationships and to manipulate numerical symbols. Reading and writing ability (Grw) includes basic reading and writing skills. Short-term memory (Gsm) is the ability to apprehend and hold information in immediate awareness, and then use it within a few seconds. Long-term storage and retrieval (Glr) is the ability to store information and fluently retrieve it later in the process of thinking. Visual processing (Gv) is the ability to perceive, analyze, synthesize, and think with visual patterns, including the ability to store and recall visual representations. Auditory processing (Ga) is the ability to analyze, synthesize, and discriminate auditory stimuli, including the ability to process and discriminate speech sounds that may be presented under distorted conditions. Processing speed (Gs) is the ability to perform automatic cognitive tasks, particularly when measured under pressure to maintain focused attention. Decision/reaction time/speed (Gt)reflects the immediacy with which an individual can react to stimuli or a task (typically measured in seconds or fractions of seconds; it is not to be confused with Gs, which typically is measured in intervals of 23 minutes). See Mental chronometry. Modern tests do not necessarily measure of all of these broad abilities. For example, Gq and Grw may be seen as measures of school achievement and not IQ.[7] Gt may be difficult to measure without special equipment. g was earlier often subdivided into only Gf and Gc, which were thought to correspond to the nonverbal or performance subtests and verbal subtests in earlier versions of the popular Wechsler IQ test. More recent research has shown the situation to be more complex.[7] Modern comprehensive IQ tests no longer give a single score. Although they still give an overall score, they now also give scores for many of these more restricted abilities, identifying particular strengths and weaknesses of an individual.[7]

117

Other theories
J.P. Guilford's Structure of Intellect (1967) model used three dimensions which when combined yielded a total of 120 types of intelligence. It was popular in the 1970s and early 1980s, but faded due to both practical problems and theoretical criticisms.[7] Alexander Luria's earlier work on neuropsychological processes lead to the PASS theory (1997). It argued that only looking at one general factor was inadequate for researchers and clinicians who worked with learning disabilities, attention disorders, mental retardation, and interventions for such disabilities. The PASS model covers four kinds of processes (planning process, attention/arousal process, simultaneous processing, and successive processing). The planning processes involve decision making, problem solving, and performing activities and requires goal setting

Intelligence quotient and self-monitoring. The attention/arousal process involves selectively attending to a particular stimulus, ignoring distractions, and maintaining vigilance. Simultaneous processing involves the integration of stimuli into a group and requires the observation of relationships. Successive processing involves the integration of stimuli into serial order. The planning and attention/arousal components comes from structures located in the frontal lobe, and the simultaneous and successive processes come from structures located in the posterior region of the cortex.[][][] It has influenced some recent IQ tests, and been seen as a complement to the Cattell-Horn-Carroll theory described above.[7]

118

Modern tests
Well-known modern IQ tests include Raven's Progressive Matrices, Wechsler Adult Intelligence Scale, Wechsler Intelligence Scale for Children, Stanford-Binet, Woodcock-Johnson Tests of Cognitive Abilities, and Kaufman Assessment Battery for Children. Approximately 95% of the population have scores within two standard deviations (SD) of the mean. If one SD is 15 points, as is common in almost all modern tests, then 95% of the population are within a range of 70 to 130, and 98% are below 131. Alternatively, two-thirds of the population have IQ scores within one SD of the mean, i.e. within the range 85-115. IQ scales are ordinally scaled.[15][16][17][18] While one standard deviation is 15 points, and two SDs are 30 points, and so on, this does not imply that mental ability is linearly related to IQ, such that IQ 50 means half the cognitive ability of IQ 100. In particular, IQ points are not percentage points. The correlation between IQ test results and achievement test results is about 0.7.[7][19]

Mental age vs. modern method


German psychologist William Stern proposed a method of scoring children's intelligence tests in 1912. He calculated what he called a Intelligenz-Quotient score, or IQ, as the quotient of the 'mental age' (the age group which scored such a result on average) of the test-taker and the 'chronological age' of the test-taker, multiplied by 100. Terman used this system for the first version of the Stanford-Binet Intelligence Scales.[] This method has several problems such as the fact that it cannot be used to score adults. Wechsler introduced a different [20] The IQs of a large enough population are calculated so they conform to a normal procedure for his test that is now used distribution with a mean of 100 and a standard deviation of 15. by almost all IQ tests. When an IQ test is constructed, a standardization sample representative of the general population takes the test. The median result is defined to be equivalent to 100 IQ points. In almost all modern tests, a standard deviation of the results is defined to be equivalent to 15 IQ points. When a subject takes an IQ test, the result is ranked compared to the results of the standardization sample and the subject is given an IQ score equal to those with the same test result in the standardization sample.

Intelligence quotient The values of 100 and 15 were chosen to get somewhat similar scores as in the older type of test. Likely as a part of the rivalry between the Binet and the Wechsler, the Binet until 2003 chose to have 16 for one SD, causing considerable confusion. Today, almost all tests use 15 for one SD. Modern scores are sometimes referred to as "deviation IQs," while older method age-specific scores are referred to as "ratio IQs."[7][21]

119

Reliability and validity


Psychometricians generally regard IQ tests as having high statistical reliability.[citation needed] A high reliability implies thatalthough test-takers may have varying scores when taking the same test on differing occasions, and they may have varying scores when taking different IQ tests at the same agethe scores generally agree with one another and across time. A test-taker's score on any one IQ test is surrounded by an error band that shows, to a specified degree of confidence, what the test-taker's true score is likely to be. For modern tests, the standard error of measurement is about three points, or in other words, the odds are about two out of three that a person's true IQ is in range from three points above to three points below the test IQ. Another description is there is a 95% chance the true IQ is in range from four to five points above to four to five points below the test IQ, depending on the test in question. Clinical psychologists generally regard them as having sufficient statistical validity for many clinical purposes.[7][22][23] IQ scores can differ to some degree for the same individual on different IQ tests (age 1213 years). (IQ score table data and pupil pseudonyms adapted from description of KABC-II norming study cited in Kaufman 2009.[7]) Pupil KABC-II WISC-III WJ-III Asher Brianna Colin Danica Elpha Fritz Georgi Hector Imelda Jose Keoku Leo 90 125 100 116 93 106 95 112 104 101 81 116 95 110 93 127 105 105 100 113 96 99 78 124 111 105 101 118 93 105 90 103 97 86 75 102

Flynn effect
Since the early 20th century, raw scores on IQ tests have increased in most parts of the world.[][][] When a new version of an IQ test is normed, the standard scoring is set so performance at the population median results in a score of IQ 100. The phenomenon of rising raw score performance means if test-takers are scored by a constant standard scoring rule, IQ test scores have been rising at an average rate of around three IQ points per decade. This phenomenon was named the Flynn effect in the book The Bell Curve after James R. Flynn, the author who did the most to bring this phenomenon to the attention of psychologists.[][]

Intelligence quotient Researchers have been exploring the issue of whether the Flynn effect is equally strong on performance of all kinds of IQ test items, whether the effect may have ended in some developed nations, whether or not there are social subgroup differences in the effect, and what possible causes of the effect might be.[] Flynn's observations have prompted much new research in psychology and "demolish some long-cherished beliefs, and raise a number of other interesting issues along the way."[]

120

IQ and age
IQ can change to some degree over the course of childhood.[24] However, in one longitudinal study, the mean IQ scores of tests at ages 17 and 18 were correlated at r=.86 with the mean scores of tests at ages five, six and seven and at r=.96 with the mean scores of tests at ages 11, 12 and 13.[25] IQ scores for children are relative to children of a similar age. That is, a child of a certain age does not do as well on the tests as an older child or an adult with the same IQ. But, relative to persons of a similar age, or other adults in the case of adults, they do equally well if the IQ scores are the same.[25] To convert a child's IQ score into an adult score the following calculation should be made: . The number 16 is used to indicate the age at which supposedly the IQ reaches its peak.[26] For decades practitioners' handbooks and textbooks on IQ testing have reported IQ declines with age after the beginning of adulthood. However, later researchers pointed out this phenomenon is related to the Flynn effect and is in part a cohort effect rather than a true aging effect. A variety of studies of IQ and aging have been conducted since the norming of the first Wechsler Intelligence Scale drew attention to IQ differences in different age groups of adults. Current consensus is that fluid intelligence generally declines with age after early adulthood, while crystallized intelligence remains intact. Both cohort effects (the birth year of the test-takers) and practice effects (test-takers taking the same form of IQ test more than once) must be controlled to gain accurate data. It is unclear whether any lifestyle intervention can preserve fluid intelligence into older ages.[27] The exact peak age of fluid intelligence or crystallized intelligence remains elusive. Cross-sectional studies usually show that especially fluid intelligence peaks at a relatively young age (often in the early adulthood) while longitudinal data mostly show that intelligence is stable until the mid adulthood or later. Subsequently, intelligence seems to decline slowly.[]

Genetics and environment


Environmental and genetic factors play a role in determining IQ. Their relative importance has been the subject of much research and debate.

Heritability
Heritability is defined as the proportion of variance in a trait which is attributable to genotype within a defined population in a specific environment. A number of points must be considered when interpreting heritability.[28] Heritability measures the proportion of 'variation' in a trait can be attributed to genes, and not the proportion of a trait caused by genes. The value of heritability can change if the impact of environment (or of genes) in the population is substantially altered. A high heritability of a trait does not mean environmental effects, such as learning, are not involved. Since heritability increases during childhood and adolescence, one should be cautious drawing conclusions regarding the role of genetics and environment from studies where the participants are not followed until they are adults. Studies have found the heritability of IQ in adult twins to be 0.7 to 0.8 and in children twins 0.45 in the Western world.[25][29][30] It may seem reasonable to expect genetic influences on traits like IQ should become less important as one gains experiences with age. However, the opposite occurs. Heritability measures in infancy are as low as 0.2,

Intelligence quotient around 0.4 in middle childhood, and as high as 0.8 in adulthood.[] One proposed explanation is that people with different genes tend to reinforce the effects of those genes, for example by seeking out different environments.[25] Debate is ongoing about whether these heritability estimates are too high due to not adequately considering various factors, such as that the environment may be relatively more important in families with low socioeconomic status or the effect of the maternal (fetal) environment. Recent research suggests that molecular genetics of psychology and social science requires approaches that go beyond the examination of candidate genes.[31]

121

Shared family environment


Family members have aspects of environments in common (for example, characteristics of the home). This shared family environment accounts for 0.250.35 of the variation in IQ in childhood. By late adolescence, it is quite low (zero in some studies). The effect for several other psychological traits is similar. These studies have not looked at the effects of extreme environments, such as in abusive families.[25][32][][33]

Non-shared family environment and environment outside the family


Although parents treat their children differently, such differential treatment explains only a small amount of nonshared environmental influence. One suggestion is that children react differently to the same environment due to different genes. More likely influences may be the impact of peers and other experiences outside the family.[25][]

Individual genes
A very large proportion of the over 17,000 human genes are thought to have an impact on the development and functionality of the brain.[34] While a number of individual genes have been reported to be associated with IQ. Examples include CHRM2, microcephalin, and ASPM. However, Deary and colleagues (2009) argued no evidence has been replicated.,[35] a finding supported by Chabris et al (2012)[36] Recently, FNBP1L polymorphisms, specifically the SNP rs236330 have been associated with normally varying intelligence differences in adults [] and in children.[37]

Gene-environment interaction
David Rowe reported an interaction of genetic effects with Socioeconomic Status, such that the heritability was high in high-SES families, but much lower in low-SES families.[] This has been replicated in infants,[38] children [39] and adolescents [40] in the US, though not outside the US, for instance a reverse result was reported in the UK. [] Dickens and Flynn (2001) have argued that genes for high IQ initiate environment shaping feedback, as genetic effects cause bright children to seek out more stimulating environments that further increase IQ. In their model, an environment effects decay over time (the model could be adapted to include possible factors, like nutrition in early childhood, that may cause permanent effects). The Flynn effect can be explained by a generally more stimulating environment for all people. The authors suggest that programs aiming to increase IQ would be most likely to produce long-term IQ gains if they caused children to persist in seeking out cognitively demanding experiences.[][41]

Intelligence quotient

122

Interventions
In general, educational interventions, as those described below, have shown short-term effects on IQ, but long-term follow-up is often missing. For example, in the US very large intervention programs such as the Head Start Program have not produced lasting gains in IQ scores. More intensive, but much smaller, projects Abecedarian Project have reported lasting effects, often on Socioeconomic status variables, rather than IQ.[25] A placebo controlled double-blind experiment found that vegetarians who took 5grams of creatine per day for six weeks showed a significant improvement on two separate tests of fluid intelligence, Raven's Progressive Matrices, and the backward digit span test from the WAIS. The treatment group was able to repeat longer sequences of numbers from memory and had higher overall IQ scores than the control group. The researchers concluded that "supplementation with creatine significantly increased intelligence compared with placebo."[42] A subsequent study found that creatine supplements improved cognitive ability in the elderly.[43] However, a study on young adults (0.03 g/kg/day for six weeks, e.g., 2 g/day for 150-pound individual) failed to find any improvements.[44] Recent studies have shown that training in using one's working memory may increase IQ. A study on young adults published in April 2008 by a team from the Universities of Michigan and Bern supports the possibility of the transfer of fluid intelligence from specifically designed working memory training.[45][] Further research will be needed to determine nature, extent and duration of the proposed transfer. Among other questions, it remains to be seen whether the results extend to other kinds of fluid intelligence tests than the matrix test used in the study, and if so, whether, after training, fluid intelligence measures retain their correlation with educational and occupational achievement or if the value of fluid intelligence for predicting performance on other tasks changes. It is also unclear whether the training is durable of extended periods of time.[46]

Music and IQ
Musical training in childhood has been found to correlate with higher than average IQ.[] In a 2004 study indicated that 6 year old children who received musical training (voice or piano lessons) had an average increase in IQ of 7.0 points while children who received alternative training (i.e. drama) or no training had an average increase in IQ of only 4.3 points (which may be consequence of the children entering grade school) as indicated by full scale IQ. Children were tested using Wechsler Intelligence Scale for ChildrenThird Edition, Kaufman Test of Educational Achievement and Parent Rating Scale of the Behavioral Assessment System for Children.[] Listening to classical music was reported to increase IQ; specifically spatial ability. In 1994 Frances Rauscher and Gorden Shaw reported that college students who listened to 10 minutes of Mozart's Sonata for Two Pianos, showed an increase in IQ of 8 to 9 points on the spatial subtest on the Standford-Binet Intelligence Scale.[47] The phenomenon was coined the Mozart effect. Multiple attempted replications (e.g.[48]) have shown that this is at best a short-term effect (lasting no longer than 10 to 15 minutes), and is not related to IQ-increase.[49]

Music lessons
In 2004, Schellenberg devised an experiment to test his hypothesis that music lessons can enhance the IQ of children. He had 144 samples of 6 year old children which were put into 4 groups; keyboard lessons, vocal lessons, drama lessons or no lessons at all, for 36 weeks. The samples' IQ was measured both before and after the lessons had taken place using the Wechsler Intelligence Scale for ChildrenThird Edition, Kaufman Test of Educational Achievement and Parent Rating Scale of the Behavioral Assessment System for Children. All four groups had increases in IQ, most likely resulted by the entrance of grade school. The notable difference with the two music groups compared to the two controlled groups was a slightly higher increase in IQ. The children in the control groups on average had an increase in IQ of 4.3 points, while the increase in IQ of the music groups was 7.0 points. Though the increases in IQ were not dramatic, one can still conclude that musical lessons do have a positive effect

Intelligence quotient for children, if taken at a young age. It is hypothesized that improvements in IQ occur after musical lessons because the music lessons encourage multiple experiences which generates progression in a wide range of abilities for the children. Testing this hypothesis however, has proven difficult.[50] Another test also performed by Schellenberg tested the effects of musical training in adulthood. He had two groups of adults, one group whom were musically trained and another group who were not. He administered tests of intelligence quotient and emotional intelligence to the trained and non-trained groups and found that the trained participants had an advantage in IQ over the untrained subjects even with gender, age, environmental issues (e.g. income, parent's education) held constant. The two groups, however, score similarly in the emotional intelligence test. The test results (like the previous results) show that there is a positive correlation between musical training and IQ, but it is not evident that musical training has a positive effect on emotional intelligence.[51]

123

IQ and brain anatomy


Several neurophysiological factors have been correlated with intelligence in humans, including the ratio of brain weight to body weight and the size, shape and activity level of different parts of the brain. Specific features that may affect IQ include the size and shape of the frontal lobes, the amount of blood and chemical activity in the frontal lobes, the total amount of gray matter in the brain, the overall thickness of the cortex and the glucose metabolic rate.

Health and IQ
Health is important in understanding differences in IQ test scores and other measures of cognitive ability. Several factors can lead to significant cognitive impairment, particularly if they occur during pregnancy and childhood when the brain is growing and the bloodbrain barrier is less effective. Such impairment may sometimes be permanent, sometimes be partially or wholly compensated for by later growth. [citation needed] Developed nations have implemented several health policies regarding nutrients and toxins known to influence cognitive function. These include laws requiring fortification of certain food products and laws establishing safe levels of pollutants (e.g. lead, mercury, and organochlorides). Improvements in nutrition, and in public policy in general, have been implicated in worldwide IQ increases. [citation needed] Cognitive epidemiology is a field of research that examines the associations between intelligence test scores and health. Researchers in the field argue that intelligence measured at an early age is an important predictor of later health and mortality differences.

Social outcomes
Intelligence is a better predictor of educational and work success than any other single score.[] Some measures of educational SAT aptitude are essentially IQ tests; For instance Frey and Detterman (2004) reported a correlation of 0.82 between g (general intelligence factor) and SAT scores [52] another has found correlation of 0.81 between g and GCSE scores.[] Correlations between IQ scores (general cognitive ability) and achievement test scores are reported to be 0.81 by Deary and colleagues, with the explained variance ranging "from 58.6% in Mathematics and 48% in English to 18.1% in Art and Design".[]

Intelligence quotient

124

School performance
The American Psychological Association's report "Intelligence: Knowns and Unknowns" states that wherever it has been studied, children with high scores on tests of intelligence tend to learn more of what is taught in school than their lower-scoring peers. The correlation between IQ scores and grades is about .50. This means that the explained variance is 25%. Achieving good grades depends on many factors other than IQ, such as "persistence, interest in school, and willingness to study" (p.81).[25] It has been found IQ correlation with school performance depends on the IQ measurement used. For undergraduate students, the Verbal IQ as measured by WAIS-R has been found to correlate significantly (0.53) with the GPA of the last 60 hours. In contrast, Performance IQ correlation with the same GPA was only 0.22 in the same study.[53]

Job performance
According to Schmidt and Hunter, "for hiring employees without previous experience in the job the most valid predictor of future performance is general mental ability."[] The validity of IQ as a predictor of job performance is above zero for all work studied to date, but varies with the type of job and across different studies, ranging from 0.2 to 0.6.[] The correlations were higher when the unreliability of measurement methods was controlled for.[25] While IQ is more strongly correlated with reasoning and less so with motor function,[54] IQ-test scores predict performance ratings in all occupations.[] That said, for highly qualified activities (research, management) low IQ scores are more likely to be a barrier to adequate performance, whereas for minimally-skilled activities, athletic strength (manual strength, speed, stamina, and coordination) are more likely to influence performance.[] It is largely through the quicker acquisition of job-relevant knowledge that higher IQ mediates job performance. In establishing a causal direction to the link between IQ and work performance, longitudinal studies by Watkins and others suggest that IQ exerts a causal influence on future academic achievement, whereas academic achievement does not substantially influence future IQ scores.[55] Treena Eileen Rohde and Lee Anne Thompson write that general cognitive ability, but not specific ability scores, predict academic achievement, with the exception that processing speed and spatial ability predict performance on the SAT math beyond the effect of general cognitive ability.[56] The US military has minimum enlistment standards at about the IQ 85 level. There have been two experiments with lowering this to 80 but in both cases these men could not master soldiering well enough to justify their costs [57] Some US police departments have set a maximum IQ score for new officers (for example: 125, in New London, CT), under the argument that those with overly-high IQs will become bored and exhibit high turnover in the job. This policy has been challenged as discriminatory, but upheld by at least one US District court.[58] The American Psychological Association's report "Intelligence: Knowns and Unknowns" states that since the explained variance is 29%, other individual characteristics such as interpersonal skills, aspects of personality etc. are probably of equal or greater importance, but at this point there are no equally reliable instruments to measure them.[25]

Income
While it has been suggested that "in economic terms it appears that the IQ score measures something with decreasing marginal value. It is important to have enough of it, but having lots and lots does not buy you that much.",[59][60] large scale longitudinal studies indicate an increase in IQ translates into an increase in performance at all levels of IQ: i.e., that ability and job performance are monotonically linked at all IQ levels.[61] Charles Murray, coauthor of The Bell Curve, found that IQ has a substantial effect on income independently of family background.[62] The link from IQ to wealth is much less strong that than from IQ to job performance. Some studies indicate that IQ is unrelated to net worth.[63][64]

Intelligence quotient The American Psychological Association's 1995 report Intelligence: Knowns and Unknowns stated that IQ scores accounted for (explained variance) about quarter of the social status variance and one-sixth of the income variance. Statistical controls for parental SES eliminate about a quarter of this predictive power. Psychometric intelligence appears as only one of a great many factors that influence social outcomes.[25] Some studies claim that IQ only accounts for (explains) a sixth of the variation in income because many studies are based on young adults, many of whom have not yet reached their peak earning capacity, or even their education. On pg 568 of The g Factor, Arthur Jensen claims that although the correlation between IQ and income averages a moderate 0.4 (one sixth or 16% of the variance), the relationship increases with age, and peaks at middle age when people have reached their maximum career potential. In the book, A Question of Intelligence, Daniel Seligman cites an IQ income correlation of 0.5 (25% of the variance). A 2002 study[65] further examined the impact of non-IQ factors on income and concluded that an individual's location, inherited wealth, race, and schooling are more important as factors in determining income than IQ.

125

IQ and crime
The American Psychological Association's 1995 report Intelligence: Knowns and Unknowns stated that the correlation between IQ and crime was -0.2. It was -0.19 between IQ scores and number of juvenile offenses in a large Danish sample; with social class controlled, the correlation dropped to -0.17. A correlation of 0.20 means that the explained variance is less than 4%. It is important to realize that the causal links between psychometric ability and social outcomes may be indirect. Children with poor scholastic performance may feel alienated. Consequently, they may be more likely to engage in delinquent behavior, compared to other children who do well.[25] In his book The g Factor (1998), Arthur Jensen cited data which showed that, regardless of race, people with IQs between 70 and 90 have higher crime rates than people with IQs below or above this range, with the peak range being between 80 and 90. The 2009 Handbook of Crime Correlates stated that reviews have found that around eight IQ points, or 0.5 SD, separate criminals from the general population, especially for persistent serious offenders. It has been suggested that this simply reflects that "only dumb ones get caught" but there is similarly a negative relation between IQ and self-reported offending. That children with conduct disorder have lower IQ than their peers "strongly argues" for the theory.[66] A study of the relationship between US county-level IQ and US county-level crime rates found that higher average IQs were associated with lower levels of property crime, burglary, larceny rate, motor vehicle theft, violent crime, robbery, and aggravated assault. These results were not "confounded by a measure of concentrated disadvantage that captures the effects of race, poverty, and other social disadvantages of the county."[67]

Other correlations with IQ


In addition, IQ and its correlation to health, violent crime, gross state product, and government effectiveness are the subject of a 2006 paper in the publication Intelligence. The paper breaks down IQ averages by U.S. states using the federal government's National Assessment of Educational Progress math and reading test scores as a source.[68] The American Psychological Association's 1995 report Intelligence: Knowns and Unknowns stated that the correlations for most "negative outcome" variables are typically smaller than 0.20, which means that the explained variance is less than 4%.[25] Tambs et al.[69]WP:NOTRS found that occupational status, educational attainment, and IQ are individually heritable; and further found that "genetic variance influencing educational attainment ... contributed approximately one-fourth of the genetic variance for occupational status and nearly half the genetic variance for IQ." In a sample of U.S. siblings, Rowe et al.[70] report that the inequality in education and income was predominantly due to genes, with shared environmental factors playing a subordinate role.

Intelligence quotient A recent USA study connecting political views and intelligence has shown that the mean adolescent intelligence of young adults who identify themselves as "very liberal" is 106.4, while that of those who identify themselves as "very conservative" is 94.8.[71] Two other studies conducted in the UK reached similar conclusions.[72][73] There are also other correlations such as those between religiosity and intelligence and fertility and intelligence.

126

Real-life accomplishments Average adult combined IQs associated with real-life accomplishments by various tests[74][75]
Accomplishment MDs, JDs, or PhDs College graduates IQ 125+ 112 Test/study Year WAIS-R KAIT K-BIT 115 13 years of college 104 WAIS-R KAIT K-BIT 105110 WAIS-R Clerical and sales workers 100105 KAIT WAIS-R 97 13 years of high school (completed 911 years of school) 94 90 95 Semi-skilled workers (e.g. truck drivers, factory workers) Elementary school graduates (completed eighth grade) Elementary school dropouts (completed 07 years of school) Have 50/50 chance of reaching high school 9095 90 8085 75 K-BIT KAIT K-BIT WAIS-R 1987 2000 1992

High school graduates, skilled workers (e.g., electricians, cabinetmakers) 100

Average IQ of various occupational groups:[76]


Accomplishment Professional and technical Managers and administrators Clerical workers, sales workers, skilled workers, craftsmen, and foremen IQ Test/study Year 112 104 101

Semi-skilled workers (operatives, service workers, including private household) 92 Unskilled workers 87

Intelligence quotient

127

Type of work that can be accomplished:[74]


Accomplishment IQ Test/study Year

Adults can harvest vegetables, repair furniture 60 Adults can do domestic work 50

There is considerable variation within and overlap between these categories. People with high IQs are found at all levels of education and occupational categories. The biggest difference occurs for low IQs with only an occasional college graduate or professional scoring below 90.[7]

Group differences
Among the most controversial issues related to the study of intelligence is the observation that intelligence measures such as IQ scores vary between ethnic and racial groups and sexes. While there is little scholarly debate about the existence of some of these differences, their causes remain highly controversial both within academia and in the public sphere.

Sex
Most IQ tests are constructed so that there are no overall score differences between females and males.[] Because environmental factors affect brain activity and behavior, where differences are found, it can be difficult for researchers to assess whether or not the differences are innate. Areas where differences have been found include verbal and mathematical ability.

Race
The 1996 Task Force investigation on Intelligence sponsored by the American Psychological Association concluded that there are significant variations in IQ across races.[25] The problem of determining the causes underlying this variation relates to the question of the contributions of "nature and nurture" to IQ. Psychologists such as Alan S. Kaufman[77] and Nathan Brody[78] and statisticians such as Bernie Devlin[79] argue that there are insufficient data to conclude that this is because of genetic influences. One of the most notable researchers arguing for a strong genetic influence on these average score differences is Arthur Jensen. In contrast, other researchers such as Richard Nisbett argue that environmental factors can explain all of the average group differences.[80]

Public policy
In the United States, certain public policies and laws regarding military service,[81] [82] education, public benefits,[83] capital punishment,[84] and employment incorporate an individual's IQ into their decisions. However, in the case of Griggs v. Duke Power Co. in 1971, for the purpose of minimizing employment practices that disparately impacted racial minorities, the U.S. Supreme Court banned the use of IQ tests in employment, except when linked to job performance via a Job analysis. Internationally, certain public policies, such as improving nutrition and prohibiting neurotoxins, have as one of their goals raising, or preventing a decline in, intelligence. A diagnosis of mental retardation is in part based on the results of IQ testing. Borderline intellectual functioning is a categorization where a person has below average cognitive ability (an IQ of 7185), but the deficit is not as severe as mental retardation (70 or below). In the United Kingdom, the eleven plus exam which incorporated an intelligence test has been used from 1945 to decide, at eleven years old, which type of school a child should go to. They have been much less used since the widespread introduction of comprehensive schools.

Intelligence quotient

128

Criticism and views


Relation between IQ and intelligence
IQ is the most researched approach to intelligence and by far the most widely used in practical setting. However, although IQ attempts to measure some notion of intelligence, it may fail to act as an accurate measure of "intelligence" in its broadest sense. IQ tests only examine particular areas embodied by the broadest notion of "intelligence", failing to account for certain areas which are also associated with "intelligence" such as creativity or emotional intelligence. There are critics such as Keith Stanovich who do not dispute the stability of IQ test scores or the fact that they predict certain forms of achievement rather effectively. They do argue, however, that to base a concept of intelligence on IQ test scores alone is to ignore many important aspects of mental ability.[4][85]

Criticism of g
Some scientists dispute IQ entirely. In The Mismeasure of Man (1996), paleontologist Stephen Jay Gould criticized IQ tests and argued that that they were used for scientific racism. He argued that g was a mathematical artifact and criticized: ...the abstraction of intelligence as a single entity, its location within the brain, its quantification as one number for each individual, and the use of these numbers to rank people in a single series of worthiness, invariably to find that oppressed and disadvantaged groupsraces, classes, or sexesare innately inferior and deserve their status.(pp. 2425) Psychologist Peter Schnemann was also a persistent critic of IQ, calling it "the IQ myth". He argued that g is a flawed theory and that the high heritability estimates of IQ are based on false assumptions.[86][] Psychologist Arthur Jensen has rejected the criticism by Gould and also argued that even if g was replaced by a model with several intelligences this would change the situation less than expected. All tests of cognitive ability would continue to be highly correlated with one another and there would still be a black-white gap on cognitive tests.[2]

Test bias
The American Psychological Association's report Intelligence: Knowns and Unknowns stated that in the United States IQ tests as predictors of social achievement are not biased against African Americans since they predict future performance, such as school achievement, similarly to the way they predict future performance for Caucasians.[25] However, IQ tests may well be biased when used in other situations. A 2005 study stated that "differential validity in prediction suggests that the WAIS-R test may contain cultural influences that reduce the validity of the WAIS-R as a measure of cognitive ability for Mexican American students,"[87] indicating a weaker positive correlation relative to sampled white students. Other recent studies have questioned the culture-fairness of IQ tests when used in South Africa.[88][89] Standard intelligence tests, such as the Stanford-Binet, are often inappropriate for children with autism; the alternative of using developmental or adaptive skills measures are relatively poor measures of intelligence in autistic children, and may have resulted in incorrect claims that a majority of children with autism are mentally retarded.[90]

Intelligence quotient

129

Outdated methodology
A 2006 article stated that contemporary psychologic research often did not reflect substantial recent developments in psychometrics and "bears an uncanny resemblance to the psychometric state of the art as it existed in the 1950s."[91]

"Intelligence: Knowns and Unknowns"


In response to the controversy surrounding The Bell Curve, the American Psychological Association's Board of Scientific Affairs established a task force in 1995 to write a report on the state of intelligence research which could be used by all sides as a basis for discussion, "Intelligence: Knowns and Unknowns". The full text of the report is available through several websites.[25][92] In this paper the representatives of the association regret that IQ-related works are frequently written with a view to their political consequences: "research findings were often assessed not so much on their merits or their scientific standing as on their supposed political implications". The task force concluded that IQ scores do have high predictive validity for individual differences in school achievement. They confirm the predictive validity of IQ for adult occupational status, even when variables such as education and family background have been statistically controlled. They stated that individual differences in intelligence are substantially influenced by both genetics and environment. The report stated that a number of biological factors, including malnutrition, exposure to toxic substances, and various prenatal and perinatal stressors, result in lowered psychometric intelligence under at least some conditions. The task force agrees that large differences do exist between the average IQ scores of blacks and whites, saying: The cause of that differential is not known; it is apparently not due to any simple form of bias in the content or administration of the tests themselves. The Flynn effect shows that environmental factors can produce differences of at least this magnitude, but that effect is mysterious in its own right. Several culturally based explanations of the Black/ White IQ differential have been proposed; some are plausible, but so far none has been conclusively supported. There is even less empirical support for a genetic interpretation. In short, no adequate explanation of the differential between the IQ means of Blacks and Whites is presently available. The APA journal that published the statement, American Psychologist, subsequently published eleven critical responses in January 1997, several of them arguing that the report failed to examine adequately the evidence for partly genetic explanations.

Dynamic assessment
Notable and increasingly influential[93][94] alternative to the wide range of standard IQ tests originated in the writings of psychologist Lev Vygotsky (1896-1934) of his most mature and highly productive period of 1932-1934. The notion of the zone of proximal development that he introduced in 1933, roughly a year before his death, served as the banner for his proposal to diagnose development as the level of actual development that can be measured by the child's independent problem solving and, at the same time, the level of proximal, or potential development that is measured in the situation of moderately assisted problem solving by the child.[95] The maximum level of complexity and difficulty of the problem that the child is capable to solve under some guidance indicates the level of potential development. Then, the difference between the higher level of potential and the lower level of actual development indicates the zone of proximal development. Combination of the two indexesthe level of actual and the zone of the proximal developmentaccording to Vygotsky, provides a significantly more informative indicator of psychological development than the assessment of the level of actual development alone.[96][97] The ideas on the zone of development were later developed in a number of psychological and educational theories and practices. Most notably, they were developed under the banner of dynamic assessment that focuses on the testing of learning and developmental potential[98][99][100] (for instance, in the work of Reuven Feuerstein and his associates,[101] who has criticized standard IQ testing for its putative assumption or acceptance of "fixed and

Intelligence quotient immutable" characteristics of intelligence or cognitive functioning). Grounded in developmental theories of Vygotsky and Feuerstein, who recognized that human beings are not static entities but are always in states of transition and transactional relationships with the world, dynamic assessment received also considerable support in the recent revisions of cognitive developmental theory by Joseph Campione, Ann Brown, and John D. Bransford and in theories of multiple intelligences by Howard Gardner and Robert Sternberg.[102]

130

High IQ societies
There are social organizations, some international, which limit membership to people who have scores as high as or higher than the 98th percentile on some IQ test or equivalent. Mensa International is perhaps the best known of these. There are other groups requiring a score above the 98th percentile.

Reference charts
IQ reference charts are tables suggested by test publishers to divide intelligence ranges in various categories.

References
Notes
[1] http:/ / icd9cm. chrisendres. com/ index. php?srchtype=procs& srchtext=94. 01& Submit=Search& action=search [2] http:/ / www. nlm. nih. gov/ medlineplus/ ency/ article/ 001912. htm [4] Intelligence: Knowns and Unknowns (http:/ / www. gifted. uconn. edu/ siegle/ research/ Correlation/ Intelligence. pdf) (Report of a Task Force established by the Board of Scientific Affairs of the American Psychological Association, Released August 7, 1995a slightly edited version was published in American Psychologist: ) [7] IQ Testing 101, Alan S. Kaufman, 2009, Springer Publishing Company, ISBN 0-8261-0629-3 ISBN 978-0-8261-0629-2 [9] Larson, Edward J. (1995). Sex, Race, and Science: Eugenics in the Deep South. Baltimore: Johns Hopkins University Press. pp. 74. [20] S.E. Embretson & S.P.Reise: Item response theory for psychologists, 2000. "...for many other psychological tests, normal distributions are achieved by normalizing procedures. For example, intelligence tests..." Found on: http:/ / books. google. se/ books?id=rYU7rsi53gQC& pg=PA29& lpg=PA29& dq=%22intelligence+ tests%22+ normalize& source=bl& ots=ZAIQEgaa6Q& sig=q-amDaZqx7Ix6mMkvIDMnj9M9O0& hl=sv& ei=lEEJTNqSIYWMOPqLuRE& sa=X& oi=book_result& ct=result& resnum=7& ved=0CEIQ6AEwBg#v=onepage& q& f=false [28] International Journal of Epidemiology, Volume 35, Issue 3, June 2006. See reprint of Leowontin's 1974 article "The analysis of variance and the analysis of causes" and 2006 commentaries: http:/ / ije. oxfordjournals. org/ content/ 35/ 3. toc [31] (http:/ / www. wjh. harvard. edu/ ~cfc/ Chabris2012a-FalsePositivesGenesIQ. pdf) [36] C. F. Chabris, B. M. Hebert, D. J. Benjamin, J. P. Beauchamp, D. Cesarini, M. J. H. M. van der Loos, M. Johannesson, P. K. E. Magnusson, P. Lichtenstein, C. S. Atwood, J. Freese, T. S. Hauser, R. M. Hauser, N. A. Christakis and D. I. Laibson. (2011). Most reported genetic associations with general intelligence are probably false positives. Psychological Science [37] B. Benyamin, B. Pourcain, O. S. Davis, G. Davies, N. K. Hansell, M. J. Brion, R. M. Kirkpatrick, R. A. Cents, S. Franic, M. B. Miller, C. M. Haworth, E. Meaburn, T. S. Price, D. M. Evans, N. Timpson, J. Kemp, S. Ring, W. McArdle, S. E. Medland, J. Yang, S. E. Harris, D. C. Liewald, P. Scheet, X. Xiao, J. J. Hudziak, E. J. de Geus, C. Wellcome Trust Case Control, V. W. Jaddoe, J. M. Starr, F. C. Verhulst, C. Pennell, H. Tiemeier, W. G. Iacono, L. J. Palmer, G. W. Montgomery, N. G. Martin, D. I. Boomsma, D. Posthuma, M. McGue, M. J. Wright, G. Davey Smith, I. J. Deary, R. Plomin and P. M. Visscher. (2013). Childhood intelligence is heritable, highly polygenic and associated with FNBP1L. Mol Psychiatry [38] E. M. Tucker-Drob, M. Rhemtulla, K. P. Harden, E. Turkheimer and D. Fask. (2011). Emergence of a Gene x Socioeconomic Status Interaction on Infant Mental Ability Between 10 Months and 2 Years. Psychological Science, 22, 125-33 (http:/ / dx. doi. org/ 10. 1177/ 0956797610392926) [40] K. P. Harden, E. Turkheimer and J. C. Loehlin. (2005). Genotype environment interaction in adolescents' cognitive ability. Behavior Genetics, 35, (http:/ / dx. doi. org/ 804-804) [48] C. Stough, B. Kerkin, T. C. Bates and G. Mangan. (1994). Music and spatial IQ. Personality & Individual Differences, 17, (http:/ / dx. doi. org/ 695) [49] C. F. Chabris. (1999). Prelude or requiem for the 'Mozart effect'? Nature, 400, author reply 827-828 (http:/ / dx. doi. org/ 826-827;) [57] Gottfredson, L. S. (2006). Social consequences of group differences in cognitive ability (Consequencias sociais das diferencas de grupo em habilidade cognitiva). In C. E. Flores-Mendoza & R. Colom (Eds.), Introducau a psicologia das diferencas individuais (pp. 433-456). Porto Allegre, Brazil: ArtMed Publishers. [58] ABC News, "Court OKs Barring High IQs for Cops", http:/ / abcnews. go. com/ US/ story?id=95836 [59] Detterman and Daniel, 1989.

Intelligence quotient
[64] http:/ / www. sciencedaily. com/ releases/ 2007/ 04/ 070424204519. htm [66] Handbook of Crime Correlates; Lee Ellis, Kevin M. Beaver, John Wright; 2009; Academic Press [70] Rowe, D. C., W. J. Vesterdal, and J. L. Rodgers, "The Bell Curve Revisited: How Genes and Shared Environment Mediate IQ-SES Associations," University of Arizona, 1997 [74] Kaufman 2009, p.126. [76] Kaufman 2009, p.132. [85] The Waning of I.Q. (http:/ / select. nytimes. com/ 2007/ 09/ 14/ opinion/ 14brooks. html) by David Brooks, The New York Times [86] Psychometrics of Intelligence. K. Kemp-Leonard (ed.) Encyclopedia of Social Measurement, 3, 193-201: (http:/ / www2. psych. purdue. edu/ ~phs/ pdf/ 89. pdf) [93] Mindes, G. Assessing young children (http:/ / books. google. ca/ books?id=x41LAAAAYAAJ& q=dynamic+ assessment+ popularity#search_anchor). Merrill/Prentice Hall, 2003, p. 158 [94] Haywood, H. Carl & Lidz, Carol Schneider. Dynamic Assessment in Practice: Clinical And Educational Applications (http:/ / books. google. ca/ books?id=xQekS_oqGzoC& q=rapid+ growth+ of+ interest+ + in+ this+ topic#v=snippet& q=rapid growth of interest in this topic& f=false). Cambridge University Press, 2006, p. 1 [95] Vygotsky, L.S. (19332-34/1997). The Problem of Age (http:/ / www. marxists. org/ archive/ vygotsky/ works/ 1934/ problem-age. htm). in The Collected Works of L. S. Vygotsky, Volume 5, 1998, pp. 187-205 [96] Chaiklin, S. (2003). "The Zone of Proximal Development in Vygotsky's analysis of learning and instruction." In Kozulin, A., Gindis, B., Ageyev, V. & Miller, S. (Eds.) Vygotsky's educational theory and practice in cultural context. 39-64. Cambridge: Cambridge University [97] Zaretskii,V.K. (2009). The Zone of Proximal Development What Vygotsky Did Not Have Time to Write. Journal of Russian and East European Psychology, vol. 47, no. 6, NovemberDecember 2009, pp. 7093 [98] Sternberg, R.S. & Grigorenko, E.L. (2001). All testing is dynamic testing. Issues in Education, 7(2), 137-170 [99] Sternberg, R.J. & Grigorenko, E.L. (2002). Dynamic testing: The nature and measurement of learning potential. Cambridge (UK): University of Cambridge [100] Haywood, C.H. & Lidz, C.S. (2007). Dynamic assessment in practice: Clinical and educational applications. New York: Cambridge University Press [101] Feuerstein, R., Feuerstein, S., Falik, L & Rand, Y. (1979; 2002). Dynamic assessments of cognitive modifiability. ICELP Press, Jerusalem: Israel [102] Dodge, Kenneth A. Foreword, xiii-xv. In Haywood, H. Carl & Lidz, Carol Schneider. Dynamic Assessment in Practice: Clinical And Educational Applications. Cambridge University Press, 2006, p.xiii-xiv

131

Further reading Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytical studies. New York: Cambridge University Press. ISBN0-521-38275-0. Lahn, Bruce T.; Ebenstein, Lanny (2009). "Let's celebrate human genetic diversity". Nature 461 (7265): 7268. doi: 10.1038/461726a (http://dx.doi.org/10.1038/461726a). PMID 19812654 (http://www.ncbi.nlm.nih. gov/pubmed/19812654). Coward, W. Mark; Sackett, Paul R. (1990). "Linearity of ability^performance relationships: A reconfirmation". Journal of Applied Psychology 75 (3): 297300. doi: 10.1037/0021-9010.75.3.297 (http://dx.doi.org/10.1037/ 0021-9010.75.3.297). Duncan, J.; Seitz, RJ; Kolodny, J; Bor, D; Herzog, H; Ahmed, A; Newell, FN; Emslie, H (2000). "A Neural Basis for General Intelligence". Science 289 (5478): 45760. doi: 10.1126/science.289.5478.457 (http://dx.doi.org/ 10.1126/science.289.5478.457). PMID 10903207 (http://www.ncbi.nlm.nih.gov/pubmed/10903207). Duncan, John; Burgess, Paul; Emslie, Hazel (1995). "Fluid intelligence after frontal lobe lesions". Neuropsychologia 33 (3): 2618. doi: 10.1016/0028-3932(94)00124-8 (http://dx.doi.org/10.1016/ 0028-3932(94)00124-8). PMID 7791994 (http://www.ncbi.nlm.nih.gov/pubmed/7791994). Flynn, James R. (1999). "Searching for justice: The discovery of IQ gains over time" (http://www.stat. columbia.edu/~gelman/stuff_for_blog/flynn.pdf). American Psychologist 54 (1): 520. doi: 10.1037/0003-066X.54.1.5 (http://dx.doi.org/10.1037/0003-066X.54.1.5). Frey, Meredith C.; Detterman, Douglas K. (2004). "Scholastic Assessment org?". Psychological Science 15 (6): 3738. doi: 10.1111/j.0956-7976.2004.00687.x (http://dx.doi.org/10.1111/j.0956-7976.2004.00687.x). PMID 15147489 (http://www.ncbi.nlm.nih.gov/pubmed/15147489). Gale, C. R; Deary, I. J; Schoon, I.; Batty, G D.; Batty, G D. (2006). "IQ in childhood and vegetarianism in adulthood: 1970 British cohort study" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790759). BMJ 334 (7587): 245. doi: 10.1136/bmj.39030.675069.55 (http://dx.doi.org/10.1136/bmj.39030.675069.55). PMC

Intelligence quotient 1790759 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790759). PMID 17175567 (http://www.ncbi. nlm.nih.gov/pubmed/17175567). Gottfredson, L (1997). "Why g matters: The complexity of everyday life" (http://www.udel.edu/educ/ gottfredson/reprints/1997whygmatters.pdf). Intelligence 24 (1): 79132. doi: 10.1016/S0160-2896(97)90014-3 (http://dx.doi.org/10.1016/S0160-2896(97)90014-3). Gottfredson, Linda S. (1998). "The general intelligence factor" (http://www.udel.edu/educ/gottfredson/ reprints/1998generalintelligencefactor.pdf) (PDF). Scientific American Presents 9 (4): 2429. Gottfredson, L.S. (2005). "Suppressing intelligence research: Hurting those we intend to help." (http://www. udel.edu/educ/gottfredson/reprints/2005suppressingintelligence.pdf) (PDF). In Wright, R.H. and Cummings, N.A (Eds.). Destructive trends in mental health: The well-intentioned path to harm. New York: Taylor and Francis. pp.155186. ISBN0-415-95086-4. Gottfredson, L.S. (2006). "Social consequences of group differences in cognitive ability (Consequencias sociais das diferencas de grupo em habilidade cognitiva)" (http://www.udel.edu/educ/gottfredson/reprints/ 2004socialconsequences.pdf) (PDF). In Flores-Mendoza, C.E. and Colom, R. (Eds.). Introduo psicologia das diferenas individuais. Porto Alegre, Brazil: ArtMed Publishers. pp.155186. ISBN85-363-0621-1. Gould, S.J. (1996). In W. W. Norton & Co. The Mismeasure of Man: Revised and Expanded Edition. New-York: Penguin. ISBN0-14-025824-8.

132

Gray, Jeremy R.; Chabris, Christopher F.; Braver, Todd S. (2003). "Neural mechanisms of general fluid intelligence". Nature Neuroscience 6 (3): 31622. doi: 10.1038/nn1014 (http://dx.doi.org/10.1038/nn1014). PMID 12592404 (http://www.ncbi.nlm.nih.gov/pubmed/12592404). Gray, Jeremy R.; Thompson, Paul M. (2004). "Neurobiology of intelligence: science and ethics". Nature Reviews Neuroscience 5 (6): 47182. doi: 10.1038/nrn1405 (http://dx.doi.org/10.1038/nrn1405). PMID 15152197 (http://www.ncbi.nlm.nih.gov/pubmed/15152197). Haier, R; Jung, R; Yeo, R; Head, K; Alkire, M (2005). "The neuroanatomy of general intelligence: sex matters". NeuroImage 25 (1): 3207. doi: 10.1016/j.neuroimage.2004.11.019 (http://dx.doi.org/10.1016/j.neuroimage. 2004.11.019). PMID 15734366 (http://www.ncbi.nlm.nih.gov/pubmed/15734366). Harris, J.R. (1998). The Nurture Assumption: why children turn out the way they do. New York (NY): Free Press. ISBN0-684-84409-5. Hunt, Earl (2001). "Multiple Views of Multiple Intelligence". PsycCRITIQUES 46 (1): 57. doi: 10.1037/002513 (http://dx.doi.org/10.1037/002513). Jensen, A.R. (1979). Bias in mental testing. New York (NY): Free Press. ISBN0-02-916430-3. Jensen, A.R. (1979). The g Factor: The Science of Mental Ability. Wesport (CT): Praeger Publishers. ISBN0-275-96103-6. Jensen, A.R. (2006). Clocking the Mind: Mental Chronometry and Individual Differences. Elsevier. ISBN0-08-044939-5. Kaufman, Alan S. (2009). IQ Testing 101. New York (NY): Springer Publishing. ISBN978-0-8261-0629-2. Klingberg, Torkel; Forssberg, Hans; Westerberg, Helena (2002). "Training of Working Memory in Children With ADHD". Journal of Clinical and Experimental Neuropsychology (Neuropsychology, Development and Cognition: Section A) 24 (6): 78191. doi: 10.1076/jcen.24.6.781.8395 (http://dx.doi.org/10.1076/jcen.24.6.781.8395). PMID 12424652 (http://www.ncbi.nlm.nih.gov/pubmed/12424652). McClearn, G. E.; Johansson, B; Berg, S; Pedersen, NL; Ahern, F; Petrill, SA; Plomin, R (1997). "Substantial Genetic Influence on Cognitive Abilities in Twins 80or More Years Old". Science 276 (5318): 15603. doi: 10.1126/science.276.5318.1560 (http://dx.doi.org/10.1126/science.276.5318.1560). PMID 9171059 (http:/ /www.ncbi.nlm.nih.gov/pubmed/9171059). Mingroni, M (2004). "The secular rise in IQ: Giving heterosis a closer look". Intelligence 32 (1): 6583. doi: 10.1016/S0160-2896(03)00058-8 (http://dx.doi.org/10.1016/S0160-2896(03)00058-8).

Intelligence quotient Murray, C. (1998). Income Inequality and IQ (http://www.aei.org/docLib/20040302_book443.pdf) (PDF). Washington (DC): AEI Press. ISBN0-8447-7094-9. Noguera, P.A (2001). "Racial politics and the elusive quest for excellence and equity in education" (http://www. inmotionmagazine.com/er/pnrp1.html). Motion Magazine. Article # ER010930002. Plomin, R.; DeFries, J.C.; Craig, I.W.; McGuffin, P (2003). Behavioral genetics in the postgenomic era. Washington (DC): American Psychological Association. ISBN1-55798-926-5. Plomin, R.; DeFries, J.C.; McClearn, G.E.; McGuffin, P (2000). Behavioral genetics (4th ed.). New York (NY): Worth Publishers. ISBN0-7167-5159-3. Rowe, D.C.; Vesterdal, W.J.; Rodgers, J.L. (1997). The Bell Curve Revisited: How Genes and Shared Environment Mediate IQ-SES Associations.Wikipedia:Verifiability Schoenemann, P Thomas; Sheehan, Michael J; Glotzer, L Daniel (2005). "Prefrontal white matter volume is disproportionately larger in humans than in other primates". Nature Neuroscience 8 (2): 24252. doi: 10.1038/nn1394 (http://dx.doi.org/10.1038/nn1394). PMID 15665874 (http://www.ncbi.nlm.nih.gov/ pubmed/15665874). Shaw, P.; Greenstein, D.; Lerch, J.; Clasen, L.; Lenroot, R.; Gogtay, N.; Evans, A.; Rapoport, J. et al. (2006). "Intellectual ability and cortical development in children and adolescents". Nature 440 (7084): 6769. doi: 10.1038/nature04513 (http://dx.doi.org/10.1038/nature04513). PMID 16572172 (http://www.ncbi.nlm. nih.gov/pubmed/16572172). Tambs, Kristian; Sundet, Jon Martin; Magnus, Per; Berg, Kre (1989). "Genetic and environmental contributions to the covariance between occupational status, educational attainment, and IQ: A study of twins". Behavior Genetics 19 (2): 20922. doi: 10.1007/BF01065905 (http://dx.doi.org/10.1007/BF01065905). PMID 2719624 (http://www.ncbi.nlm.nih.gov/pubmed/2719624). Thompson, Paul M.; Cannon, Tyrone D.; Narr, Katherine L.; Van Erp, Theo; Poutanen, Veli-Pekka; Huttunen, Matti; Lnnqvist, Jouko; Standertskjld-Nordenstam, Carl-Gustaf et al. (2001). "Genetic influences on brain structure". Nature Neuroscience 4 (12): 12538. doi: 10.1038/nn758 (http://dx.doi.org/10.1038/nn758). PMID 11694885 (http://www.ncbi.nlm.nih.gov/pubmed/11694885). Wechsler, D. (1997). Wechsler Adult Intelligence Scale (3rd ed.). San Antonia (TX): The Psychological Corporation. Wechsler, D. (2003). Wechsler Intelligence Scale for Children (4th ed.). San Antonia (TX): The Psychological Corporation. Weiss, Volkmar (2009). "National IQ means transformed from Programme for International Student Assessment (PISA) Scores" (http://mpra.ub.uni-muenchen.de/14600/). The Journal of Social, Political and Economic Studies 31 (1): 7194.

133

External links
Human Intelligence: biographical profiles, current controversies, resources for teachers (http://www. intelltheory.com/) Classics in the History of Psychology (http://psychclassics.yorku.ca/)

Internal consistency

134

Internal consistency
In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test (or the same subscale on a larger test). It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

Cronbach's alpha
Internal consistency is usually measured with Cronbach's alpha, a statistic calculated from the pairwise correlations between items. Internal consistency ranges between zero and one. A commonly accepted rule of thumb for describing internal consistency is as follows:[1]
Cronbach's alpha Internal consistency .9 .9 > .8 .8 > .7 .7 > .6 .6 > .5 .5 > Excellent Good Acceptable Questionable Poor Unacceptable

Very high reliabilities (0.95 or higher) are not necessarily desirable, as this indicates that the items may be entirely redundant. [2] The goal in designing a reliable instrument is for scores on similar items to be related (internally consistent), but for each to contribute some unique information as well. An alternative way of thinking about internal consistency is that it is the extent to which all of the items of a test measure the same latent variable. The advantage of this perspective over the notion of a high average correlation among the items of a test - the perspective underlying Cronbach's alpha - is that the average item correlation is affected by skewness (in the distribution of item correlations) just as any other average is. Thus, whereas the modal item correlation is zero when the items of a test measure several unrelated latent variables, the average item correlation in such cases will be greater than zero. Thus, whereas the ideal of measurement is for all items of a test to measure the same latent variable, alpha has been demonstrated many times to attain quite high values even when the set of items measures several unrelated latent variables.[3][4][5][6][7][8] The hierarchical "Coefficient omega" may be a more appropriate index of the extent to which all of the items in a test measure the same latent variable.[9][10] Several different measures of internal consistency are reviewed by Revelle & Zinbarg (2009).[11]

Internal consistency

135

References
[1] George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon. [2] Streiner, D. L. (2003) Starting at the beginning: an introduction to coefficient alpha and internal consistency, Journal of Personality Assessment, 80, 99-103 [3] Cortina. J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98104. [4] Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297334. [5] Green, S. B., Lissitz, R.W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37, 827838. [6] Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14, 5774. [7] Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350353. [8] Zinbarg, R., Yovel, I., Revelle, W. & McDonald, R. (2006). Estimating generalizability to a universe of indicators that all have an attribute in common: A comparison of estimators for . Applied Psychological Measurement, 30, 121144. [9] McDonald, R. P. (1999). Test theory: A unified treatment. Psychology Press. ISBN 0-8058-3075-8 [10] Zinbarg, R., Revelle, W., Yovel, I. & Li, W. (2005). Cronbachs , Revelles , and McDonalds : Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70, 123133. [11] Revelle, W., Zinbarg, R. (2009) "Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma", Psychometrika, 74(1), 145154. (http:/ / dx. doi. org/ 10. 1007/ s11336-008-9102-z)

External links
http://www.wilderdom.com/personality/L3-2EssentialsGoodPsychologicalTest.html

Intra-rater reliability
In statistics, intra-rater reliability is the degree of agreement among multiple repetitions of a diagnostic test performed by a single rater.[1][2]

References
[1] Stroke Engine glossary (McGill Faculty of Medicine) (http:/ / www. medicine. mcgill. ca/ strokengine-assess/ definitions-en. html) [2] Outcomes database glossary (http:/ / www. outcomesdatabase. org/ show/ category/ id/ 8)

IPPQ

136

IPPQ
The iOpener People and Performance Questionnaire (iPPQ) is a psychometric tool, designed to assess workplace happiness and wellbeing. It is designed and administered by iOpener Ltd, a management consultancy firm based in Oxford, UK.

Happiness at work
Despite a large body of positive psychological research into the relationship between happiness and productivity,[1][2][3] and the development of corporate psychometric tools to asses factors such as personality profile and feedback (e.g. 360 feedback), the two fields of study have never previously been combined to produce a psychometric tool specifically designed to measure happiness in the workplace. The iPPQ is the first and only example of this type of tool to date.

Research
The tool was developed following the development of a model of workplace happiness[4] and research into the relationships between employee happiness, overtime, sick leave and intention to stay or leave,[5] conducted by Dr Laurel Edmunds and Jessica Pryce-Jones. In addition to the academic articles cited above, iOpener's research into happiness at work has received widespread press coverage from publications including The Sunday Times,[6] Jobsite,[7] Legal Week[8] and Construction Today.[9]

References
[1] Carr, A.: "Positive Psychology: The Science of Happiness and Human Strengths" Hove, Brunner-Routledge 2004 [2] Isen, A.; Positive Affect and Decision-making. In M. Lewis and J. Haviland Jones (eds), "Handbook of Emotions" (2nd edition), pp. 417-436. New York, Guilford Press 2000 [3] Buss, D. The Evolution of Happiness, "American Psychologist" Vol. 55 (2000) pp. 15-23 [4] Dutton V.M., Edmunds L.D.: A model of workplace happiness, Selection & Development Review, Vol. 23, No.1, 2007 [5] Relationships between employee happiness, overtime, sick leave and intention to stay or leave, Selection & Development Review, Vol. 24, No.2, 2008 (http:/ / www. iopener. co. uk/ wsc_content/ download/ sdr2008paper. pdf) [6] Make sure people are happy in their job, The Sunday Times 25/06/08 (http:/ / business. timesonline. co. uk/ tol/ business/ career_and_jobs/ recruiter_forum/ article3998244. ece) [7] How to be Happy at Work, Jobsite 02/04/2009 (http:/ / www. jobsite. co. uk/ cgi-bin/ bulletin_search. cgi?act=da& aid=1782) [8] The pursuit of happiness, Legal Week 13/11/2008 (http:/ / www. legalweek. com/ Articles/ 1180002/ The+ pursuit+ of+ happiness. html) [9] Increaseing Employee Morale, Construction Today 15/10/2008 (http:/ / www. ct-europe. com/ article-page. php?contentid=6290& issueid=218)

External links
iOpener homepage (http://www.iopener.com/) Take the iPPQ online for free (http://www.smart-survey.co.uk/v.asp?i=5427fbrin)

Item bank

137

Item bank
An item bank is a term for a repository of test items that belong to a testing program, as well as all information pertaining to those items. In most applications of testing and assessment, the items are of multiple choice format, but any format can be used. Items are pulled from the bank and assigned to test forms for publication either as a paper-and-pencil test or some form of e-assessment.

Types of information
An item bank will not only include the text of each item, but also extensive information regarding test development and psychometric characteristics of the items. Examples of such information include:[1] Item author Date written Item status (e.g., new, pilot, active, retired) Angoff ratings Correct answer Item format

Classical test theory statistics Item response theory statistics User-defined fields

Item banking software


Because an item bank is essentially a simple database, it can be stored in database software or even a spreadsheet such as Microsoft Excel. However, there are several dozen commercially-available software programs specifically designed for item banking. The advantages that these provide are related to assessment. For example, items are presented on the computer screen as they would appear to a test examinee, and item response theory parameters can be translated into item response functions or information functions. Additionally, there are functionalities for publication, such as formatting a set of items to be printed as a paper-and-pencil test. Some item bankers also have test administration functionalities, such as being able to deliver e-assessment or process "bubble" answer sheets.

References
[1] Vale, C.D. (2004). Computerized item banking. In Downing, S.D., & Haladyna, T.M. (Eds.) The Handbook of Test Development. Routledge.

Item response theory

138

Item response theory


In psychometrics, item response theory (IRT) also known as latent trait theory, strong true score theory, or modern mental test theory, is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. Unlike simpler alternatives for creating scales as the simple sum questionnaire responses it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, the assumption in Likert scaling that "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" [1] (p.197). By contrast, item response theory treats the difficulty of each item (the ICCs) as information to be incorporated in scaling items. It is based on the application of related mathematical models to testing data. Because it is generally regarded as superior to classical test theory, it is the preferred method for developing scales, especially when optimal decisions are demanded, as in so-called high-stakes tests e.g. the Graduate Record Examination (GRE) and Graduate Management Admission Test (GMAT). The name item response theory is due to the focus of the theory on the item, as opposed to the test-level focus of classical test theory. Thus IRT models the response of each examinee of a given ability to each item in the test. The term item is generic: covering all kinds of informative item. They might be multiple choice questions that have incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to indicate level of agreement (a rating or Likert scale), or patient symptoms scored as present/absent, or diagnostic information in complex systems. IRT is based on the idea that the probability of a correct/keyed response to an item is a mathematical function of person and item parameters. The person parameter is construed as (usually) a single latent trait or dimension. Examples include general intelligence or the strength of an attitude. Parameters on which items are characterized include their difficulty (known as "location" for their location on the difficulty range), discrimination (slope or correlation) representing how steeply the rate of success of individuals varies with their ability, and a pseudoguessing parameter, characterising the (lower) asymptote at which even the least able persons will score due to guessing (for instance, 25% for pure chance on a 4-item multiple choice item).

Overview
The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred during the 1950s and 1960s. Three of the pioneers were the Educational Testing Service psychometrician Frederic M. Lord,[2] the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel research independently. Key figures who furthered the progress of IRT include Benjamin Drake Wright and David Andrich. IRT did not become widely used until the late 1970s and 1980s, when personal computers gave many researchers access to the computing power necessary for IRT. Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, where psychometricians use it for developing and refining exams, maintaining banks of items for exams, and equating for the difficulties of successive versions of exams (for example, to allow comparisons between results over time).[3] IRT models are often referred to as latent trait models. The term latent is used to emphasize that discrete item responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of sociology, but are virtually identical to IRT models. IRT is generally regarded as an improvement over classical test theory (CTT). For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as computerized adaptive testing, are enabled by IRT and cannot reasonably be performed using only classical

Item response theory test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a researcher to improve the reliability of an assessment. IRT entails three assumptions: 1. A unidimensional trait denoted by ; 2. Local independence of items; 3. The response of a person to an item can be modeled by a mathematical item response function (IRF). The trait is further assumed to be measurable on a scale (the mere existence of a test assumes this), typically set to a standard scale with a mean of 0.0 and a standard deviation of 1.0. 'Local independence' means that items are not related except for the fact that they measure the same trait, which is equivalent to the assumption of unidimensionality, but presented separately because multidimensionality can be caused by other issues. The topic of dimensionality is often investigated with factor analysis, while the IRF is the basic building block of IRT and is the center of much of the research and literature.

139

The item response function


The IRF gives the probability that a person with a given ability level will answer correctly. Persons with lower ability have less of a chance, while persons with high ability are very likely to answer correctly; for example, students with higher math ability are more likely to get a math item correct. The exact value of the probability depends, in addition to ability, on a set of item parameters for the IRF.

Three parameter logistic model


For example, in the three parameter logistic (3PL) model, the probability of a correct response to an item i is:

where

is

the ,

person , and

(ability) are the

parameter and

item parameters. The item parameters simply determine the shape of the IRF and in some cases have a direct interpretation. The figure to the right depicts an example of the 3PL model of the ICC with an overlaid conceptual explanation of the parameters. The item parameters can be interpreted as changing the shape of the standard logistic function:

In brief, the parameters are interpreted as follows (dropping subscripts for legibility); b is most basic, hence listed first: b difficulty, item location: the half-way point between (min) and 1 (max), also where

the slope is maximized. a discrimination, scale, slope: the maximum slope c pseudo-guessing, chance, asymptotic minimum If then these simplify to and meaning that b equals the 50% success level

(difficulty), and a (divided by four) is the maximum slope (discrimination), which occurs at the 50% success level.

Item response theory Further, the logit (log odds) of a correct response is (assuming

140

): in particular if ability equals difficulty b, there ar

even odds (1:1, so logit 0) of a correct answer, the greater the ability is above (or below) the difficulty the more (or less) likely a correct response, with discrimination a determining how rapidly the odds increase or decrease with ability. In words, the standard logistic function has an asymptotic minimum of 0 ( ), is centered around 0 ( , ), and has maximum slope parameter shifts the horizontal scale, and the The parameter stretches the horizontal scale, the to This is compresses the vertical scale from

elaborated below. The parameter represents the item location which, in the case of attainment testing, is referred to as the item difficulty. It is the point on minimum value of where the IRF has its maximum slope, and where the value is half-way between the =0.0, which and the maximum value of 1. The example item is of medium difficulty since

is near the center of the distribution. Note that this model scales the item's difficulty and the person's trait onto the same continuum. Thus, it is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty, in the sense that successful performance of the task involved with an item reflects a specific level of ability. The item parameter represents the discrimination of the item: that is, the degree to which the item discriminates between persons in different regions on the latent continuum. This parameter characterizes the slope of the IRF where the slope is at its maximum. The example item has =1.0, which discriminates fairly well; persons with low ability do indeed have a much smaller chance of correctly responding than persons of higher ability. For items such as multiple choice items, the parameter is used in attempt to account for the effects of guessing on the probability of a correct response. It indicates the probability that very low ability individuals will get this item correct by chance, mathematically represented as a lower asymptote. A four-option multiple choice item might have an IRF like the example item; there is a 1/4 chance of an extremely low ability candidate guessing the correct answer, so the would be approximately 0.25. This approach assumes that all options are equally plausible, because if one option made no sense, even the lowest ability person would be able to discard it, so IRT parameter estimation methods take this into account and estimate a based on the observed data.[4]

IRT models
Broadly speaking, IRT models can be divided into two families: unidimensional and multidimensional. Unidimensional models require a single trait (ability) dimension . Multidimensional IRT models model response data hypothesized to arise from multiple traits. However, because of the greatly increased complexity, the majority of IRT research and applications utilize a unidimensional model. IRT models can also be categorized based on the number of scored responses. The typical multiple choice item is dichotomous; even though there may be four or five options, it is still scored only as correct/incorrect (right/wrong). Another class of models apply to polytomous outcomes, where each response has a different score value.[5][6] A common example of this are Likert-type items, e.g., "Rate on a scale of 1 to 5."

Number of IRT parameters


Dichotomous IRT models are described by the number of parameters they make use of.[7] The 3PL is named so because it employs three item parameters. The two-parameter model (2PL) assumes that the data have no guessing, but that items can vary in terms of location ( ) and discrimination ( ). The one-parameter model (1PL) assumes that guessing is a part of the ability and that all items that fit the model have equivalent discriminations, so that items are only described by a single parameter ( ). This results in one-parameter models having the property of specific objectivity, meaning that the rank of the item difficulty is the same for all respondents independent of ability, and that the rank of the person ability is the same for items independently of difficulty. Thus, 1 parameter

Item response theory models are sample independent, a property that does not hold for two-parameter and three-parameter models. Additionally, there is theoretically a four-parameter model (4PL), with an upper asymptote, denoted by where 3PL is replaced by match their practical or psychometric importance; the location/difficulty (

141

in the

. However, this is rarely used. Note that the alphabetical order of the item parameters does not ) parameter is clearly most important .

because it is included in all three models. The 1PL uses only , the 2PL uses and , the 3PL adds , and the 4PL adds The 2PL is equivalent to the 3PL model with , and is appropriate for testing items where guessing the correct answer is highly unlikely, such as fill-in-the-blank items ("What is the square root of 121?"), or where the concept of guessing does not apply, such as personality, attitude, or interest items (e.g., "I like Broadway musicals. Agree/Disagree"). The 1PL assumes not only that guessing is not present (or irrelevant), but that all items are equivalent in terms of discrimination, analogous to a common factor analysis with identical loadings for all items. Individual items or individuals might have secondary factors but these are assumed to be mutually independent and collectively orthogonal.

Logistic and normal IRT models


An alternative formulation constructs IRFs based on the normal probability distribution; these are sometimes called normal ogive models. For example, the formula for a two-parameter normal-ogive IRF is:

where is the cumulative distribution function (cdf) of the standard normal distribution. The normal-ogive model derives from the assumption of normally distributed measurement error and is theoretically appealing on that basis. Here is, again, the difficulty parameter. The discrimination parameter is , the standard deviation of the measurement error for item i, and comparable to 1/ . One can estimate a normal-ogive latent trait model by factor-analyzing a matrix of tetrachoric correlations between items.[8] This means it is technically possible to estimate a simple IRT model using general-purpose statistical software. With rescaling of the ability parameter, it is possible to make the 2PL logistic model closely approximate the cumulative normal ogive. Typically, the 2PL logistic and normal-ogive IRFs differ in probability by no more than 0.01 across the range of the function. The difference is greatest in the distribution tails, however, which tend to have more influence on results. The latent trait/IRT model was originally developed using normal ogives, but this was considered too computationally demanding for the computers at the time (1960s). The logistic model was proposed as a simpler alternative, and has enjoyed wide use since. More recently, however, it was demonstrated that, using standard polynomial approximations to the normal cdf,[9] the normal-ogive model is no more computationally demanding than logistic models.[10]

The Rasch model


The Rasch model is often considered to be the 1PL IRT model. However, proponents of Rasch modeling prefer to view it as a completely different approach to conceptualizing the relationship between data and the theory.[11] Like other statistical modeling approaches, IRT emphasizes the primacy of the fit of a model to observed data,[12] while the Rasch model emphasizes the primacy of the requirements for fundamental measurement, with adequate data-model fit being an important but secondary requirement to be met before a test or research instrument can be claimed to measure a trait.[13] Operationally, this means that the IRT approaches include additional model parameters to reflect the patterns observed in the data (e.g., allowing items to vary in their correlation with the latent trait), whereas the Rasch approach requires both the data fit the Rasch model and that test items and examinees

Item response theory confirm to the model, before claims regarding the presence of a latent trait can be considered valid. Therefore, under Rasch models, misfitting responses require diagnosis of the reason for the misfit, and may be excluded from the data set if substantive explanations can be made that they do not address the latent trait.[14] Thus, the Rasch approach can be seen to be a confirmatory approach, as opposed to exploratory approaches that attempt to model the observed data. As in any confirmatory analysis, care must be taken to avoid confirmation bias. The presence or absence of a guessing or pseudo-chance parameter is a major and sometimes controversial distinction. The IRT approach includes a left asymptote parameter to account for guessing in multiple choice examinations, while the Rasch model does not because it is assumed that guessing adds randomly distributed noise to the data. As the noise is randomly distributed, it is assumed that, provided sufficient items are tested, the rank-ordering of persons along the latent trait by raw score will not change, but will simply undergo a linear rescaling. Three-parameter IRT, by contrast, achieves data-model fit by selecting a model that fits the data,[15] at the expense of sacrificing specific objectivity. In practice, the Rasch model has at least two principal advantages in comparison to the IRT approach. The first advantage is the primacy of Rasch's specific requirements,[16] which (when met) provides fundamental person-free measurement (where persons and items can be mapped onto the same invariant scale).[17] Another advantage of the Rasch approach is that estimation of parameters is more straightforward in Rasch models due to the presence of sufficient statistics, which in this application means a one-to-one mapping of raw number-correct scores to Rasch estimates.[18]

142

Analysis of model fit


As with any use of mathematical models, it is important to assess the fit of the data to the model. If item misfit with any model is diagnosed as due to poor item quality, for example confusing distractors in a multiple-choice test, then the items may be removed from that test form and rewritten or replaced in future test forms. If, however, a large number of misfitting items occur with no apparent reason for the misfit, the construct validity of the test will need to be reconsidered and the test specifications may need to be rewritten. Thus, misfit provides invaluable diagnostic tools for test developers, allowing the hypotheses upon which test specifications are based to be empirically tested against data. There are several methods for assessing fit, such as a chi-square statistic, or a standardized version of it. Two and three-parameter IRT models adjust item discrimination, ensuring improved data-model fit, so fit statistics lack the confirmatory diagnostic value found in one-parameter models, where the idealized model is specified in advance. Data should not be removed on the basis of misfitting the model, but rather because a construct relevant reason for the misfit has been diagnosed, such as a non-native speaker of English taking a science test written in English. Such a candidate can be argued to not belong to the same population of persons depending on the dimensionality of the test, and, although one parameter IRT measures are argued to be sample-independent, they are not population independent, so misfit such as this is construct relevant and does not invalidate the test or the model. Such an approach is an essential tool in instrument validation. In two and three-parameter models, where the psychometric model is adjusted to fit the data, future administrations of the test must be checked for fit to the same model used in the initial validation in order to confirm the hypothesis that scores from each administration generalize to other administrations. If a different model is specified for each administration in order to achieve data-model fit, then a different latent trait is being measured and test scores cannot be argued to be comparable between administrations.

Item response theory

143

Information
One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance. This index is helpful in characterizing a test's average reliability, for example in order to compare two tests. But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the test's range, for example, generally have more error associated with them than scores closer to the middle of the range. Item response theory advances the concept of item and test information to replace reliability. Information is also a function of the model parameters. For example, according to Fisher information theory, the item information supplied in the case of the 1PL for dichotomous response data is simply the probability of a correct response multiplied by the probability of an incorrect response, or,

The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the

Thus more information implies less error of measurement. For other models, such as the two and three parameters models, the discrimination parameter plays an important role in the function. The item information function for the two parameter model is

The item information function for the three parameter model is

[19]

In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range. Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range. Because of local independence, item information functions are additive. Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely. Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification. In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta, . These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of ability by including carefully chosen items. For example, in a certification situation in which a test can only be passed or failed, where there is only a single "cutscore," and where the actually passing score is unimportant, a very efficient test can be developed by selecting only items that have high information near the cutscore. These items generally correspond to items whose difficulty is about the same as that of the cutscore.

Item response theory

144

Scoring
The person parameter represents the magnitude of latent trait of the individual, which is the human capacity or attribute measured by the test.[20] It might be a cognitive ability, physical ability, skill, knowledge, attitude, personality characteristic, etc. The estimate of the person parameter - the "score" on a test with IRT - is computed and interpreted in a very different manner as compared to traditional scores like number or percent correct. The individual's total number-correct score is not the actual score, but is rather based on the IRFs, leading to a weighted score when the model contains item discrimination parameters. It is actually obtained by multiplying the item response function for each item to obtain a likelihood function, the highest point of which is the maximum likelihood estimate of . This highest point is typically estimated with IRT software using the Newton-Raphson method.[21] While scoring is much more sophisticated with IRT, for most tests, the (linear) correlation between the theta estimate and a traditional score is very high; often it is .95 or more. A graph of IRT scores against traditional scores shows an ogive shape implying that the IRT estimates separate individuals at the borders of the range more than in the middle. An important difference between CTT and IRT is the treatment of measurement error, indexed by the standard error of measurement. All tests, questionnaires, and inventories are imprecise tools; we can never know a person's true score, but rather only have an estimate, the observed score. There is some amount of random error which may push the observed score higher or lower than the true score. CTT assumes that the amount of error is the same for each examinee, but IRT allows it to vary.[22] Also, nothing about IRT refutes human development or improvement or assumes that a trait level is fixed. A person may learn skills, knowledge or even so called "test-taking skills" which may translate to a higher true-score. In fact, a portion of IRT research focuses on the measurement of change in trait level.[23]

A comparison of classical and item response theories


Classical test theory (CTT) and IRT are largely concerned with the same problems but are different bodies of theory and entail different methods. Although the two paradigms are generally consistent and complementary, there are a number of points of difference: IRT makes stronger assumptions than CTT and in many cases provides correspondingly stronger findings; primarily, characterizations of error. Of course, these results only hold when the assumptions of the IRT models are actually met. Although CTT results have allowed important practical results, the model-based nature of IRT affords many advantages over analogous CTT findings. CTT test scoring procedures have the advantage of being simple to compute (and to explain) whereas IRT scoring generally requires relatively complex estimation procedures. IRT provides several improvements in scaling items and people. The specifics depend upon the IRT model, but most models scale the difficulty of items and the ability of people on the same metric. Thus the difficulty of an item and the ability of a person can be meaningfully compared. Another improvement provided by IRT is that the parameters of IRT models are generally not sample- or test-dependent whereas true-score is defined in CTT in the context of a specific test. Thus IRT provides significantly greater flexibility in situations where different samples or test forms are used. These IRT findings are foundational for computerized adaptive testing. It is worth also mentioning some specific similarities between CTT and IRT which help to understand the correspondence between concepts. First, Lord[24] showed that under the assumption that is normally distributed, discrimination in the 2PL model is approximately a monotonic function of the point-biserial correlation. In particular:

Item response theory

145

where

is the point biserial correlation of item i. Thus, if the assumption holds, where there is a higher

discrimination there will generally be a higher point-biserial correlation. Another similarity is that while IRT provides for a standard error of each estimate and an information function, it is also possible to obtain an index for a test as a whole which is directly analogous to Cronbach's alpha, called the separation index. To do so, it is necessary to begin with a decomposition of an IRT estimate into a true location and error, analogous to decomposition of an observed score into a true score and error in CTT. Let

where

is the true location, and

is the error association with an estimate. Then

is an estimate of the

standard deviation of

for person with a given weighted score and the separation index is obtained as follows

where the mean squared standard error of person estimate gives an estimate of the variance of the errors, , across persons. The standard errors are normally produced as a by-product of the estimation process. The separation index is typically very close in value to Cronbach's alpha.[25] IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT.

References
[1] A. van Alphen, R. Halfens, A. Hasman and T. Imbos. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing. 20, 196-201 [2] ETS Research Overview (http:/ / www. ets. org/ portal/ site/ ets/ menuitem. c988ba0e5dd572bada20bc47c3921509/ ?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD& vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRD) [3] Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Press. [7] Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. [8] K. G. Jreskog and D. Srbom(1988). PRELIS 1 user's manual, version 1. Chicago: Scientific Software, Inc. [9] Abramowitz M., Stegun I.A. (1972). Handbook of Mathematical Functions. Washington DC: U. S. Government Printing Office. [11] Andrich, D (1989), Distinctions between assumptions and requirements in measurement in the Social sciences", in Keats, J.A, Taft, R., Heath, R.A, Lovibond, S (Eds), Mathematical and Theoretical Systems, Elsevier Science Publishers, North Holland, Amsterdam, pp.7-16. [12] Steinberg, J. (2000). Frederic Lord, Who Devised Testing Yardstick, Dies at 87. New York Times, February 10, 2000 [16] Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. [18] Fischer, G.H. & Molenaar, I.W. (1995). Rasch Models: Foundations, Recent Developments, and Applications. New York: Springer. [19] de Ayala, R.J. (2009). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press. (6.12), p.144 [20] Lazarsfeld P.F, & Henry N.W. (1968). Latent Structure Analysis. Boston: Houghton Mifflin. [23] Hall, L.A., & McDonald, J.L. (2000). Measuring Change in Teachers' Perceptions of the Impact that Staff Development Has on Teaching. (http:/ / eric. ed. gov/ ERICWebPortal/ custom/ portlets/ recordDetails/ detailmini. jsp?_nfpb=true& _& ERICExtSearch_SearchValue_0=ED441789& ERICExtSearch_SearchType_0=no& accno=ED441789) Paper presented at the Annual Meeting of the American Educational Research Association (New Orleans, LA, April 2428, 2000). [24] Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Item response theory

146

Additional reading
Many books have been written that address item response theory or contain IRT or IRT-like models. This is a partial list, focusing on texts that provide more depth. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum. This book summaries much of Lord's IRT work, including chapters on the relationship between IRT and classical methods, fundamentals of IRT, estimation, and several advanced topics. Its estimation chapter is now dated in that it primarily discusses joint maximum likelihood method rather than the marginal maximum likelihood method implemented by Darrell Bock and his colleagues. Embretson, Susan E.; Reise, Steven P. (2000). Item Response Theory for Psychologists (http://books.google. com/books?id=rYU7rsi53gQC). Psychology Press. ISBN978-0-8058-2819-1. This book is an accessible introduction to IRT, aimed, as the title says, at psychologists. Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD. This introductory book is by one of the pioneers in the field, and is available online at (http:/ / edres. org/ irt/ baker/) Baker, Frank B.; Kim, Seock-Ho (2004). Item Response Theory: Parameter Estimation Techniques (http:// books.google.com/books?id=y-Q_Q7pasJ0C) (2nd ed.). Marcel Dekker. ISBN978-0-8247-5825-7. This book describes various item response theory models and furnishes detailed explanations of algorithms that can be used to estimate the item and ability parameters. Portions of the book are available online as limited preview at Google Books. van der Linden, Wim J.; Hambleton, Ronald K., eds. (1996). Handbook of Modern Item Response Theory (http:// books.google.com/books?id=aytUuwl4ku0C). Springer. ISBN978-0-387-94661-0. This book provides a comprehensive overview regarding various popular IRT models. It is well suited for persons who already have gained basic understanding of IRT. de Boeck, Paul; Wilson, Mark (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach (http://books.google.com/books?id=pDeLy5L14mAC). Springer. ISBN978-0-387-40275-8. This volume shows an integrated introduction to item response models, mainly aimed at practitioners, researchers and graduate students. Fox, Jean-Paul (2010). Bayesian Item Response Modeling: Theory and Applications (http://books.google.com/ books?id=BZcPc4ffSTEC). Springer. ISBN978-1-4419-0741-7. This book discusses the Bayesian approach towards item response modeling. The book will be useful for persons (who are familiar with IRT) with an interest in analyzing item response data from a Bayesian perspective.

Item response theory

147

External links
"HISTORY OF ITEM RESPONSE THEORY (up to 1982)" (http://www.uic.edu/classes/ot/ot540/history. html), University of Illinois at Chicago A Simple Guide to the Item Response Theory(PDF) (http://www.creative-wisdom.com/computer/sas/IRT. pdf) Psychometric Software Downloads (http://www.umass.edu/remp/main_software.html) flexMIRT IRT Software (http://flexMIRT.VPGCentral.com) IRT Tutorial (http://work.psych.uiuc.edu/irt/tutorial.asp) IRT Tutorial FAQ (http://sites.google.com/site/benroydo/irt-tutorial) An introduction to IRT (http://edres.org/irt/) The Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html) IRT Command Language (ICL) computer program (http://www.b-a-h.com/software/irt/icl/) IRT Programs from SSI, Inc. (http://www.ssicentral.com/irt/index.html) IRT Programs from Assessment Systems Corporation (http://assess.com/xcart/home.php?cat=37) IRT Programs from Winsteps (http://www.winsteps.com) Latent Trait Analysis and IRT Models (http://www.john-uebersax.com/stat/lta.htm) Rasch analysis (http://www.rasch-analysis.com/) Free IRT software (http://www.john-uebersax.com/stat/papers.htm) IRT Packages in R (http://cran.r-project.org/web/views/Psychometrics.html)

Jenkins activity survey


The Jenkins Activity Survey is one of the most widely used methods of assessing Type A behavior. The Jenkins Activity Survey is a psychometric survey of behavior and attitude designed to identify persons showing signs of Type A behavior. The test is multiple choice and self-administered. It was published in 1974 by C. David Jenkins, Stephen Zyzanski, and Ray Rosenman. The terms Type A and Type B personality were originally described in the work of Rosenman and Friedman in 1959. The Jenkins Activity Survey (JAS) was developed in an attempt to duplicate the clinical assessment of the Type A behavior pattern by employing an objective psychometric procedure. Individuals displaying a Type A behavior pattern are characterized by extremes of competitiveness, striving for achievement and personal recognition, aggressiveness, haste, impatience, explosiveness and loudness in speech, characteristics which the JAS attempts to measure.

External links
Further information [1]

References
[1] http:/ / www. cps. nova. edu/ ~cpphelp/ JAS. html

Jensen box

148

Jensen box
The Jensen box was developed by University of California, Berkeley psychologist Arthur Jensen as a standard apparatus for measuring choice reaction time, especially in relationship to differences in intelligence.[1] Since Jensen created this approach, correlations between simple and choice reaction time have been demonstrated in many hundreds of studies. Perhaps the best was reported by Ian Deary and colleagues, in a population-based cohort study of 900 individuals, demonstrating correlations between IQ simple and choice reaction time of 0.3 and 0.5 respectively, and of 0.26 with the degree of variation between trials shown by an individual.[]

The Jensen box.

The standard box is around 20 inches wide and 12 deep, with a sloping face on which 8 buttons are arrayed in a semicircle, with a 'home' key in the lower center. Above each response button lies a small LED which can be illuminated, and the box contains a loudspeaker to play alerting sounds. Following Hick's law,[2] reaction times (RTs) slow as a log2 of the number of choices are presented. Thus when all but one button is covered responses are fastest, and slowest when all 8 responses are available. Several parameters can be extracted: The mean 1-choice RT gives simple reaction time. The slope of the function across 1, 2, 4, and 8 lights gives the rate of information processing, an variance or standard deviation in RTs can be extracted to give a measure of response variability within subjects. Finally, the time to lift off the home button and the time to hit the response button can be measured separately, and these are typically thought of as assessing decision time, and movement time, though in the standard paradigm, subjects can shift decision time into the movement phase by lifting off the home button while the location computation is still incomplete. Masking the stimulus light can eliminate this artifact.[] Simple reaction time correlates around .4 with general ability,[] and there is some evidence that the slope of responding does also, so long as access to the stimulus is controlled.[]

References
[1] A. R. Jensen. (1987). Individual differences in the Hick paradigm. In Speed of information-processing and intelligence. P. A. Vernon and et al., Norwood, NJ, USA, Ablex Publishing Corp, 101-175.

KuderRichardson Formula 20

149

KuderRichardson Formula 20
In statistics, the KuderRichardson Formula 20 (KR-20) first published in 1937[1] is a measure of internal consistency reliability for measures with dichotomous choices. It is analogous to Cronbach's , except Cronbach's is also used for non-dichotomous (continuous) measures.[2] A high KR-20 coefficient (e.g., >0.90) indicates a homogeneous test. Values can range from 0.00 to 1.00 (sometimes expressed as 0 to 100), with high values indicating that the examination is likely to correlate with alternate forms (a desirable characteristic). The KR-20 may be affected by difficulty of the test, the spread in scores and the length of the examination. In the case when scores are not tau-equivalent (for example when there is not homogeneous but rather examination items of increasing difficulty) then the KR-20 is an indication of the lower bound of internal consistency (reliability). The KR-20 formula can't be used when multiple-choice questions involve partial credit, and it requires detailed item analysis.[3]

where K is the number of test items (i.e. the length of the test), p is the proportion of correct responses to the test item, q is the proportion of incorrect responses to the test item (so that p + q = 1), and the variance for the denominator is

where n is the total sample size. If it is important to use unbiased operators then the sum of squares should be divided by degrees of freedom (n1) and the probabilities are multiplied by

Since Cronbach's was published in 1951, there has been no known advantage to KR-20 over Cronbach. KR-20 is seen as a derivative of the Cronbach formula, with the advantage to Cronbach that it can handle both dichotomous and continuous variables.

References
[1] Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151160. [2] Cortina, J. M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98104. [3] http:/ / chemed. chem. purdue. edu/ chemed/ stats. html (as of 3/27/2013

External links
Statistical analysis of multiple choice exams (http://chemed.chem.purdue.edu/chemed/stats.html) Quality of assessment chapter in Illinois State Assessment handbook (1995) (http://www.gower.k12.il.us/ Staff/ASSESS/4_ch2app.htm)

Latent variable

150

Latent variable
In statistics, latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. Latent variable models are used in many disciplines, including psychology, economics, machine learning/artificial intelligence, bioinformatics, natural language processing, management and the social sciences. Sometimes latent variables correspond to aspects of physical reality, which could in principle be measured, but may not be for practical reasons. In this situation, the term hidden variables is commonly used (reflecting the fact that the variables are "really there", but hidden). Other times, latent variables correspond to abstract concepts, like categories, behavioral or mental states, or data structures. The terms hypothetical variables or hypothetical constructs may be used in these situations. One advantage of using latent variables is that it reduces the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories. At the same time, latent variables link observable ("sub-symbolic") data in the real world to symbolic data in the modeled world. Latent variables, as created by factor analytic methods, generally represent 'shared' variance, or the degree to which variables 'move' together. Variables that have no correlation cannot result in a latent construct based on the common factor model.[1]

Examples of latent variables


Economics
Examples of latent variables from the field of economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly. But linking these latent variables to other, observable variables, the values of the latent variables can be inferred from measurements of the observable variables. Quality of life is a latent variable which can not be measured directly so observable variables are used to infer quality of life. Observable variables to measure quality of life includes wealth, employment, environment, physical and mental health, education, recreation and leisure time, and social belonging.

Psychology
The "Big Five personality traits" have been inferred using factor analysis. extraversion[] spatial ability[] wisdom Two of the more predominant means of assessing wisdom include wisdom-related performance and latent variable measures.[]

Latent variable

151

Common methods for inferring latent variables


Hidden Markov models Factor analysis Principal component analysis Latent semantic analysis and Probabilistic latent semantic analysis EM algorithms

Bayesian algorithms and methods


Bayesian statistics is often used for inferring latent variables. Latent Dirichlet Allocation The Chinese Restaurant Process is often used to provide a prior distribution over assignments of objects to latent categories. The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects.

References

Law of comparative judgment


The law of comparative judgment was conceived by L. L. Thurstone. In modern day terminology, it is more aptly described as a model that is used to obtain measurements from any process of pairwise comparison. Examples of such processes are the comparison of perceived intensity of physical stimuli, such as the weights of objects, and comparisons of the extremity of an attitude expressed within statements, such as statements about capital punishment. The measurements represent how we perceive objects, rather than being measurements of actual physical properties. This kind of measurement is the focus of psychometrics and psychophysics. In somewhat more technical terms, the law of comparative judgment is a mathematical representation of a discriminal process, which is any process in which a comparison is made between pairs of a collection of entities with respect to magnitudes of an attribute, trait, attitude, and so on. The theoretical basis for the model is closely related to item response theory and the theory underlying the Rasch model, which are used in Psychology and Education to analyse data from questionnaires and tests.

Background
Thurstone published a paper on the law of comparative judgment in 1927. In this paper he introduced the underlying concept of a psychological continuum for a particular 'project in measurement' involving the comparison between a series of stimuli, such as weights and handwriting specimens, in pairs. He soon extended the domain of application of the law of comparative judgment to things that have no obvious physical counterpart, such as attitudes and values (Thurstone, 1929). For example, in one experiment, people compared statements about capital punishment to judge which of each pair expressed a stronger positive (or negative) attitude. The essential idea behind Thurstone's process and model is that it can be used to scale a collection of stimuli based on simple comparisons between stimuli two at a time: that is, based on a series of pairwise comparisons. For example, suppose that someone wishes to measure the perceived weights of a series of five objects of varying masses. By having people compare the weights of the objects in pairs, data can be obtained and the law of comparative judgment applied to estimate scale values of the perceived weights. This is the perceptual counterpart to the physical weight of the objects. That is, the scale represents how heavy people perceive the objects to be based on

Law of comparative judgment the comparisons. Although Thurstone referred to it as a law, as stated above, in terms of modern psychometric theory the 'law' of comparative judgment is more aptly described as a measurement model. It represents a general theoretical model which, applied in a particular empirical context, constitutes a scientific hypothesis regarding the outcomes of comparisons between some collection of objects. If data agree with the model, it is possible to produce a scale from the data.

152

Relationships to pre-existing psychophysical theory


Thurstone showed that in terms of his conceptual framework, Weber's law and the so-called Weber-Fechner law, which are generally regarded as one and the same, are independent, in the sense that one may be applicable but not the other to a given collection of experimental data. In particular, Thurstone showed that if Fechner's law applies and the discriminal dispersions associated with stimuli are constant (as in Case 5 of the LCJ outlined below), then Weber's law will also be verified. He considered that the Weber-Fechner law and the LCJ both involve a linear measurement on a psychological continuum whereas Weber's law does not. Weber's law essentially states that how much people perceive physical stimuli to change depends on how big a stimulus is. For example, if someone compares a light object of 1 kg with one slightly heavier, they can notice a relatively small difference, perhaps when the second object is 1.2 kg. On the other hand, if someone compares a heavy object of 30 kg with a second, the second must be quite a bit larger for a person to notice the difference, perhaps when the second object is 36 kg. People tend to perceive differences that are proportional to the size rather than always noticing a specific difference irrespective of the size. The same applies to brightness, pressure, warmth, loudness and so on. Thurstone stated Weber's law as follows: "The stimulus increase which is correctly discriminated in any specified proportion of attempts (except 0 and 100 per cent) is a constant fraction of the stimulus magnitude" (Thurstone, 1959, p. 61). He considered that Weber's law said nothing directly about sensation intensities at all. In terms of Thurstone's conceptual framework, the association posited between perceived stimulus intensity and the physical magnitude of the stimulus in the Weber-Fechner law will only hold when Weber's law holds and the just noticeable difference (JND) is treated as a unit of measurement. Importantly, this is not simply given a priori (Michell, 1997, p. 355), as is implied by purely mathematical derivations of the one law from the other. It is, rather, an empirical question whether measurements have been obtained; one which requires justification through the process of stating and testing a well-defined hypothesis in order to ascertain whether specific theoretical criteria for measurement have been satisfied. Some of the relevant criteria were articulated by Thurstone, in a preliminary fashion, including what he termed the additivity criterion. Accordingly, from the point of view of Thurstone's approach, treating the JND as a unit is justifiable provided only that the discriminal dispersions are uniform for all stimuli considered in a given experimental context. Similar issues are associated with Stevens' power law. In addition, Thurstone employed the approach to clarify other similarities and differences between Weber's law, the Weber-Fechner law, and the LCJ. An important clarification is that the LCJ does not necessarily involve a physical stimulus, whereas the other 'laws' do. Another key difference is that Weber's law and the LCJ involve proportions of comparisons in which one stimulus is judged greater than another whereas the so-called Weber-Fechner law does not.

Law of comparative judgment

153

The general form of the law of comparative judgment


The most general form of the LCJ is

in which: is the psychological scale value of stimuli i is the sigma corresponding with the proportion of occasions on which the magnitude of stimulus i is judged

to exceed the magnitude of stimulus j is the discriminal dispersion of a stimulus is the correlation between the discriminal deviations of stimuli i and j The discriminal dispersion of a stimulus i is the dispersion of fluctuations of the discriminal process for a uniform repeated stimulus, denoted , where represents the mode of such values. Thurstone (1959, p. 20) used the term discriminal process to refer to the "psychological values of psychophysics"; that is, the values on a psychological continuum associated with a given stimulus.

Case 5 of the law of comparative judgment


Thurstone specified five particular cases of the 'law', or measurement model. An important case of the model is Case 5, in which the discriminal dispersions are specified to be uniform and uncorrelated. This form of the model can be represented as follows:

where

In this case of the model, the difference is judged greater than i if it is hypothesised that choice of the unit of measurement. Letting for example,

can be inferred directly from the proportion of instances in which j is distributed according to some density function, such as the , which is in effect an arbitrary

normal distribution or logistic function. In order to do so, it is necessary to let and it is hypothesised that

be the proportion of occasions on which i is judged greater than j, if, is normally distributed, then it would be inferred that

. When a simple logistic function is employed instead of the normal density function, then the model has the structure of the Bradley-Terry-Luce model (BTL model) (Bradley & Terry, 1952; Luce, 1959). In turn, the Rasch model for dichotomous data (Rasch, 1960/1980) is identical to the BTL model after the person parameter of the Rasch model has been eliminated, as is achieved through statistical conditioning during the process of Conditional Maximum Likelihood estimation. With this in mind, the specification of uniform discriminal dispersions is equivalent to the requirement of parallel Item Characteristic Curves (ICCs) in the Rasch model. Accordingly, as shown by Andrich (1978), the Rasch model should, in principle, yield essentially the same results as those obtained from a Thurstone scale. Like the Rasch model, when applied in a given empirical context, Case 5 of the LCJ constitutes a mathematized hypothesis which embodies theoretical criteria for measurement.

Law of comparative judgment

154

Applications
One important application involving the law of comparative judgment is the widely-used Analytic Hierarchy Process, a structured technique for helping people deal with complex decisions. It uses pairwise comparisons of tangible and intangible factors to construct ratio scales that are useful in making important decisions.[1][]

References
Andrich, D. (1978b). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 449-460. Bradley, R.A. and Terry, M.E. (1952). Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324-345. Krus, D.J., & Kennedy, P.H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189-193 (Request reprint). (http://www.visualstatistics.net/ Scaling/Domain Referenced Scaling/Domain-Referenced Scaling.htm) Luce, R.D. (1959). Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley. Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 355-383. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review, 34, 273-286. Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court. Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press.

External links
"The Measurement of Pyschological Value" (http://www.brocku.ca/MeadProject/Thurstone/ Thurstone_1929a.html) How to Analyze Paired Comparisons (tutorial on using Thurstone's Law of Comparative Judgement) (http:// www.ee.washington.edu/research/guptalab/publications/ PairedComparisonTutorialTsukidaGuptaUWTechReport2011.pdf) L.L. Thurstone psychometric laboratory (http://www.unc.edu/depts/quantpsy/thurstone/history.htm)

Likert scale

155

Likert scale
A Likert scale (pron.: /lkrt/[1]) is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, or more accurately the Likert-type scale, even though the two are not synonymous. The scale is named after its inventor, psychologist Rensis Likert.[2] Likert distinguished between a scale proper, which emerges from collective responses to a set of items (usually eight or more), and the format in which responses are scored along a range. Technically speaking, a Likert scale refers only to the former. The difference between these two concepts has to do with the distinction Likert made between the underlying phenomenon being investigated and the means of capturing variation that points to the underlying phenomenon.[3] When responding to a Likert questionnaire item, respondents specify their level of agreement or disagreement on a symmetric agree-disagree scale for a series of statements. Thus, the range captures the intensity of their feelings for a given item.[4] A scale can be created as the simple sum questionnaire responses over the full range of the scale. In so doing, Likert scaling assumes that distances on each item are equal. Importantly, "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" [5] (p.197). By contrast modern test theory treats the difficulty of each item (the ICCs) as information to be incorporated in scaling items.

Sample question presented using a five-point Likert item


An important distinction must be made between a Likert scale and a Likert item. The Likert scale is the sum of responses on several Likert items. Because Likert items are often accompanied by a visual analog scale (e.g., a horizontal line, on which a subject indicates his or her response by circling or checking tick-marks), the items are sometimes called scales themselves. This is the source of much confusion; it is better, therefore, to reserve the term Likert scale to apply to the summed scale, and Likert item to refer to an individual item. A Likert item is simply a statement which the respondent is asked to evaluate according to any kind of subjective or objective criteria; generally the level of agreement or disagreement is measured. It is considered symmetric or "balanced" because there are equal amounts of positive and negative positions.[6] Often five ordered response levels are used, although many psychometricians advocate using seven or nine levels; a recent empirical study[7] found that a 5- or 7- point scale may produce slightly higher mean scores relative to the highest possible attainable score, compared to those produced from a 10-point scale, and this difference was statistically significant. In terms of the other data characteristics, there was very little difference among the scale formats in terms of variation about the mean, skewness or kurtosis. The format of a typical five-level Likert item, for example, could be: 1. Strongly disagree 2. Disagree 3. Neither agree nor disagree 4. Agree 5. Strongly agree

A Likert scale pertaining to Wikipedia can be calculated using these five Likert items.

Likert scale Likert scaling is a bipolar scaling method, measuring either positive or negative response to a statement. Sometimes an even-point scale is used, where the middle option of "Neither agree nor disagree" is not available. This is sometimes called a "forced choice" method, since the neutral option is removed.[8] The neutral option can be seen as an easy option to take when a respondent is unsure, and so whether it is a true neutral option is questionable. A 1987 study found negligible differences between the use of "undecided" and "neutral" as the middle option in a 5-point Likert scale.[9] Likert scales may be subject to distortion from several causes. Respondents may avoid using extreme response categories (central tendency bias); agree with statements as presented (acquiescence bias); or try to portray themselves or their organization in a more favorable light (social desirability bias). Designing a scale with balanced keying (an equal number of positive and negative statements) can obviate the problem of acquiescence bias, since acquiescence on positively keyed items will balance acquiescence on negatively keyed items, but central tendency and social desirability are somewhat more problematic.

156

Scoring and analysis


After the questionnaire is completed, each item may be analyzed separately or in some cases item responses may be summed to create a score for a group of items. Hence, Likert scales are often called summative scales. Whether individual Likert items can be considered as interval-level data, or whether they should be treated as ordered-categorical data is the subject of considerable disagreement in the literature,[10][11] with strong convictions on what are the most applicable methods. This disagreement can be traced back, in many respects, to the extent to which Likert items are interpreted as being ordinal data. There are two primary considerations in this discussion. First, Likert scales are arbitrary. The value assigned to a Likert item has no objective numerical basis, either in terms of measure theory or scale (from which a distance metric can be determined). The value assigned to each Likert item is simply determined by the researcher designing the survey, who makes the decision based on a desired level of detail. However, by convention Likert items tend to be assigned progressive positive integer values. Likert scales typically range from 2 to 10 with 5 or 7 being the most common. Further, this progressive structure of the scale is such that each successive Likert item is treated as indicating a better response than the preceding value. (This may differ in cases where reverse ordering of the Likert Scale is needed). The second, and possibly more important point, is whether the distance between each successive item category is equivalent, which is inferred traditionally. For example, in the above five-point Likert item, the inference is that the distance between category 1 and 2 is the same as between category 3 and 4. In terms of good research practice, an equidistant presentation by the researcher is important; otherwise a bias in the analysis may result. For example, a four-point Likert item with categories "Poor", "Average", "Good", and "Very Good" is unlikely to have all equidistant categories since there is only one category that can receive a below average rating. This would arguably bias any result in favor of a positive outcome. On the other hand, even if a researcher presents what he or she believes are equidistant categories, it may not be interpreted as such by the respondent. A good Likert scale, as above, will present a symmetry of categories about a midpoint with clearly defined linguistic qualifiers. In such symmetric scaling, equidistant attributes will typically be more clearly observed or, at least, inferred. It is when a Likert scale is symmetric and equidistant that it will behave more like an interval-level measurement. So while a Likert scale is indeed ordinal, if well presented it may nevertheless approximate an interval-level measurement. This can be beneficial since, if it was treated just as an ordinal scale, then some valuable information could be lost if the distance between Likert items were not available for consideration. The important idea here is that the appropriate type of analysis is dependent on how the Likert scale has been presented. Given the Likert Scale's ordinal basis, summarizing the central tendency of responses from a Likert scale by using either the median or the mode is best, with spread measured by quartiles or percentiles.[12] Non-parametric tests should be preferred for statistical inferences, such as chi-squared test, MannWhitney test, Wilcoxon signed-rank

Likert scale test, or KruskalWallis test.[] While some commentators[13] consider that parametric analysis is justified for a Likert scale using the Central Limit Theorem, this should be reserved for when the Likert scale has suitable symmetry and equidistance so an interval-level measurement can be approximated and reasonably inferred. Responses to several Likert questions may be summed, providing that all questions use the same Likert scale and that the scale is a defensible approximation to an interval scale, in which case they may be treated as interval data measuring a latent variable. If the summed responses fulfill these assumptions, parametric statistical tests such as the analysis of variance can be applied. These can be applied only when 4 to 8 Likert questions (preferably closer to 8) are summed.[14] Data from Likert scales are sometimes converted to binomial data by combining all agree and disagree responses into two categories of "accept" and "reject". The chi-squared, Cochran Q, or McNemar test are common statistical procedures used after this transformation. Consensus based assessment (CBA) can be used to create an objective standard for Likert scales in domains where no generally accepted or objective standard exists. Consensus based assessment (CBA) can be used to refine or even validate generally accepted standards.

157

Level of measurement
The five response categories are often believed to represent an Interval level of measurement. But this can only be the case if the intervals between the scale points correspond to empirical observations in a metric sense. Reips and Funke (2008)[15] show that this criterion is much better met by a visual analogue scale. In fact, there may also appear phenomena which even question the ordinal scale level in Likert scales. For example, in a set of items A,B,C rated with a Likert scale circular relations like A>B, B>C and C>A can appear. This violates the axiom of transitivity for the ordinal scale.

Rasch model
Likert scale data can, in principle, be used as a basis for obtaining interval level estimates on a continuum by applying the polytomous Rasch model, when data can be obtained that fit this model. In addition, the polytomous Rasch model permits testing of the hypothesis that the statements reflect increasing levels of an attitude or trait, as intended. For example, application of the model often indicates that the neutral category does not represent a level of attitude or trait between the disagree and agree categories. Again, not every set of Likert scaled items can be used for Rasch measurement. The data has to be thoroughly checked to fulfill the strict formal axioms of the model.

Pronunciation
Rensis Likert, the developer of the scale, pronounced his name 'lick-urt' with a short "i" sound.[16][17] It has been claimed that Likert's name "is among the most mispronounced in [the] field",[18] as many people pronounce it with a diphtong "i" sound ('lie-kurt').

References
[3] Carifio, James and Rocco J. Perla. (2007) Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes. Journal of Social Sciences 3 (3): 106-116 [5] A. van Alphen, R. Halfens, A. Hasman and T. Imbos. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing. 20, 196-201 [8] Allen, Elaine and Seaman, Christopher (2007). "Likert Scales and Data Analyses". Quality Progress 2007, 64-65. [9] Armstrong, Robert (1987). "The midpoint on a Five-Point Likert-Type Scale". Perceptual and Motor Skills: Vol 64, pp359-362. [10] Jamieson, Susan (2004). Likert Scales: How to (Ab)use Them, Medical Education, Vol. 38(12), pp.1217-1218

Likert scale
[11] Norman, Geoff (2010). Likert scales, levels of measurement and the laws of statistics. Advances in Health Science Education. Vol 15(5) pp625-632 [12] Jamieson, Susan (2004) [13] Norman, Geoff (2010) [14] Carifio and Perla, 2007, Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes. Journal of Social Sciences 3 (3): 106-116.

158

External links
Carifio (2007). "Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes" (http://www.comp.dit.ie/dgordon/Courses/ ResearchMethods/likertscales.pdf). Retrieved September 19, 2011. Unknown parameter |unused_data= ignored (help) Trochim, William M. (October 20, 2006). "Likert Scaling" (http://www.socialresearchmethods.net/kb/scallik. php). Research Methods Knowledge Base, 2nd Edition. Retrieved April 30, 2009. Uebersax, John S. (2006). "Likert Scales: Dispelling the Confusion" (http://www.john-uebersax.com/stat/ likert.htm). Retrieved August 17, 2009. "A search for the optimum feedback scale" (http://www.getfeedback.net/kb/ Choosing-the-optimium-feedback-scale). Getfeedback. Correlation scatter-plot matrix - for ordered-categorical data (http://www.r-statistics.com/2010/04/ correlation-scatter-plot-matrix-for-ordered-categorical-data/) - On the visual presentation of correlation between Likert scale variables Net stacked distribution of Likert data (http://www.organizationview.com/ net-stacked-distribution-a-better-way-to-visualize-likert-data/) - Method of visualizing Likert data to highlight differences from a central neutral value.

Linear-on-the-fly testing
Linear-on-the-fly testing, often referred to as LOFT, is a method of delivering educational or professional examinations. Competing methods include traditional linear fixed-form delivery and computerized adaptive testing. LOFT is a compromise between the two, in an effort to maintain the equivalence of the set of items that each examinee sees, which is found in fixed-form delivery, while attempting to reduce item exposure and enhance test security. Fixed-form delivery, which most people are familiar with, entails the testing organization determining one or several fixed sets of items to be delivered together. For example, suppose the test contains 100 items, and the organization wished for two forms. Two forms are published with a fixed set of 100 items each, some of which should overlap to enable equating. All examinees that take the test are given one of the two forms. If this exam is high volume, meaning that there is a large number of examinees, the security of the examination could be in jeopardy. Many of the test items would become well known in the population of examinees. To offset this, more forms would be needed; if there were eight forms, not as many examinees would see each item. LOFT takes this to an extreme, and attempts to construct a unique exam for each candidate, within the given constraints of the testing program. Rather than publishing a fixed set of items, a large pool of items is delivered to the computer on which the examinee is taking the exam. Also delivered is a computer program to pseudo-randomly select items so that every examinee will receive a test that is equivalent with respect to content and statistical characeristics,[1] although composed of a different set of items. This is usually done with item response theory.

Linear-on-the-fly testing

159

References
[1] Luecht, R.M. (2005). Some Useful Cost-Benefit Criteria for Evaluating Computer-based Test Delivery Models and Systems. Journal of Applied Testing Technology, 7(2). (http:/ / www. testpublishers. org/ Documents/ JATT2005_rev_Criteria4CBT_RMLuecht_Apr2005. pdf)

Frederic M. Lord
Frederic M. Lord (Nov 12, 1912 in Hanover, NH - Feb 5, 2000) was a psychometrician for Educational Testing Service. He was the source of much of the seminal research on item response theory,[1] including two important books: Statistical Theories of Mental Test Scores (1968, with Melvin Novick, and two chapters by Allen Birnbaum), and Applications of Item Response Theory to Practical Testing Problems (1980). Lord has been called the "Father of Modern Testing."[2]

References
[1] ETS Research Overview (http:/ / www. ets. org/ portal/ site/ ets/ menuitem. c988ba0e5dd572bada20bc47c3921509/ ?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD& vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRD) [2] NCME News: Frederic Lord, Father of Modern Testing, Dies at 87 (http:/ / www. ncme. org/ news/ newsdetail. cfm?ID=21& ArchView=y)

Measurement invariance
Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in classical test theory.[1] Measurement invariance is relevant in the context of latent variables. Measurement invariance is supported if relationships between manifest indicator variables and the latent construct are the same across groups. Measurement invariance is usually tested in the framework of multiple-group confirmatory factor analysis.[2]

References
[1] Vandenberg, Robert J. & Lance, Charles E. (2000). A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, 3, 470 [2] Chen, Fang Fang, Sousa, Karen H., and West, Stephen G. (2005). Testing Measurement Invariance of Second-Order Factor Models. Structural Equation Modeling, 12, 471492

Mediation (statistics)

160

Mediation (statistics)
In statistics, a mediation model is one that seeks to identify and explicate the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third explanatory variable, known as a A simple statistical mediation model. mediator variable. Rather than hypothesizing a direct causal relationship between the independent variable and the dependent variable, a mediational model hypothesizes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables.[1] In other words, mediating relationships occur when a third variable plays an important role in governing the relationship between the other two variables. Researchers are now focusing their studies on better understanding known findings. Mediation analyses are employed to understand a known relationship by exploring the underlying mechanism or process by which one variable (X) influences another variable (Y). For example, a cause X of some variable (Y) presumably precedes Y in time and has a generative mechanism that accounts for its impact on Y.[2] Thus, if gender is thought to be the cause of some characteristic, one assumes that other social or biological mechanisms are present in the concept of gender that can explain how gender-associated differences arise. The explicit inclusion of such a mechanism is called a mediator.

Baron and Kenny's (1986) Steps for Mediation


Baron and Kenny (1986) [3] laid out several requirements that must be met to form a true mediation relationship. They are outlined below using a real world example. See the diagram above for a visual representation of the overall mediating relationship to be explained. Step 1: Regress the dependent variable on the independent variable. In other words, confirm that the independent variable is a significant predictor of the dependent variable. Independent Variable Dependent Variable

11 is significant Step2: Regress the mediator on the independent variable. In other words, confirm that the independent variable is a significant predictor of the mediator. If the mediator is not associated with the independent variable, then it couldnt possibly mediate anything. Independent Variable Mediator

21 is significant Step 3: Regress the dependent variable on both the mediator and independent variable. In other words, confirm that the mediator is a significant predictor of the dependent variable, while controlling for the independent variable.

Mediation (statistics) This step involves demonstrating that when the mediator and the independent variable are used simultaneously to predict the dependent variable, the previously significant path between the independent and dependent variable (Step #1) is now greatly reduced, if not nonsignificant. In other words, if the mediator were to be removed from the relationship, the relationship between the independent and dependent variables would be noticeably reduced.

161

32 is significant 31 should be smaller in absolute value than the original mediation effect (11 above) Example The following example, drawn from Howell (2009),[4] explains each step of Baron and Kennys requirements to understand further how a mediation effect is characterized. Step 1 and step 2 use simple regression analysis, whereas step 3 uses multiple regression analysis. Step 1: How you were parented (i.e., independent variable) predicts how confident you feel about parenting your own children (i.e., dependent variable). How you were parented Step 2: How you were parented (i.e., independent variable) predicts your feelings of competence and self-esteem (i.e., mediator). How you were parented Step 3: Your feelings of competence and self-esteem (i.e., mediator) predict how confident you feel about parenting your own children (i.e., dependent variable), while controlling for how you were parented (i.e., independent variable). Such findings would lead to the conclusion implying that your feelings of competence and self-esteem mediate the relationship between how you were parented and how confident you feel about parenting your own children. Note: If step 1 does not yield a significant result, one may still have grounds to move to step 2. Sometimes there is actually a significant relationship between independent and dependent variables but because of small sample sizes, or other extraneous factors, there could not be enough power to predict the effect that actually exists (See Shrout & Bolger, 2002 [5] for more info). Feelings of competence and self-esteem. Confidence in own parenting abilities.

Direct Versus Indirect Mediation Effects


In the diagram shown above, the indirect effect is the product of path coefficients "A" and "B". The direct effect is the coefficient "C". The total effect measures the extent to which the dependent variable changes when the independent variable increases by one unit. In contrast, the indirect effect measures the extent to which the dependent variable changes when the independent variable is held fixed and the mediator variable changes to the level it would have attained had the independent variable increased by one unit.[][6] In linear systems, the total effect is equal to the sum of the direct and indirect effects (C + AB in the model above). In nonlinear models, the total effect is not generally equal to the sum of the direct and indirect effects, but to a modified combination of the two.[6]

Mediation (statistics)

162

Full versus partial mediation


A mediator variable can either account for all or some of the observed relationship between two variables. Full mediation Maximum evidence for mediation, also called full mediation, would occur if inclusion of the mediation variable drops the relationship between the independent variable and dependent variable (see pathway c in diagram above) to zero. This rarely, if ever, occurs. The most likely event is that c becomes a weaker, yet still significant path with the inclusion of the mediation effect. Partial mediation Partial mediation maintains that the mediating variable accounts for some, but not all, of the relationship between the independent variable and dependent variable. Partial mediation implies that there is not only a significant relationship between the mediator and the dependent variable, but also some direct relationship between the independent and dependent variable. In order for either full or partial mediation to be established, the reduction in variance explained by the independent variable must be significant as determined by one of several tests, such as the Sobel test.[] The effect of an independent variable on the dependent variable can become nonsignificant when the mediator is introduced simply because a trivial amount of variance is explained (i.e., not true mediation). Thus, it is imperative to show a significant reduction in variance explained by the independent variable before asserting either full or partial mediation. It is possible to have statistically significant indirect effects in the absence of a total effect.[] This can be explained by the presence of several mediating paths that cancel each other out, and become noticeable when one of the cancelling mediators is controlled for. This implies that the terms 'partial' and 'full' mediation should always be interpreted relative to the set of variables that are present in the model. In all cases, the operation of "fixing a variable" must be distinguished from that of "controlling for a variable," which has been inappropriately used in the literature.[][7] The former stands for physically fixing, while the latter stands for conditioning on, adjusting for, or adding to the regression model. The two notions coincide only when all error terms (not shown in the diagram) are statistically uncorrelated. When errors are correlated, adjustments must be made to neutralize those correlations before embarking on mediation analysis (see Bayesian Networks).

Sobel's Test
As mentioned above, Sobels test[] is calculated to determine if the relationship between the independent variable and dependent variable has been significantly reduced after inclusion of the mediator variable. In other words, this test assesses whether a mediation effect is significant. Examines the relationship between the independent variable and the dependent variable compared to the relationship between the independent variable and dependent variable including the mediation factor. The Sobel test is more accurate than the Baron and Kenny steps explained above, however it does have low statistical power. As such, large sample sizes are required in order to have sufficient power to detect significant effects. This is because the key assumption of Sobels test is the assumption of normality. Because Sobels test evaluates a given sample on the normal distribution, small sample sizes and skewness of the sampling distribution can be problematic (See Normal Distribution for more details). Thus, the general rule of thumb as suggested by MacKinnon et al., (2002) [8] is that a sample size of 1000 is required to detect a small effect, a sample size of 100 is sufficient in detecting a medium effect, and a sample size of 50 is required to detect a large effect.

Mediation (statistics)

163

Preacher & Hayes (2004) Bootstrap Method


The bootstrapping method provides some advantages to the Sobels test, primarily an increase in power. The Preacher and Hayes Bootstrapping method is a non-parametric test (See Non-parametric statistics for a discussion on why non parametric tests have more power). As such, the bootstrap method does not violate assumptions of normality and is therefore recommended for small sample sizes. Bootstrapping involves repeatedly randomly sampling observations with replacement from the data set to compute the desired statistic in each resample. Over hundreds, or thousands, of bootstrap resamples provide an approximation of the sampling distribution of the statistic of interest. Hayes offers a macro <http://www.afhayes.com/> that calculates bootstrapping directly within SPSS, a computer program used for statistical analyses. This method provides point estimates and confidence intervals by which one can assess the significance or nonsignificance of a mediation effect. Point estimates reveal the mean over the number of bootstrapped samples and if zero does not fall between the resulting confidence intervals of the bootstrapping method, one can confidently conclude that there is a significant mediation effect to report.

Significance of mediation
As outlined above, there are a few different options one can choose from to evaluate a mediation model. Bootstrapping[9][10] is becoming the most popular method of testing mediation because it does not require the normality assumption to be met, and because it can be effectively utilized with smaller sample sizes (N<25). However, mediation continues to be most frequently determined using the logic of Baron and Kenny [11] or the Sobel test. It is becoming increasingly more difficult to publish tests of mediation based purely on the Baron and Kenny method or tests that make distributional assumptions such as the Sobel test. Thus, it is important to consider your options when choosing which test to conduct.[]

Approaches to Mediation
While the concept of mediation as defined within psychology is theoretically appealing, the methods used to study mediation empirically have been challenged by statisticians and epidemiologists[][7][12] and interpreted formally.[6] (1) Experimental-Causal-Chain Design An experimental-causal-chain design is used when the proposed mediator is experimental manipulated. Such a design implies that one manipulates some controlled third variable that they have reason to believe could be the underlying mechanism of a given relationship. (2) Measurement-of-Mediation Design A measurement-of-mediation design can be conceptualized as a statistical approach. Such a design implies that one measures the proposed intervening variable and then uses statistical analyses to establish mediation. This approach does not involve manipulation of the hypothesized mediating variable, but only involves measurement. See Spencer et al., 2005 [13] for a discussion on the approaches mentioned above.

Criticisms of Mediation Measurement


Experimental approaches to mediation must be carried out with caution. First, it is important to have strong theoretical support for the exploratory investigation of a potential mediating variable. A criticism of a mediation approach rests on the ability to manipulate and measure a mediating variable. Thus, one must be able to manipulate the proposed mediator in an acceptable and ethical fashion. As such, one must be able to measure the intervening process without interfering with the outcome. The mediator must also be able to establish construct validity of manipulation. One of the most common criticisms of the measurement-of-mediation approach is that it is ultimately a correlational design. Consequently, it is possible that some other third variable, independent from the proposed mediator, could be responsible for the proposed effect. However, researchers have worked hard to provide counter

Mediation (statistics) evidence to this disparagement. Specifically, the following counter arguments have been put forward:[2] (1) Temporal precedence. For example, if the independent variable precedes the dependent variable in time, this would provide evidence suggesting a directional, and potentially causal, link from the independent variable to the dependent variable. (2) Nonspuriousness and/or no confounds. For example, should one identify other third variables and prove that they do not alter the relationship between the independent variable and the dependent variable he/she would have a stronger argument for their mediation effect. See other 3rd variables below. Mediation can be an extremely useful and powerful statistical test, however it must be used properly. It is important that the measures used to assess the mediator and the dependent variable are theoretically distinct and that the independent variable and mediator cannot interact. Should there be an interaction between the independent variable and the mediator one would have grounds to investigate moderation.

164

Other Third Variables


(1) Confounding: Another model that is often tested is one in which competing variables in the model are alternative potential mediators or an unmeasured cause of the dependent variable. An additional variable in a causal model may obscure or confound the relationship between the independent and dependent variables. Potential confounders are variables that may have a causal impact on both the independent variable and dependent variable. They include common sources of measurement error (as discussed above) as well as other influences shared by both the independent and dependent variables. In experimental studies, there is a special concern about aspects of the experimental manipulation or setting that may account for study effects, rather than the motivating theoretical factor. Any of these problems may produce spurious relationships between the independent and dependent variables as measured. Ignoring a confounding variable may bias empirical estimates of the causal effect of the independent variable. (2) Suppression: Suppression variables increase the predictive validity of another variable by its inclusion into a regression equation. For example, higher intelligence scores (X) cause a decrease in errors made at work on an assembly line (Y). However an increase in intelligence (X) could cause an increase in errors made on an assembly line (Y) as it may also relate to an increase in boredom while at work (Z) thereby introducing an element of carelessness resulting in a higher percentage of errors made on the job. Such a suppressor variable will lead to an increase in magnitude of the relationship between two variables. In general, the omission of suppressors or confounders will lead to either an underestimation or an overestimating of the effect of X on Y, thereby either reducing or artificially inflating the magnitude of a relationship between two variables. (3) Moderators: Other important third variables are moderators. Moderators are variables that can make the relationship between two variables either stronger or weaker. Such variables further characterize interactions in regression by affecting the direction and/or strength of the relationship between X and Y. A moderating relationship can be thought of as an interaction. It occurs when the relationship between variables A and B depends on the level of C. See moderation for further discussion.

Mediation (statistics)

165

Mediator Variable
A mediator variable (or mediating variable, or intervening variable) in statistics is a variable that describes how, rather than when, effects will occur by accounting for the relationship between the independent and dependent variables. A mediating relationship is one in which the path relating A to C is mediated by a third variable (B). For example, a mediating variable explains the actual relationship between the following variables. Most people will agree that older drivers (up to a certain point), are better drivers. Thus: Aging Better driving

But what is missing from this relationship is a mediating variable that is actually causing the improvement in driving: experience. The mediated relationship would look like the following: Aging Increased experience driving a car Better driving

Mediating variables are often contrasted with moderating variables, which pinpoint the conditions under which an independent variable exerts its effects on a dependent variable.

Moderated Mediation
Mediation and moderation can co-occur in statistical models. It is possible to mediate moderation and moderate mediation. Moderated mediation is when the effect of the treatment effect A on the mediator B, and/or when the partial effect of B on C, depends on levels of another variable (D). Essentially, in moderated mediation, mediation is first established, and then one investigates if the mediation effect that describes the relationship between the independent variable and dependent variable is moderated by different levels of another variable (i.e., a moderator). This definition has been outlined by Muller, Judd, and Yzerbyt (2005)[] and Preacher, Rucker, and Hayes (2007).[14]

Mediated Moderation
Mediated moderation is a variant of both moderation and mediation. This is where there is initially overall moderation and the direct effect of the moderator variable on the outcome is mediated either at the A path in the diagram, between the independent A simple statistical moderation model. variable and the moderating variable, or at the B path, between the moderating variable and the dependent variable. The main difference between mediated moderation and moderated mediation is that for the former there is initial moderation and this effect is mediated and for the latter there is no moderation but the effect of either the treatment on the mediator (path A) is moderated or the effect of the mediator on the outcome (path B) is moderated.[] In order to establish mediated moderation, one must first establish moderation, meaning that the direction and/or the strength of the relationship between the independent and dependent variables (path C) differs depending on the level of a third variable (the moderator variable). Researchers next look for the presence of mediated moderation when they have a theoretical reason to believe that there is a fourth variable that acts as the mechanism or process that causes the relationship between the independent variable and the moderator (path A) or between the moderator and the dependent variable (path C). Example

Mediation (statistics) The following is a published example of mediated moderation in psychological research.[15] Participants were presented with an initial stimulus (a prime) that made them think of morality or made them think of might. They then participated in the Prisoners Dilemma Game (PDG), in which participants pretend that they and their partner in crime have been arrested, and they must decide whether to remain loyal to their partner or to compete with their partner and cooperate with the authorities. The researchers found that prosocial individuals were affected by the morality and might primes, whereas proself individuals were not. Thus, social value orientation (proself vs. prosocial) moderated the relationship between the prime (independent variable: morality vs. might) and the behaviour chosen in the PDG (dependent variable: competitive vs. cooperative). The researchers next looked for the presence of a mediated moderation effect. Regression analyses revealed that the type of prime (morality vs. might) mediated the moderating relationship of participants social value orientation on PDG behaviour. Prosocial participants who experienced the morality prime expected their partner to cooperate with them, so they chose to cooperate themselves. Prosocial participants who experienced the might prime expected their partner to compete with them, which made them more likely to compete with their partner and cooperate with the authorities. In contrast, participants with a pro-self social value orientation always acted competitively. Models of Mediated Moderation There are five possible models of mediated moderation, as illustrated in the diagrams below.[] 1. In the first model the independent variable also mediates the relationship between the moderator and the dependent variable. 2. The second possible model of mediated moderation involves a new variable which mediates the relationship between the independent variable and the moderator (the A path). 3. The third model of mediated moderation involves a new mediator variable which mediates the relationship between the moderator and the dependent variable (the B path). 4. Mediated moderation can also occur when one mediating variable affects both the relationship between the independent variable and the moderator (the A path) and the relationship between the moderator and the dependent variable (the B path). 5. The fifth an final possible model of mediated moderation involves two new mediator variables, one mediating the A path and the other mediating the B path.

166

First option: independent variable mediates the B path.

Second option: fourth variable mediates the A path.

Third option: fourth variable mediates the B path.

Fourth option: fourth variable Fifth option: fourth variable mediates both the A path and the mediates the A path and a fifth B path. variable mediates the B path.

Mediation (statistics)

167

Regression Equations for Moderated Mediation and Mediated Moderation


Muller, Judd, and Yzerbyt (2005)[] outline three fundamental models that underlie both moderated mediation and mediated moderation. Mo represents the moderator variable(s), Me represents the mediator variable(s), and i represents the measurement error of each regression equation. Step 1: Moderation of the relationship between the independent variable (X) and the dependent variable (Y), also called the overall treatment effect (path C in the diagram).

To establish overall moderation, the 43 regression weight must be significant (first step for establishing mediated moderation). Establishing moderated mediation requires that there be no moderation effect, so the 43 regression weight must not be significant. Step 2: Moderation of the relationship between the independent variable and the mediator (path A).

If the 53 regression weight is significant, the moderator affects the relationship between the IV and the mediator. Step 3: Moderation of both the relationship between the independent and dependent variables (path A) and the relationship between the mediator and the dependent variable (path B).

If both 53 in step 2 and 64 in step 3 are significant, the moderator affects the relationship between the independent variable and the mediator (path A). If both 51 in step 2 and 65 in step 3 are significant, the moderator affects the relationship between the mediator and the dependent variable (path B). Either or both of the conditions above may be true.

References
Notes
[1] MacKinnon, D. P. (2008). Introduction to Statistical Mediation Analysis. New York: Erlbaum. [2] Cohen, J.; Cohen, P.; West, S. G.; Aiken, L. S. (2003) Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum. [3] Baron, R. M. and Kenny, D. A. (1986) "The Moderator-Mediator Variable Distinction in Social Psychological Research Conceptual, Strategic, and Statistical Considerations", Journal of Personality and Social Psychology, Vol. 51(6), pp.11731182. [4] Howell, D. C. (2009). Statistical methods for psychology (7th ed.). Belmot, CA: Cengage Learning. [5] Shrout, P. E., & Bolger, N. (2002). Mediation in experimental and nonexperimental studies: New procedures and recommendations. Psychological Methods, 7(4), 422-445 [6] Pearl, J. (2001) "Direct and indirect effects" (http:/ / ftp. cs. ucla. edu/ pub/ stat_ser/ R273-U. pdf). Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 411420. [7] Kaufman, J. S., MacLehose R. F., Kaufman S (2004). A further critique of the analytic strategy of adjusting for covariates to identify biologic mediation. Epidemiology Innovations and Perspectives, 1:4. [11] "Mediation" (http:/ / davidakenny. net/ cm/ mediate. htm). davidakenny.net. Retrieved April 25, 2012. [12] Bullock, J. G., Green, D. P., Ha, S. E. (2010). Yes, but what's the mechanism? (Don't expect an easy answer). Journal of Personality & Social Psychology, 98(4):550-558. [13] Spencer, S. J., Zanna, M. P., & Fong, G. T. (2005). Establishing a causal chain: why experiments are often more effective than meditational analyses in examining psychological processes. Attitudes and Social Cognition, 89(6): 845-851. [14] Preacher, K. J., Rucker, D. D. & Hayes, A. F. (2007). Assessing moderated mediation hypotheses: Strategies, methods, and prescriptions. Multivariate Behavioral Research, 42, 185227.

Bibliography

Mediation (statistics) Preacher, Kristopher J.; Hayes, Andrew F. (2004). "SPSS and SAS procedures for estimating indirect effects in simple mediation models" (http://www.afhayes.com/spss-sas-and-mplus-macros-and-code.html). Behavior Research Methods, Instruments, and Computers 36 (4): 717731. doi: 10.3758/BF03206553 (http://dx.doi.org/ 10.3758/BF03206553) Preacher, Kristopher J.; Hayes, Andrew F. (2008). "Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models" (http://www.afhayes.com/ spss-sas-and-mplus-macros-and-code.html). Behavior Research Methods 40 (3): 879891. doi: 10.3758/BRM.40.3.879 (http://dx.doi.org/10.3758/BRM.40.3.879). PMID 18697684 (http://www.ncbi. nlm.nih.gov/pubmed/18697684) Preacher, K. J.; Zyphur, M. J.; Zhang, Z. (2010). "A general multilevel SEM framework for assessing multilevel mediation". Psychological Methods 15 (3): 209233. doi: 10.1037/a0020141 (http://dx.doi.org/10.1037/ a0020141). PMID 20822249 (http://www.ncbi.nlm.nih.gov/pubmed/20822249) Baron, R. M. and Kenny, D. A. (1986) "The Moderator-Mediator Variable Distinction in Social Psychological Research Conceptual, Strategic, and Statistical Considerations", Journal of Personality and Social Psychology, Vol. 51(6), pp.11731182. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York, NY: Academic Press. Hayes, A. F. (2009). "Beyond Baron and Kenny: Statistical mediation analysis in the new millennium" (http:// www.informaworld.com/smpp/ftinterface~db=all~content=a917285720~fulltext=713240930). Communication Monographs 76 (4): 408420. doi: 10.1080/03637750903310360 (http://dx.doi.org/10.1080/ 03637750903310360). Howell, D. C. (2009). Statistical methods for psychology (7th ed.). Belmot, CA: Cengage Learning. MacKinnon, D. P.; Lockwood, C. M. (2003). "Advances in statistical methods for substance abuse prevention research" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2843515). Prevention Science 4 (3): 155171. doi: 10.1023/A:1024649822872 (http://dx.doi.org/10.1023/A:1024649822872). PMC 2843515 (http:// www.ncbi.nlm.nih.gov/pmc/articles/PMC2843515). PMID 12940467 (http://www.ncbi.nlm.nih.gov/ pubmed/12940467). Preacher, K. J.; Kelley, K. (2011). "Effect sizes measures for mediation models: Quantitative strategies for communicating indirect effects". Psychological Methods 16 (2): 93115. doi: 10.1037/a0022658 (http://dx.doi. org/10.1037/a0022658). PMID 21500915 (http://www.ncbi.nlm.nih.gov/pubmed/21500915). Rucker, D.D., Preacher, K.J., Tormala, Z.L. & Petty, R.E. (2011). "Mediation analysis in social psychology: Current practices and new recommendations". Social and Personality Psychology Compass, 5/6, 359-371. Sobel, M. E. (1982). "Asymptotic confidence intervals for indirect effects in structural equation models". Sociological Methodology 13: 290312. doi: 10.2307/270723 (http://dx.doi.org/10.2307/270723). Spencer, S. J.; Zanna, M. P.; Fong, G. T. (2005). "Establishing a causal chain: why experiments are often more effective than meditational analyses in examining psychological processes". Attitudes and Social Cognition 89 (6): 845851.

168

External links
Summary of mediation methods at PsychWiki (http://www.psychwiki.com/wiki/Mediation) Example of Causal Mediation Using Propensity Scores (http://methodology.psu.edu/ra/causal/example) The Methodology Center, Penn State University SPSS and SAS macros for observed variable moderation, mediation, and conditional process modeling (http:// www.afhayes.com/introduction-to-mediation-moderation-and-conditional-process-analysis.html) Andrew F. Hayes, Ohio State University

Mental age

169

Mental age
Mental age is a concept in relation to intelligence, expressed as the age at which a child is performing intellectually. The mental age of the child that is tested is the same as the average age at which normal children achieve a particular score.[1] However, a mental age result on an intelligence test does not mean that children function at their "mental age level" in all aspects of life. For instance, a gifted six-year-old child can still in some ways function as a three-year-old child.[2] Mental age was once considered a controversial concept.[3]

Mental age and IQ


Originally, the differences between mental age and chronological age were used to compute the intelligence quotient, or IQ. This was computed using the ratio method, with the following formula: mental age/chronological age * 100 = IQ. No matter what the child's chronological age, if the mental age is the same as the chronological age, then the IQ will equal 100.[4] An IQ of 100 thus indicates a child of average intellectual development. For a gifted child, the mental age is above the chronological age, and the IQ is higher than 140; for a mentally retarded child, the mental age is below the chronological age, and the IQ is below 70.[5]

References
[1] http:/ / www. apa. org/ research/ action/ glossary. aspx#m [2] L.K. Silverman, 1997. The construct of asynchronous development. Peabody Journal of Education, Vol. 72 Issue 3/4 [3] *Thurstone LL. The Mental Age Concept. (http:/ / www. brocku. ca/ MeadProject/ Thurstone/ Thurstone_1926. html) Psychological Review 33 (1926): 268-278. [4] http:/ / users. ipfw. edu/ abbott/ 120/ IntelligenceTests. html [5] http:/ / users. ipfw. edu/ abbott/ 120/ IntelligenceTests. html

Mental chronometry

170

Mental chronometry
Mental chronometry is the use of response time in perceptual-motor tasks to infer the content, duration, and temporal sequencing of cognitive operations. Mental chronometry is one of the core paradigms of experimental and cognitive psychology, and has found application in various disciplines including cognitive psychophysiology/cognitive neuroscience and behavioral neuroscience to elucidate mechanisms underlying cognitive processing. Mental chronometry is studied using the measurements of reaction time (RT). Reaction time is the elapsed time between the presentation of a sensory stimulus and the subsequent behavioral response. In psychometric psychology it is considered to be an index of speed of processing.[1] That is, it indicates how fast the thinker can execute the mental operations needed by the task at hand. In turn, speed of processing is considered an index of processing efficiency. The behavioral response is typically a button press but can also be an eye movement, a vocal response, or some other observable behavior.

Types
Response time is the sum of reaction time plus movement time. Usually the focus in research is on reaction time. There are four basic means of measuring it: Simple reaction time is the motion required for an observer to respond to the presence of a stimulus. For example, a subject might be asked to press a button as soon as a light or sound appears. Mean RT for college-age individuals is about 160 milliseconds to detect an auditory stimulus, and approximately 190 milliseconds to detect visual stimulus.[2] The mean reaction times for sprinters at the Beijing Olympics were 166 ms for males and 189 ms for females, but in one out of 1,000 starts they can achieve 109 ms and 121 ms, respectively.[3] Interestingly, that study concluded that longer female reaction times are an artifact of the measurement method used; a suitable lowering of the force threshold on the starting blocks for women would eliminate the sex difference. Recognition or Go/No-Go reaction time tasks require that the subject press a button when one stimulus type appears and withhold a response when another stimulus type appears. For example, the subject may have to press the button when a green light appears and not respond when a blue light appears. Choice reaction time (CRT) tasks require distinct responses for each possible class of stimulus. For example, the subject might be asked to press one button if a red light appears and a different button if a yellow light appears. The Jensen box is an example of an instrument designed to measure choice reaction time. Discrimination reaction time involves comparing pairs of simultaneously presented visual displays and then pressing one of two buttons according to which display appears brighter, longer, heavier, or greater in magnitude on some dimension of interest. Due to momentary attentional lapses, there is a considerable amount of variability in an individual's response time, which does not tend to follow a normal (Gaussian) distribution. To control for this, researchers typically require a subject to perform multiple trials, from which a measure of the 'typical' response time can be calculated. Taking the mean of the raw response time is rarely an effective method of characterizing the typical response time, and alternative approaches (such as modeling the entire response time distribution) are often more appropriate.[4]

Mental chronometry

171

The evolution of mental chronometry methodology


Ab Rayhn al-Brn
Psychologists have developed and refined mental chronometry for over the past 100 years. According to Muhammad Iqbal, the Persian scientist Ab Rayhn al-Brn (973-1048) was the first person to describe the concept of reaction time: "Not only is every sensation attended this by a corresponding change localized in the sense-organ, which demands a certain time, but also, between the stimulation of the organ and consciousness of the perception an interval of time must elapse, corresponding to the transmission of stimulus for some distance along the nerves."[5]

Galton and differential psychology


Sir Francis Galton is typically credited as the founder of differential psychology, which seeks to determine and explain the mental differences between individuals. He was the first to use rigorous reaction time tests with the express intention of determining averages and ranges of individual differences in mental and behavioral traits in humans. Galton hypothesized that differences in intelligence would be reflected in variation of sensory discrimination and speed of response to stimuli, and he built various machines to test different measures of this, including reaction time to visual and auditory stimuli. His tests involved a selection of over 10,000 men, women and children from the London public.[1]

Donders' experiment
The first scientist to measure reaction time in the laboratory was Franciscus Donders (1869). Donders found that simple reaction time is shorter than recognition reaction time, and that choice reaction time is longer than both.[2] Donders also devised a subtraction method to analyze the time it took for mental operations to take place.[6] By subtracting simple reaction time from choice reaction time, for example, it is possible to calculate how much time is needed to make the connection. This method provides a way to Donders (1868s): method of subtraction. Picture from the Historical Introduction to Cognitive Psychology webpage. investigate the cognitive processes underlying simple perceptual-motor tasks, and formed the basis of subsequent developments.[6] Although Donders' work paved the way for future research in mental chronometry tests, it was not without its drawbacks. His insertion method was based on the assumption that inserting a particular complicating requirement into an RT paradigm would not affect the other components of the test. This assumption - that the incremental effect on RT was strictly additive - was not able to hold up to later experimental tests, which showed that the insertions were able to interact with other portions of the RT paradigm. Despite this, Donders' theories are still of interest and his ideas are still used in certain areas of psychology, which now have the statistical tools to use them more accurately.[1]

Mental chronometry

172

Hick's Law
W. E. Hick (1952) devised a CRT experiment which presented a series of nine tests in which there are n equally possible choices. The experiment measured the subject's reaction time based on number of possible choices during any given trial. Hick showed that the individual's reaction time increased by a constant amount as a function of available choices, or the "uncertainty" involved in which reaction stimulus would appear next. Uncertainty is measured in "bits", which are defined as the quantity of information that reduces uncertainty by half in information theory. In Hick's experiment, the reaction time is found to be a function of the binary logarithm of the number of available choices (n). This phenomenon is called "Hick's Law" and is said to be a measure of the "rate of gain of information." The law is usually expressed by the formula , where and are constants representing the intercept and slope of the function, and is the number of alternatives.[7] The Jensen Box is a more recent application of Hick's Law.[1] Hick's Law has interesting modern applications in marketing, where restaurant menus and web interfaces (among other things) take advantage of its principles in striving to achieve [8] speed and ease of use for the consumer.

Sternbergs memory-scanning task


Sternberg (1966) devised an experiment wherein subjects were told to remember a set of unique digits in short-term memory. Subjects were then given a probe stimulus in the form of a digit from 0-9. The subject then answered as quickly as possible whether the probe was in the previous set of digits or not. The size of the initial set of digits determined the reaction time of the subject. The idea is that as the size of the set of digits increases the number of processes that need to be completed before a decision can be made increases as well. So if the subject has 4 items in short-term memory (STM), then after encoding the information from the probe stimulus the subject needs to compare the probe to each of the 4 items in memory and then make a decision. If there were only 2 items in the initial set of digits, then only 2 processes would be needed. The data from this study found that for each additional item added to the set of digits, about 38 milliseconds were added to the response time of the subject. This supported the idea that a subject did a serial exhaustive search through memory rather than a serial self-terminating search.[9] Sternberg (1969) developed a much-improved method for dividing reaction time into successive or serial stages, called the additive factor method.[10]

Shepard and Metzlers mental rotation task


Shepard and Metzler (1971) presented a pair of three-dimensional shapes that were identical or mirror-image versions of one another. Reaction time to determine whether they were identical or not was a linear function of the angular difference between their orientation, whether in the picture plane or in depth. They concluded that the observers performed a constant-rate mental rotation to align the two objects so they could be compared.[11] Cooper and Shepard (1973) presented a letter or digit that was either normal or mirror-reversed, and presented either upright or at angles of rotation in units of 60 degrees. The subject had to identify whether the stimulus was normal or mirror-reversed. Response time increased roughly linearly as the orientation of the letter deviated from upright (0 degrees) to inverted (180 degrees), and then decreases again until it reaches 360 degrees. The authors concluded that the subjects mentally rotate the image the shortest distance to upright, and then judge whether it is normal or mirror-reversed.[12]

Sentence-picture verification
Mental chronometry has been used in identifying some of the processes associated with understanding a sentence. This type of research typically revolves around the differences in processing 4 types of sentences: true affirmative (TA), false affirmative (FA), false negative (FN), and true negative (TN). A picture can be presented with an associated sentence that falls into one of these 4 categories. The subject then decides if the sentence matches the picture or does not. The type of sentence determines how many processes need to be performed before a decision can

Mental chronometry be made. According to the data from Clark and Chase (1972) and Just and Carpenter (1971), the TA sentences are the simplest and take the least time, than FA, FN, and TN sentences.[13][14]

173

Mental chronometry and models of memory


Hierarchical network models of memory were largely discarded due to some findings related to mental chronometry. The TLC model proposed by Collins and Quillian (1969) had a hierarchical structure indicating that recall speed in memory should be based on the number of levels in memory traversed in order to find the necessary information. But the experimental results did not agree. For example, a subject will reliably answer that a robin is a bird more quickly than he will answer that an ostrich is a bird despite these questions accessing the same two levels in memory. This led to the development of spreading activation models of memory (e.g., Collins & Loftus, 1975), wherein links in memory are not organized hierarchically but by importance instead.[15][16]

Posners letter matching studies


Posner (1978) used a series of letter-matching studies to measure the mental processing time of several tasks associated with recognition of a pair of letters. The simplest task was the physical match task, in which subjects were shown a pair of letters and had to identify whether the two letters were physically identical or not. The next task was the name match task where subjects had to identify whether two letters had the same name. The task involving the most cognitive processes was the rule match task in which subjects had to determine whether the two letters presented both were vowels or not vowels. The physical match task was the most simple; subjects had to encode the letters, compare them to each other, and make a decision. When doing the name match task subjects were forced to add a cognitive step before making a decision: they had to search memory for the names of the letters, and then compare those before deciding. In the rule based task they had to also categorize the letters as either vowels or consonants before making their choice. The time taken to perform the rule match task was longer than the name match task which was longer than the physical match task. Using the subtraction method experimenters were able to determine the approximate amount of time that it took for subjects to perform each of the cognitive processes associated with each of these tasks.[17]

Mental chronometry and cognitive development


There is extensive recent research using mental chronometry for the study of cognitive development. Specifically, various measures of speed of processing were used to examine changes in the speed of information processing as a function of age. Kail (1991) showed that speed of processing increases exponentially from early childhood to early adulthood.[18] Studies of reaction times in young children of various ages are consistent with common observations of children engaged in activities not typically associated with chronometry.[1] This includes speed of counting, reaching for things, repeating words, and other developing vocal and motor skills that develop quickly in growing children.[19] Once reaching early maturity, there is then a long period of stability until speed of processing begins declining from middle age to senility (Salthouse, 2000).[20] In fact, cognitive slowing is considered a good index of broader changes in the functioning of the brain and intelligence. Demetriou and colleagues, using various methods of measuring speed of processing, showed that it is closely associated with changes in working memory and thought (Demetriou, Mouyi, & Spanoudis, 2009). These relations are extensively discussed in the neo-Piagetian theories of cognitive development.[] During senescence, RT deteriorates (as does fluid intelligence), and this deterioration is systematically associated with changes in many other cognitive processes, such as executive functions, working memory, and inferential processes.[] In the theory of Andreas Demetriou,[21] one of the neo-Piagetian theories of cognitive development, change in speed of processing with age, as indicated by decreasing reaction time, is one of the pivotal factors of cognitive development.

Mental chronometry

174

Mental chronometry and cognitive ability


Researchers have reported medium-sized correlations between reaction time and measures of intelligence: There is thus a tendency for individuals with higher IQ to be faster on reaction time tests. Research into this link between mental speed and general intelligence (perhaps first proposed by Charles Spearman) was re-popularised by Arthur Jensen, and the "Choice reaction Apparatus" associated with his name became a common standard tool in reaction time-IQ research. The strength of the RT-IQ association is a subject of research. Several studies have reported association between simple reaction time and intelligence of around (r=.31), with a tendency for larger associations between choice reaction time and intelligence (r=.49).[22] Much of the theoretical interest in reaction time was driven by Hick's Law, relating the slope of reaction time increases to the complexity of decision required (measured in units of uncertainty popularised by Claude Shannon as the basis of information theory). This promised to link intelligence directly to the resolution of information even in very basic information tasks. There is some support for a link between the slope of the reaction time curve and intelligence, as long as reaction time is tightly controlled.[] Standard deviations of reaction times have been found to be more strongly correlated with measures of general intelligence (g) than mean reaction times. The reaction times of low-g individuals are more spread-out than those of high-g individuals.[] The cause of the relationship is unclear. It may reflect more efficient information processing, better attentional control, or the integrity of neuronal processes.

Mental chronometry

175

Other factors
Research has shown that reaction times may be improved by chewing gum: "The results showed that chewing gum was associated with greater alertness and a more positive mood. Reaction times were quicker in the gum condition, and this effect became bigger as the task became more difficult." [23]

Application of mental chronometry in biological psychology/cognitive neuroscience


With the advent of the functional neuroimaging techniques of PET and fMRI, psychologists started to modify their mental chronometry paradigms for functional imaging (Posner, 2005). Although psycho(physio)logists have been using electroencephalographic measurements for decades, the images obtained with PET have attracted great interest from other branches of neuroscience, popularizing mental chronometry among a wider range of scientists in recent years. The way that mental chronometry is utilized is by performing tasks based on reaction time which measures through neuroimaging the parts of the brain which are involved in the cognitive processes.[24] In the 1950s, the use of a micro electrode recording of single neurons in anaesthetized monkeys allowed research to look at physiological process in the brain and supported this idea that people encode information serially.

Regions of the Brain Involved in a Number Comparison Task Derived from EEG and fMRI Studies. The regions represented correspond to those showing effects of notation used for the numbers (pink and hatched), distance from the test number (orange), choice of hand (red), and errors (purple). Picture from the article: Timing the Brain: Mental Chronometry as a Tool in Neuroscience.

In the 1960s, these methods were used extensively in humans: researchers recorded the electrical potentials in human brain using scalp electrodes while a reaction tasks was being conducted using digital computers. What they found was that there was a connection between the observed electrical potentials with motor and sensory stages for information processing. For example, researchers found in the recorded scalp potentials that the frontal cortex was being activated in association with motor activity. These finding can be connected to Donders idea of the subtractive method of the sensory and motor stages involved in reaction tasks. In the 1970s and early 1980s, development of signal processing tool for EEG translated into a revival of research using this technique to assess the timing and the speed of mental processes. For example, high-profile research showed how reaction time on a given trial correlated with the latency (delay between stimulus and response) of the P300 wave[25] or how the timecourse of the EEG reflected the sequence of cognitive processes involved in perceptual processing.[26] With the invention of functional magnetic resonance imaging (fMRI), techniques were used to measure activity through electrical event-related potentials in a study when subjects were asked to identify if a digit that was presented was above or below five. According to Sternbergs additive theory, each of the stages involved in

Mental chronometry performing this task includes: encoding, comparing against the stored representation for five, selecting a response, and then checking for error in the response.[27] The fMRI image presents the specific locations where these stages are occurring in the brain while performing this simple mental chronometry task. In the 1980s, neuroimaging experiments allowed researchers to detect the activity in localized brain areas by injecting radionuclides and using positron emission tomography (PET) to detect them. Also, fMRI was used which have detected the precise brain areas that are active during mental chronometry tasks. Many studies have shown that there is a small number of brain areas which are widely spread out which are involved in performing these cognitive tasks.

176

References
[1] Jensen, A. R. (2006). Clocking the mind: Mental chronometry and individual differences. Amsterdam: Elsevier. (ISBN 978-0-08-044939-5) [2] Kosinski, R. J. (2008). A literature review on reaction time, Clemson University. (http:/ / biae. clemson. edu/ bpc/ bp/ Lab/ 110/ reaction. htm#Type of Stimulus) [4] (http:/ / opensiuc. lib. siu. edu/ cgi/ viewcontent. cgi?article=1077& context=tpr) Whelan, R. (2008). Effective analysis of reaction time data. The Psychological Record, 58, 475-482. [6] Donders, F.C. (1869). On the speed of mental processes. In W. G. Koster (Ed.), Attention and Performance II. Acta Psychologica, 30, 412-431. (Original work published in 1868.) [7] Hick's Law at Encyclopedia.com (http:/ / www. encyclopedia. com/ doc/ 1O87-Hickslaw. html) Originally from Colman, A. (2001). A Dictionary of Psychology. Retrieved February 28, 2009. [8] W. Lidwell, K. Holden and J. Butler: Universal. Principles of Design. Rockport, Gloucester, MA, 2003. [12] Cooper, L. A., & Shepard, R. N. (1973). Chronometric studies of the rotation of mental images. New York: Academic Press. [17] Posner, M. I. (1978). Chronometric explorations of mind. Hillsdale, NJ: Erlbaum, 1978. [21] Demetriou, A., Mouyi, A., & Spanoudis, G. (2010). The development of mental processing. Nesselroade, J. R. (2010). Methods in the study of life-span human development: Issues and answers. In W. F. Overton (Ed.), Biology, cognition and methods across the life-span. Volume 1 of the Handbook of life-span development (pp. 36-55), Editor-in-chief: R. M. Lerner. Hoboken, NJ: Wiley. [23] Smith, A. (2009). Effects of chewing gum on mood, learning, memory and performance of an intelligence test. Nutritional Neuroscience, 12(2), 81

Further reading
Luce, R.D. (1986). Response Times: Their Role in Inferring Elementary Mental Organization. New York: Oxford University Press. ISBN0-19-503642-5. Meyer, D.E.; Osman, A.M.; Irwin, D.E.; Yantis, S. (1988). "Modern mental chronometry". Biological Psychology 26 (13): 367. doi: 10.1016/0301-0511(88)90013-0 (http://dx.doi.org/10.1016/0301-0511(88)90013-0). PMID 3061480 (http://www.ncbi.nlm.nih.gov/pubmed/3061480). Townsend, J.T.; Ashby, F.G. (1984). Stochastic Modeling of Elementary Psychological Processes. Cambridge, UK: Cambridge University Press. ISBN0-521-27433-8. Weiss, V; Weiss, H (2003). "The golden mean as clock cycle of brain waves" (http://www.v-weiss.de/chaos. html). Chaos, Solitons and Fractals 18 (4): 643652. Bibcode: 2003CSF....18..643W (http://adsabs.harvard. edu/abs/2003CSF....18..643W). doi: 10.1016/S0960-0779(03)00026-2 (http://dx.doi.org/10.1016/ S0960-0779(03)00026-2).

Mental chronometry

177

External links
Reaction Time Test (http://www.humanbenchmark.com/tests/reactiontime/index.php) - Measuring Mental Chronometry on the Web Historical Introduction to Cognitive Psychology (http://www.mtsu.edu/~sschmidt/Cognitive/intro/intro. html) Timing the Brain: Mental Chronometry as a Tool in Neuroscience (http://biology.plosjournals.org/perlserv/ ?request=get-document&doi=10.1371/journal.pbio.0030051) Sample Chronometric Test on the web (http://cognitivelabs.com/mydna_speedtestno.htm)

Missing completely at random


In statistical analysis, data-values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random.[] When data are MCAR, the analyses performed on the data are unbiased; however, data are rarely MCAR.[] Missing at random (MAR) is an alternative, and occurs when the missingness is related to a particular variable, but it is not related to the value of the variable that has missing data.[]An example of this is accidentally omitting an answer on a questionnaire. Not missing at random (NMAR) is data that is missing for a specific reason (ie. the value of the variable that's missing is related to the reason it's missing).[] An example of this is if certain question on a questionnaire tend to be skipped deliberately by participants with certain characteristics.

References Further reading


Heitjan, D. F.; Basu, S. (1996). "Distinguishing "Missing at Random" and "Missing Completely at Random"". The American Statistician 50 (3): 207213. doi: 10.2307/2684656 (http://dx.doi.org/10.2307/2684656). JSTOR 2684656 (http://www.jstor.org/stable/2684656). Weiner, I. B., Freedheim, D.K., Velicer, W. F., Schinka, J. A., & Lerner, R. M. (2003). Handbook of Psychology. John Wiley and Sons: USA Little, Roderick J. A.; Rubin, Donald B. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley. ISBN0-471-18386-5.

Moderated mediation

178

Moderated mediation
In statistics, moderation and mediation can occur together in the same model.[1] Moderated mediation, also known as conditional indirect effects,[2] occurs when the treatment effect of an independent variable A on an outcome variable C via a mediator variable B differs depending on levels of a moderator variable D. Specifically, either the effect of A on the B, and/or the effect of B on C depends on the level of D.

Muller, Judd, & Yzerbyt (2005) model


Muller, Judd, and Yzerbyt (2005) were the first to provide a comprehensive definition of this process.[1] The following regression equations are fundamental to their model of moderated mediation, where A = independent variable, C = outcome variable, B = mediator variable, and D = moderator variable. C = 40 + 41A + 42D + 43AD + 4 This equation assesses moderation of the overall treatment effect of A on C. B = 50 + 51A + 52D + 53AD + 5 This equation assesses moderation of the treatment effect of A on the mediator B. C = 60 + 61A + 62D + 63AD + 64B + 65BD + 6 This equation assesses moderation of the effect of the mediator B on C, as well as moderation of the residual treatment effect of A on C. This fundamental equality exists among these equations: 43 63 = 6453 + 6551 In order to have moderated mediation, there must be an overall treatment effect of A on the outcome variable C (41), which does not depend on the moderator (43 = 0). Either the treatment effect of A on the mediator B depends on the moderator (53 0) and/or the effect of the mediator B on the outcome variable C depends on the moderator (65 0). At least one of the products on the right side of the above equality must not equal 0 (i.e. either 53 0 and 64 0, or 65 0 and 51 0). As well, since there is no overall moderation of the treatment effect of A on the outcome variable C (43 = 0), this means that 63 cannot equal 0. In other words, the residual direct effect of A on the outcome variable C, controlling for the mediator, is moderated.

Additions by Preacher, Rucker, and Hayes (2007)


In addition to the three manners proposed by Muller and colleagues in which moderated mediation can occur, Preacher, Rucker, and Hayes (2007) proposed that the independent variable A itself can moderate the effect of the mediator B on the outcome variable C. They also proposed that a moderator variable D could moderate the effect of A on B, while a different moderator E moderates the effect of B on C.[2]

Differences between moderated mediation and mediated moderation


Moderated mediation relies on the same underlying models (specified above) as mediated moderation. The main difference between the two processes is whether there is overall moderation of the treatment effect of A on the outcome variable C. If there is, then there is mediated moderation. If there is no overall moderation of A on C, then there is moderated mediation.[1]

Moderated mediation

179

Testing for moderated mediation


In order to test for moderated mediation, some recommend examining a series of models, sometimes called a piecemeal approach, and looking at the overall pattern of results.[1] This approach is similar to the Baron and Kenny method for testing mediation by analyzing a series of three regressions.[3] These researchers claim that a single overall test would be insufficient to analyze the complex processes at play in moderated mediation, and would not allow one to differentiate between moderated mediation and mediated moderation. Bootstrapping has also been suggested as a method of estimating the sampling distributions of a moderated mediation model in order to generate confidence intervals.[2] This method has the advantage of not requiring that any assumptions be made about the shape of the sampling distribution. Preacher, Rucker and Hayes also discuss an extension of simple slopes analysis for moderated mediation. Under this approach, one must choose a limited number of key conditional values of the moderator that will be examined. As well, one can use the JohnsonNeyman technique to determine the range of significant conditional indirect effects.[2] Preacher, Rucker, and Hayes (2007) have created an SPSS macro that provides bootstrapping estimations as well as JohnsonNeyman results.

References
[1] Muller, D., Judd, C. M., & Yzerbyt, V. Y. (2005). When moderation is mediated and mediation is moderated. Journal of Personality and Social Psychology, 89, 852863. [2] Preacher, K. J., Rucker, D. D., & Hayes, A. F. (2007) Addressing moderated mediation hypotheses: Theory, Methods, and Prescriptions. Multivariate Behavioral Research, 42, 185227. [3] Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 11731182.

External links
SPSS and SAS macros for testing conditional indirect effects (http://www.afhayes.com/ spss-sas-and-mplus-macros-and-code.html)

Moderation (statistics)

180

Moderation (statistics)
In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator.[] The effect of a moderating variable is characterized statistically as an interaction;[] that is, a qualitative (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.[1]

Example
Moderation analysis in the behavioral sciences involves the use of linear multiple regression analysis or causal modelling.[] To quantify the effect of a moderating variable in multiple regression analyses, regressing random variable Y on X, an additional term is added to the model. This term is the interaction between X and the proposed moderating variable.[] Thus, for a response Y and two variables x1 and moderating variable x2,: In this case, the role of x2 as a moderating variable is accomplished by evaluating b3, the parameter estimate for the interaction term.[] See linear regression for discussion of statistical evaluation of parameter estimates in regression analyses.

Multicollinearity in moderated regression


In moderated regression analysis, a new interaction predictor ( ) is calculated. However, the new interaction term will be correlated with the two main effects terms used to calculate it. This is the problem of multicollinearity in moderated regression. Multicollinearity tends to cause coefficients to be estimated with higher standard errors and hence greater uncertainty.

Post-hoc probing of interactions


Like simple main effect analysis in ANOVA, in post-hoc probing of interactions in regression, we are examining the simple slope of one independent variable at the specific values of the other independent variable. Below is an example of probing two-way interactions. In what follows the regression equation with two variables A and B and an interaction term A*B, will be considered.[2]

Two categorical independent variables


If both of the independent variables are categorical variables, we can analyze the results of the regression for one independent variable at a specific level of the other independent variable. For example, suppose that both A and B are single dummy coded (0,1) variables, and that A represents ethnicity (0 = European Americans, 1 = East Asians) and B represents the condition in the study (0 = control, 1 = experimental). Then the interaction effect shows whether the effect of condition on the dependent variable Y is different for European Americans and East Asians and whether the effect of ethnic status is different for the two conditions. The coefficient of A shows the ethnicity effect on Y for the control condition, while the coefficient of B shows the effect of imposing the experimental condition for

Moderation (statistics) European American participants. To probe if there is any significant difference between European Americans and East Asians in the experimental condition, we can simply run the analysis with the condition variable reverse-coded (0 = experimental, 1 = control), so that the coefficient for ethnicity represents the ethnicity effect on Y in the experimental condition. In a similar vein, if we want to see whether the treatment has an effect for East Asian participants, we can reverse code the ethnicity variable (0 = East Asians, 1 = European Americans).

181

One categorical and One continuous independent variable


If the first independent variable is a categorical variable (e.g. gender) and the second is a continuous variable (e.g. scores on the Satisfaction With Life Scale (SWLS)), then b1 represents the difference in the dependent variable between males and females when life satisfaction is zero. However, a zero score on the Satisfaction With Life Scale is meaningless as the range of the score is from 7 to 35. This is where centering comes in. If we subtract the mean of the SWLS score for the sample from each participant's score, the mean of the resulting centered SWLS score is zero. When the analysis is run again, b1 now represents the difference between males and females at the mean level of the SWLS score of the sample. Cohen et al. (2003) recommended using the following to probe the simple effect of gender on the dependent variable (Y) at three levels of the continuous independent variable: high (one standard deviation above the mean), moderate (at the mean), and low (one standard deviation below the mean).[3] If the scores of the continuous variable are not standardized, one can just calculate these three values by adding or subtracting one standard deviation of the original scores; if the scores of the continuous variable are standardized, one can calculate the three values as follows: high = the standardized score minus 1, moderate (mean = 0), low = the standardized score plus 1. Then one can explore the effects of gender on the dependent variable (Y) at high, moderate, and low levels of the SWLS score. As with two categorical independent variables, b2 represents the effect of the SWLS score on the dependent variable for females. By reverse coding the gender variable, one can get the effect of the SWLS score on the dependent variable for males. Coding in moderated regression When treating categorical variables such as ethnic groups and experimental treatments as independent variables in moderated regression, one needs to code the variables so that each code variable represents a specific setting of the categorical variable. There are three basic ways of coding: Dummy-variable coding, Effects coding, and Contrast coding. Below is an introduction to these coding systems.[4][5] Dummy coding is used when one has a reference group or one condition in particular (e.g. a control group in the experiment) that is to be compared to each of the other experimental groups. In this case, the intercept is the mean of the reference group, and each of the unstandardized regression coefficients is the difference in the dependent variable between one of the treatment groups and the mean of the reference group (or control group). This coding system is similar to ANOVA analysis, and is appropriate when researchers have a specific reference group and want to compare each of the other groups with it. Effects coding is used when one does not have a particular comparison or control group and does not have any planned orthogonal contrasts. The intercept is the grand mean (the mean of all the conditions). The regression coefficient is the difference between the mean of one group and the mean of all the group means (e.g. the mean of group A minus the mean of all groups). This coding system is appropriate when the groups represent natural categories. Contrast coding is used when one has a series of orthogonal contrasts or group comparisons that are to be investigated. In this case, the intercept is the unweighted mean of the individual group means. The unstandardized regression coefficient represents the difference between the unweighted mean of the means of one group (A) and the unweighted mean of another group (B), where A and B are two sets of groups in the contrast. This coding system is appropriate when researchers have an a priori hypothesis concerning the specific differences among the group

Moderation (statistics) means.

182

Two continuous independent variables


If both of the independent variables are continuous, we can either center or standardize the original scores. There is a subtle difference between centering and standardization: in centering we just center all the continuous independent variables rather than the dependent variable, while in standardization we standardize all the continuous independent variables and the continuous dependent variable. Regarding standardization, suppose that independent variable A represents the participant's score on the Rosenberg self-esteem scale and B represents the participant's score on the Satisfaction With Life Scale. Through standardization, the mean score of each of self-esteem and life satisfaction is zero. Label the standardized scores Zse for self-esteem and Zls for life-satisfaction. Coefficient b1 shows the effect of self-esteem on the dependent variable (Y) at the mean level of life satisfaction, and coefficient b2 shows the effect of life satisfaction on the dependent variable at the mean level of self-esteem. To probe the interaction effect, we need to calculate the three values representing high, moderate, and low levels of each independent variable. We don't need to calculate the moderate level as it is zero and represents the mean of each independent variable after standardization. The high and low levels of each independent variable can be calculated as in the case of one categorical independent variable and one continuous independent variable as discussed above. We can label them ZseHigh, ZseLow, ZlsHigh, and ZlsLow. Then we can create the interaction effects with the calculated values. For example, to get the simple effect of self-esteem on the dependent variable at a high level of life satisfaction, the value of the interaction term would be Zse LSHigh. We evaluate the right side of the regression equation at Zse, ZlsHigh, and Zse LSHigh, to get the effect of self-esteem on the dependent variable at a high level of life satisfaction. Similarly, we can do simple slope analysis for the effect of self-esteem on the dependent variable at a low level of life satisfaction or the effect of life satisfaction on the dependent variable at different levels of self-esteem.

Higher level interactions


The principles for two-way interactions apply when we want to explore three-way or higher level interactions. For instance, if we have a three-way interaction between A, B, and C, the regression equation will be as follows:

Spurious higher-order effects


It is worth noting that the reliability of the higher-order terms depends on the reliability of the lower-order terms. For example, if the reliability for variable A is .70, and reliability for variable B is .80, then the reliability for the interaction variable A*B is .70*.80 = .56. In this case, low reliability of the interaction term leads to low power; therefore, we may not be able to find the interaction effects between A and B that actually exist. The solution for this problem is to use highly reliable measures for each independent variable. Another caveat for interpreting the interaction effects is that when variable A and variable B are highly correlated, then the A * B term will be highly correlated with the omitted variable A2; consequently what appears to be a significant moderation effect might actually be a significant nonlinear effect of A alone. If this is the case, it is worth testing a nonlinear regression model by adding nonlinear terms in individual variables into the moderated regression analysis to see if the interactions remain significant. If the interaction effect A*B is still significant, we will be more confident in saying that there is indeed a moderation effect; however, if the interaction effect is no longer significant after adding the nonlinear term, we will be less certain about the existence of a moderation effect and the nonlinear model will be preferred because it is more parsimonious.

Moderation (statistics)

183

References
[1] Baron, R. M., & Kenny, D. A. (1986). "The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations", Journal of Personality and Social Psychology, 5 (6), 11731182 (page 1174)

Hayes, A. F., & Matthes, J. (2009). "Computational procedures for probing interactions in OLS and logistic regression: SPSS and SAS implementations." Behavior Research Methods, Vol. 41, pp.924936.

Multidimensional scaling
Multidimensional scaling (MDS) is a set of related statistical techniques often used in information visualization for exploring similarities or dissimilarities in data. MDS is a special case of ordination. An MDS algorithm starts with a matrix of itemitem similarities, then assigns a location to each item in N-dimensional space, where N is specified a priori. For sufficiently small N, the resulting locations may be displayed in a graph or 2D visualization techniques such as scatterplots.

Types
MDS algorithms fall into a taxonomy, depending on the meaning of the input matrix: Classical multidimensional scaling Also known as Principal Coordinates Analysis, Torgerson Scaling or TorgersonGower scaling. Takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain.[] Metric multidimensional scaling A superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization. Non-metric multidimensional scalingLouis Guttman's smallest space analysis (SSA) is an example of a non-metric MDS procedure. In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. The relationship is typically found using isotonic regression. Generalized multidimensional scaling An extension of metric multidimensional scaling, in which the target space is an arbitrary smooth non-Euclidean space. In cases where the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another.[]

Multidimensional scaling

184

Details
The data to be analyzed is a collection of defined, i,j := distance between i th and j th objects. These distances are the entries of the dissimilarity matrix objects (colors, faces, stocks,...) on which a distance function is

The goal of MDS is, given , to find for all where

vectors ,

such that

is a vector norm. In classical MDS, this norm is the Euclidean distance, but, in a broader sense, it may objects into RN such that distances are preserved. If

be a metric or arbitrary distance function.[1] In other words, MDS attempts to find an embedding from the

the dimension N is chosen to be 2 or 3, we may plot the vectors xi to obtain a visualization of the similarities between the objects. Note that the vectors xi are not unique: With the Euclidean distance, they may be arbitrarily translated, rotated, and reflected, since these transformations do not change the pairwise distances . There are various approaches to determining the vectors xi. Usually, MDS is formulated as an optimization problem, where is found as a minimizer of some cost function, for example,

A solution may then be found by numerical optimization techniques. For some particularly chosen cost functions, minimizers can be stated analytically in terms of matrix eigendecompositions.[citation needed]

Procedure
There are several steps in conducting MDS research: 1. Formulating the problem What variables do you want to compare? How many variables do you want to compare? More than 20 is often considered cumbersome. [citation needed] Fewer than 8 (4 pairs) will not give valid results. [citation needed] What purpose is the study to be used for? 2. Obtaining input data Respondents are asked a series of questions. For each product pair, they are asked to rate similarity (usually on a 7 point Likert scale from very similar to very dissimilar). The first question could be for Coke/Pepsi for example, the next for Coke/Hires rootbeer, the next for Pepsi/Dr Pepper, the next for Dr Pepper/Hires rootbeer, etc. The number of questions is a function of the number of brands and can be calculated as where Q is the number of questions and N is the number of brands. This approach is referred to as the Perception data : direct approach. There are two other approaches. There is the Perception data : derived approach in which products are decomposed into attributes that are rated on a semantic differential scale. The other is the Preference data approach in which respondents are asked their preference rather than similarity. 3. Running the MDS statistical program Software for running the procedure is available in many software for statistics. Often there is a choice between Metric MDS (which deals with interval or ratio level data), and Nonmetric MDS (which deals with ordinal data). 4. Decide number of dimensions The researcher must decide on the number of dimensions they want the computer to create. The more dimensions, the better the statistical fit, but the more difficult it is to interpret the results.

Multidimensional scaling 5. Mapping the results and defining the dimensions The statistical program (or a related module) will map the results. The map will plot each product (usually in two-dimensional space). The proximity of products to each other indicate either how similar they are or how preferred they are, depending on which approach was used. How the dimensions of the embedding actually correspond to dimensions of system behavior, however, are not necessarily obvious. Here, a subjective judgment about the correspondence can be made (see perceptual mapping). 6. Test the results for reliability and validity Compute R-squared to determine what proportion of variance of the scaled data can be accounted for by the MDS procedure. An R-square of 0.6 is considered the minimum acceptable level. [citation needed] An R-square of 0.8 is considered good for metric scaling and .9 is considered good for non-metric scaling. Other possible tests are Kruskals Stress, split data tests, data stability tests (i.e., eliminating one brand), and test-retest reliability. 7. Report the results comprehensively Along with the mapping, at least distance measure (e.g., Sorenson index, Jaccard index) and reliability (e.g., stress value) should be given. It is also very advisable to give the algorithm (e.g., Kruskal, Mather), which is often defined by the program used (sometimes replacing the algorithm report), if you have given a start configuration or had a random choice, the number of runs, the assessment of dimensionality, the Monte Carlo method results, the number of iterations, the assessment of stability, and the proportional variance of each axis (r-square).

185

Applications
Applications include scientific visualisation and data mining in fields such as cognitive science, information science, psychophysics, psychometrics, marketing and ecology. New applications arise in the scope of autonomous wireless nodes that populate a space or an area. MDS may apply as a real time enhanced approach to monitoring and managing such populations. Furthermore, MDS has been used extensively in geostatistics for modeling the spatial variability of the patterns of an image, by representing them as points in a lower-dimensional space.[2]

Marketing
In marketing, MDS is a statistical technique for taking the preferences and perceptions of respondents and representing them on a visual grid, called perceptual maps.

Comparison and advantages


Potential customers are asked to compare pairs of products and make judgments about their similarity. Whereas other techniques (such as factor analysis, discriminant analysis, and conjoint analysis) obtain underlying dimensions from responses to product attributes identified by the researcher, MDS obtains the underlying dimensions from respondents judgments about the similarity of products. This is an important advantage. [citation needed] It does not depend on researchers judgments. It does not require a list of attributes to be shown to the respondents. The underlying dimensions come from respondents judgments about pairs of products. Because of these advantages, MDS is the most common technique used in perceptual mapping. [citation needed]

Multidimensional scaling

186

Implementations
cmdscale in R NMS in PC-ORD, Multivariate Analysis of Ecological Data [3] Orange, a free data mining software suite, module orngMDS [4] ViSta [5] has implementations of MDS by Forrest W. Young. Interactive graphics allow exploring the results of MDS in detail. usabiliTEST's Online Card Sorting [6] software is utilizing MDS to plot the data collected from the participants of usability tests.

Bibliography
[1] Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences, 07-011. Beverly Hills and London: Sage Publications. [2] Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling (http:/ / dx. doi. org/ 10. 1007/ s11004-010-9276-7), Mathematical Geosciences, 42: 487517 [3] http:/ / www. pcord. com [4] http:/ / www. ailab. si/ orange/ doc/ modules/ orngMDS. htm [5] http:/ / www. uv. es/ visualstats/ Book [6] http:/ / www. usabilitest. com/ CardSorting

Cox, T.F., Cox, M.A.A. (2001). Multidimensional Scaling. Chapman and Hall. Coxon, Anthony P.M. (1982). The User's Guide to Multidimensional Scaling. With special reference to the MDS(X) library of Computer Programs. London: Heinemann Educational Books. Green, P. (January 1975). "Marketing applications of MDS: Assessment and outlook". Journal of Marketing 39 (1): 2431. doi: 10.2307/1250799 (http://dx.doi.org/10.2307/1250799). McCune, B. and Grace, J.B. (2002). Analysis of Ecological Communities. Oregon, Gleneden Beach: MjM Software Design. ISBN0-9721290-0-6. Torgerson, Warren S. (1958). Theory & Methods of Scaling. New York: Wiley. ISBN0-89874-722-8.

External links
An elementary introduction to multidimensional scaling (http://www.mathpsyc.uni-bonn.de/doc/delbeke/ delbeke.htm) NewMDSX: Multidimensional Scaling Software (http://www.newmdsx.com/) MDS page (http://www.granular.com/MDS/) MDS in C++ (http://codingplayground.blogspot.com/2009/05/multidimension-scaling.html) by Antonio Gulli The orngMDS module (http://orange.biolab.si/doc/modules/orngMDS.htm) for MDS from Orange (software)

Multiple mini interview

187

Multiple mini interview


The multiple mini interview (MMI)[1] is an interview format that uses many short independent assessments, typically in a timed circuit, to obtain an aggregate score of each candidates soft skills. In 2001, the Michael DeGroote School of Medicine at McMaster University began developing the MMI system, to address two widely recognized problems. First, it has been shown that traditional interview formats or simulations of educational situations do not accurately predict performance in medical school. Secondly, when a licensing or regulatory body reviews the performance of a physician subsequent to patient complaints, the most frequent issues of concern are those of the non-cognitive skills, such as interpersonal skills, professionalism and ethical/moral judgment.

Introduction
Interviews have been used widely for different purposes, including assessment and recruitment. Candidate assessment is normally deemed successful when the scores generated by the measuring tool predict for future outcomes of interest, such as job performance or job retention. Meta-analysis of the human resource literature has demonstrated low to moderate ability of interviews to predict for future job performance.[2] How well a candidate scores on one interview is only somewhat correlated with how well that candidate scores on the next interview. Marked shifts in scores are buffered when collecting many scores on the same candidate, with a greater buffering effect provided by multiple interviews than by multiple interviewers acting as a panel for one interview.[3] The score assigned by an interviewer in the first few minutes of an interview is rarely changed significantly over the course of the rest of the interview, an effect known as the halo effect. Therefore, even very short interviews within an MMI format provide similar ability to differentiate reproducibly between candidates.[4] Ability to reproducibly differentiate between candidates, also known as overall test reliability, is markedly higher for the MMI than for other interview formats.[1] This has translated into higher predictive validity, correlating for future performance much more highly than standard interviews.[5][6][7][8]

History
Aiming to enhance predictive correlations with future performance in medical school, post-graduate medical training, and future performance in practice, McMaster University began research and development of the MMI in 2001. The initial pilot was conducted on 18 graduate students volunteering as medical school candidates. High overall test reliability (0.81) led to a larger study conducted in 2002 on real medical school candidates, many of whom volunteered after their standard interview to stay for the MMI. Overall test reliability remained high,[1] and subsequent follow-up through medical school and on to national licensure examination (Medical Council of Canada [9] Qualifying Examination Parts I and II) revealed the MMI to be the best predictor for subsequent clinical performance,[5][7] professionalism,[6] and ability to communicate with patients and successfully obtain national licensure.[7][8] Since its formal inception at the Michael G. DeGroote School of Medicine at McMaster University in 2004, the MMI subsequently spread as an admissions test across medical schools, and to other disciplines. By 2008, the MMI was being used as an admissions test by the majority of medical schools in Canada, Australia and Israel, as well as other medical schools in the United States and Brunei. This success lead to the development of a McMaster spin-off company, APT Inc., to commercialize the MMI system. The MMI was branded as ProFitHR [10] and made available to both the academic and corporate sector.[11] By 2009, the list of other disciplines using the MMI included schools for dentistry, pharmacy, midwifery, physiotherapy and occupational therapy, veterinary medicine, ultrasound technology, nuclear medicine technology, X-ray technology, medical laboratory technology, chiropody, dental hygiene, and postgraduate training programs in dentistry and medicine.

Multiple mini interview

188

MMI Procedure
1. Interview stations the domain(s) being assessed at any one station are variable, and normally reflects the objectives of the selecting institution. Examples of domains include the soft skills - ethics, professionalism, interpersonal relationships, ability to manage, communicate, collaborate, as well as perform a task. An MMI interview station takes considerable time and effort to produce; it is composed of several parts, including the stem question, probing questions for the interviewer, and a scoring sheet. 2. Circuit(s) of stations to reduce costs of the MMI significantly below that of most interviews,[12] the interview stations are kept short (eight minutes or less) and are conducted simultaneously in a circuit as a bell-ringer examination. The preferred number of stations depends to some extent on the characteristics of the candidate group being interviewed, though nine interviews per candidate represents a reasonable minimum.[3] The circuit of interview stations should be within sufficiently close quarters to allow candidates to move from interview room to interview room. Multiple parallel circuits can be run, each circuit with the same set of interview stations, depending upon physical plant limitations. 3. Interviewers one interviewer per interview station is sufficient.[3] In a typical MMI, each interviewer stays in the same interview throughout, as candidates rotate through. The interviewer thus scores each candidate based upon the same interview scenario throughout the course of the test. 4. Candidates each candidate rotates through the circuit of interviews. For example, if each interview station is eight minutes, and there are nine interview stations, it will take the nine candidates being assessed on that circuit 72 minutes to complete the MMI. Each of the candidates begins at a different interview station, rotating to the next interview station at the ringing of the bell. 5. Administrators each circuit requires at least one administrator to ensure that the MMI is conducted fairly and on time.

Utility of the MMI


The MMI requires less expenditure of resources than standard interview formats.[11] Test security breaches tend not to unduly influence results.[13] Sex of candidate and candidate status as under-represented minority tends not to unduly influence results.[1][14] Preparatory courses taken by the candidate tend not to unduly influence results.[15] The MMI has been validated and tested for over seven years and the product is now available off the shelf.[8]

References
[1] Eva KW, Reiter HI, Rosenfeld J, Norman GR. An admissions OSCE: the multiple mini-interview. Medical Education, 38:314-326 (2004). [2] Barrick MR, Mount MK. The Big 5 personality dimensions and job performance: a meta-analysis. Personnel Psychology 1991, 44:1-26. [3] Eva KW, Reiter HI, Rosenfeld J, Norman GR. The relationship between interviewer characteristics and ratings assigned during a Multiple Mini-Interview. Academic Medicine, 2004 Jun; 79(6):602.9. [4] Dodson M, Crotty B, Prideaux D, Carne R, Ward A, de Leeuw E. The multiple mini-interview: how long is long enough? Med Educ. 2009 Feb;43(2):168-74. [5] Eva KW, Reiter HI, Rosenfeld J, Norman GR. The ability of the Multiple Mini-Interview to predict pre-clerkship performance in medical school. Academic Medicine, 2004, Oct; 79(10 Suppl): S40-2. [6] Reiter HI, Eva KW, Rosenfeld J, Norman GR. Multiple Mini-Interview Predicts for Clinical Clerkship Performance, National Licensure Examination Performance. Med Educ. 2007 Apr;41(4):378-84. [7] Eva KW, Reiter HI, Trinh K, Wasi P, Rosenfeld J, Norman GR. Predictive validity of the multiple mini-interview for selecting medical trainees. Accepted for publication January 2009 in Medical Education. [8] Hofmeister M, Lockyer J, Crutcher R. The multiple mini-interview for selection of international medical graduates into family medicine residency education. Med Educ. 2009 Jun;43(6):573-9. [9] http:/ / www. mcc. ca/ [10] http:/ / www. profithr. com/ [11] www.ProFitHR.com [12] Rosenfeld J, Eva KW, Reiter HI, Trinh K. A Cost-Efficiency Comparison between the Multiple Mini-Interview and Panel-based Admissions Interviews. Advanced Health Science Education Theory Pract. 2008 Mar;13(1):43-58

Multiple mini interview


[13] Reiter HI, Salvatori P, Rosenfeld J, Trinh K, Eva KW. The Impact of Measured Violations of Test Security on Multiple-Mini Interview (MMI). Medical Education, 2006; 40:36-42. [14] Moreau K, Reiter HI, Eva KW. Comparison of Aboriginal and Non-Aboriginal Applicants for Admissions on the Multiple Mini-Interview using Aboriginal and Non-Aboriginal Interviewers. Teaching and Learning in Medicine, 2006; 18:58-61. [15] Griffin B, Harding DW, Wilson IG, Yeomans ND. Does practice make perfect? The effect of coaching and retesting on selection tests used for admission to an Australian medical school. Med J Aust. 2008 Sep 1;189(5):270-3

189

Multistage testing
Multistage testing is an algorithm-based approach to administering tests. It is very similar to computer-adaptive testing in that items are interactively selected for each examinee by the algorithm, but rather than selecting individual items, groups of items are selected, building the test in stages. These groups are called testlets or panels.[1] While multistage tests could theoretically be administered by a human, the extensive computations required (often using item response theory) mean that multistage tests are administered by computer. The number of stages or testlets can vary. If the testlets are relatively small, such as five items, ten or more could easily be used in a test. Some multistage tests are designed with the minimum of two stages (one stage would be a conventional fixed-form test).[2] In response to the increasing use of multistage testing, the scholarly journal Applied Measurement in Education published a special edition on the topic in 2006.[3]

References
[1] Luecht, R. M. & Nungester, R. J. (1998). "Some practical examples of computer-adaptive sequential testing." Journal of Educational Measurement, 35, 229-249. [2] Castle, R.A. (1997). "The Relative Efficiency of Two-Stage Testing Versus Traditional Multiple Choice Testing Using Item Response Theory in Licensure." Unpublished doctoral dissertation. (http:/ / dwb. unl. edu/ Diss/ RCastle/ ReedCastleDiss. html) [3] Applied Measurement in Education edition on multistage testing (http:/ / www. leaonline. com/ toc/ ame/ 19/ 3)

Multitrait-multimethod matrix

190

Multitrait-multimethod matrix
The multitrait-multimethod (MTMM) matrix is an approach to examining Construct Validity developed by Campbell and Fiske(1959).[1] There are six major considerations when examining a construct's validity through the MTMM matrix, which are as follows: 1. Evaluation of convergent validity Tests designed to measure the same construct should correlate highly amongst themselves. 2. Evaluation of discriminant (divergent) validity The construct being measured by a test should not correlate highly with different constructs. 3. Trait-method unit- Each task or test used in measuring a construct is considered a trait-method unit; in that the variance contained in the measure is part trait, and part method. Generally, researchers desire low method specific variance and high trait variance. 4. Multitrait-multimethod More than one trait and more than one method must be used to establish (a) discriminant validity and (b) the relative contributions of the trait or method specific variance. This tenet is consistent with the ideas proposed in Platt's concept of Strong inference (1964).[2] 5. Truly different methodology When using multiple methods, one must consider how different the actual measures are. For instance, delivering two self report measures are not truly different measures; whereas using an interview scale or a psychosomatic reading would be. 6. Trait characteristics Traits should be different enough to be distinct, but similar enough to be worth examining in the MTMM.

Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social

Applied psychology

Applied behavior analysis

Multitrait-multimethod matrix

191
Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

Multitrait
Multiple traits are used in this approach to examine (a) similar or (b) dissimilar traits, as to establish convergent and discriminant validity amongst traits.

Multimethod
Similarly, multiple methods are used in this approach to examine the differential effects (or lack thereof) caused by method specific variance.

Example
The example below provides a prototypical matrix and what the correlations between measures mean. The diagonal line is typically filled in with a reliability coefficient of the measure (e.g. alpha coefficient). Descriptions in brackets [] indicate what is expected when the validity of the construct (e.g., depression or anxiety) and the validities of the measures are all high.

Multitrait-multimethod matrix

192

Test

Beck Depression Inv

Hepner Depression Interview

Beck Anxiety Inv

Hepner Anxiety Interview

BDI

(Reliability Coefficient) [close to 1.00] (Reliability Coefficient) [close to 1.00] Heteromethod-heterotrait [lowest of all] Monomethod-heterotrait [low, less than monotrait] (Reliability Coefficient) [close to 1.00] Heteromethod-monotrait [highest of all except reliability] (Reliability Coefficient) [close to 1.00]

HDIv Heteromethod-monotrait [highest of all except reliability] BAI Monomethod-heterotrait [low, less than monotrait]

HAIv Heteromethod-heterotrait [lowest of all]

In this example the first row and the first column display the trait being assessed (i.e. anxiety or depression) as well as the method of assessing this trait (i.e. interview or survey as measured by fictitious measures). The term heteromethod indicates that in this cell the correlation between two separate methods is being reported. Monomethod indicates the opposite, in that the same method is being used (e.g. interview, interview). Heterotrait indicates that the cell is reporting two supposedly different traits. Monotrait indicates the opposite- that the same trait is being used. In evaluating an actual matrix one wishes to examine the proportion of variance shared amongst traits and methods as to establish a sense of how much method specific variance is induced by the measurement method, as well as provide a look at how unique the trait is, as compared to another trait. That is, for example, the trait should matter more than the specific method of measuring. For example, if a person is measured as being highly depressed by one measure, then another type of measure should also indicate that the person is highly depressed. On the other hand, people who appear highly depressed on the Beck Depression Inventory should not necessarily get high anxiety scores on Beck's Anxiety Inventory. Since the inventories were written by the same person, and are similar in style, there might be some correlation, but this similarity in method should not affect the scores much, so the correlations between these measures of different traits should be low.

Analysis of the MTMM Matrix


A variety of statistical approaches have been used to analyze the data from the MTMM matrix. The standard method from Campbell and Fiske can be implemented using the MTMM.EXE program available at: http:/ / gim. med. ucla. edu/ FacultyPages/ Hays/ util. htm One can also use confirmatory factor analysis[3] due to the complexities in considering all of the data in the matrix. The Sawilowsky I test,[4][5] however, considers all of the data in the matrix with a distribution-free statistical test for trend. The test is conducted by reducing the heterotrait-heteromethod and heterotrait-monomethod triangles, and the validity and reliability diagonals, into a matrix of four levels. Each level consists of the minimum, median, and maximum value. The null hypothesis is these values are unordered, which is tested against the alternative hypothesis of an increasing ordered trend. The test statistic is found by counting the number of inversions (I). The critical value for alpha = 0.05 is 10, and for alpha = .01 is 14.

Multitrait-multimethod matrix

193

References
[1] Campbell, D.T., & FiskeD.W. (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105 " [2] John R. Platt (1964). "Strong inference". Science 146 (3642). [3] Figueredo, A., Ferketich, S., Knapp, T. (1991). Focus on psychometrics: More on MTMM: The Role of Confirmatory Factor Analysis. Nursing & Health, 14, 387-391 [4] Sawilowsky, S. (2002). A quick distribution-free test for trend that contributes evidence of construct validity. Measurement and Evaluation in Counseling and Development, 35, 78-88. [5] Cuzzocrea, J., & Sawilowsky, S. (2009). Robustness to non-independence and power of the I test for trend in construct validity. Journal of Modern Applied Statistical Methods, 8(1), 215-225.

Neo-Piagetian theories of cognitive development


Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military

Neo-Piagetian theories of cognitive development


194
Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

Jean Piaget's theory of cognitive development has been criticized on many grounds. One criticism is concerned with the very nature of development itself. It is suggested that Piaget's theory does not explain why development from stage to stage occurs. The theory is also criticized for ignoring individual differences in cognitive development. That is, the theory does not account for the fact that some individuals move from stage to stage faster than other individuals. Finally, another criticism is concerned with the nature of stages themselves. Research shows that the functioning of a person at a given age may be so variable from domain to domain, such as the understanding of social, mathematical, and spatial concepts, that it is not possible to place the person in a single stage.[1] To remove these weaknesses, a group of researchers, who are known as neo-Piagetian theorists, advanced models that integrate concepts from Piaget's theory with concepts from cognitive and differential psychology.[2][3][4][5]

The Theory of Juan Pascual-Leone


Initially, neo-Piagetian theorists explained cognitive growth along Piagetian stages by invoking information processing capacity as the cause of both development from the one stage to the next and individual differences in developmental rate. Juan Pascual-Leone was the first to advance this approach.[6] Specifically, he argued that human thought is organized in two levels. The first and more basic level is defined by mental power or capacity. That is, this level involves processes that define the volume and kind of information that the individual can process. Working memory is the functional manifestation of mental power. The capacity of working memory is usually specified in reference to the number of information chunks or units that one can keep in mind simultaneously at a given moment. The second level involves mental content as such. That is, it involves concepts and schemes about the physical, the biological, and the social world, and the symbols we use to refer to them, such as words, numbers, mental images. It also involves the mental operations that we can carry on them, such as arithmetic operations on numbers, mental rotation on mental images, etc. Pascual-Leone proposed that the increase of the number of mental units that one can represent simultaneously makes the persons able to handle more complex concepts. For instance, one needs to be able to hold two mental units in mind to be able to decide if one number is bigger than another number. To be able to add them, the person needs to be able to hold three units, that is, the two numbers plus the arithmetic operation to be applied, such as addition or subtraction. To be able to understand proportionality, one must be able to keep in mind five units, that is the two pairs of numbers to be compared and their relation. According to Pascual-Leone, mental power is equal to 1 scheme or unit of information at the age of 23 years and it increases by one unit every second year until it reaches its maximum of 7 units at the age 15 years. He claimed that the classical Piaget's stages of pre-operational, intuitive, early concrete, late concrete, transitional from concrete to

Neo-Piagetian theories of cognitive development formal, early formal, and late formal thought require a mental power of 1, 2, 3, 4, 5, 6, and 7 mental units, respectively. Having a lesser degree of mental power than required by a task makes the solution of this task impossible, because the necessary relations cannot be represented and computed. Thus, each increase in mental power with age opens the way for the construction of concepts and skills up to the new level of capacity. Falling short or exceeding the mental power that is typical of a given age results in slower or faster rates of development, respectively.

195

The Theory of Robbie Case


Based on Pascual-Leone, several other researchers advanced alternative models of capacity development. Robbie Case rejected the idea that changes in processing capacity can be described as a progression along Pascual-Leone's single line of development.[7] Instead, he maintained that processing capacity development recycles over a succession of four main stages and that each of them is characterized by a different kind of mental structures. These stages correspond to Piaget's main stages of sensorimotor, preoperational, concrete operational and formal operational thought. Each of these four stages involves its own executive control structures that are defined by the medium of representation and the type of relations that are possible at the stage.

Executive control structures


Executive control structures enable the person to: (1) a represent the problem situation; (2) specify the objectives of problem solving; (3) conceive of the strategy needed to attain the objectives. Case maintained that there are four types of executive control structures: sensorimotor structures from 1 to 18 months of age (i.e., perceptions and actions such as seeing and grasping); inter-relational structures from 18 months to 5 years of age (i.e., mental representations that stand for actual objects in the environment, such as words or mental images); dimensional structures from 5 to 11 years (i.e., mental representations that are connected together by a consistent relation such that every particular case can be related to every other case, such as the mental number line where every number can be related to every other number); finally, vectorial structures from 11 to 19 years (i.e., relations between the dimensions of the previous stage, such as ratios and proportions which connect two or more dimensions with each other). Case also argued that development within each of these four main stages evolves along the same sequence of the following four levels of complexity: (1) operational consolidation (when a particular mental unit specific to each of the four main stages above can be contemplated and handled, such as an action in the sensorimotor stage, a word in the relational stage, a number in the dimensional stage, etc.); (2) unifocal coordination, (when two such units may be interrelated); (3) bifocal coordination, (when three such units may be interrelated); (4) elaborated coordination, (when four such units may be interrelated). Thus, structures of increasing complexity can be handled at each of the four levels. According to Case, this expansion of the capacity of short-term storage space is caused by increasing operational efficiency. That is, the command of the operations that define each kind of executive control structures improves, thereby freeing space for the representation of goals and objectives. For example, counting becomes faster with age enabling children to keep more numbers in mind. Successive stages are not unrelated, however. That is, the final level of a given stage is at the same time the first level of the following stage. For instance, when the concept of number is well established at the final level of elaborated coordination of the relational stage it enables children to view numbers as related to each other and this is equivalent to the first level of operational consolidation of the following dimensional stage. Thus, when the structures of a given stage reach a given level of complexity (which corresponds to the level of elaborated coordination) a new mental structure is created and the cycle starts up from the beginning.

Neo-Piagetian theories of cognitive development

196

Central conceptual structures


Case recognized that variations may occur in the organization and development of different domains, due to differences in how meaning is organized in each of the domains. Specifically, Case recognized that there are central conceptual structures. These are "networks of semantic notes and relations that have an extremely broad (but not system-wide) domain of application and that are central to children's functioning in that domain."[8] Case and his colleagues identified central conceptual structures for quantities, space, social behavior, narrative, music, and motor behavior. Each of these structures is supposed to involve a set of core processes and principles which serve to organize a broad array of situations; for example, the concept of more and less for quantities, adjacency and inclusion relationships for space, and actions and intentions for social behavior. Thus, these are very broad structures in which many executive control structures may be constructed, relative to an individual's experiences and needs. For example, in the central conceptual structure that organizes quantities, executive control structures to solve arithmetic problems, to operate balance beams, to represent home locations according to their street address etc., may be constructed. In short, central conceptual structures function as frames and they provide the basic guiding principles and raw conceptual material for the construction of more locally focused concepts and action plans, when the need for them arises. Learning the core elements of a central conceptual structure opens the way for fast acquisition of a wide array of executive control structures, although this does not generalize to other conceptual structures. It remains limited within the one affected, indicating that there may be variations both within and across individuals in the executive control structures that can be constructed within each central conceptual structure. These variations depend on the environmental support provided to each structure and on the individual's particular preferences and involvement.[9]

The Theory of Graeme S Halford


Graeme S Halford raised a number of objections regarding Case's definition of working memory capacity and its role in cognitive growth. The main objection is that different persons may represent the same problem differently and thus they may analyze the goals and objectives of the problem differently. Therefore, mental capacity cannot be specified in reference to executive functions. Halford proposed an alternative way to analyze the processing demands of problems that is supposed to explain the most crucial component of understanding and problem solving. This is the grasp of the network of relations that minimally and fully define a particular concept or problem.[10] According to Halford, this grasp is built through structure mapping. Structure mapping is analogical reasoning that people use to give meaning to problems by translating the givens of a problem into a representation or mental model that they already have and which allows them to understand the problem. The structure mappings that can be constructed depend upon the relational complexity of the structures they involve. The relational complexity of structures depends on the number of entities or the number of dimensions that are involved in the structure. The processing load of a task corresponds to the number of dimensions, which must be simultaneously represented, if their relations are to be understood. For example, to understand any comparison between two entities (e.g., "larger than", "better than", etc.) one must be able to represent two entities and one relation between them. To understand a transitive relation one must be able to represent at least three entities (e.g., objects A, B, and C) and two relations (e.g., A is taller than B; C is shorter than B); otherwise it would not be possible to mentally arrange the entities in the right order that would reveal the relations between all entities involved. Halford identified four levels of dimensionality. The first is the level of unary relations or element mappings. Mappings at this level are constructed on the basis of a single attribute. For instance, the mental image of an apple is a valid representation of this fruit because it is similar to it. The second is the level of binary relations or relational mappings. At this level two-dimensional concepts of the type "larger than" can be constructed. Thus, two elements connected by a given relation can be considered at this level. The next is the level of system mappings, which requires that three elements or two relations must be considered simultaneously. At this level ternary relations or binary operations can be represented. The example of transitivity, which can be understood at this level, has already

Neo-Piagetian theories of cognitive development been explained above. The ability to solve simple arithmetic problems, where one term is missing, such as "3 + ? = 8" or "4 ? 2 = 8" also depends on system mappings, because all three known factors given must be considered simultaneously if the missing element or operation is to be specified. At the final level multiple-system mappings can be constructed. At this level quaternary relations or relations between binary operations can be constructed. For example, problems with two unknowns (e.g., 2 ? 2 ? 4 = 4) or problems of proportionality, can be solved. That is, at this level four dimensions can be considered at once. The four levels of structure mappings are thought to be attainable at the age of 1, 3, 5, and 10 years, respectively, and they correspond, in the theory of cognitive development of Piaget, to the sensorimotor, the preoperational, the concrete operational, and the formal operational, or Case's sensorimotor, interrelational, dimensional, and vectorial stage, respectively.

197

The Theory of Kurt W Fischer


Kurt W. Fischer advanced a theory that integrates Piaget's notion of stages in cognitive development with notions from learning theory and skill construction as explained by the cognitive psychology of the sixties.[11] Fischer's conception of the stages of cognitive development is very similar to that of Case. That is, he describes four major stages or tiers which coincide by and large with Case's major stages. Thinking at each of the tiers operates with a different type of representation. That is, first is the tier of reflexes, which structures the basic reflexes constructed during the first month of life. Then it is the sensorimotor tier, which operates on perceptions and actions. The third is the representational tier, which operates on representations that are descriptive of reality. The fourth is the abstract tier, which operates on abstractions integrating the representations of the second tier. Moreover, like Case, he believes that development within each major stage recycles over the same sequence of four structurally identical levels. That is, at the first level of single sets individuals can construct skills involving only one element of the tier concerned, that is, sensorimotor sets, representational sets, or abstract sets. At the level of mappings they can construct skills involving two elements mapped onto or coordinated with each other, that is, sensorimotor mappings, representational mappings, or abstract mappings. At the level of systems they can construct skills integrating two mappings of the previous level, that is, sensorimotor systems, representational systems, or abstract systems. At the level of systems of systems they can construct skills integrating two systems of the previous level, that is, sensorimotor systems of systems, representational systems of systems, or abstract systems of systems. However, Fischer's theory differs from the other neo-Piagetian theories in a number of respects. One of them is in the way it explains cognitive change. Specifically, although Fischer does not deny the operation of information processing constrains on development, he emphasizes on the environmental and social rather than individual factors as causes of development. To explain developmental change he borrowed two classic notions from Lev Vygotsky,[12] that is, internalization and the zone of proximal development. Internalization refers to the processes that enable children to reconstruct and absorb the products of their observations and interactions in a way that makes them their own. That is, it is a process which transforms external, alien skills and concepts into internal, integral ones. The zone of proximal development expresses Vygotsky's idea that at any age the childs potential for understanding and problem solving is not identical to his actual understanding and problem solving ability. Potential ability is always greater than actual ability: the zone of proximal development refers to the range of possibilities that exist between the actual and the potential. Structured social interaction, or scaffolding, and internalization are the processes that gradually allow potential (for understanding and problem solving) to become actual (concepts and skills). Fischer argued that variations in the development and functioning of different mental skills and functions from the one domain to the other may be the rule rather than the exception. In his opinion these variations are to be attributed to differences in the experience that individuals have with different domains and also to differences in the support that they receive when interacting with the various domains. In addition, he posited that an individual's true level, which functions as a kind of ceiling for all domains, is the level of his potential, which can only be determined under conditions of maximum familiarity and scaffolding.

Neo-Piagetian theories of cognitive development

198

The Theory of Andreas Demetriou


The models above do not systematically elaborate on the differences between domains, the role of self-awareness in development, and the role of other aspects of processing efficiency, such as speed of processing and cognitive control. In the theory proposed by Andreas Demetriou, with his colleagues, all of these factors are systematically studied. According to this theory, the human mind is organized in three functional levels. The first is the level of processing potentials which involves information processing mechanisms underlying the ability to attend to, select, represent, and operate on information. The other two of levels involve knowing processes, one oriented to the environment and another oriented to the self.[3][13][14] This model is graphically depicted in Figure 1.

Processing potentials
Mental functioning at any moment occurs under the constraints of the processing potentials that are available at a given age. Processing potentials are specified in terms of three dimensions: speed of processing, control of processing, and representational capacity. Speed of processing refers to the maximum speed at which a given mental act may be efficiently executed. It is measured in reference to the reaction time to very simple tasks, such as the time needed to recognize an object. Control of processing involves executive functions that enable the person to keep the mind focused on a goal, protect attention of being captured by irrelevant stimuli, timely shift focus to other relevant information if required, and inhibit irrelevant or premature responses, so that a strategic plan of action can be made and sustained. Reaction time to situations where one must choose between two or more alternatives is one measure of control of processing. Stroop effect tasks are good measures of control of processing. Representational capacity refers to the various aspects of mental power or working memory

Figure 1: The general model of the architecture of the developing mind integrating concepts from the theories of Demetriou and Case.

mentioned above.[13]

Domain-specific systems of thought


The level oriented to the environment includes representational and understanding processes and functions that specialize in the representation and processing of information coming from different domains of the environment. Six such environment-oriented systems are described: (1) the categorical system enables categorizations of objects or persons on the basis of their similarities and differences. Forming hierarchies of interrelated concepts about class relationships is an example of the domain of this system. For instance, the general class of plants includes the classes of fruits and vegetables, which, in turn, include the classes of apples and lettuce, etc.; (2) the quantitative system deals with quantitative variations and relations in the environment. Mathematical concepts and operations are examples of the domain of this system; (3) the causal system deals with cause-effect relations. Operations such as trial-and-error or isolation of variable strategies that enable a person to decipher the causal relations between things or persons and ensuing causal concepts and attributions belong to this system; (4) the spatial system deals with orientation in space and the imaginal representation of the environment. Our mental maps of our city or the mental images of familiar persons and objects and operations on them, such as mental rotation, belong to this system; (5) the propositional system deals with the truth/falsity and the validity/invalidity of statements or representations about the environment. Different types of logical relationships, such as implication (if ... then) and conjunction (and ... and) belong to this system; (6) the social system deals with the understanding of social relationships and interactions. Mechanisms for monitoring non-verbal communication or skills for manipulating social interactions belong to this system. This system also includes understanding the general moral principles specifying what is acceptable and what

Neo-Piagetian theories of cognitive development is unacceptable in human relations. Table 1 summarizes the core processes, mental operations, and concepts that are typical of each domain. The domain specificity of these systems implies that the mental processes differ from the one system to the other. Compare, for instance, arithmetic operations in the quantitative system with mental rotation in the spatial system. The first require the thinker to relate quantities; the other require the transformation of the orientation of an object in space. Moreover, the different systems require different kinds of symbols to represent and operate on their objects. Compare, for instance, mathematical symbolism in the quantitative system with mental images in the spatial system. Obviously, these differences make it difficult to equate the concepts and operations across the various systems in the mental load they impose on representational capacity, as the models above assume. Case (1992) also recognized that different types of problem domains, such as the domain of social, mathematical, and spatial thought, may have a different kind of central conceptual Table 1: The three levels of organization of each structure. That is, concepts and executive control structures differ specialized system of thought across domains in the semantic networks that they involve.[15] As a result, development over different concepts within domains may proceed in parallel but it may be uneven across domains. In fact, Case and Demetriou worked together to unify their analysis of domains. That is, they suggested that Demetriou's domains may be specified in terms of Case's central conceptual structures.[16]

199

Hypercognition
The third level includes functions and processes oriented to monitoring, representing, and regulating the environment-oriented systems. The input to this level is information arising from the functioning of processing potentials and the environment-oriented systems, for example, sensations, feelings, and conceptions caused by mental activity. The term hypercognition was used to refer to this level and denote the effects that it exerts on the other two levels of the mind. Hypercognition involves two central functions, namely working hypercognition and long-term hypercognition. Working hypercognition is a strong directive-executive function that is responsible for setting and pursuing mental and behavioral goals until they are attained. This function involves processes enabling the person to: (1) set mental and behavioral goals; (2) plan their attainment; (3) evaluate each step's processing demands vis--vis the available potentials, knowledge, skills and strategies; (4) monitor planned activities vis--vis the goals; and (5) evaluate the outcome attained. These processes operate recursively in such a way that goals and subgoals may be renewed according to the online evaluation of the system's distance from its ultimate objective. These regulatory functions operate under the current structural constraints of the mind that define the current processing potentials.[14][17] Recent research suggests that these processes participate in general intelligence together with processing potentials and the general inferential processes used by the specialized thought domains described above.[18] Consciousness is an integral part of the hypercognitive system. The very process of setting mental goals, planning their attainment, monitoring action vis--vis both the goals and the plans, and regulating real or mental action requires a system that can remember and review and therefore know itself. Therefore, conscious awareness and all ensuing functions, such as a self-concept (i.e., awareness of one's own mental characteristics, functions, and mental states) and a theory of mind (i.e., awareness of others' mental functions and states) are part of the very construction of the system. In fact, long-term hypercognition gradually builds maps or models of mental functions which are continuously updated. These maps are generally accurate representations of the actual organization of cognitive processes in the

Neo-Piagetian theories of cognitive development domains mentioned above.[14][18][19] When needed, they can be used to guide problem solving and understanding in the future. Optimum performance at any time depends on the interaction between actual problem solving processes specific to a domain and our representations of them. The interaction between the two levels of mind ensures flexibility of behavior, because the self-oriented level provides the possibility for representing alternative environment-oriented representations and actions and thus it provides the possibility for planning.[14][18]

200

Development
All of the processes mentioned above develop systematically with age. Speed of processing increases systematically from early childhood to middle age and it then starts to decrease again. For instance, to recognize a very simple object takes about 750 milliseconds at the age of 6 years and only about 450 milliseconds in early adulthood. Control of processing also becomes more efficient and capable of allowing the person to focus on more complex information, hold attention for longer periods of time, and alternate between increasingly larger stacks of stimuli and responses while filtering out irrelevant information. For instance, to recognize a particular stimulus among conflicting information may take about 2000 milliseconds at the age of 6 years and only about 750 milliseconds in early adulthood.[20] All components of working memory (e.g., executive functions, numerical, phonological and visuospatial storage) increase with age.[13][20] However, the exact capacity of working memory varies greatly depending upon the nature of information. For example, in the spatial domain, they may vary from 3 units at the age of six to 5 units at the age of 12 years. In the domain of mathematical thought, they may vary from about 2 to about 4 units in the same age period. If executive operations are required, the capacity is extensively limited, varying from about 1 unit at 6 to about 3 units at 12 years of age. Demetriou proposed the functional shift model to account for these data.[19] This model presumes that when the mental units of a given level reach a maximum degree of complexity, the mind tends to reorganize these units at a higher level of representation or integration so as to make them more manageable. Having created a new mental unit, the mind prefers to work with this rather than the previous units due to its functional advantages. An example in the verbal domain would be the shift from words to sentences and in the quantitative domain from natural numbers to algebraic representations of numerical relations. The functional shift models explains how new units are created leading to stage change in the fashion described by Case[7] and Halford.[21] The specialized domains develop through the life span both in terms of general trends and in terms of the typical characteristics of each domain. In the age span from birth to middle adolescence, the changes are faster in all of the domains. With development, thought in each of the domains becomes able to deal with increasingly more representations. Moreover, representations become increasingly interconnected with each other and they acquire their meaning from their interrelations rather than simply their relations with concrete objects. As a result, concepts in each of the domains become increasingly defined in reference to rules and general principles bridging more local concepts and creating new, broader, and more abstract concepts. Moreover, understanding and problem solving in each of the domains evolve from global and less integrated to differentiated, but better integrated, mental operations. As a result, planning and operation from alternatives becomes increasingly part of the person's functioning, as well as the increasing ability to efficiently monitor the problem solving process. This offers flexibility in cognitive functioning and problem solving across the whole spectrum of specialized domains. Table 2 summarizes the development of the domains from early childhood to adolescence.

Neo-Piagetian theories of cognitive development

201

In the hypercognitive system, self-awareness and self-regulation, that is, the ability to regulate one's own cognitive activity, develop systematically with age. Specifically, with development, self-awareness of cognitive processes becomes more accurate and shifts from the external and superficial characteristics of problems (e.g., this is about numbers and this is about pictures) to the cognitive processes involved (e.g., the one requires addition and the other requires mental rotation). Moreover, self-representations: (i) involve more dimensions which are better integrated into increasingly more complex structures; (ii) move along a concrete (e.g., I am fast and strong) to abstract (e.g., I am able) continuum so that they become increasingly more abstract and flexible; and (iii) become more accurate Table 2: Modal characteristics of the specialized in regard to the actual characteristics and abilities to which they refer domains with development (i.e., persons know where they are cognitively strong and where they are weak). The knowledge available at each phase defines the kind of self-regulation that can be affected. Thus, self-regulation becomes increasingly focused, refined, efficient, and strategic. Practically this implies that our information processing capabilities come under increasing a priori control of our long-term hypercognitive maps and our self-definitions.[17] Moreover, as we move into middle age, intellectual development gradually shifts from the dominance of systems that are oriented to the processing of the environment (such as spatial and propositional reasoning) to systems that require social support and self-understanding and management (social understanding). Thus, the transition to mature adulthood makes persons intellectually stronger and more self-aware of their strengths.[22] There are strong developmental relations between the various processes, such that changes at any level of organization of the mind open the way for changes in other levels. Specifically, changes in speed of processing open the way for changes in the various forms of control of processing. These, in turn, open the way for the enhancement of working memory capacity, which subsequently opens the way for development in inferential processes, and the development of the various specialized domains through the reorganization of domain-specific skills, strategies, and knowledge and the acquisition of new ones.[20] There are top-down effects as well. That is, general inference patterns, such as implication (if ... then inferences), or disjunction (either ... or inferences), are constructed by mapping domain-specific inference patterns onto each other through the hypercognitive process of metarepresentation. Metarepresentation is the primary top-down mechanism of cognitive change which looks for, codifies, and typifies similarities between mental experiences (past or present) to enhance understanding and problem-solving efficiency. In logical terms, metarepresentation is analogical reasoning applied to mental experiences or operations, rather than to representations of environmental stimuli. For example, if ... then sentences are heard over many different occasions in everyday language: if you are a good child then I will give you a toy; if it rains and you stay out then you become wet; if the glass falls on the floor then it brakes in pieces; etc. When a child realizes that the sequencing of the if ... then connectives in language is associated with situations in which the event or thing specified by if always comes first and it leads to the event or thing specified by then, this child is actually formulating the inference schema of implication. With development, the schema becomes a reasoning frame for predictions and interpretations of actual events or conversations about them.[3]

Neo-Piagetian theories of cognitive development

202

Brain and cognitive development


Modern research on the organization and functioning of the brain lends support to this architecture. This research shows that some general aspects of the brain, such as myelination, plasticity, and connectivity of neurons, are related to some dimensions of general intelligence, such as speed of processing and learning efficiency. Moreover, there are brain regions, located mainly in the frontal and parietal cortex that subserve functions that are central to all cognitive processing, such as executive control, and working memory. Also, there are many neural networks that specialize in the representation of different types of information such as verbal (temporal lobe of the brain), spatial (occipital lobe of the brain) or quantitative information (parietal lobe of the brain).[3] Moreover, several aspects of neural development are related to cognitive development. For example, increases in the myelination of neuronal axons, which protect the transmission of electrical signalling along the axons from leakage, are related to changes in general processing efficiency. This, in turn, enhances the capacity of working memory, thereby facilitating transition across the stages of cognitive development.[16] Also it is assumed that changes within stages of cognitive development are associated with improvements in neuronal connectivity within brain regions whereas transitions across stages are associated with improvements in connectivity between brain regions.[23]

Dynamic systems theory


In recent years, there has been an increasing interest in theories and methods that show promise for capturing and modeling the regularities underlying multiple interacting and changing processes. Dynamic systems theory is one of them. When multiple processes interact in complex ways, they very often appear to behave unsystematically and unpredictably. In fact, however, they are interconnected in systematic ways, such that the condition of one process at a given point of time t (for example, speed of processing) is responsible for the condition of another process (for example working memory), at a next point of time t + 1, and together they determine the condition of a third process (for example thought), at a time t + 2, which then influences the conditions of the other two processes at a time t + 3, etc. Dynamic systems theory can reveal and model the dynamic relationships among different processes and specify the forms of development that result from different types of interaction among processes. The aim is to explain the order and systematicity that exist beneath a surface of apparent disorder or "chaos". It needs to be noted that there is no limitation as to what processes may be involved in this kind of modeling. That is, the processes may belong to any of the levels of mind, such as the level of the processing capacity and the level of problem solving skills. Paul van Geert[24] was the first to show the promise that dynamic systems theory holds for the understanding of cognitive development. Van Geert assumed that the basic growth model is the so-called "logistic growth model", which suggests that the development of mental processes follows an S-like pattern of change. That is, at the beginning, change is very slow and hardly noticeable; after a given point in time, however, it occurs very rapidly so that the process or ability spurts to a much higher level in a relatively short period of time; finally, as this process approaches its end state, change decelerates until it stabilizes. According to van Geert, logistic growth is a function of three parameters: the present level, the rate of change, and a limit on the level that can be reached that depends on the available resources for the functioning of the process under consideration. The first parameter, i.e., the present level, indicates the potential that a process has for further development. Obviously, the further away a process is from its end state the more its potential of change would be. The second (the rate of change), is an augmenting or multiplying factor applied to the present level. This may come from pressures for change from the environment or internal drives or motives for improvement. It operates like the interest rate applied to a no-withdrawal savings account. That is, this is a factor that indicates the rate at which an ability changes in order to approach its end state. The third parameter refers to the resources available for development. For example, the working memory available is the resource for the development of cognitive processes which may belong to any domain. Many theorists, including Case,[8] Demetriou,[25] and Fischer,[26] used dynamic systems modeling to investigate and explore the dynamic relations between cognitive processes during development.

Neo-Piagetian theories of cognitive development

203

Relations between theories


The neo-Piagetian theories above are related. Pascual-Leone, Case, and Halford attempt to explain development along the sequence of Piagetian stages and substages. Pascual-Leone aligned this sequence with a single line of development of mental power that goes from one to seven mental units. Case suggested that each of four main stages involves different kinds of mental structures and he specified the mental load of the successive levels or substages of complexity within each of the main stages. Moreover, he recognized that there may be different central conceptual structures within each level of executive control structures that differ between each other in reference to the concepts and semantic relations involved. Halford attempted to specify the cognitive load of the mental structure that is typical of each of the main stages. Demetriou integrated into the theory the constructs of speed and control of processing and he formulated the functional shift model which unifies Pascual-Leone's notion of underlying common dimension of capacity development with the notion of qualitative changes in mental structure as development progresses along this dimension. Moreover, Demetriou did justice to the role of self-awareness in cognitive development and the relevant autonomy in the development of different domains of thought. Fischer stressed the importance of skill construction processes in building stage-like constructs and he emphasized the role of the environment and social support in skill construction. The Model of Hierarchical Complexity formulated by Michael Commons[27] offers a useful language of description of the successive levels of cognitive development while allowing for the explicit reference to the particularities of concepts and operations specific to each of the domains. Dynamic systems modeling can capture and express how different processes interact dynamically when developmental hierarachies are built. Moreover, Demetriou's theory integrated models from cognitive, psychometric, and developmental psychology into an overarching model that describes the architecture of the human mind, its development, and individual differences in regard to both architecture and development. In as far as architecture is concerned, it is maintained that both general and specialized capabilities and processes do exist, which are organized hierarchically so that more complex and specialized processes include more simple or general processes. This type of architecture converges with more than a century of psychometric research.[28][29] in suggesting that general intelligence or "g" is a very powerful component of human intelligence. This can be reduced to mechanisms underlying processing efficiency, processing capacity, executive control, and working memory, which have been the primary target of research and theory in cognitive psychology and differential psychology. Many scholars argue that fluid intelligence, that is the general mechanisms underlying learning, problem solving, and the handling of novelty, depends on these processes.[28][29] Also, changes in these very mechanisms seem able to explain, to a considerable extent, the changes in the quality of understanding and problem solving at successive age levels, which is the object of developmental psychology and individual differences in regard to it. Thus, an overarching definition of intelligence can be as follows: The more mentally efficient (that is, the faster and more focused on goal), capable (that is, the more information one can hold in mind at a given moment), foresighted (that is, the more clearly one can specify his goals and plan how to achieve them), and flexible (that is, the more one can introduce variations in the concepts and mental operations one already possesses) a person is, the more intelligent (both in regard to other individuals and in regard to a general developmental hierarchy) this person is. In psychometric terms, this is tantamount to saying that differences in the processes associated with g cause differences in general inferential and reasoning mechanisms. In developmental terms, this is tantamount to saying that changes in the processes underlying g result in the qualitative transformation of the general structures of thought underlying understanding and reasoning at successive ages so that more complex and less familiar problems can be solved and more abstract concepts can be constructed. Thus, differences between persons in IQ or in the rate of development result, additively, from differences in all of the processes mentioned here. Thus, this theory, on the one hand, surpasses Arthur Jensen's[29] theory of general intelligence in that it recognizes the importance of specialized domains in the human mind, which are underestimated in Jensen's theory. On the other hand, by recognizing the role of general processes and showing how specialized competences are constrained by them, it also surpasses Howard

Neo-Piagetian theories of cognitive development Gardner's theory of multiple intelligences, which underestimates the operation of common processes.[30]

204

Implications for education


Education and the psychology of cognitive development converge on a number of crucial assumptions. First, the psychology of cognitive development defines human cognitive competence at successive phases of development. That is, it specifies what aspects of the world can be understood at different ages, what kinds of concepts can be constructed, and what types of problems can be solved. Education aims to help students acquire knowledge and develop skills which are compatible with their understanding and problem-solving capabilities at different ages. Thus, knowing the students' level on a developmental sequence provides information on the kind and level of knowledge they can assimilate, which, in turn, can be used as a frame for organizing the subject matter to be taught at different school grades. This is the reason why Piaget's theory of cognitive development was so influential for education, especially mathematics and science education. In the 60s and the 70s, school curricula were designed to implement Piaget's ideas in the classroom. For example, in mathematics, teaching must build on the stage sequence of mathematical understanding. Thus, in preschool and early primary (elementary) school, teaching must focus on building the concept of numbers, because concepts are still unstable and uncoordinated. In the late primary school years operations on numbers must be mastered because concrete operational thought provides the mental background for this. In adolescence the relations between numbers and algebra can be taught, because formal operational thought allows for conception and manipulation of abstract and multidimensional concepts. In science teaching, early primary education should familiarize the children with properties of the natural world, late primary education should lead the children to practice exploration and master basic concepts such as space, area, time, weight, volume, etc., and, in adolescence, hypothesis testing, controlled experimentation, and abstract concepts, such as energy, inertia, etc., can be taught.[31] In the same direction, the neo-Piagetian theories of cognitive development suggest that in addition to the concerns above, sequencing of concepts and skills in teaching must take account of the processing and working memory capacities that characterize successive age levels. In other words, the overall structure of the curriculum across time, in any field, must reflect the developmental processing and representational possibilities of the students as specified by all of the theories summarized above. This is necessary because when understanding of the concepts to be taught at a given age requires more than the available capacity, the necessary relations cannot be worked out by the student.[32] In fact, Demetriou has shown that speed of processing and working memory are excellent predictors of school performance.[33] Second, the psychology of cognitive development involves understanding how cognitive change takes place and recognizing the factors and processes which enable cognitive competence to develop. Education also capitalizes on cognitive change. The transmission of information and the construction of knowledge presuppose effective teaching methods. Effective teaching methods have to enable the student to move from a lower to a higher level of understanding or abandon less efficient skills for more efficient ones. Therefore, knowledge of change mechanisms can be used as a basis for designing instructional interventions that will be both subject- and age-appropriate. Comparison of past to present knowledge, reflection on actual or mental actions vis--vis alternative solutions to problems, tagging new concepts or solutions to symbols that help one recall and mentally manipulate them are just a few examples of how mechanisms of cognitive development may be used to facilitate learning.[34] For example, to support metarepresentation and facilitate the emergence of general reasoning patterns from domain specific processing, teaching must continually raise awareness in students of what may be abstracted from any particular domain-specific learning. Specifically, the student must be led to become aware of the underlying relations that surpass content differences and of the very mental processes used while handling them (for instance, elaborate on how particular inference schemas, such as implication, operate in different domains).[35][36] Finally, the psychology of cognitive development is concerned with individual differences in the organization of cognitive processes and abilities, in their rate of change, and in their mechanisms of change. The principles

Neo-Piagetian theories of cognitive development underlying intra- and inter-individual differences could be educationally useful, because it highlights why the same student is not an equally good learner in different domains, and why different students in the same classroom react differently to the same instructional materials. For instance, differences between same age students in the same classroom in processing efficiency and working memory may differentiate these students in their understanding and mastering of the concepts or skills taught at a given moment. That is, students falling behind the demands would most probably have problems in capturing the concepts and skills taught. Thus, knowing the students' potentials in this regard would enable the teacher to develop individual examples of the target concepts and skills that would cater for the needs of the different students so that no one is left behind. Also, differences in the developmental condition, experience, familiarity, or interest in respect to the various domains would most certainly cause differences in how students would respond to teaching related to them. This is equally true for both differences between students and differences within the same student. In Case's terms, the central conceptual structures available in different domains would not necessarily match the complexity of executive control structures that are possible based on the students' processing and representational capacity. As a result, teaching would have to accommodate these differences if it is to lead each of the students to the optimum of their possibilities across all domains. Finally, identifying individual differences with regard to the various aspects of cognitive development could be the basis for the development of programs of individualized instruction which may focus on the gifted student or which may be of a remedial nature.[35][37] The discussion here about the educational implications of the neo-Piagetian theories of cognitive development taken as whole suggests that these theories provide a frame for designing educational interventions that is more focused and specific than traditional theories of cognitive development, such as the theory of Piaget, or theories of intelligence, such as the theories discussed above. Of course, much research is still needed for the proper application of these theories into the various aspects of education.

205

References
[1] Greenberg, D. (1987). Chapter 19, Learning (http:/ / books. google. co. il/ books?id=es2nOuZE0rAC& pg=PA91& lpg=PA91& dq="Learning"+ Greenberg+ Free+ at+ Last+ Learning+ -+ The+ Sudbury+ Valley+ School& source=bl& ots=TkL0NkwkBG& sig=aTvBo6l-92OZUeeW5tPB4-Nr0m8& hl=en& ei=IEn-SorsDJ2wnQOWuvTzCw& sa=X& oi=book_result& ct=result& resnum=8& ved=0CBwQ6AEwBw#v=onepage& q=& f=false), Free at Last, The Sudbury Valley School. The experience of Sudbury model schools shows that a great variety can be found in the minds of children, against Piaget's theory of universal steps in comprehension and general patterns in the acquisition of knowledge: "No two kids ever take the same path. Few are remotely similar. Each child is so unique, so exceptional" (Greenberg, 1987). Retrieved June 26, 2010. [2] Demetriou, A. (1998). Cognitive development. In A. Demetriou, W. Doise, K. F. M. van Lieshout (Eds.), Life-span developmental psychology (pp. 179-269). London: Wiley. [3] Demetriou, A., Mouyi, A., & Spanoudis, G. (2010). The development of mental processing. Nesselroade, J. R. (2010). Methods in the study of life-span human development: Issues and answers. In W. F. Overton (Ed.), Biology, cognition and methods across the life-span. Volume 1 of the Handbook of life-span development (pp. 306-343), Editor-in-chief: R. M. Lerner. Hoboken, NJ: Wiley. [4] Demetriou, A. (2006). Neo-Piagetische Ansatze. In W. Sneider & F. Wilkening (Eds.),Theorien, modelle, und methoden der Endwicklungpsychologie. Volume of Enzyklopadie der Psychologie (pp. 191-263): Gotingen: Hogrefe-Verlag. [5] Mora, S. (2007). Cognitive development: Neo-Piagetian perspectives. London: Psychology Press. [6] Pascual-Leone, J. (1970). A mathematical model for the transition rule in Piagets developmental stages. Acta Psychologica, 32, 301-345. [7] Case, R. (1985). Intellectual development. Birth to adulthood. New York: Academic Press. [8] Case, R., Okamoto, Y., Griffin, S., McKeough, A., Bleiker, C., Henderson, B., & Stephenson, K. M. (1996). The role of central conceptual structures in the development of childrens thought. Monographs of the Society for Research in Child Development, 61 (1-2, Serial No. 246). [9] Case, R. (1992). The minds staircase: Exploring the conceptual underpinnings of childrens thought and knowledge. Hillsdale, NJ: Erlbaum. [10] Halford, G. S. (1993). Childrens understanding: The development of mental models. Hillsdale, NJ: Erlbaum. [11] Fischer, K. W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87, 477-531. [12] Vygotsky, L. S. (1962). Thought and language. Cambridge, MA: MIT Press. [13] Demetriou, A., Christou, C., Spanoudis, G., & Platsidou, M. (2002). The development of mental processing: Efficiency, working memory, and thinking. Monographs of the Society of Research in Child Development, 67, Serial Number 268. [14] Demetriou, A., & Kazi, S. (2001). Unity and modularity in the mind and the self: Studies on the relationships between self-awareness, personality, and intellectual development from childhood to adolescence. London: Routledge.

Neo-Piagetian theories of cognitive development


[15] Case, R. (1992a). The minds staircase: Exploring the conceptual underpinnings of childrens thought and knowledge. Hillsdale, NJ: Erlbaum. [16] Case, R., Demetriou, A., Platsidou, M., & Kazi, S. (2001). Integrating concepts and tests of intelligence from the differential and the developmental traditions. Intelligence, 29, 307-336. [17] Demetriou, A., (2000). Organization and development of self-understanding and self-regulation: Toward a general theory. In M. Boekaerts, P. R. Pintrich, & M. Zeidner (Eds.), Handbook of self-regulation (pp. 209-251). Academic Press. [18] Demetriou, A., & Kazi, S. (2006). Self-awareness in g (with processing efficiency and reasoning). Intelligence, 34, 297-317. [19] Demetriou, A., Efklides, A., & Platsidou, M. (1993). The architecture and dynamics of developing mind: Experiential structuralism as a frame for unifying cognitive developmental theories. Monographs of the Society for Research in Child Development, 58, Serial Number 234. [20] Demetriou, A., Mouyi, A., & Spanoudis, G. (2008). Modeling the structure and development of g. Intelligence, 5, 437-454. [21] Halford, G. S. (1993). Childrens understanding: The development of mental models. Hillsdale, NJ: Erlbaum. [22] Demetriou, A., & Bakracevic, K. (2009). Cognitive development from adolescence to middle age: From environment-oriented reasoning to social understanding and self-awareness. Learning and Individual Differences, 19, 181-194. [23] Thatcher, R. W. (1992). Cyclic cortical reorganization during early childhood. Brain and Cognition, 20, 24-50. [24] van Geert, P. (1994). Dynamic systems of development: Change between complexity and chaos. New York: Harvester Wheatsheaf. [25] Demetriou, A., Christou, C., Spanoudis, G., & Platsidou, M. (2002). The development of mental processing: Efficiency, working memory, and thinking. Monographs of the Society of Research in Child Development, 67, Serial Number 268.) [26] Fischer, K. W., & Bidell, T. R. (1998). Dynamic development of psychological structures in action and thought. In R. M. Lerner (Ed.), & W. Damon (Series Ed.), Handbook of child psychology: Vol. 1. Theoretical models of human development (5th ed., pp. 467-561). New York: Wiley.) [27] Commons, M. L., Trudeau, E. J., Stein, S. A., Richards, S. A. Krause, S. R. (1998). Hierarchical complexity of tasks shows the existence of developmental stages. Developmental Review, 18, 237-278. [28] Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York: Cambridge University Press. [29] Jensen, A. R. (1998). The G factor: The science of mental ability. New York: Praeger. [30] Gardner, H. (1983). Frames of mind. The theory of multiple intelligences. New York: Basic Books. [31] Furth, H. G., & Wachs, H. (1975). Thinking goes to school: Piagets theory in practice. Oxford: Oxford University Press). [32] Demetriou, A., & Valanides, N. (1998). A three level of theory of the developing mind: Basic principles and implications for instruction and assessment. In R. J. Sternberg & W. M. Williams (Eds.), Intelligence, instruction, and assessment (pp. 149-199). Hillsdale, NJ: Lawrence Erlbaum. [33] Demetriou, A., Spanoudis, G., & Mouyi, A. (2010). A Three-level Model of the Developing Mind: Functional and Neuronal Substantiation. In M. Ferrari and L. Vuletic (Eds.), The Developmental Relations between Mind, Brain, and Education: Essays in honor of Robbie Case. New York: Springer. [34] Case, R. (1985). Intellectual development: Birth to adulthood. New York: Academic Press. [35] Demetriou, A., Spanoudis, G., & Mouyi, A. (2010). A Three-level Model of the Developing Mind: Functional and Neuronal Substantiation. In M. Ferrari and L. Vuletic (Eds.), The Developmental Relations between Mind, Brain, and Education: Essays in honor of Robbie Case. New York: Springer [36] Demetriou A., & Raftopoulos, A. (1999). Modeling the developing mind: From structure to change. Developmental Review, 19, 319-368. [37] Case, R. (1992). The role of central conceptual structures in the development of childrens mathematical and scientific thought. In A. Demetriou, M. Shayer, & A. Efklides (Eds.), Neo-Piagetian theories of cognitive development: Implications and applications to education (pp. 52-65). London: Routledge

206

NOMINATE (scaling method)

207

NOMINATE (scaling method)


NOMINATE

W-NOMINATE coordinates of members of the 111th House of Representatives. Inventors Keith T. Poole [1] , University of Georgia [2] [4]

Howard Rosenthal

[3]

, New York University

NOMINATE (an acronym for Nominal Three-Step Estimation) is a multidimensional scaling method developed by political scientists Keith T. Poole and Howard Rosenthal in the early 1980s to analyze preferential and choice data, such as legislative roll-call voting behavior.[5][6] As computing capabilities grew, Poole and Rosenthal developed multiple iterations of their NOMINATE procedure: the original D-NOMINATE method, W-NOMINATE, and most recently DW-NOMINATE (for dynamic, weighted NOMINATE). In 2009, Poole and Rosenthal were named the first recipients of the Society for Political Methodology's Best Statistical Software Award for their development of NOMINATE, a recognition conferred to "individual(s) for developing statistical software that makes a significant research contribution."[7]

Procedure
Though there are important technical differences between these types of NOMINATE scaling procedures;[8] all operate under the same fundamental assumptions. First, that alternative choices can be projected on a basic, low-dimensional (often two-dimensional) Euclidian space. Second, within that space, individuals have utility functions which are bell-shaped (normally distributed), and maximized at their ideal point. Because individuals also have symmetric, single-peaked utility functions which center on their ideal point, ideal points represent individuals' most preferred outcomes. That is, individuals most desire outcomes closest their ideal point, and will choose/vote probabilistically for the closest outcome. Ideal points can be recovered from observing choices, with individuals exhibiting similar preferences placed more closely than those behaving dissimilarly. It is helpful to compare this procedure to producing maps based on driving distances between cities. For example, Los Angeles is about 1,800 miles from St. Louis; St. Louis is about 1,200 miles from Miami; and Miami is about 2,700 miles from Los Angeles. From this (dis)similarities data, any map of these three cities should place Miami far from Los Angeles, with St. Louis somewhere in between (though a bit closer to Miami than Los Angeles). Just as cities like Los Angeles and San Francisco would be clustered on a map, NOMINATE places ideologically similar legislators (e.g., liberal Senators Barbara Boxer (D-Calif.) and Al Franken (D-Minn.)) closer to each other, and farther from dissimilar legislators (e.g., conservative Senator Tom Coburn (R-Okla.)) based on the degree of agreement between their roll call voting records. At the heart of the NOMINATE procedures (and other multidimensional scaling methods, such as Poole's Optimal Classification method[]) are

NOMINATE (scaling method) algorithms they utilize to arrange individuals and choices in low dimensional (usually two-dimensional) space. Thus, NOMINATE scores provide "maps" of legislatures.[9] Using NOMINATE procedures to study congressional roll call voting behavior from the First Congress to the present-day, Poole and Rosenthal published Congress: A Political-Economic History of Roll Call Voting[10] in 1997 and the revised edition Ideology and Congress[11] in 2007. Both were landmark works for their development and application of the use of sophisticated measurement and scaling methods in political science. These works also revolutionized the study of the American politics and, in particular, Congress. Their methods provided political scientistsfor the first timequantitative measures of Representatives' and Senators' ideology across chambers and across time.

208

Keith T. Poole (left) and Howard Rosenthal (right), 1984.

Poole and Rosenthal demonstrate thatdespite the many complexities of congressional representation and politicsroll call voting in both the House and the Senate can be organized and explained by no more than two dimensions throughout the sweep of American history. The first dimension (horizontal or x-axis) is the familiar left-right (or liberal-conservative) spectrum on economic matters. The second dimension (vertical or y-axis) picks up attitudes on cross-cutting, salient issues of the day (which include or have included slavery, bimetallism, civil rights, regional, and social/lifestyle issues). For the most part, congressional voting is uni-dimensional, with most of the variation in voting patterns explained by placement along the liberal-conservative first dimension.

Interpreting scores
For illustrative purposes, consider the following plots which use W-NOMINATE scores to scale members of Congress and uses the probabilistic voting model (in which legislators farther from the cutting line between yea and nay outcomes become more likely to vote in the predicted manner) to illustrate some major Congressional votes in the 1990s. Some of these votes, like the Houses vote on President Clintons welfare reform package (the Personal Responsibility and Work Opportunity Act of 1996) are best modeled through the use of the first (economic liberal-conservative) dimension. On the welfare reform vote, nearly all Republicans joined the moderate-conservative bloc of House Democrats in voting for the bill, while opposition was virtually confined to the most liberal Democrats in the House. The errors (those representatives on the wrong side of the cutting line which separates predicted yeas and predicted nays) are generally close to the cutting line, which is what we would expect. A legislator directly on the cutting line is indifferent between voting yea and nay on the measure. All members are shown on the left panel of the plot, while only errors are shown on the right panel:

NOMINATE (scaling method)

209

Economic ideology also dominates the Senate vote on the Balanced Budget Amendment of 1995:

On other votes, however, a second dimension (which has recently come to represent attitudes on cultural and lifestyle issues) is important. For example, roll call votes on gun control routinely split party coalitions, with socially conservative blue dog Democrats joining most Republicans in opposing additional regulation and socially liberal Republicans joining most Democrats in supporting gun control. The addition of the second dimension accounts for these inter-party differences, and the cutting line is more horizontal than vertical (meaning the cleavage is found on the second dimension rather than the first dimension on these votes) This pattern was evident in the 1991 House vote to require waiting periods on handguns:

NOMINATE (scaling method)

210

Political polarization
Poole and Rosenthal (beginning with their 1984 article "The Polarization of American Politics"[12]) have also used NOMINATE data to show that, since the 1970s, party delegations in Congress have become ideologically homogeneous and distant from one another (a phenomenon known as "polarization.") Using DW-NOMINATE scores (which permit direct comparisons between members of different Congresses across time), political scientists have demonstrated the expansion of ideological divides in Congress, which has spurred Political polarization in the United States House intense partisanship between Republicans and Democrats in recent of Representatives. decades.[13][14][15][16][17][18][19] Contemporary political polarization has had important political consequences on American public policy, as Poole and Rosenthal (with fellow political scientist Nolan McCarty) show in their book Polarized America: The Dance of Ideology and Unequal Riches.[20]

Applications
NOMINATE has been used to test, refine, and/or develop wide-ranging theories and models of the United States Congress.[21][22] In Ideology and Congress (pp.270271), Poole and Rosenthal agree that their findings are consistent with the "party cartel" model that Cox and McCubbins present in their 1993 book Legislative Leviathan.[23] Keith Krehbiel utilizes NOMINATE scores to determine the ideological rank order of both chambers of Congress in developing his "pivotal politics" theory,[24] as do Gary Cox and Matthew McCubbins in their tests of whether parties in Congress meet the conditions of responsible party government (RPG).[25] NOMINATE scores are also used by popular media outlets like The New York Times and The Washington Post as a measure of the political ideology of political institutions and elected officials or candidates. Political blogger Nate Silver regularly uses DW-NOMINATE scores to gauge the ideological location of major political figures and institutions.[26][27][28][29] NOMINATE procedures and related roll call scaling techniques have also been applied to a number of other legislative bodies besides the United States Congress. These include the United Nations General Assembly,[30] the European Parliament[31] National Assemblies in Latin America,[32] and the French Fourth Republic.[33] Poole and

NOMINATE (scaling method) Rosenthal note in Chapter 11 of Ideology and Congress that most of these analyses produce the finding that roll call voting is organized by only few dimensions (usually two): "These findings suggest that the need to form parliamentary majorities limits dimensionality."[34]

211

References
[1] [2] [3] [4] [5] http:/ / polisci. uga. edu/ people/ profile/ dr_keith_poole http:/ / www. uga. edu/ http:/ / politics. as. nyu. edu/ object/ HowardRosenthal http:/ / www. nyu. edu/ Poole, Keith T. and Howard Rosenthal. 1983. "A Spatial Model for Legislative Roll Call Analysis." GSIA Working Paper No. 5- 83-84. http:/ / voteview. com/ Upside_Down-A_Spatial_Model_for_Legislative_Roll_Call_Analysis_1983. pdf [6] Poole, Keith T. and Howard Rosenthal. "A Spatial Model For Legislative Roll Call Analysis." American Journal of Political Science, May 1985, 357-384. [7] The Society for Political Methodology: Awards. http:/ / polmeth. wustl. edu/ about. php?page=awards [8] Description of NOMINATE Data. http:/ / www. voteview. com/ page2a. htm [10] Poole, Keith T. and Howard Rosenthal. 1997. Congress: A Political-Economic History of Roll Call Voting. New York: Oxford University Press. [11] Poole, Keith T. and Howard Rosenthal. 2007. Ideology and Congress. New Brunswick, NJ: Transaction Publishers. http:/ / www. transactionpub. com/ title/ Ideology-and%20Congress-978-1-4128-0608-4. html [12] Poole, Keith T. and Howard Rosenthal. 1984. "The Polarization of American Politics." Journal of Politics 46: 1061-79. http:/ / www. voteview. com/ The_Polarization_of_American_Politics_1984. pdf [13] Theriault, Sean M. 2008. Party Polarization in Congress. Cambridge: Cambridge University Press. [14] Jacobson, Gary. 2010. A Divider, Not a Uniter: George W. Bush and the American People. New York: Pearson Longman. [15] Abramowitz, Alan I. 2010. The Disappearing Center: Engaged Citizens, Polarization, and American Democracy. New Haven, CT: Yale University Press. [16] Levendusky, Matthew. 2009. The Partisan Sort: How Liberals Became Democrats and Conservatives Became Republicans. Chicago: University of Chicago Press. [17] Baldassarri, Delia, and Andrew Gelman. 2008. Partisans without Constraint: Political Polarization and Trends in American Public Opinion. American Journal of Sociology. 114(2): 408-46. [18] Fiorina, Morris P., with Samuel J. Abrams and Jeremy C. Pope. 2005. Culture Wars? The Myth of Polarized America. New York: Pearson Longman. [19] Hetherington, Marc J. 2001. "Resurgent Mass Partisanship: The Role of Elite Polarization." American Political Science Review 95: 619-631. [20] McCarty, Nolan, Keith T. Poole and Howard Rosenthal. 2006. Polarized America: The Dance of Ideology and Unequal Riches. Cambridge, MA: MIT Press. http:/ / www. voteview. com/ polarizedamerica. asp [21] Kiewiet, D. Roderick and Matthew D. McCubbins. 1991. The Logic of Delegation. Chicago: University of Chicago Press. [22] Shickler, Eric. 2000. "Institutional Change in the House of Representatives, 1867-1998: A Test of Partisan and Ideological Power Balance Models." American Political Science Review 94: 269-288. [23] Cox, Gary W. and Matthew D. McCubbins. 1993. Legislative Leviathan. Berkeley: University of California Press. [24] Krehbiel, Keith. 1998. Pivotal Politics: A Theory of U.S. Lawmaking. Chicago: University of Chicago Press. [25] Cox, Gary W. and Matthew D. McCubbins. 2005. Setting the Agenda: Responsible Party Government in the U.S. House of Representatives. New York: Cambridge University Press. [30] Voeten, Erik. 2001. "Outside Options and the Logic of Security Council Action." American Political Science Review 95: 845-858. [31] Hix, Simon, Abdul Noury, and Gerald Roland. 2006. "Dimensions of Politics in the European Parliament." American Journal of Political Science 50: 494-511. [32] Mogernstern, Scott. 2004. Patterns of Legislative Politics: Roll-Call Voting in Latin America and the United States. New York: Cambridge University Press. [33] Rosenthal, Howard and Erik Voeten. 2004. "Analyzing Roll Calls with Perfect Spatial Voting: France 1946-1958." American Journal of Political Science 48: 620-632. [34] Poole and Rosenthal, Ideology and Congress, p. 295.

NOMINATE (scaling method)

212

External links
"NOMINATE and American Political History: A Primer." A helpful, more extensive introduction to NOMINATE (http://www.voteview.com/nominate_and_political_history_primer.pdf) Jordan Ellenberg, "Growing Apart: The Mathematical Evidence for Congress' Growing Polarization," Slate Magazine, 26 December 2001 (http://www.slate.com/id/2060047) "NOMINATE: A Short Intellectual History" (by Keith T. Poole) (http://www.voteview.com/nominate.pdf) Voteview website, with NOMINATE scores (http://www.voteview.com) Voteview Blog (http://voteview.com/blog/) W-NOMINATE in R: Software and Examples (http://www.voteview.com/wnominate_in_R.htm) Optimal Classification (OC) in R: Software and Examples (http://www.voteview.com/OC_in_R.htm)

Non-response bias
Non-response bias occurs in statistical surveys if the answers of respondents differ from the potential answers of those who did not answer.

Example
If one selects a sample of 1000 managers in a field and polls them about their workload, the managers with a high workload may not answer the survey because they do not have enough time to answer it, and/or those with a low workload may decline to respond for fear that their supervisors or colleagues will perceive them as unnecessary (either immediately, if the survey is non-anonymous, or in the future, should their anonymity be compromised by collusion, "leaks," insufficient procedural precautions, or data-security breaches). Therefore, non-response bias may make the measured value for the workload too low, too high, or, if the effects of the above biases happen to offset each other, "right for the wrong reasons."

Test
There are different ways to test for non-response bias. In e-mail surveys some values are already known from all potential participants (e.g. age, branch of the firm, ...) and can be compared to the values that prevail in the subgroup of those who answered. If there is no significant difference this is an indicator that there might be no non-response bias. In e-mail surveys those who didn't answer can also systematically be phoned and a small number of survey questions can be asked. If their answers don't differ significantly from those who answered the survey, there might be no non-response bias. This technique is sometimes called non-response follow-up. Generally speaking, the lower the response rate, the greater the likelihood of a non-response bias in play.

Related terminology
Response bias is not the opposite of "non-response bias" but instead relates to a possible tendency of respondents to give an answer a) of which they believe that the questioner, or society in general, might approve or b) that they perceive would help yield a result that would tend to promote some desired goal of their own. Special issue of Public Opinion Quarterly (Volume 70, Issue 5) about "Nonresponse Bias in Household Surveys": http://poq.oxfordjournals.org/content/70/5.toc

Norm-referenced test

213

Norm-referenced test
A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. This estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population.[1] That is, this type of test identifies whether the test taker performed better or worse than other test takers, but not whether the test taker knows either more or less material than is necessary for a given purpose. The term normative assessment refers to the process of comparing one test-taker to his or her peers.[1] Norm-referenced assessment can be contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-referenced assessment, the score shows whether or not the test takers performed well or poorly on a given task, but not how that compares to other test takers; in an ipsative system, the test taker is compared to his previous performance.

Other types
Alternative to normative testing, tests can be ipsative, that is, the individual assessment is compared to him- or herself through time.[2][3] By contrast, a test is criterion-referenced when provision is made for translating the test score into a statement about the behavior to be expected of a person with that score. The same test can be used in both ways.[4] Robert Glaser originally coined the terms norm-referenced test and criterion-referenced test.[] Standards-based education reform is based on the belief that public education should establish what every student should know and be able to do.[5] Students should be tested against a fixed yardstick, rather than against each other or sorted into a mathematical bell curve.[6] By assessing that every student must pass these new, higher standards, education officials believe that all students will achieve a diploma that prepares them for success in the 21st century.[7]

Common use
Most state achievement tests are criterion referenced. In other words, a predetermined level of acceptable performance is developed and students pass or fail in achieving or not achieving this level. Tests that set goals for students based on the average student's performance are norm-referenced tests. Tests that set goals for students based on a set standard (e.g., 80 words spelled correctly) are criterion-referenced tests. Many college entrance exams and nationally used school tests use norm-referenced tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test-takers cannot "fail" a norm-referenced test, as each test-taker receives a score that compares the individual to others that have taken the test, usually given by a percentile. This is useful when there is a wide range of acceptable scores that is different for each college. By contrast, nearly two-thirds of US high school students will be required to pass a criterion-referenced high school graduation examination. One high fixed score is set at a level adequate for university admission whether the high school graduate is college bound or not. Each state gives its own test and sets its own passing level, with states like Massachusetts showing very high pass rates, while in Washington State, even average students are failing, as well as 80 percent of some minority groups. This practice is opposed by many in the education community such as Alfie Kohn as unfair to groups and individuals who don't score as high as others.

Norm-referenced test

214

Advantages and limitations


An obvious disadvantage of norm-referenced tests is that it cannot measure progress of the population as a whole, only where individuals fall within the whole. Thus, only measuring against a fixed goal can be used to measure the success of an educational reform program which seeks to raise the achievement of all students against new standards which seek to assess skills beyond choosing among multiple choices. However, while this is attractive in theory, in practice the bar has often been moved in the face of excessive failure rates, and improvement sometimes occurs simply because of familiarity with and teaching to the same test. With a norm-referenced test, grade level was traditionally set at the level set by the middle 50 percent of scores.[8] By contrast, the National Children's Reading Foundation believes that it is essential to assure that virtually all of our children read at or above grade level by third grade, a goal which cannot be achieved with a norm referenced definition of grade level.[9] Advantages to this type of assessment include students and teachers alike know what to expect from the test and just how the test will be conducted and graded. Likewise, each and every school will conduct the exam in the same manner reducing such inaccuracies as time differences or environmental differences that may cause distractions to the students. This also makes these assessments fairly accurate as far as results are concerned, a major advantage for a test. Critics of criterion-referenced tests point out that judges set bookmarks around items of varying difficulty without considering whether the items actually are compliant with grade level content standards or are developmentally appropriate.[10] Thus, the original 1997 sample problems published for the WASL 4th grade mathematics contained items that were difficult for college educated adults, or easily solved with 10th grade level methods such as similar triangles.[11] The difficulty level of items themselves, as are the cut-scores to determine passing levels are also changed from year to year.[12] Pass rates also vary greatly from the 4th to the 7th and 10th grade graduation tests in some states.[13] One of the limitations of No Child Left Behind is that each state can choose or construct its own test which cannot be compared to any other state.[14] A Rand study of Kentucky results found indications of artificial inflation of pass rates which were not reflected in increasing scores in other tests such as the NAEP or SAT given to the same student populations over the same time.[15] Graduation test standards are typically set at a level consistent for native born 4 year university applicants [citation needed] . An unusual side effect is that while colleges often admit immigrants with very strong math skills who may be deficient in English, there is no such leeway in high school graduation tests, which usually require passing all sections, including language. Thus, it is not unusual for institutions like the University of Washington to admit strong Asian American or Latino students who did not pass the writing portion of the state WASL test, but such students would not even receive a diploma once the testing requirement is in place. Although the tests such as the WASL are intended as a minimal bar for high school, 27 percent of 10th graders applying for Running Start in Washington State failed the math portion of the WASL. These students applied to take college level courses in high school, and achieve at a much higher level than average students. The same studyconcluded the level of difficulty was comparable to, or greater than that of tests intended to place students already admitted to the college.[16] A norm referenced test has none of these problems because it does not seek to enforce any expectation of what all students should know or be able to do other than what actual students demonstrate. Present levels of performance and inequity are taken as fact, not as defects to be removed by a redesigned system. Goals of student performance are not raised every year until all are proficient. Scores are not required to show continuous improvement through Total Quality Management systems. Disadvantages include standards based assessments measure the level that students are currently by measuring against where their peers are currently at instead of the level that both students should be at.

Norm-referenced test A rank-based system only produces data which tell which average students perform at an average level, which students do better, and which students do worse. This contradicts the fundamental beliefs, whether optimistic or simply unfounded, that all will perform at one uniformly high level in a standards based system if enough incentives and punishments are put into place. This difference in beliefs underlies the most significant differences between a traditional and a standards based education system.

215

Examples
IQ tests are norm-referenced tests, because their goal is to see which test taker is more intelligent than the other test takers. Theater auditions and job interviews are norm-referenced tests, because their goal is to identify the best candidate compared to the other candidates, not to determine how many of the candidates meet a fixed list of standards.

References
[1] [2] [3] [4] Assessment Guided Practices (https:/ / fp. auburn. edu/ rse/ trans_media/ 08_Publications/ 06_Transition_in _Action/ chap8. htm) Assessment (http:/ / www. dmu. ac. uk/ ~jamesa/ teaching/ assessment. htm) PDF presentation (http:/ / www. psychology. nottingham. ac. uk/ staff/ nfr/ rolefunction. pdf) Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.

[5] (http:/ / www. isbe. state. il. us/ ils/ ) Illinois Learning Standards [6] stories 5-01.html (http:/ / www. fairtest. org/ nattest/ times) Fairtest.org: Times on Testing "criterion referenced" tests measure students against a fixed yardstick, not against each other. [7] (http:/ / www. newhorizons. org/ spneeds/ improvement/ bergeson. htm) By the Numbers: Rising Student Achievement in Washington State by Terry Bergesn "She continues her pledge ... to ensure all students achieve a diploma that prepares them for success in the 21st century." [8] (http:/ / www. nctm. org/ news/ assessment/ 2004_04nb. htm) NCTM: News & Media: Assessment Issues (Newsbulletin April 2004) "by definition, half of the nation's students are below grade level at any particular moment" [9] (http:/ / www. readingfoundation. org/ about/ about_us. asp) National Children's Reading Foundation website [10] (http:/ / www. leg. wa. gov/ pub/ billinfo/ 2001-02/ house/ 2075-2099/ 2087_hbr. pdf) HOUSE BILL REPORT HB 2087 "A number of critics ... continue to assert that the mathematics WASL is not developmentally appropriate for fourth grade students." [11] Prof Don Orlich, Washington State University [12] (http:/ / archives. seattletimes. nwsource. com/ cgi-bin/ texis. cgi/ web/ vortex/ display?slug=wasl11m& date=20040511') Panel lowers bar for passing parts of WASL By Linda Shaw, Seattle Times May 11, 2004 "A blue-ribbon panel voted unanimously yesterday to lower the passing bar in reading and math for the fourth- and seventh-grade exam, and in reading on the 10th-grade test" [13] (http:/ / archives. seattletimes. nwsource. com/ cgi-bin/ texis. cgi/ web/ vortex/ display?slug=mathtest06m& date=20021206& query=WASL+ 7th+ grade) Seattle Times December 06, 2002 Study: Math in 7th-grade WASL is hard By Linda Shaw "Those of you who failed the math section ... last spring had a harder test than your counterparts in the fourth or 10th grades." [14] (http:/ / www. state. nj. us/ njded/ njpep/ assessment/ naep/ index. html) New Jersey Department of Education: "But we already have tests in New Jersey, why have another test? Our statewide test is an assessment that only New Jersey students take. No comparisons should be made to other states, or to the nation as a whole. [15] (http:/ / www. rand. org/ pubs/ research_briefs/ RB8017/ index1. html) Test-Based Accountability Systems (Rand) "NAEP data are particularly important ...Taken together, these trends suggest appreciable inflation of gains on KIRIS. ... [16] (http:/ / www. transitionmathproject. org/ assetts/ docs/ highlights/ wasl_report. doc) Relationship of the Washington Assessment of Student Learning (WASL) and Placement Tests Used at Community and Technical Colleges By: Dave Pavelchek, Paul Stern and Dennis Olson Social & Economic Sciences Research Center, Puget Sound Office, WSU "The average difficulty ratings for WASL test questions fall in the middle of the range of difficulty ratings for the college placement tests."

External links
A webpage (http://www.citrus.kcusd.com/instruction.htm) about instruction that discusses assessment

Normal curve equivalent

216

Normal curve equivalent


In educational statistics, a normal curve equivalent (NCE), developed for the United States Department of Education by the RMC Research Corporation,[1] is a way of standardizing scores received on a test into a 0-100 scale similar to a percentile-rank, but preserving the valuable equal-interval properties of a z-score. It is defined as: 50 + 49/qnorm(.99) z or, approximately 50 + 21.063 z, where z is the standard score or "z-score", i.e. z is how many standard deviations above the mean the raw score is (zis negative if the raw score is below the mean). The reason for the choice of the number 21.06 is to bring about the following result: If the scores are normally distributed (i.e. they follow the "bell-shaped curve") then the normal equivalent score is 99 if the percentile rank of the raw score is99; the normal equivalent score is 50 if the percentile rank of the raw score is50; the normal equivalent score is 1 if the percentile rank of the raw score is1. This relationship between normal equivalent scores and percentile ranks does not hold at values other than 1,50,and99. It also fails to hold in general if scores are not normally distributed. The number 21.06 was chosen because It is desired that a score of 99 correspond to the 99th percentile; The 99th percentile is normal distribution is 2.3263 standard deviations above the mean; 99 is 49 more than 50thus 49 points above the mean; 49/2.3263=21.06.

Normal curve equivalents are on an equal-interval scale (see [2] and [3] for examples). This is advantageous compared to percentile rank scales, which suffer from the problem that the difference between any two scores is not the same as that between any other two scores (see below or percentile rank for more information). The major advantage of NCEs over percentile ranks is that NCEs can be legitimately averaged.[4] The Rochester School Department webpage describes how NCE scores change: In a normally distributed population, if all students were to make exactly one year of progress after one year of instruction, then their NCE scores would remain exactly the same and their NCE gain would be zero, even though their raw scores (i.e. the number of questions they answered correctly) increased. Some students will make more than a year's progress in that time and will have a net gain in the NCE score, which means that those students have learned more, or at least have made more progress in the areas tested, than the general population. Other students, while making progress in their skills, may progress more slowly than the general population and will show a net loss in their NCE ranks.

Caution
Careful consideration is required when computing effect sizes using NCEs. NCEs differ from other scores, such as raw and scaled scores, in the magnitude of the effect sizes. Comparison of NCEs typically results in smaller effect sizes, and using the typical ranges for other effect sizes may result in interpretation errors.[5] Excel formula for conversion from Percentile to NCE: =21.06*NORMSINV(PR/100)+50, where PR is the percentile value. Excel formula for conversion from NCE to Percentile: =100*NORMSDIST((NCE-50)/21.06), where NCE is the Normal Curve Equivalent (NCE) value

Normal curve equivalent

217

References
[1] Mertler, C. A. (2002). Using standardized test data to guide instruction and intervention. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation. ( ERIC Document Reproduction Service (http:/ / www. eric. ed. gov/ ) No. ED470589)

Normal curve equivalent (NCE): A normalized standardized score with a mean of 50 and a standard deviation of 21.06 resulting in a near equal interval scale from 0 to 99. The NCE was developed by RMC Research Corporation in 1976 to measure the effectiveness of the Title I Program across the United States and is often used to measure gains over time. (p. 3)
[2] [3] [4] [5] http:/ / www. rochesterschools. com/ Webmaster/ StaffHelp/ rdgstudy/ ncurve2. gif http:/ / www. citrus. kcusd. com/ gif/ bellcurve. gif Rochester School Department (http:/ / www. rochesterschools. com/ Webmaster/ StaffHelp/ rdgstudy/ nce. html) webpage McLean, J. E., O'Neal, M. R., & Barnette, J. J. (2000, November). Are all effect sizes created equal? Paper presented at the Annual Meeting of the Mid-South Educational Research Association, Bowling Green, KY. ( ERIC Document Reproduction Service (http:/ / www. eric. ed. gov/ ) No. ED448188)

External links
Norm Scale Calculator (http://www.psychometrica.de/normwertrechner_en.html) (Utility for the Transformation and Visualization of Norm Scores) Scholastic Testing Service (http://ststesting.com/explainit.html), a glossary of terms related to the bell or normal curve. UCLA stats: How should I analyze percentile rank data (http://www.ats.ucla.edu/stat/stata/faq/prank.htm) describing how to convert percentile ranks to NCEs with Stata.

Objective test
An objective test is a psychological test that measures an individual's characteristics in a way that is independent of rater bias or the examiner's own beliefs, usually by the administration of a bank of questions that are marked and compared against exacting scoring mechanisms that are completely standardized, much in the same way that examinations are administered. Objective tests are often contrasted with projective tests, which are sensitive to rater or examiner beliefs. Projective tests are based on Freudian Psychology (Psychoanalysis), and seek to expose the unconscious perceptions of people. Objective tests tend to have more validity than projective tests, however they are still subject to the willingness of the subject to be open about his/her personality and as such can sometimes be badly representative of the true personality of the subject. Projective tests purportedly expose certain aspects of the personality of individuals that are impossible to measure by means of an objective test, and are much more reliable at uncovering "protected" or unconscious personality traits or features. An objective test is built by following a rigorous protocol which includes the following steps: Making decisions on nature, goal, target population, power. Creating a bank of questions. Estimating the validity of the questions, by means of statistical procedures and/or judgement of experts in the field. Designing a format of application (a clear, easy-to-answer questionnaire, or an interview, etc.). Detecting which questions are better in terms of discrimination, clarity, ease of response, upon application on a pilot sample. Applying a revised questionnaire or interview to a sample. Using appropriate statistical procedures to establish norms for the test.

Objective test

218

References

Online assessment
Online assessment is the process used to measure certain aspects of information for a set purpose where the assessment is delivered via a computer connected to a network. Most often the assessment is some type of educational test. Different types of online assessments contain elements of one or more of the following components, depending on the assessment's purpose: formative, diagnostic, or summative. Instant and detailed feedback, as well as flexibility of location and time, are just two of the many benefits associated with online assessments. There are many resources available that provide online assessments, some free of charge and others that charge fees or require a membership.

Purpose of assessments
Assessments are a vital part of determining student achievement. They are used to determine the knowledge gained by students and to determine if adjustments need to be made to either the teaching or learning process.[1]

Types of online assessments


Online assessment is used primarily to measure cognitive abilities, demonstrating what has been learned after a particular educational event has occurred, such as the end of an instructional unit or chapter. When assessing practical abilities or to demonstrate learning that has occurred over a longer period of time an online portfolio (or ePortfolio) is often used. The first element that must be prepared when teaching an online course is assessment. Assessment is used to determine if learning is happening, to what extent and if changes need to be made.[2] Independent Work Independent work is work that a students prepares to assist the instructor in determining their learning progress. Some examples are: exercises, papers, portfolios, and exams (multiple choice, true false, short answer, fill in the blank, open ended/essay or matching). To truly evaluate, an instructor must use multiple methods. Most students will not complete assignments unless there is an assessment (i.e. motivation). It is the instructors role to catalyze student motivation. Appropriate feedback is the key to assessment, whether or not the assessement is graded.[3] Group Work Students are often asked to work in groups. With this brings on new assessment strategies. Students can be evaluated using a collaborative learning model in which the learning is driven by the students and/or a cooperative learning model where tasks are assigned and the instructor is involved in decisions.[4]

Uses of online assessments


Pre-Testing - Prior to the teaching of a lesson or concept, a student can complete an online pretest to determine their level of knowledge. This form of assessment helps determine a baseline so that when a summative assessment or post-test is given, quantitative evidence is provided showing that learning has occurred. Formative Assessment - Formative assessment is used to provide feedback during the learning process. In online assessment situations, objective questions are posed, and feedback is provided to the student either during or immediately after the assessment. Summative Assessment - Summative assessments provide a quantitative grade and are often given at the end of a unit or lesson to determine that the learning objectives have been met.

Online assessment Practice Testing - With the ever-increasing use of high-stakes testing in the educational arena, online practice tests are used to give students an edge. Students can take these types of assessments multiple times to familiarize themselves with the content and format of the assessment. Surveys - Online surveys may be used by educators to collect data and feedback on student attitudes, perceptions or other types of information that might help improve the instruction. Evaluations - This type of survey allows facilitators to collect data and feedback on any type of situation where the course or experience needs justification or improvement. Performance Testing - The user shows what they know and what they can do. This type of testing is used to show technological proficiency, reading comprehension, math skills, etc. This assessment is also used to identify gaps in student learning. New technologies, such as the Web, digital video, sound, animations, and interactivity, are providing tools that can make assessment design and implementation more efficient, timely, and sophisticated.

219

Academic Dishonesty
Academic dishonesty, commonly known as cheating, occurs in all levels of educational institutions. In traditional classrooms, students cheat in various forms such as hidden prepared notes not permitted to be used or looking at another students paper during an exam, copying homework from one another, or copying from a book, article or media without properly citing the source. Individuals can be dishonest due to lack of time management skills, pursuit for better grades, cultural behavior or a misunderstanding of plagiarism.[5] Online classroom environments are no exception to the possibility of academic dishonesty. It can easily be seen from a students perspective as an easy passing grade. Proper assignments types, meetings and projects can prevent academic dishonesty in the online classroom.[6]

Types of Academic Dishonesty


Two common types of academic dishonesty are identity fraud and plagiarism. Identity fraud can occur in the traditional or online classroom. There is a higher chance in online classes due to the lack of proctored exams or instructor-student interaction. In a traditional classroom, instructors have the opportunity to get to know the students, learn their writing styles or use proctored exams. To prevent identity fraud in an online class, instructors can use proctored exams through the institutions testing center or require students to come in at a certain time for the exam. Correspondence through the phone or video conferencing techniques can allow an instructor to become familiar with a student through their voice and appearance. Another option would be personalize assignments to students backgrounds or current activities. This allows the student to apply it to their personal life and gives the instructor more assurance the actual student is completing the assignment. Lastly, an instructor may not make the assignments heavily weighted so the students do not feel as pressured.[7] Plagiarism is the misrepresentation of another persons work. It is easy to copy and paste from the internet or retype directly from a source. It is not only the exact wordage, but the thought or idea.[8] It is important to learn to properly cite a source when using someone elses work. Various websites are available to check for plagiarism for a fee.[9] www.canexus.com, www.catchitfirst.com, www.ithenticate.com, www.mydropbox.com, www.turnitin.com

Online assessment

220

References

Operational definition
An operational definition, also called functional definition,[1][2] defines something (e.g. a variable, term, or object) in terms of the specific process or set of validation tests used to determine its presence and quantity. That is, one defines something in terms of the operations that count as measuring it.[3] The term was coined in philosophy of science book The Logic of Modern Physics (1927), by Percy Williams Bridgman, and is a part of the process of operationalization. One might use definitions that rely on operations in order to avoid the troubles associated with attempting to define things in terms of some intrinsic essence.
The operational definition of a peanut butter sandwich might be simply "the result

An example of an operational definition of putting peanut butter on a slice of bread with a butter knife and laying a second equally sized slice of bread on top" might be defining the weight of an object in terms of the numbers that appear when that object is placed on a weighing scale. The weight then, is whatever results from following the (weight) measurement procedure, which should be repeatable by anyone. This is in contrast to operationalization that uses theoretical definitions.

Overview
Properties described in this manner must be sufficiently accessible, so that persons other than the definer may independently measure or test for them at will.[citation needed] An operational definition is generally designed to model a theoretical definition. The most operational definition is a process for identification of an object by distinguishing it from its background of empirical experience. The binary version produces either the result that the object exists, or that it doesn't, in the experiential field to which it is applied. The classifier version results in discrimination between what is part of the object and what is not part of it. This is also discussed in terms of semantics, pattern recognition, and operational techniques, such as regression. Operationalize means to put into operation. Operational definitions are also used to define system states in terms of a specific, publicly accessible process of preparation or validation testing, which is repeatable at will. For example, 100 degrees Celsius may be crudely defined by describing the process of heating water at sea level until it is observed to boil. An item like a brick, or even a photograph of a brick, may be defined in terms of how it can be made. Likewise, iron may be defined in terms of the results of testing or measuring it in particular ways. Vandervert (1980/1988) described in scientific detail a simple, every day illustration of an operational definition in terms of making a cake (i.e., its recipe is an operational definition used in a specialized laboratory known as the household kitchen). Similarly, the saying, if it walks like a duck and quacks like a duck, it must be some kind of duck, may be regarded as involving a sort of measurement process or set of tests (see duck test).

Operational definition

221

Application
Despite the controversial philosophical origins of the concept, particularly its close association with logical positivism, operational definitions have undisputed practical applications. This is especially so in the social and medical sciences, where operational definitions of key terms are used to preserve the unambiguous empirical testability of hypothesis and theory. Operational definitions are also important in the physical sciences.

Philosophy
The Stanford Encyclopedia of Philosophy says the following about Operationalism as written by Richard Boyd:[4] The idea originally arises in the operationalist philosophy of P. W. Bridgman and others. By 1914, Bridgman was dismayed by the abstraction and lack of clarity with which, he argued, many scientific concepts were expressed. Inspired by logical positivism and the phenomenalism of Ernst Mach, in 1914 he declared that the meaning of a theoretical term (or unobservable entity), such as electron density, lay in the operations, physical and mental, performed in its measurement. The goal was to eliminate all reference to theoretical entities by "rationally reconstructing" them in terms of the particular operations of laboratory procedures and experimentation. Hence, the term electron density could be analyzed into a statement of the following form: (*) The electron density of an object, O, is given by the value, x, if and only if P applied to O yields the value x, where P stands for an instrument that scientists take as a procedure for measuring electron density. Operationalism, defined in this way, was rejected even by the logical positivists, due to inherent problems: defining terms operationally necessarily implied the analytic necessity of the definition. The analyticity of operational definitions like (*) is essential to the project of rational reconstruction. Operationalism is not, for example, the idea that electron density is defined as whatever magnitude instruments of the sort P reliably measure. On that conception (*) would represent an empirical discovery about how to measure electron density, but -- since electrons are unobservables -- that's a realist conception not an empiricist one. What the project of rational reconstruction requires is that (*) be true purely as a matter of linguistic stipulation about how the term "electron density" is to be used. Since (*) is supposed to be analytic, it's supposed to be unrevisable. There is supposed to be no such thing as discovering, about P, that some other instrument provides a more accurate value for electron density, or provides values for electron density under conditions where P doesn't function. Here again, thinking that there could be such an improvement in P with respect to electron density requires thinking of electron density as a real feature of the world which P (perhaps only approximately) measures. But that's the realist conception that operationalism is designed rationally to do away with! In actual, and apparently reliable, scientific practice, changes in the instrumentation associated with theoretical terms are routine, and apparently crucial to the progress of science. According to a 'pure' operationalist conception, these sorts of modifications would not be methodologically acceptable, since each definition must be considered to identify a unique 'object' (or class of objects). In practice, however, an 'operationally defined' object is often taken to be that object which is determined by a constellation of different unique 'operational procedures.' Most logical empiricists were not willing to accept the conclusion that operational definitions must be unique (in contradiction to 'established' scientific practice). So they felt compelled to reject operationalism. In the end, it reduces to a reductio ad absurdum, since each measuring instrument must itself be operationally defined, in infinite regress... But this was also a failure of the logical positivist approach generally. However, this rejection of operationalism as a general project destined ultimately to define all experiential phenomena uniquely did not mean that operational definitions ceased to have any practical use or that they could not be applied in particular cases.

Operational definition

222

Science
The special theory of relativity can be viewed as the introduction of operational definitions for simultaneity of events and of distance, that is, as providing the operations needed to define these terms.[] In quantum mechanics the notion of operational definitions is closely related to the idea of observables, that is, definitions based upon what can be measured.[][] Operational definitions are at their most controversial in the fields of psychology and psychiatry, where intuitive concepts, such as intelligence need to be operationally defined before they become amenable to scientific investigation, for example, through processes such as IQ tests. Such definitions are used as a follow up to a theoretical definition, in which the specific concept is defined as a measurable occurrence. John Stuart Mill pointed out the dangers of believing that anything that could be given a name must refer to a thing and Stephen Jay Gould and others have criticized psychologists for doing just that. A committed operationalist would respond that speculation about the thing in itself, or noumenon, should be resisted as meaningless, and would comment only on phenomena using operationally defined terms and tables of operationally defined measurements. A behaviorist psychologist might (operationally) define intelligence as that score obtained on a specific IQ test (e.g., the Wechsler Adult Intelligence Scale test) by a human subject. The theoretical underpinnings of the WAIS would be completely ignored. This WAIS measurement would only be useful to the extent it could be shown to be related to other operationally defined measurements, e.g., to the measured probability of graduation from university.[5] Operational definitions are the foundation of the diagnostic nomenclature of mental disorders (classification of mental disorders) from the DSM-III onward.[6][7]

Business
On October 15, 1970, the West Gate Bridge in Melbourne, Australia collapsed, killing 35 construction workers. The subsequent enquiry found that the failure arose because engineers had specified the supply of a quantity of flat steel plate. The word flat in this context lacked an operational definition, so there was no test for accepting or rejecting a particular shipment or for controlling quality. In his managerial and statistical writings, W. Edwards Deming placed great importance on the value of using operational definitions in all agreements in business. As he said: "An operational definition is a procedure agreed upon for translation of a concept into measurement of some kind." - W. Edwards Deming "There is no true value of any characteristic, state, or condition that is defined in terms of measurement or observation. Change of procedure for measurement (change of operational definition) or observation produces a new number." - W. Edwards Deming

General process
Operational, in a process context, also can denote a working method or a philosophy that focuses principally on cause and effect relationships (or stimulus/response, behavior, etc.) of specific interest to a particular domain at a particular point in time. As a working method, it does not consider issues related to a domain that are more general, such as the ontological, etc. The term can be used strictly within the realm of the interactions of humans with advanced computational systems. In this sense, an AI system cannot be entirely operational (this issue can be used to discuss strong versus weak AI) if learning is involved. Given that one motive for the operational approach is stability, systems that relax the operational factor can be problematic, for several reasons, as the operational is a means to manage complexity. There will be differences in the nature of the operational as it pertains to degrees along the end-user computing axis.

Operational definition For instance, a knowledge-based engineering system can enhance its operational aspect and thereby its stability through more involvement by the SME, thereby opening up issues of limits that are related to being human, in the sense that, many times, computational results have to be taken at face value due to several factors (hence the duck test's necessity arises) that even an expert cannot overcome. The end proof may be the final results (reasonable facsimile by simulation or artifact, working design, etc.) that are not guaranteed to be repeatable, may have been costly to attain (time and money), and so forth. Many domains, with a numerics focus, use limits logic to overcome the duck test necessity with varying degrees of success. Complex situations may require logic to be more non-monotonic than not raising concerns related to the qualification, frame, and ramification problems.

223

Examples
Temperature
The thermodynamic definition of temperature, due to Nicolas Lonard Sadi Carnot, refers to heat "flowing" between "infinite reservoirs". This is all highly abstract and unsuited for the day-to-day world of science and trade. In order to make the idea concrete, temperature is defined in terms of operations with the gas thermometer. However, these are sophisticated and delicate instruments, only adapted to the national standardization laboratory. For day-to-day use, the International Temperature Scale of 1990 (ITS) is used, defining temperature in terms of characteristics of the several specific sensor types required to cover the full range. One such is the electrical resistance of a thermistor, with specified construction, calibrated against operationally defined fixed points. Therefore it can be seen that heat is hot.

Electric current
Electric current is defined in terms of the force between two infinite parallel conductors, separated by a specified distance. This definition is too abstract for practical measurement, so a device known as a current balance is used to define the ampere operationally.

Mechanical hardness
Unlike temperature and electric current, there is no abstract physical concept of the hardness of a material. It is a slightly vague, subjective idea, somewhat like the idea of intelligence. In fact, it leads to three more specific ideas: 1. Scratch hardness measured on Mohs' scale; 2. Indentation hardness; and 3. Rebound, or dynamic, hardness measured with a Shore scleroscope. Of these, indentation hardness itself leads to many operational definitions, the most important of which are: 1. Brinell hardness test using a 10mm steel ball; 2. Vickers hardness test using a pyramidal diamond indenter; and 3. Rockwell hardness test using a diamond cone indenter. In all these, a process is defined for loading the indenter, measuring the resulting indentation and calculating a hardness number. Each of these three sequences of measurement operations produces numbers that are consistent with our subjective idea of hardness. The harder the material to our informal perception, the greater the number it will achieve on our respective hardness scales. Furthermore, experimental results obtained using these measurement methods has shown that the hardness number can be used to predict the stress required to permanently deform steel, a characteristic that fits in well with our idea of resistance to permanent deformation. However, there is not always a simple relationship between the various hardness scales. Vickers and Rockwell hardness numbers exhibit qualitatively different behaviour when used to describe some materials and phenomena.

Operational definition

224

The constellation Virgo


The constellation Virgo is a specific constellation of stars in the sky, hence the process of forming Virgo cannot be an operational definition, since it is historical and not repeatable. Nevertheless, the process whereby we locate Virgo in the sky is repeatable, so in this way, Virgo is operationally defined. In fact, Virgo can have any number of definitions (although we can never prove that we are talking about the same Virgo), and any number may be operational.

Duck typing
In advanced modeling, with the requisite computational support such as knowledge-based engineering, mappings must be maintained between a real-world object, its abstracted counterparts as defined by the domain and its experts, and the computer models. Mismatches between domain models and their computational mirrors can raise issues that are apropos to this topic. Techniques that allow the flexible modeling required for many hard problems must resolve issues of identity, type, etc. which then lead to methods, such as duck typing.

Theoretical vs operational definition


Theoretical definition Operational definition

Weight: a measurement of gravitational force acting on an object a result of measurement of an object on a newton spring scale

References and notes


Vandervert, L. (1988). Operational definitions made simple, useful, and lasting. In M. Ware & C. Brewer (Eds.), Handbook for teaching statistics and research methods (pp.132134). Hillsdale, NJ: Lawrence Erlbaum Associates. (Original work published 1980)
[1] Adanza, Estela G. (1995) Research methods: Principles and Applications (http:/ / books. google. com/ books?id=yNmTHbQiPEUC& pg=PA21) p.21 [2] Sevilla, Consuelo G. et al. (1992) Research methods (http:/ / books. google. com/ books?id=SK18tR3vTucC& pg=PA20), revised edition p.20

Further reading
Ballantyne, Paul F. History and Theory of Psychology Course, in Langfeld, H.S. (1945) Introduction to the Symposium on Operationism. Psyc. Rev. 32, 241-243. (http://www.comnet.ca/~pballan/operationism(1945). htm) Bohm, D. (1996). On dialog. N.Y.: Routledge. Boyd, Richard. On the Current Status of the Issue of Scientific Realism in Erkenntnis. 19: 45-90. Bridgman, P. W. The way things are. Cambridge: Harvard University Press. (1959) Carnap, R. The Elimination of Metaphysics Through Logical Analysis of Language in Ayer, A.J. 1959. Churchland, Patricia, Neurophilosophy Toward a unified science of the mind/brain, MIT Press (1986). Churchland, Paul., A Neurocomputational Perspective The Nature of Mind and the Structure of Science, MIT Press (1989). Dennett, Daniel C. Consciousness Explained, Little, Brown & Co.. 1992. Depraz, N. (1999). "The phenomenological reduction as praxis." Journal of Consciousness Studies, 6(2-3), 95-110. Hardcastle, G. L. (1995). "S.S. Stevens and the origins of operationism." Philosophy of Science, 62, 404-424. Hermans, H. J. M. (1996). "Voicing the self: from information processing to dialogical interchange." Psychological Bulletin, 119(1), 31-50.

Operational definition Hyman, Bronwen, U of Toronto, and Shephard, Alfred H., U of Manitoba, "Zeitgeist: The Development of an Operational Definition", The Journal of Mind and Behavior, 1(2), pps. 227-246 (1980) Leahy, Thomas H., Virginia Commonwealth U, The Myth of Operationism, ibid, pps. 127-144 (1980) Ribes-Inesta, Emilio "What Is Defined In Operational Definitions? The Case Of Operant Psychology," Behavior and Philosophy, 2003. (http://www.findarticles.com/p/articles/mi_qa3814/is_200301/ai_n9222880) Roepstorff, A. & Jack, A. (2003). "Editorial introduction, Special Issue: Trusting the Subject? (Part 1)." Journal of Consciousness Studies, 10(9-10), v-xx. Roepstorff, A. & Jack, A. (2004). "Trust or Interaction? Editorial introduction, Special Issue: Trusting the Subject? (Part 2)." Journal of Consciousness Studies, 11(7-8), v-xxii. Stevens, S. S. Operationism and logical positivism, in M. H. Marx (Ed.), Theories in contemporary psychology (pp.4776). New York: MacMillan. (1963) Thomson Waddsworth, eds., Learning Psychology: Operational Definitions Research Methods Workshops (http:// www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/op_def/op_def_01. html)

225

Operationalization
In social science and humanities, operationalization is the process of defining a fuzzy concept so as to make the concept clearly distinguishable or measurable and to understand it in terms of empirical observations. In a wider sense it refers to the process of specifying the extension of a concept describing what is and is not a part of that concept. Operationalization often means creating operational definitions and theoretical definitions.

Theory
Early operationalism
Operationalization is used to An example of operationally defining "personal space".[citation needed] specifically refer to the scientific practice of operationally defining, where even the most basic concepts are defined through the operations by which we measure them. This comes from the philosophy of science book The Logic of Modern Physics (1927), by Percy Williams Bridgman, whose methodological position is called operationalism.[1] Bridgman's theory was criticized because we measure "length" in various ways (e.g. it's impossible to use a measuring rod if we want to measure the distance to the Moon), "length" logically isn't one concept but many.[citation needed] Each concept is defined by the measuring operations used. Another example is the radius of a sphere, obtaining different values depending on the way it is measured (say, in metres and in millimeters). Bridgman said the

Operationalization concept is defined on the measurement. So the criticism is that we would end up with endless concepts, each defined by the things that measured the concept.[citation needed] Bridgman notes that in the theory of relativity we see how a concept like "duration" can split into multiple different concepts. As part of the process of refining a physical theory, it may be found that what was one concept is, in fact, two or more distinct concepts. However, Bridgman proposes that if we only stick to operationally defined concepts, this will never happen.

226

Operationalization
The practical 'operational definition' is generally understood as relating to the theoretical definitions that describe reality through the use of theory. The importance of careful operationalization can perhaps be more clearly seen in the development of General Relativity. Einstein discovered that there were two operational definitions of "mass" being used by scientists: inertial, defined by applying a force and observing the acceleration, from Newton's Second Law of Motion; and gravitational, defined by putting the object on a scale or balance. Previously, no one had paid any attention to the different operations used because they always produced the same results,[citation needed] but the key insight of Einstein was to posit the Principle of Equivalence that the two operations would always produce the same result because they were equivalent at a deep level, and work out the implications of that assumption, which is the General Theory of Relativity. Thus, a breakthrough in science was achieved by disregarding different operational definitions of scientific measurements and realizing that they both described a single theoretical concept. Einstein's disagreement with the operationalist approach was criticized by Bridgman[2] as follows: "Einstein did not carry over into his general relativity theory the lessons and insights he himself has taught us in his special theory." (p.335).

Operationalization in the social sciences


Operationalization is often used in the social sciences as part of the scientific method and psychometrics.

Anger example
For example, a researcher may wish to measure the concept "anger." Its presence, and the depth of the emotion, cannot be directly measured by an outside observer because anger is intangible. Rather, other measures are used by outside observers, such as facial expression, choice of vocabulary, loudness and tone of voice.

An operationalization diagram, used to illustrate obscure or ambiguous concepts in an

If a researcher wants to measure the academic paper. This particular example is tailored to use in the field of Political Science. depth of "anger" in various persons, the most direct operation would be to ask them a question, such as "are you angry", or "how angry are you?". This operation is problematic, however, because it depends upon the definition of the individual. Some people might be subjected to a mild annoyance, and become slightly angry, but describe themselves as "extremely angry," whereas others might be subjected to a severe provocation, and become very angry, but describe themselves as "slightly angry." In addition, in many circumstances it is impractical to ask subjects whether they are angry.

Operationalization Since one of the measures of anger is loudness, the researcher can operationalize the concept of anger by measuring how loudly the subject speaks compared to his normal tone. However, this must assume that loudness is uniform measure. Some might respond verbally while other might respond physically. This makes anger a non-operational variable.

227

Economics objections
One of the main critics of operationalism in social science argues that "the original goal was to eliminate the subjective mentalistic concepts that had dominated earlier psychological theory and to replace them with a more operationally meaningful account of human behavior. But, as in economics, the supporters ultimately ended up "turning operationalism inside out".[3] "Instead of replacing 'metaphysical' terms such as 'desire' and 'purpose'" they "used it to legitimize them by giving them operational definitions." Thus in psychology, as in economics, the initial, quite radical operationalist ideas eventually came to serve as little more than a "reassurance fetish"[4] for mainstream methodological practice."[5]

Tying operationalization to conceptual frameworks


The above discussion links operationalization to measurement of concepts. Many scholars have worked to operationalize concepts like job satisfaction, prejudice, anger etc. Scale and index construction are forms of operationalization. Operationalizaton is part of the empirical research process. Take for example an empirical research question: Does job satisfaction influence job turnover? Both job satisfaction and job turnover need to be measured. The concepts and their relationship are important operationalization occurs within a larger framework of concepts. When there is a large empirical research question or purpose the conceptual framework that organizes the response to the question must be operationalized before the data collection can begin. If a scholar constructs a questionnaire based on a conceptual framework, they have operationalized the framework. Most serious empirical research should involve operationalization that is transparent and linked to a conceptual framework. To use an oversimplified example, the hypothesis Job satisfaction reduces job turnover is one way to connect (or frame) two concepts job satisfaction and job turnover. The process of moving from the idea job satisfaction to the set of questionnaire items that form a job satisfaction scale is operationalization. For most of us, operationalization outside the larger issue of a research question and conceptual framework is just not very interesting. In the field of Public Administration, Shields and Tajalli (2006) have identified five kinds of conceptual frameworks (working hypothesis, descriptive categories, practical ideal type, operations research, and formal hypothesis). They explain and illustrate how each of these conceptual frameworks can be operationalized. They also show how to make conceptualization and operationalization more concrete by demonstrating how to form conceptual framework tables that are tied to the literature and operationalization tables that lay out the specifics of how to operationalize the conceptual framework (measure the concepts). [6] To see examples of research projects that use conceptual framework and operationalization tables see http:/ / ecommons.txstate.edu/arp/

Operationalization

228

Notes
[1] The basic operationalist thesiswhich can be considered a variation on the positivist themewas that all theoretical terms must be defined via the operations by which one measured them; see Crowther-Heyck, Hunter (2005), Herbert A. Simon: The Bounds of Reason in Modern America, JHU Press, p. 65 (http:/ / books. google. com/ books?id=LV1rnS9NBjkC& pg=PA65). [2] P.W. Bridgman, Einstein's Theories and the Operational Point of View, in: P.A. Schilpp, ed., Albert Einstein: Philosopher-Scientist, Open Court, La Salle, Ill., Cambridge University Press, 1982, Vol. 2, p. 335354. [3] Green 2001 Operationalism Again: What Did Bridgman Say? What Did Briclgman Need? in Theory and Psychology 11 (2001) p.49 [4] Koch, Sigmund (1992) Psychologys Bridgman vs. Bridgmans Bridgman: An Essay in Reconstruction., in Theory and Psychology vol. 2 no. 3 (1992) p.275 [5] Wade Hands (2004) "On operationalisms and economics" (December 2004) (http:/ / www. redorbit. com/ news/ science/ 112364/ on_operationalisms_and_economics/ )

Bibliography
Bridgman, P.W. (1927). The Logic of Modern Physics.

Opinion poll
An opinion poll, sometimes simply referred to as a poll, is a survey of public opinion from a particular sample. Opinion polls are usually designed to represent the opinions of a population by conducting a series of questions and then extrapolating generalities in ratio or within confidence intervals.

History
The first known example of an opinion poll was a local straw poll conducted by The Harrisburg Pennsylvanian in 1824, showing Andrew Jackson leading John Quincy Adams by 335 votes to 169 in the contest for the United States Presidency. Since Jackson won the popular vote in that state and the whole country, such straw votes gradually became more popular, but they remained local, usually city-wide phenomena. In 1916, the Literary Digest embarked on a national survey (partly as a circulation-raising exercise) and correctly predicted Woodrow Wilson's election as president. Mailing out millions of postcards and simply counting the returns, the Digest correctly predicted the victories of Warren Harding in 1920, Calvin Coolidge in 1924, Herbert Hoover in 1929, and Franklin Roosevelt in 1932. Then, in 1936, its 2.3 million "voters" constituted a huge sample; however, they were generally more affluent Americans who tended to have Republican sympathies. The Literary Digest was ignorant of this new bias. The week before election day, it reported that Alf Landon was far more popular than Roosevelt. At the same time, George Gallup conducted a far smaller, but more scientifically based survey, in which he polled a demographically representative sample. Gallup correctly predicted Roosevelt's landslide victory. The Literary Digest soon went out of business, while polling started to take off. Elmo Roper was another American pioneer in political forecasting using scientific polls.[] He predicted the reelection of President Franklin D. Roosevelt three times, in 1936, 1940, and 1944. Louis Harris had been in the field of public opinion since 1947 when he joined the Elmo Roper firm, then later became partner. In September 1938 Jean Stoetzel, after having met Gallup, created IFOP, the Institut Franais d'Opinion Publique, as the first European survey institute in Paris and started political polls in summer 1939 with the question "Why die for Danzig?", looking for popular support or dissent with this question asked by appeasement politician and future collaborationist Marcel Dat. Gallup launched a subsidiary in the United Kingdom that, almost alone, correctly predicted Labour's victory in the 1945 general election, unlike virtually all other commentators, who expected a victory for the Conservative Party, led by Winston Churchill.

Opinion poll The Allied occupation powers helped to create survey institutes in all of the Western occupation zones of Germany in 1947 and 1948 to better steer denazification. By the 1950s, various types of polling had spread to most democracies.

229

Sample and polling methods


Opinion polls for many years were maintained through telecommunications or in person-to-person contact. Methods and techniques vary, though they are widely accepted in most areas. Verbal, ballot, and processed types can be conducted efficiently, contrasted with other types of surveys, systematics, and complicated matrices beyond previous orthodox procedures.[citation needed] Opinion polling developed into popular applications through popular thought, although response rates for some surveys declined. Also, the following has also led to differentiating results:[] Some polling organizations, such as Angus Reid Public Opinion, YouGov and Zogby use Internet surveys, where a sample is drawn from a large panel of volunteers, and the results are weighted to reflect the demographics of the population of interest. In contrast, popular web polls draw on whoever wishes to participate, rather than a scientific sample of the population, and are therefore not generally considered professional.

Voter polling questionnaire on display at the Smithsonian Institution

Polls can be used in the public relation field as well. In the early 1920s Public Relation experts described their work as a two way street. Their job would be to present the misinterpreted interests of large institutions to public. They would also gauge the typically ignored interests of the public through polls.

Benchmark polls
A benchmark poll is generally the first poll taken in a campaign. It is often taken before a candidate announces their bid for office but sometimes it happens immediately following that announcement after they have had some opportunity to raise funds. This is generally a short and simple survey of likely voters. A benchmark poll serves a number of purposes for a campaign, whether it is a political campaign or some other type of campaign. First, it gives the candidate a picture of where they stand with the electorate before any campaigning takes place. If the poll is done prior to announcing for office the candidate may use the poll to decide whether or not they should even run for office. Secondly, it shows them where their weaknesses and strengths are in two main areas. The first is the electorate. A benchmark poll shows them what types of voters they are sure to win, those who they are sure to lose, and everyone in-between those two extremes. This lets the campaign know which voters are persuadable so they can spend their limited resources in the most effective manner. Second, it can give them an idea of what messages, ideas, or slogans are the strongest with the electorate.[1]

Brushfire polls
Brushfire Polls are polls taken during the period between the Benchmark Poll and Tracking Polls. The number of Brushfire Polls taken by a campaign is determined by how competitive the race is and how much money the campaign has to spend. These polls usually focus on likely voters and the length of the survey varies on the number of messages being tested. Brushfire polls are used for a number of purposes. First, it lets the candidate know if they have made any progress on the ballot, how much progress has been made, and in what demographics they have been making or losing ground. Secondly, it is a way for the campaign to test a variety of messages, both positive and negative, on themselves and their opponent(s). This lets the campaign know what messages work best with certain demographics and what messages should be avoided. Campaigns often use these polls to test possible attack messages that their opponent

Opinion poll may use and potential responses to those attacks. The campaign can then spend some time preparing an effective response to any likely attacks. Thirdly, this kind of poll can be used by candidates or political parties to convince primary challengers to drop out of a race and support a stronger candidate.

230

Tracking polls
A tracking poll is a poll repeated at intervals generally averaged over a trailing window.[] For example, a weekly tracking poll uses the data from the past week and discards older data. A caution is that estimating the trend is more difficult and error-prone than estimating the level intuitively, if one estimates the change, the difference between two numbers X and Y, then one has to contend with the error in both X and Y it is not enough to simply take the difference, as the change may be random noise. For details, see t-test. A rough guide is that if the change in measurement falls outside the margin of error, it is worth attention.

Potential for inaccuracy


Polls based on samples of populations are subject to sampling error which reflects the effects of chance and uncertainty in the sampling process. The uncertainty is often expressed as a margin of error. The margin of error is usually defined as the radius of a confidence interval for a particular statistic from a survey. One example is the percent of people who prefer product A versus product B. When a single, global margin of error is reported for a survey, it refers to the maximum margin of error for all reported percentages using the full sample from the survey. If the statistic is a percentage, this maximum margin of error can be calculated as the radius of the confidence interval for a reported percentage of 50%. Others suggest that a poll with a random sample of 1,000 people has margin of sampling error of 3% for the estimated percentage of the whole population. A 3% margin of error means that if the same procedure is used a large number of times, 95% of the time the true population average will be within the 95% confidence interval of the sample estimate plus or minus 3%. The margin of error can be reduced by using a larger sample, however if a pollster wishes to reduce the margin of error to 1% they would need a sample of around 10,000 people.[2] In practice, pollsters need to balance the cost of a large sample against the reduction in sampling error and a sample size of around 5001,000 is a typical compromise for political polls. (Note that to get complete responses it may be necessary to include thousands of additional participators.)[3] Another way to reduce the margin of error is to rely on poll averages. This makes the assumption that the procedure is similar enough between many different polls and uses the sample size of each poll to create a polling average.[4] An example of a polling average can be found here: 2008 Presidential Election polling average [5]. Another source of error stems from faulty demographic models by pollsters who weigh their samples by particular variables such as party identification in an election. For example, if you assume that the breakdown of the US population by party identification has not changed since the previous presidential election, you may underestimate a victory or a defeat of a particular party candidate that saw a surge or decline in its party registration relative to the previous presidential election cycle. Over time, a number of theories and mechanisms have been offered to explain erroneous polling results. Some of these reflect errors on the part of the pollsters; many of them are statistical in nature. Others blame the respondents for not giving candid answers (e.g., the Bradley effect, the Shy Tory Factor); these can be more controversial.

Nonresponse bias
Since some people do not answer calls from strangers, or refuse to answer the poll, poll samples may not be representative samples from a population due to a non-response bias. Because of this selection bias, the characteristics of those who agree to be interviewed may be markedly different from those who decline. That is, the actual sample is a biased version of the universe the pollster wants to analyze. In these cases, bias introduces new errors, one way or the other, that are in addition to errors caused by sample size. Error due to bias does not become

Opinion poll smaller with larger sample sizes, because taking a larger sample size simply repeats the same mistake on a larger scale. If the people who refuse to answer, or are never reached, have the same characteristics as the people who do answer, then the final results should be unbiased. If the people who do not answer have different opinions then there is bias in the results. In terms of election polls, studies suggest that bias effects are small, but each polling firm has its own techniques for adjusting weights to minimize selection bias.[]

231

Response bias
Survey results may be affected by response bias, where the answers given by respondents do not reflect their true beliefs. This may be deliberately engineered by unscrupulous pollsters in order to generate a certain result or please their clients, but more often is a result of the detailed wording or ordering of questions (see below). Respondents may deliberately try to manipulate the outcome of a poll by e.g. advocating a more extreme position than they actually hold in order to boost their side of the argument or give rapid and ill-considered answers in order to hasten the end of their questioning. Respondents may also feel under social pressure not to give an unpopular answer. For example, respondents might be unwilling to admit to unpopular attitudes like racism or sexism, and thus polls might not reflect the true incidence of these attitudes in the population. In American political parlance, this phenomenon is often referred to as the Bradley effect. If the results of surveys are widely publicized this effect may be magnified - a phenomenon commonly referred to as the spiral of silence.

Wording of questions
It is well established that the wording of the questions, the order in which they are asked and the number and form of alternative answers offered can influence results of polls. For instance, the public is more likely to indicate support for a person who is described by the operator as one of the "leading candidates". This support itself overrides subtle bias for one candidate, as does lumping some candidates in an "other" category or vice versa. Thus comparisons between polls often boil down to the wording of the question. On some issues, question wording can result in quite pronounced differences between surveys.[6][7][8] This can also, however, be a result of legitimately conflicted feelings or evolving attitudes, rather than a poorly constructed survey.[9] A common technique to control for this bias is to rotate the order in which questions are asked. Many pollsters also split-sample. This involves having two different versions of a question, with each version presented to half the respondents. The most effective controls, used by attitude researchers, are: asking enough questions to allow all aspects of an issue to be covered and to control effects due to the form of the question (such as positive or negative wording), the adequacy of the number being established quantitatively with psychometric measures such as reliability coefficients, and analyzing the results with psychometric techniques which synthesize the answers into a few reliable scores and detect ineffective questions. These controls are not widely used in the polling industry.Wikipedia:Please clarify

Opinion poll

232

Coverage bias
Another source of error is the use of samples that are not representative of the population as a consequence of the methodology used, as was the experience of the Literary Digest in 1936. For example, telephone sampling has a built-in error because in many times and places, those with telephones have generally been richer than those without. In some places many people have only mobile telephones. Because pollsters cannot call mobile phones (it is unlawful in the United States to make unsolicited calls to phones where the phone's owner may be charged simply for taking a call), these individuals are typically excluded from polling samples. There is concern that, if the subset of the population without cell phones differs markedly from the rest of the population, these differences can skew the results of the poll. Polling organizations have developed many weighting techniques to help overcome these deficiencies, with varying degrees of success. Studies of mobile phone users by the Pew Research Center in the US, in 2007, concluded that "cell-only respondents are different from landline respondents in important ways, (but) they were neither numerous enough nor different enough on the questions we examined to produce a significant change in overall general population survey estimates when included with the landline samples and weighted according to US Census parameters on basic demographic characteristics."[] This issue was first identified in 2004,[] but came to prominence only during the 2008 US presidential election.[] In previous elections, the proportion of the general population using cell phones was small, but as this proportion has increased, there is concern that polling only landlines is no longer representative of the general population. In 2003, only 2.9% of households were wireless (cellphones only), compared to 12.8% in 2006.[] This results in "coverage error". Many polling organisations select their sample by dialling random telephone numbers; however, in 2008, there was a clear tendency for polls which included mobile phones in their samples to show a much larger lead for Obama, than polls that did not.[][] The potential sources of bias are:[] 1. Some households use cellphones only and have no landline. This tends to include minorities and younger voters; and occurs more frequently in metropolitan areas. Men are more likely to be cellphone-only compared to women. 2. Some people may not be contactable by landline from Monday to Friday and may be contactable only by cellphone. 3. Some people use their landlines only to access the Internet, and answer calls only to their cellphones. Some polling companies have attempted to get around that problem by including a "cellphone supplement". There are a number of problems with including cellphones in a telephone poll: 1. It is difficult to get co-operation from cellphone users, because in many parts of the US, users are charged for both outgoing and incoming calls. That means that pollsters have had to offer financial compensation to gain co-operation. 2. US federal law prohibits the use of automated dialling devices to call cellphones (Telephone Consumer Protection Act of 1991). Numbers therefore have to be dialled by hand, which is more time-consuming and expensive for pollsters. An oft-quoted example of opinion polls succumbing to errors occurred during the UK General Election of 1992. Despite the polling organizations using different methodologies, virtually all the polls taken before the vote, and to a lesser extent, exit polls taken on voting day, showed a lead for the opposition Labour party, but the actual vote gave a clear victory to the ruling Conservative party. In their deliberations after this embarrassment the pollsters advanced several ideas to account for their errors, including: Late swing Voters who changed their minds shortly before voting tended to favour the Conservatives, so the error was not as great as it first appeared. Nonresponse bias

Opinion poll Conservative voters were less likely to participate in surveys than in the past and were thus under-represented. The Shy Tory Factor The Conservatives had suffered a sustained period of unpopularity as a result of economic difficulties and a series of minor scandals, leading to a spiral of silence in which some Conservative supporters were reluctant to disclose their sincere intentions to pollsters. The relative importance of these factors was, and remains, a matter of controversy, but since then the polling organizations have adjusted their methodologies and have achieved more accurate results in subsequent election campaigns.[citation needed]

233

Failures
A widely publicized failure of opinion polling to date in the United States was the prediction that Thomas Dewey would defeat Harry S. Truman in the 1948 US presidential election. Major polling organizations, including Gallup and Roper, indicated a landslide victory for Dewey. In the United Kingdom, most polls failed to predict the Conservative election victories of 1970 and 1992, and Labour's victory in 1974. However, their figures at other elections have been generally accurate.

Influence
Effect on voters
By providing information about voting intentions, opinion polls can sometimes influence the behavior of electors, and in his book The Broken Compass, Peter Hitchens asserts that opinion polls are actually a device for influencing public opinion.[] The various theories about how this happens can be split into two groups: bandwagon/underdog effects, and strategic ("tactical") voting. A bandwagon effect occurs when the poll prompts voters to back the candidate shown to be winning in the poll. The idea that voters are susceptible to such effects is old, stemming at least from 1884; William Safire reported that the term was first used in a political cartoon in the magazine Puck in that year.[10] It has also remained persistent in spite of a lack of empirical corroboration until the late 20th century. George Gallup spent much effort in vain trying to discredit this theory in his time by presenting empirical research. A recent meta-study of scientific research on this topic indicates that from the 1980s onward the Bandwagon effect is found more often by researchers.[11] The opposite of the bandwagon effect is the underdog effect. It is often mentioned in the media. This occurs when people vote, out of sympathy, for the party perceived to be "losing" the elections. There is less empirical evidence for the existence of this effect than there is for the existence of the bandwagon effect.[11] The second category of theories on how polls directly affect voting is called strategic or tactical voting. This theory is based on the idea that voters view the act of voting as a means of selecting a government. Thus they will sometimes not choose the candidate they prefer on ground of ideology or sympathy, but another, less-preferred, candidate from strategic considerations. An example can be found in the United Kingdom general election, 1997. As he was then a Cabinet Minister, Michael Portillo's constituency of Enfield Southgate was believed to be a safe seat but opinion polls showed the Labour candidate Stephen Twigg steadily gaining support, which may have prompted undecided voters or supporters of other parties to support Twigg in order to remove Portillo. Another example is the boomerang effect where the likely supporters of the candidate shown to be winning feel that chances are slim and that their vote is not required, thus allowing another candidate to win. In addition, Mark Pickup in Cameron Anderson and Laura Stephenson's "Voting Behaviour in Canada" outlines three additional "behavioural" responses that voters may exhibit when faced with polling data. The first is known as a "cue taking" effect which holds that poll data is used as a "proxy" for information about the candidates or parties. Cue taking is "based on the psychological phenomenon of using heuristics to simplify a

Opinion poll complex decision" (243).[12] The second, first described by Petty and Cacioppo (1996) is known as "cognitive response" theory. This theory asserts that a voter's response to a poll may not line with their initial conception of the electoral reality. In response, the voter is likely to generate a "mental list" in which they create reasons for a party's loss or gain in the polls. This can reinforce or change their opinion of the candidate and thus affect voting behaviour. Third, the final possibility is a "behavioural response" which is similar to a cognitive response. The only salient difference is that a voter will go and seek new information to form their "mental list," thus becoming more informed of the election. This may then affect voting behaviour. These effects indicate how opinion polls can directly affect political choices of the electorate. But directly or indirectly, other effects can be surveyed and analyzed on all political parties. The form of media framing and party ideology shifts must also be taken under consideration. Opinion polling in some instances is a measure of cognitive bias, which is variably considered and handled appropriately in its various applications.

234

Effect on politicians
Starting in the 1980s, tracking polls and related technologies began having a notable impact on U.S. political leaders.[] According to Douglas Bailey, a Republican who had helped run Gerald Ford's 1976 presidential campaign, "It's no longer necessary for a political candidate to guess what an audience thinks. He can [find out] with a nightly tracking poll. So it's no longer likely that political leaders are going to lead. Instead, they're going to follow."[]

Regulation
Some jurisdictions over the world restrict the publication of the results of opinion polls in order to prevent the possibly erroneous results from affecting voters' decisions. For instance, in Canada, it is prohibited to publish the results of opinion surveys that would identify specific political parties or candidates in the final three days before a poll closes.[] However, most western democratic nations don't support the entire prohibition of the publication of pre-election opinion polls; most of them have no regulation and some only prohibit it in the final days or hours until the relevant poll closes.[] A survey by Canada's Royal Commission on Electoral Reform reported that the prohibition period of publication of the survey results largely differed in different countries. Out of the 20 countries examined, three prohibit the publication during the entire period of campaigns, while others prohibit it for a shorter term such as the polling period or the final 48 hours before a poll closes.[]

Footnotes
[1] [2] [4] [5] [8] Kenneth F. Warren (1992). "in Defense of Public Opinion Polling." Westview Press. p. 200-1. An estimate of the margin of error in percentage terms can be gained by the formula 100 square root of sample size Lynch, Scott M. Introduction to Bayesian Statistics and Estimation for Social Scientists (2007). http:/ / www. daytodaypolitics. com/ polls/ presidential_election_Obama_vs_McCain_2008. htm "Public Agenda Issue Guide: Abortion - Public View - Red Flags" (http:/ / www. publicagenda. org/ citizen/ issueguides/ abortion/ publicview/ redflags). Public Agenda. [10] Safire, William, Safire's Political Dictionary, page 42. Random House, 1993. [11] Irwin, Galen A. and Joop J. M. Van Holsteyn. Bandwagons, Underdogs, the Titanic and the Red Cross: The Influence of Public Opinion Polls on Voters (2000).

Opinion poll

235

External references
Asher, Herbert: Polling and the Public. What Every Citizen Should Know, fourth edition. Washington, D.C.: CQ Press, 1998. Bourdieu, Pierre, "Public Opinion does not exist" in Sociology in Question, London, Sage (1995). Bradburn, Norman M. and Seymour Sudman. Polls and Surveys: Understanding What They Tell Us (1988). Cantril, Hadley. Gauging Public Opinion (1944). Cantril, Hadley and Mildred Strunk, eds. Public Opinion, 1935-1946 (1951) (http://www.questia.com/PM. qst?a=o&d=98754501), massive compilation of many public opinion polls from US, UK, Canada, Australia, and elsewhere. Converse, Jean M. Survey Research in the United States: Roots and Emergence 1890-1960 (1987), the standard history. Crespi, Irving. Public Opinion, Polls, and Democracy (1989) (http://www.questia.com/PM.qst?a=o& d=8971691). Gallup, George. Public Opinion in a Democracy (1939). Gallup, Alec M. ed. The Gallup Poll Cumulative Index: Public Opinion, 1935-1997 (1999) lists 10,000+ questions, but no results. Gallup, George Horace, ed. The Gallup Poll; Public Opinion, 1935-1971 3 vol (1972) summarizes results of each poll. Glynn, Carroll J., Susan Herbst, Garrett J. O'Keefe, and Robert Y. Shapiro. Public Opinion (1999) (http://www. questia.com/PM.qst?a=o&d=100501261) textbook Lavrakas, Paul J. et al. eds. Presidential Polls and the News Media (1995) (http://www.questia.com/PM. qst?a=o&d=28537852) Moore, David W. The Superpollsters: How They Measure and Manipulate Public Opinion in America (1995) (http://www.questia.com/PM.qst?a=o&d=8540600). Niemi, Richard G., John Mueller, Tom W. Smith, eds. Trends in Public Opinion: A Compendium of Survey Data (1989) (http://www.questia.com/PM.qst?a=o&d=28621255). Oskamp, Stuart and P. Wesley Schultz; Attitudes and Opinions (2004) (http://www.questia.com/PM.qst?a=o& d=104829752). Robinson, Claude E. Straw Votes (1932). Robinson, Matthew Mobocracy: How the Media's Obsession with Polling Twists the News, Alters Elections, and Undermines Democracy (2002). Rogers, Lindsay. The Pollsters: Public Opinion, Politics, and Democratic Leadership (1949) (http://www. questia.com/PM.qst?a=o&d=89021667). Traugott, Michael W. The Voter's Guide to Election Polls (http://www.questia.com/PM.qst?a=o& d=71288534) 3rd ed. (2004). James G. Webster, Patricia F. Phalen, Lawrence W. Lichty; Ratings Analysis: The Theory and Practice of Audience Research Lawrence Erlbaum Associates, 2000. Young, Michael L. Dictionary of Polling: The Language of Contemporary Opinion Research (1992) (http:// www.questia.com/PM.qst?a=o&d=59669912). Additional Sources Walden, Graham R. Survey Research Methodology, 1990-1999: An Annotated Bibliography. Bibliographies and Indexes in Law and Political Science Series. Westport, CT: Greenwood Press, Greenwood Publishing Group, Inc., 2002. xx, 432p. Walden, Graham R. Public Opinion Polls and Survey Research: A Selective Annotated Bibliography of U.S. Guides and Studies from the 1980s. Public Affairs and Administrative Series, edited by James S. Bowman, vol. 24. New York, NY: Garland Publishing Inc., 1990. xxix, 360p.

Opinion poll Walden, Graham R. Polling and Survey Research Methods 1935-1979: An Annotated Bibliography. Bibliographies and Indexes in Law and Political Science Series, vol. 25. Westport, CT: Greenwood Publishing Group, Inc., 1996. xxx, 581p.

236

External links
Polls (http://ucblibraries.colorado.edu/govpubs/us/polls.htm) from UCB Libraries GovPubs The Pew Research Center (http://www.pewresearch.org) nonpartisan "fact tank" providing information on the issues, attitudes and trends shaping America and the world by conducting public opinion polling and social science research "Use Opinion Research To Build Strong Communication" (http://www.gcastrategies.com/books_articles/ article_001_or.php) by Frank Noto Public Agenda for Citizens (http://www.publicagenda.org/) nonpartisan, nonprofit group that tracks public opinion data in the United States National Council on Public Polls (http://www.ncpp.org/?q=home) association of polling organizations in the United States devoted to setting high professional standards for surveys How Will America Vote (http://howwillamericavote.com) Aggregates polling data with demographic sub-samples. USA Election Polls (http://www.usaelectionpolls.com) tracks the public opinion polls related to elections in the US Survey Analysis Tool (http://www.i-marvin.si) based on A. Berkopec, HyperQuick algorithm for discrete hypergeometric distribution, Journal of Discrete Algorithms, Elsevier, 2006 (http://dx.doi.org/10.1016/j.jda. 2006.01.001). "Poll Position - Issue 010 - GOOD" (http://www.good.is/post/poll_position/), track record of pollsters for USA presidential elections in Good magazine, April 23, 2008.

Optimal discriminant analysis

237

Optimal discriminant analysis


Optimal discriminant analysis (ODA) and the related classification tree analysis (CTA) are statistical methods that maximize predictive accuracy. For any specific sample and exploratory or confirmatory hypothesis, optimal discriminant analysis (ODA) identifies the statistical model that yields maximum predictive accuracy, assesses the exact Type I error rate, and evaluates potential cross-generalizability. Optimal discriminant analysis may be applied to >0 dimensions, with the one-dimensional case being referred to as UniODA and the multidimensional case being referred to as MultiODA. Classification tree analysis is a generalization of optimal discriminant analysis to non-orthogonal trees. Classification tree analysis has more recently been called "hierarchical optimal discriminant analysis". Optimal discriminant analysis and classification tree analysis may be used to find the combination of variables and cut points that best separate classes of objects or events. These variables and cut points may then be used to reduce dimensions and to then build a statistical model that optimally describes the data. Optimal discriminant analysis may be thought of as a generalization of Fisher's linear discriminant analysis. Optimal discriminant analysis is an alternative to ANOVA (analysis of variance) and regression analysis, which attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA and regression analysis give a dependent variable that is a numerical variable, while optimal discriminant analysis gives a dependent variable that is a class variable.

References
Yarnold, Paul R.; Soltysik, Robert C. (2004). Optimal Data Analysis (http://books.apa.org/books. cfm?id=4316000). American Psychologicla Association. ISBN1-55798-981-8. Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems". Annals of Eugenics 7 (2): 179188. doi: 10.1111/j.1469-1809.1936.tb02137.x (http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137. x). hdl: 2440/15227 (http://hdl.handle.net/2440/15227). Martinez, A. M.; Kak, A. C. (2001). "PCA versus LDA" (http://www.ece.osu.edu/~aleix/pami01f.pdf). IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228233. doi: 10.1109/34.908974 (http://dx. doi.org/10.1109/34.908974). Mika, S. et al. (1999). "Fisher Discriminant Analysis with Kernels" (http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.35.9904). IEEE Conference on Neural Networks for Signal Processing IX: 4148. doi: 10.1109/NNSP.1999.788121 (http://dx.doi.org/10.1109/NNSP.1999.788121).

External links
LDA tutorial using MS Excel (http://people.revoledu.com/kardi/tutorial/LDA/index.html) IMSL discriminant analysis function DSCRM (http://www.roguewave.com/Portals/0/products/ imsl-numerical-libraries/fortran-library/docs/7.0/stat/stat.htm), which has many useful mathematical definitions.

Pairwise comparison

238

Pairwise comparison
Pairwise comparison generally refers to any process of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property. The method of pairwise comparison is used in the scientific study of preferences, attitudes, voting systems, social choice, public choice, and multiagent AI systems. In psychology literature, it is often referred to as paired comparison. Prominent psychometrician L. L. Thurstone first introduced a scientific approach to using pairwise comparisons for measurement in 1927, which he referred to as the law of comparative judgment. Thurstone linked this approach to psychophysical theory developed by Ernst Heinrich Weber and Gustav Fechner. Thurstone demonstrated that the method can be used to order items along a dimension such as preference or importance using an interval-type scale.

Overview
If an individual or organization expresses a preference between two mutually distinct alternatives, this preference can be expressed as a pairwise comparison. If the two alternatives are x and y, the following are the possible pairwise comparisons: The agent prefers x over y: "x>y" or "xPy" The agent prefers y over x: "y>x" or "yPx" The agent is indifferent between both alternatives: "x=y" or "xIy"

Probabilistic models
In terms of modern psychometric theory, Thurstone's approach, called the law of comparative judgment, is more aptly regarded as a measurement model. The BradleyTerryLuce (BTL) model (Bradley & Terry, 1952; Luce, 1959) is often applied to pairwise comparison data to scale preferences. The BTL model is identical to Thurstone's model if the simple logistic function is used. Thurstone used the normal distribution in applications of the model. The simple logistic function varies by less than 0.01 from the cumulative normal ogive across the range, given an arbitrary scale factor. In the BTL model, the probability that object j is judged to have more of an attribute than object i is:

where

is the scale location of object

is the inverse logit function. For example, the scale location

might represent the perceived quality of a product, or the perceived weight of an object. The BTL is very closely related to the Rasch model for measurement. Thurstone used the method of pairwise comparisons as an approach to measuring perceived intensity of physical stimuli, attitudes, preferences, choices, and values. He also studied implications of the theory he developed for opinion polls and political voting (Thurstone, 1959).

Pairwise comparison

239

Transitivity
For a given decision agent, if the information, objective, and alternatives used by the agent remain constant, then it is generally assumed that pairwise comparisons over those alternatives by the decision agent are transitive. Most agree upon what transitivity is, though there is debate about the transitivity of indifference. The rules of transitivity are as follows for a given decision agent. If xPy and yPz, then xPz If xPy and yIz, then xPz If xIy and yPz, then xPz If xIy and yIz, then xIz This corresponds to (xPy or xIy) being a total preorder, P being the corresponding strict weak order, and I being the corresponding equivalence relation. Probabilistic models require transitivity only within the bounds of errors of estimates of scale locations of entities. Thus, decisions need not be deterministically transitive in order to apply probabilistic models. However, transitivity will generally hold for a large number of comparisons if models such as the BTL can be effectively applied. Using Transitivity test[1] one can investigate whether a data set of pairwise comparisons contains a higher degree of transitivity than expected by chance.

Argument for intransitivity of indifference


Some contend that indifference is not transitive. Consider the following example. Suppose you like apples and you prefer apples that are larger. Now suppose there exists an apple A, an apple B, and an apple C which have identical intrinsic characteristics except for the following. Suppose B is larger than A, but it is not discernible without an extremely sensitive scale. Further suppose C is larger than B, but this also is not discernible without an extremely sensitive scale. However, the difference in sizes between apples A and C is large enough that you can discern that C is larger than A without a sensitive scale. In psychophysical terms, the size difference between A and C is above the just noticeable difference ('jnd') while the size differences between A and B and B and C are below the jnd. You are confronted with the three apples in pairs without the benefit of a sensitive scale. Therefore, when presented A and B alone, you are indifferent between apple A and apple B; and you are indifferent between apple B and apple C when presented B and C alone. However, when the pair A and C are shown, you prefer C over A.

Preference orders
If pairwise comparisons are in fact transitive in respect to the four mentioned rules, then pairwise comparisons for a list of alternatives (A1,A2,A3,...,An1, and An) can take the form: A1(>XOR=)A2(>XOR=)A3(>XOR=)...(>XOR=)An1(>XOR=)An For example, if there are three alternatives a, b, and c, then the possible preference orders are:

Pairwise comparison If the number of alternatives is n, and indifference is not allowed, then the number of possible preference orders for any given n-value isn!. If indifference is allowed, then the number of possible preference orders is the number of total preorders. It can be expressed as a function of n:

240

where S2(n,k) is the Stirling number of the second kind.

Applications
One important application of pairwise comparisons is the widely used Analytic Hierarchy Process, a structured technique for helping people deal with complex decisions. It uses pairwise comparisons of tangible and intangible factors to construct ratio scales that are useful in making important decisions.[1][]

References
[1] Nikoli D (2012) Non-parametric detection of temporal order across pairwise measurements of time delays. Journal of Computational Neuroscience, 22(1)" pp. 519. http:/ / www. danko-nikolic. com/ wp-content/ uploads/ 2011/ 09/ Nikolic-Transitivity-2007. pdf

" Sloane's A000142 : Factorial numbers (http://oeis.org/A000142)", The On-Line Encyclopedia of Integer Sequences. OEIS Foundation. " Sloane's A000670 : Number of preferential arrangements of n labeled elements (http://oeis.org/A000670)", The On-Line Encyclopedia of Integer Sequences. OEIS Foundation. Y. Chevaleyre, P.E. Dunne, U. Endriss, J. Lang, M. Lematre, N. Maudet, J. Padget, S. Phelps, J.A. Rodrguez-Aguilar, and P. Sousa. Issues in Multiagent Resource Allocation. Informatica, 30:331, 2006.

Further reading
How to Analyze Paired Comparison Data (http://www.ee.washington.edu/research/guptalab/publications/ PairedComparisonTutorialTsukidaGuptaUWTechReport2011.pdf) Bradley, R.A. and Terry, M.E. (1952). Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324345. David, H.A. (1988). The Method of Paired Comparisons. New York: Oxford University Press. Luce, R.D. (1959). Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley. Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review, 34, 278286. Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court. Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press.

Pathfinder network

241

Pathfinder network
Several psychometric scaling methods start from proximity data and yield structures revealing the underlying organization of the data. Data clustering and multidimensional scaling are two such methods. Network scaling represents another method based on graph theory. Pathfinder networks are derived from proximities for pairs of entities. Proximities can be obtained from similarities, correlations, distances, conditional probabilities, or any other measure of the relationships among entities. The entities are often concepts of some sort, but they can be anything with a pattern of relationships. In the Pathfinder network, the entities correspond to the nodes of the generated network, and the links in the network are determined by the patterns of proximities. For example, if the proximities are similarities, links will generally connect nodes of high similarity. The links in the network will be undirected if the proximities are symmetrical for every pair of entities. Symmetrical proximities mean that the order of the entities is not important, so the proximity of i and j is the same as the proximity of j and i for all pairs i,j. If the proximities are not symmetrical for every pair, the links will be directed. Here is an example of an undirected Pathfinder network derived from average similarity ratings of a group of biology graduate students. The students rated the similarity of all pairs of the terms shown.

Pathfinder uses two parameters. (1) The q parameter constrains the number of indirect proximities examined in generating the network. The q parameter is an integer value between 2 and n1, inclusive where n is the number of nodes or items. (2) The r parameter defines the metric used for computing the distance of paths (cf. the Minkowski distance). The r parameter is a real number between 1 and infinity, inclusive. A network generated with particular values of q and r is called a PFnet(q,r). Both of the parameters have the effect of decreasing the number of links in the network as their values are increased. The network with the minimum number of links is obtained when q=n1 and r=, i.e., PFnet(n1,). With ordinal-scale data (see level of measurement), the r-parameter should be infinity because the same PFnet would result from any positive monotonic transformation of the proximity data. Other values of r require data measured on a ratio scale. The q parameter can be varied to yield the desired number of links in the network.

Pathfinder network Essentially, Pathfinder networks preserve the shortest possible paths given the data so links are eliminated when they are not on shortest paths. The PFnet(n1,) will be the minimum spanning tree for the links defined by the proximity data if a unique minimum spanning tree exists. In general, the PFnet(n1,) includes all of the links in any minimum spanning tree. Pathfinder networks are used in the study of expertise, knowledge acquisition, knowledge engineering, citation patterns, information retrieval, and data visualization. The networks are potentially applicable to any problem addressed by network theory.

242

References
Further information on Pathfinder networks and several examples of the application of PFnets to a variety of problems can be found in: Schvaneveldt, R. W. (Ed.) (1990) Pathfinder Associative Networks: Studies in Knowledge Organization. Norwood, NJ: Ablex. The book is out of print. A copy can be downloaded: pdf [1] A shorter article summarizing Pathfinder networks: Schvaneveldt, R. W., Durso, F. T., & Dearholt, D. W. (1989). Network structures in proximity data. In G. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory, Vol. 24 (pp.249284). New York: Academic Press. pdf [2] Three papers describing fast implementations of Pathfinder networks: Guerrero-Bote, V.; Zapico-Alonso, F.; Esinosa-Calvo, M.; Gomez-Crisostomo, R.; Moya-Anegon, F. (2006). "Binary pathfinder: An improvement to the pathfinder algorithm". Information Processing and Management 42 (6): 14841490. doi:10.1016/j.ipm.2006.03.015 [3]. Quirin, A; Cordn, O; Santamara, J; Vargas-Quesada, B; Moya-Anegn, F (2008). "A new variant of the Pathfinder algorithm to generate large visual science maps in cubic time". Information Processing and Management 44 (4): 16111623. doi:10.1016/j.ipm.2007.09.005 [4]. Quirin, A.; Cordn, O.; Guerrero-Bote, V. P.; Vargas-Quesada, B.; Moya-Anegn, F. (2008). "A Quick MST-based Algorithm to Obtain Pathfinder Networks". Journal of the American Society for Information Science and Technology 59 (12): 19121924. doi:10.1002/asi.20904 [5]. (Quirin et al. is significantly faster, but can only be applied in cases where q=n1, while Guerrero-Bote et al. can be use for all cases.)

External links
Interlink [6] Implementation of the MST-Pathfinder algorithm in C++ [7]

References
[1] [2] [3] [4] [5] [6] [7] http:/ / interlinkinc. net/ PFBook. zip http:/ / www. interlinkinc. net/ Roger/ Papers/ Schvaneveldt_Durso_Dearholt_1989. pdf http:/ / dx. doi. org/ 10. 1016%2Fj. ipm. 2006. 03. 015 http:/ / dx. doi. org/ 10. 1016%2Fj. ipm. 2007. 09. 005 http:/ / dx. doi. org/ 10. 1002%2Fasi. 20904 http:/ / www. interlinkinc. net http:/ / aquirin. ovh. org/ research/ mstpathfinder. html

Perceptual mapping

243

Perceptual mapping
Perceptual mapping is a diagrammatic technique used by asset marketers that attempts to visually display the perceptions of customers or potential customers. Typically the position of a product, product line, brand, or company is displayed relative to their competition. Perceptual maps can have any number of dimensions but the most common is two dimensions. The first perceptual map below shows consumer perceptions of various automobiles on the two dimensions of sportiness/conservative and classy/affordable. This sample of consumers felt Porsche was the sportiest and classiest of the cars in the study (top right corner). They felt Plymouth was most practical and conservative (bottom left corner).

Perceptual Map of Competing Products

Cars that are positioned close to each other are seen as similar on the relevant dimensions by the consumer. For example consumers see Buick, Chrysler, and Oldsmobile as similar. They are close competitors and form a competitive grouping. A company considering the introduction of a new model will look for an area on the map free from competitors. Some perceptual maps use different size circles to indicate the sales volume or market share of the various competing products. Displaying consumers perceptions of related products is only half the story. Many perceptual maps also display consumers ideal points. These points reflect ideal combinations of the two dimensions as seen by a consumer. The next diagram shows a study of consumers ideal points in the alcohol/spirits product space. Each dot represents one respondent's ideal combination of the two dimensions. Areas where there is a cluster of ideal points (such as A) indicates a market segment. Areas without ideal points are sometimes referred to as demand voids.

Perceptual Map of Ideal Points and Clusters

A company considering introducing a new product will look for areas with a high density of ideal points. They will also look for areas without competitive rivals. This is best done by placing both the ideal points and the competing products on the same map. Some maps plot ideal vectors instead of ideal points. The map below, displays various aspirin products as seen on the dimensions of effectiveness and gentleness. It also shows two ideal vectors. The slope of the ideal vector indicates the preferred ratio of the two dimensions by those consumers within that segment. This study indicates

Perceptual mapping there is one segment that is more concerned with effectiveness than harshness, and another segment that is more interested in gentleness than strength.

244

Perceptual Map of Competing Products with Ideal Vectors

Perceptual maps need not come from a detailed study. There are also intuitive maps (also called judgmental maps or consensus maps) that are created by marketers based on their understanding of their industry. Management uses its best judgment. It is questionable how valuable this type of map is. Often they just give the appearance of credibility to managements preconceptions. When detailed marketing research studies are done methodological problems can arise, but at least the information is coming directly from the consumer. There is an assortment of statistical procedures that can be used to convert the raw data collected in a survey into a perceptual map. Preference regression will produce ideal vectors. Multi dimensional scaling will produce either ideal points or competitor positions. Factor analysis, discriminant analysis, cluster analysis, and logit analysis can also be used. Some techniques are constructed from perceived differences between products, others are constructed from perceived similarities. Still others are constructed from cross price elasticity of demand data from electronic scanners.

References External links


Organization Perception Mapping (http://www.perceptionmapping.com) Positioning Analysis/Mapping software (http://www.decisionpro.biz)

Person-fit analysis

245

Person-fit analysis
Person-fit analysis is a technique for determining if the person's results on a given test are valid. The purpose of a person-fit analysis is to detect item-score vectors that are unlikely given a hypothesized test theory model such as item response theory, or unlikely compared with the majority of item-score vectors in the sample. An item-score vector is a list of "scores" that a person gets on the items of a test, where "1" is often correct and "0" is incorrect. For example, if a person took a 10-item quiz and only got the first five correct, the vector would be {1111100000}. In individual decision-making in education, psychology, and personnel selection, it is critically important that test users can have confidence in the test scores used. The validity of individual test scores may be threatened when the examinee's answers are governed by factors other than the psychological trait of interest - factors that can range from something as benign as the examinee dozing off to concerted fraud efforts. Person-fit methods are used to detect item-score vectors where such external factors may be relevant, and as a result, indicate invalid measurement. Unfortunately, person-fit statistics only tell if the set of responses is likely or unlikely, and cannot prove anything. The results of the analysis might look like an examinee cheated, but there is no way to go back to when the test was administered and prove it. This limits its practical applicability on an individual scale. However, it might be useful on a larger scale; if most examinees at a certain test site or with a certain proctor have unlikely responses, an investigation might be warranted.

References
Emons, W.H.M., Sijtsma, K., & Meijer, R.R. (2005). Global, local and graphical person-fit analysis using person response functions. Psychological Methods, 10(1), 101-119. Emons, W.H.M., Glas, C.A.W., Meijer, R.R., & Sijtsma, K. (2003). Person fit in order-restricted latent class models. Applied Psychological Measurement, 27(6), 459-478. Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person-fit. Applied Psychological Measurement, 25, 107-135.

Phrase completions

246

Phrase completions
Phrase completion scales are a type of psychometric scale used in questionnaires. Developed in response to the problems associated with Likert scales, Phrase completions are concise, unidimensional measures that tap ordinal level data in a manner that approximates interval level data.

Overview of the phrase completion method


Phrase completions consist of a phrase followed by an 11-point response key. The phrase introduces part of the concept. Marking a reply on the response key completes the concept. The response key represents the underlying theoretical continuum. Zero(0)indicates the absence of the construct. Ten(10)indicates the theorized maximum amount of the construct. Response keys are reversed on alternate items to mitigate response set bias.

Sample question using the phrase completion method


I am aware of the presence of God or the Divine Never Continually 0 7 1 8 2 9 3 10 4 5 6

Scoring and analysis


After the questionnaire is completed the score on each item is summed together, to create a test score for the respondent. Hence, Phrase Completions, like Likert scales, are often considered to be summative scales.

Level of measurement
The response categories represent an ordinal level of measurement. Ordinal level data, however, varies in terms of how closely it approximates interval level data. By using a numerical continuum as the response key instead of sentiments that reflect intensity of agreement, respondents may be able to quantify their responses in more equal units.

References
Hodge, D. R. & Gillespie, D. F. (2003). Phrase Completions: An alternative to Likert scales. Social Work Research, 27(1), 45-55. Hodge, D. R. & Gillespie, D. F. (2005). Phrase Completion Scales. In K. Kempf-Leonard (Editor). Encyclopedia of Social Measurement. (Vol. 3, pp. 53-62). San Diego: Academic Press. Hodge, D. R. & Gillespie, D. F. (2007). Phrase Completion Scales: A Better Measurement Approach than Likert Scales? Journal of Social Service Research, 33, (4), 1-12.

Point-biserial correlation coefficient

247

Point-biserial correlation coefficient


The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be "naturally" dichotomous, like gender, or an artificially dichotomized variable. In most situations it is not advisable to artificially dichotomize variables. When you artificially dichotomize a variable the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation. The point-biserial correlation is mathematically equivalent to the Pearson (product moment) correlation, that is, if we have one continuously measured variable X and a dichotomous variable Y, rXY = rpb. This can be shown by assigning two distinct numerical values to the dichotomous variable. To calculate rpb, assume that the dichotomous variable Y has the two values 0 and 1. If we divide the data set into two groups, group 1 which received the value "1" on Y and group 2 which received the value "0" on Y, then the point-biserial correlation coefficient is calculated as follows:

where sn is the standard deviation used when you have data for every member of the population:

M1 being the mean value on the continuous variable X for all data points in group 1, and M0 the mean value on the continuous variable X for all data points in group 2. Further, n1 is the number of data points in group 1, n0 is the number of data points in group 2 and n is the total sample size. This formula is a computational formula that has been derived from the formula for rXY in order to reduce steps in the calculation; it is easier to compute than rXY. There is an equivalent formula that uses sn1:

where sn1 is the standard deviation used when you only have data for a sample of the population:

It's important to note that this is merely an equivalent formula. It is not a formula for use in the case where you only have sample data. There is no version of the formula for a case where you only have sample data. The version of the formula using sn1 is useful if you are calculating point-biserial correlation coefficients in a programming language or other development environment where you have a function available for calculating sn1, but don't have a function available for calculating sn. To clarify:

Glass and Hopkins' book Statistical Methods in Education and Psychology, (3rd Edition)[1] contains a correct version of point biserial formula. Also the square of the point biserial correlation coefficient can be written:

Point-biserial correlation coefficient We can test the null hypothesis that the correlation is zero in the population. A little algebra shows that the usual formula for assessing the significance of a correlation coefficient, when applied to rpb, is the same as the formula for an unpaired t-test and so

248

follows Student's t-distribution with (n1+n0 - 2) degrees of freedom when the null hypothesis is true. One disadvantage of the point biserial coefficient is that the further the distribution of Y is from 50/50, the more constrained will be the range of values which the coefficient can take. If X can be assumed to be normally distributed, a better descriptive index is given by the biserial coefficient

where u is the ordinate of the normal distribution with zero mean and unit variance at the point which divides the distribution into proportions n0/n and n1/n. As you might imagine, this is not the easiest thing in the world to calculate and the biserial coefficient is not widely used in practice. A specific case of biserial correlation occurs where X is the sum of a number of dichotomous variables of which Y is one. An example of this is where X is a person's total score on a test composed of n dichotomously scored items. A statistic of interest (which is a discrimination index) is the correlation between responses to a given item and the corresponding total test scores. There are three computations in wide use,[2] all called the point-biserial correlation: (i) the Pearson correlation between item scores and total test scores including the item scores, (ii) the Pearson correlation between item scores and total test scores excluding the item scores, and (iii) a correlation adjusted for the bias caused by the inclusion of item scores in the test scores. Correlation (iii) is

A slightly different version of the point biserial coefficient is the rank biserial which occurs where the variable X consists of ranks while Y is dichotomous. We could calculate the coefficient in the same way as where X is continuous but it would have the same disadvantage that the range of values it can take on becomes more constrained as the distribution of Y becomes more unequal. To get round this, we note that the coefficient will have its largest value where the smallest ranks are all opposite the 0s and the largest ranks are opposite the 1s. Its smallest value occurs where the reverse is the case. These values are respectively plus and minus (n1+n0)/2. We can therefore use the reciprocal of this value to rescale the difference between the observed mean ranks on to the interval from plus one to minus one. The result is

where M1 and M0 are respectively the means of the ranks corresponding to the 1 and 0 scores of the dichotomous variable. This formula, which simplifies the calculation from the counting of agreements and inversions, is due to Gene V Glass (1966). It is possible to use this to test the null hypothesis of zero correlation in the population from which the sample was drawn. If rrb is calculated as above then the smaller of

and

is distributed as MannWhitney U with sample sizes n1 and n0 when the null hypothesis is true.

Point-biserial correlation coefficient

249

External links
Point Biserial Coefficient [3] (Keith Calkins, 2005)

Notes
[3] http:/ / www. andrews. edu/ ~calkins/ math/ edrm611/ edrm13. htm#POINTB

Polychoric correlation
In statistics, polychoric correlation is a technique for estimating the correlation between two theorised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series, mathematical expansions once, but no longer, used for estimation of these correlations.

Applications and examples


This technique is frequently applied when analysing items on self-report instruments such as personality tests and surveys that often use rating scales with a small number of response options (e.g., strongly disagree to strongly agree). The smaller the number of response categories, the more a correlation between latent continuous variables will tend to be attenuated. Lee, Poon & Bentler (1995) have recommended a two-step approach to factor analysis for assessing the factor structure of tests involving ordinally measured items. This aims to reduce the effect of statistical artifacts, such as the number of response scales or skewness of variables leading to items grouping together in factors.

Software
polycor package in R by John Fox[1] psych package in R by William Revelle[2] PRELIS POLYCORR program [3] An extensive list of software for computing the polychoric correlation, by John Uebersax [4]

References
Lee, S.-Y., Poon, W. Y., & Bentler, P. M. (1995). "A two-stage estimation of structural equation models with continuous and polytomous variables". British Journal of Mathematical and Statistical Psychology, 48, 339358. Bonett, D. G., & Price R. M. (2005). "Inferential Methods for the Tetrachoric Correlation Coefficient". Journal of Educational and Behavioral Statistics, 30, 213.

External links
The Tetrachoric and Polychoric Correlation Coefficients [4]

Polychoric correlation

250

References
[1] [2] [3] [4] http:/ / rss. acs. unt. edu/ Rdoc/ library/ polycor/ html/ polychor. html http:/ / cran. r-project. org/ web/ packages/ psych/ index. html http:/ / www. john-uebersax. com/ stat/ xpc. htm http:/ / www. john-uebersax. com/ stat/ tetra. htm

Polynomial conjoint measurement


Polynomial conjoint measurement is an extension of the theory of conjoint measurement to three or more attributes. It was initially developed by the mathematical psychologists David Krantz (1968) and Amos Tversky (1967). The theory was given a comprehensive mathematical exposition in the first volume of Foundations of Measurement (Krantz, Luce, Suppes & Tversky, 1971), which Krantz and Tversky wrote in collaboration with the mathematical psychologist R. Duncan Luce and philosopher Patrick Suppes. Krantz & Tversky (1971) also published a non-technical paper on polynomial conjoint measurement for behavioural scientists in the journal Psychological Review. As with the theory of conjoint measurement, the significance of polynomial conjoint measurement lies in the quantification of natural attributes in the absence of concatenation operations. Polynomial conjoint measurement differs from the two attribute case discovered by Luce & Tukey (1964) in that more complex composition rules are involved.

Polynomial conjoint measurement


Krantz's (1968) schema
Most scientific theories involve more than just two attributes; and thus the two variable case of conjoint measurement has rather limited scope. Moreover, contrary to the theory of n - component conjoint measurement, many attributes are non-additive compositions of other attributes (Krantz, et al., 1971). Krantz (1968) proposed a general schema to ascertain the sufficient set of cancellation axioms for a class of polynomial combination rules he called simple polynomials. The formal definition of this schema given by Krantz, et al., (1971, p.328) is as follows. Let such that . The set ; and , then and are in is the smallest set of simple polynomials such that:

. Informally, the schema argues: a) single attributes are simple polynomials; b) if G1 and G2 are simple polynomials that are disjoint (i.e. have no attributes in common), then G1 + G2 and G1 G2 are simple polynomials; and c) no polynomials are simple except as given by a) and b). Let A, P and U be single disjoint attributes. From Krantzs (1968) schema it follows that four classes of simple polynomials in three variables exist which contain a total of eight simple polynomials: Additive: Distributive: Dual distributive: Multiplicative: ; ; plus 2 others obtained by interchanging A, P and U; plus 2 others as per above; .

Krantzs (1968) schema can be used to construct simple polynomials of greater numbers of attributes. For example, if D is a single variable disjoint to A, B, and C then three classes of simple polynomials in four variables are A + B + C + D, D + (B + AC) and D + ABC. This procedure can be employed for any finite number of variables. A simple test is that a simple polynomial can be split into either a product or sum of two smaller, disjoint simple polynomials.

Polynomial conjoint measurement These polynomials can be further split until single variables are obtained. An expression not amenable to splitting in this manner is not a simple polynomial (e.g. AB + BC + AC (Krantz & Tversky, 1971)).

251

Axioms
Let , and be non-empty and disjoint sets. Let " is a polynomial conjoint " be a simple order. Krantz et al. (1971) argued the quadruple system if and only if the following axioms hold. WEAK ORDER. SINGLE CANCELLATION. The relation " if and only if Single cancellation upon P and U is similarly defined. DOUBLE CANCELLATION. The relation " " upon and , and " satisfies single cancellation upon A whenever holds for all and .

satisfies double cancellation if and only if for all therefore

is true for all . The condition holds similarly upon and . JOINT SINGLE CANCELLATION. The relation " " upon satisfies joint single cancellation such that if and only if is true for all and if and only if implies is true if and only if implies . . . Joint independence is similarly defined for and . DISTRIBUTIVE CANCELLATION. Distributive cancellation holds upon , and for all and . DUAL DISTRIBUTIVE CANCELLATION. Dual distributive cancellation holds upon , is true for all SOLVABILITY. The relation " and , there exists ARCHIMEDEAN CONDITION. " upon and such that , and and is solvable if and only if for all

Representation theorems
The quadruple single cancellation axiom. falls into one class of three variable simple polynomials by virtue of the joint

References
Krantz, D.H. (1968). A survey of measurement theory. In G.B. Danzig & A.F. Veinott (Eds.), Mathematics of the Decision Sciences, part 2 (pp.314-350). Providence, RI: American Mathematical Society. Krantz, D.H.; Luce, R.D; Suppes, P. & Tversky, A. (1971). Foundations of Measurement, Vol. I: Additive and polynomial representations. New York: Academic Press. Krantz, D.H. & Tversky, A. (1971). Conjoint measurement analysis of composition rules in psychology. Psychological Review, 78, 151-169. Luce, R.D. & Tukey, J.W. (1964). Simultaneous conjoint measurement: a new scale type of fundamental measurement. Journal of Mathematical Psychology, 1, 1-27. Tversky, A. (1967). A general theory of polynomial conjoint measurement. Journal of Mathematical Psychology, 4, 1-20.

Polytomous Rasch model

252

Polytomous Rasch model


The polytomous Rasch model is generalization of the dichotomous Rasch model. It is a measurement model that has potential application in any context in which the objective is to measure a trait or ability through a process in which responses to items are scored with successive integers. For example, the model is applicable to the use of Likert scales, rating scales, and to educational assessment items for which successively higher integer scores are intended to indicate increasing levels of competence or attainment.

Background and overview


The polytomous Rasch model was derived by Andrich (1978), subsequent to derivations by Rasch (1961) and Andersen (1977), through resolution of relevant terms of a general form of Raschs model into threshold and discrimination parameters. When the model was derived, Andrich focused on the use of Likert scales in psychometrics, both for illustrative purposes and to aid in the interpretation of the model. The model is sometimes referred to as the Rating Scale Model when (i) items have the same number of thresholds and (ii) in turn, the difference between any given threshold location and the mean of the threshold locations is equal or uniform across items. This is, however, a potentially misleading name for the model because it is far more general in its application than to so-called rating scales. The model is also sometimes referred to as the Partial Credit Model, particularly when applied in educational contexts. The Partial Credit Model (Masters, 1982) has an identical mathematical structure but was derived from a different starting point at a later time, and is expressed in a somewhat different form. The Partial Credit Model also allows different thresholds for different items. Although this name for the model is often used, Andrich (2005) provides a detailed analysis of problems associated with elements of Masters' approach, which relate specifically to the type of response process that is compatible with the model, and to empirical situations in which estimates of threshold locations are disordered. These issues are discussed in the elaboration of the model that follows. The model is a general probabilistic measurement model which provides a theoretical foundation for the use of sequential integer scores, in a manner that preserves the distinctive property that defines Rasch models: specifically, total raw scores are sufficient statistics for the parameters of the models. See the main article for the Rasch model for elaboration of this property. In addition to preserving this property, the model permits a stringent empirical test of the hypothesis that response categories represent increasing levels of a latent attribute or trait, hence are ordered. The reason the model provides a basis for testing this hypothesis is that it is empirically possible that thresholds will fail to display their intended ordering. In this more general form of the Rasch model for dichotomous data, the score on a particular item is defined as the count of the number of threshold locations on the latent trait surpassed by the individual. It should be noted, however, that this does not mean that a measurement process entails making such counts in a literal sense; rather, threshold locations on a latent continuum are usually inferred from a matrix of response data through an estimation process such as Conditional Maximum likelihood estimation. In general, the central feature of the measurement process is that individuals are classified into one of a set of contiguous, or adjoining, ordered categories. A response format employed in a given experimental context may achieve this in a number of ways. For example, respondents may choose a category they perceive best captures their level of endorsement of a statement (such as 'strongly agree'), judges may classify persons into categories based on well-defined criteria, or a person may categorise a physical stimulus based on perceived similarity to a set of reference stimuli. The polytomous Rasch model specialises to the model for dichotomous data when responses are classifiable into only two categories. In this special case, the item difficulty and (single) threshold are identical. The concept of a threshold is elaborated on in the following section.

Polytomous Rasch model

253

The model
Firstly, let

be an integer random variable where

is the maximum score for item i. That is, the variable .

is a random is

variable that can take on integer values between 0 and a maximum of

In the polytomous Rasch "Partial Credit" model (Masters, 1982), the probability of the outcome

where

is the kth threshold location of item i on a latent continuum,

is the location of person n on the same

continuum, and m is the maximum score for the item. These equations are the same as

where the value of

is chosen for computational convenience.

Similarly, the Rasch "Rating Scale" model (Andrich, 1978) is

where items.

is the difficulty of item i and

is the kth threshold of the rating scale which is in common to all the

is chosen for computational convenience.

Applied in a given empirical context, the model can be considered a mathematical hypothesis that the probability of a given outcome is a probabilistic function of these person and item parameters. The graph showing the relation between the probability of a given category as a function of person location is referred to as a Category Probability Curve (CPC). An example of the CPCs for an item with five categories, scored from 0 to 4, is shown in Figure 1. A given threshold partitions the continuum into regions above and below its location. The threshold corresponds with the location on a latent continuum at which it is equally likely a person will be classified into adjacent categories, and therefore to obtain one of two successive scores. The first threshold of item i, , is the location on the continuum at which Figure 1: Rasch category probability curves for an item with five ordered categories a person is equally likely to obtain a score of 0 or 1, the second threshold is the location at which a person is equally likely to obtain a score of 1 and 2, and so on. In the example shown in Figure 1, the threshold locations are 1.5, 0.5, 0.5, and 1.5 respectively. Respondents may obtain scores in many different ways. For example, where Likert response formats are employed, Strongly Disagree may be assigned 0, Disagree a 1, Agree a 2, and Strongly Agree a 3. In the context of assessment in educational psychology, successively higher integer scores may be awarded according to explicit criteria or descriptions which characterise increasing levels of attainment in a specific domain, such as reading comprehension. The common and central feature is that some process must result in classification of each individual into one of a set of ordered categories that collectively comprise an assessment item.

Polytomous Rasch model

254

Elaboration of the model


In elaborating on features of the model, Andrich (2005) clarifies that its structure entails a simultaneous classification process, which results in a single manifest response, and involves a series of dichotomous latent responses. In addition, the latent dichotomous responses operate within a Guttman structure and associated response space, as is characterised to follow. Let

be a set of independent dichotomous random variables. Andrich (1978, 2005) shows that the polytomous Rasch model requires that these dichotomous responses conform with a latent Guttman response subspace:

in which x ones are followed by m-x zeros. For example, in the case of two thresholds, the permissible patterns in this response subspace are:

where the integer score x implied by each pattern (and vice versa) is as shown. The reason this subspace is implied by the model is as follows. Let

be the probability that

and let

. This function has the structure of the Rasch model

for dichotomous data. Next, consider the following conditional probability in the case two thresholds:

It can be shown that this conditional probability is equal to

which, in turn, is the probability

given by the polytomous Rasch model. From the denominator of

these equations, it can be seen that the probability in this example is conditional on response patterns of or . It is therefore evident that in general, the response subspace , as defined earlier, is intrinsic to the structure of the polytomous Rasch model. This restriction on the subspace is necessary to the justification for integer scoring of responses: i.e. such that the score is simply the count of ordered thresholds surpassed. Andrich (1978) showed that equal discrimination at each of the thresholds is also necessary to this justification. In the polytomous Rasch model, a score of x on a given item implies that an individual has simultaneously surpassed x thresholds below a certain region on the continuum, and failed to surpass the remaining mx thresholds above that region. In order for this to be possible, the thresholds must be in their natural order, as shown in the example of Figure 1. Disordered threshold estimates indicate a failure to construct an assessment context in which classifications represented by successive scores reflect increasing levels of the latent trait. For example, consider a situation in which there are two thresholds, and in which the estimate of the second threshold is lower on the continuum than the estimate of the first threshold. If the locations are taken literally, classification of a person into category 1 implies that the person's location simultaneously surpasses the second threshold but fails to surpass the first threshold. In turn, this implies a response pattern {0,1}, a pattern which does not belong to the subspace of patterns that is intrinsic to the structure of the model, as described above.

Polytomous Rasch model When threshold estimates are disordered, the estimates cannot therefore be taken literally; rather the disordering, in itself, inherently indicates that the classifications do not satisfy criteria that must logically be satisfied in order to justify the use of successive integer scores as a basis for measurement. To emphasise this point, Andrich (2005) uses an example in which grades of fail, pass, credit, and distinction are awarded. These grades, or classifications, are usually intended to represent increasing levels of attainment. Consider a person A, whose location on the latent continuum is at the threshold between regions on the continuum at which a pass and credit are most likely to be awarded. Consider also another person B, whose location is at the threshold between the regions at which a credit and distinction are most likely to be awarded. In the example considered by Andrich (2005, p.25), disordered thresholds would, if taken literally, imply that the location of person B (at the pass/credit threshold) is higher than that of person A (at the credit/distinction threshold). That is, taken literally, the disordered threshold locations would imply that a person would need to demonstrate a higher level of attainment to be at the pass/credit threshold than would be needed to be at the credit/distinction threshold. Clearly, this disagrees with the intent of such a grading system. The disordering of the thresholds would, therefore, indicate that the manner in which grades are being awarded is not in agreement with the intention of the grading system. That is, the disordering would indicate that the hypothesis implicit in the grading system - that grades represent ordered classifications of increasing performance is not substantiated by the structure of the empirical data.

255

References
Andersen, E.B. (1977). Sufficient statistics and latent trait models, Psychometrika, 42, 69-81. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73. Andrich, D. (2005). The Rasch model explained. In Sivakumar Alagumalai, David D Durtis, and Njora Hungi (Eds.) Applied Rasch Measurement: A book of exemplars. Springer-Kluwer. Chapter 3, 308-328. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. Wright, B.D. & Masters, G.N. (1982). Rating Scale Analysis. Chicago: MESA Press. (Available from the Institute for Objective Measurement.)

External links
Disordered thresholds and item information [1] Category Disordering and Threshold Disordering [2] Andrich on disordered thresholds and 'steps' [3] Directory of Rasch Software - freeware and paid [4] Institute for Objective Measurement [5] Rasch analysis [6] Rasch Model in Stata [7]

Polytomous Rasch model

256

References
[1] [2] [3] [4] [5] [6] [7] http:/ / www. rasch. org/ rmt/ rmt202a. htm http:/ / www. rasch. org/ rmt/ rmt131a. htm http:/ / www. rasch. org/ rmt/ rmt1239. htm http:/ / www. rasch. org/ software. htm http:/ / www. rasch. org/ http:/ / www. rasch-analysis. com/ http:/ / www. stata. com/ support/ faqs/ stat/ rasch. html

Progress testing
Progress tests are longitudinal, feedback oriented educational assessment tools for the evaluation of development and sustainability of cognitive knowledge during a learning process. A Progress Testis a written knowledge exam (usually involving multiple choice questions) that is usually administered to all students in the a program at the same time and at regular intervals (usually twice to four times yearly) throughout the entire academic program. The test samples the complete knowledge domain expected of new graduates on completion of their course, regardless of the year level of the student. The differences between students knowledge levels show in the test scores; the further a student has progressed in the curriculum the higher the scores. As a result, these resultant scores provide a longitudinal, repeated measures, curriculum-independent assessment of the objectives (in knowledge) of the entire programme.[1]

History
Since its inception in the late 1970s at both Maastricht University [1] and the University of MissouriKansas City [2] independently, the progress test of applied knowledge has been increasingly used in medical and health sciences programs across the globe. They are well established and increasingly used in medical education in both undergraduate and postgraduate medical education. They are used formatively and summatively.

Use in academic programs


The progress test is currently used by national progress test consortia in the United Kingdom,[3] Italy, The Netherlands,[4] in Germany (including Austria),[5] and in individual schools in Africa,[6] Saudi Arabia,[7] South East Asia,[8] the Caribbean, Australia, New Zealand, Sweden, Finland, UK, and the USA.[9] The National Board of Medical Examiners in the USA also provides progress testing in various countries [10][11] The feasibility of an international approach to progress testing has been recently acknowledged [12] and was first demonstrated by Albano et. al.[13] in 1996, who compared test scores across German, Dutch and Italian medical schools. An international consortium has been established in Canada [12][14] involving faculties in Ireland, Australia, Canada, Portugal and the West Indies. The progress test serves several important functions in academic programs. Considerable empirical evidence from medical schools in the Netherlands, Canada, United Kingdom and Ireland, as well postgraduate medical studies and schools in dentistry and psychology have shown that the longitudinal feature of the progress test provides a unique and demonstrable measurement of the growth and effectiveness of students knowledge acquisition throughout their course of study [1] [12] [15] [16] [17] [18] [19] [20] [21] .[22][23] As a result, this information can be consistently used for diagnostic, remedial and prognostic teaching and learning interventions. In the Netherlands, these interventions have been aided by the provision of a web-based results feedback system known as ProF [24] in which students can compare their results with their peers across different total and subtotal score perspectives, both across and within universities.

Progress testing Additionally, the longitudinal data can serve as a transparent quality assurance measure for program reviews by providing an evaluation of the extent to which a school is meeting its curriculum objectives.[1][10][25] The test also provides more reliable data for high-stakes assessment decisions by using measures of continuous learning rather than a one-shot method (Schuwirth, 2007). Inter-university progress testing collaborations provide a means of improving the cost-effectiveness of assessments by sharing a larger pool of items, item writers, reviewers, and administrators. The collaborative approach adopted by the Dutch and other consortia has enabled the progress test to become a benchmarking instrument by which to measure the quality of educational outcomes in knowledge. The success of the progress test in these ways has led to consideration of developing an international progress test.[25][26] The benefits for all main stakeholders in a medical or health sciences programme make the progress test an appealing tool to invest resources and time for inclusion in an assessment regime. This attractiveness is demonstrated by its increasingly widespread use in individual medical education institutions and inter-faculty consortia around the world, and by its use for national and international benchmarking practices.

257

Advantages
Progress tests provide a rich source of information: the comprehensive nature in combination with the cross-sectional and longitudinal design offers a wealth of information both for individual learners as well as for curriculum evaluations.[1] Progress Testing fosters knowledge retention: the repeated testing of the same comprehensive domain of knowledge means that there is no point testing facts that could be remembered if studied the night before. Long term knowledge and knowledge retention is fostered because item content remains relevant long after the knowledge has been learned. Progress Testing removes the need for resit examinations: every new test occasion is a renewed opportunity to demonstrate growth of knowledge. Progress Testing allows early detection of high achievers: some learners perform (far) beyond the expected level of their phase in training (e.g. they might have had relevant previous other training) and, depending on their performance, individual and more speeded pathways through the curriculum could be offered. Progress Testing brings stability in assessment procedures: curriculum changes, changes in content, have no consequence for the progress test provided the end outcomes are unchanged. Progress Testing provides excellent benchmarking opportunities: progress tests are not limited to a single school nor to PBL curricula and evaluations can easily be done to compare graduates and the effectiveness of different curriculum approaches.

Disadvantages
Naturally, there are disadvantages. The required resources for test development and scoring and the need for a central organization are two very important ones. Scoring,[27] psychometric procedures [28] for reducing test difficulty variation and standard setting procedures [29] are more complex in progress testing. Finally progress tests do not work in heterogeneous programs with early specialization (like in many health sciences programs). In more homogenous programs, such as most medical programs, they work really well and pay off in relation to driving learning and use of resources.

Progress testing

258

International programs using progress testing


Information from 2010+ (this list may not be complete or up to date, please add any other known progress test administrations) Netherlands Group - Five medical faculties in the Netherlands (Groningen, Leiden, Maastricht, Nijmegen and VU Amsterdam) and additionally, the Ghent University in Belgium use the test McMaster, including undergraduate programme, physician assistant programme, Canada Limerick University Charite, Germany (Germany Berlin Regel, Berlin reform, Witten, Aachen, Bochum, LMU Munich, Koln, Munster, Hannover, Mannheim, Regensburg, Austria Graz, Innsbruck) NBME 1 (Barts, St. Georges London, Leeds and Queens University, Belfast UK) NBME 2 (University of South Florida and Case Western Reserve University) Southern Illinois University, Vanderbilt, University of New Mexico, Penn State, Texas Tech, Medical College of Georgia, University of Minnesota University of Manchester School of Medicine, UK Peninsula College of Medicine and Dentistry, UK Swansea University, College of Medicine, UK (Graduate entry) University of Tampere, Finland The College of Medicine at King Saud bin Abdulaziz University for Health Sciences (KSAU-HS), Saudi Arabia Karaganda State Medical University, Kazakhstan Otago Medical School, New Zealand Sao Paulo City Medical School (UNICID), Brazil University of Indonesia, Medical School Catholic University of Mozambique Pretoria, South Africa CMIRA Program, Syrian-Lebanese Hospital Institute for Education and Research, Brazil Source:[9]

References
[1] van der Vleuten CPM, Verwijnen GM, Wijnen WHFW. 1996. Fifteen years of experience with progress testing in a problem-based learning curriculum. Medical Teacher 18(2):103110. [2] Arnold L, Willoughby TL. 1990. The quarterly profile examination. Academic Medicine 65(8):515516. [3] Swanson, D. B., Holtzman, K. Z., Butler, A., Langer, M. M., Nelson, M. V., Chow, J. W. M., et al. (2010). Collaboration across the pond: The multi-school progress testing project. Medical Teacher, 32, 480-485. [4] Schuwirth, L., Bosman, G., Henning, R. H., Rinkel, R., & Wenink, A. C. G. (2010). Collaboration on progress testing in medical schools in the Netherlands. Medical Teacher, 32, 476-479. [5] Nouns, Z. M., & Georg, W. (2010). Progress testing in German speaking countries. Medical Teacher, 32, 467-470. [6] Aarts, R., Steidel, K., Manuel, B. A. F., & Driessen, E. W. (2010). Progress testing in resource-poor countries: A case from Mozambique. Medical Teacher, 32, 461-463. [7] Al Alwan, I., Al-Moamary, M., Al-Attas, N., Al Kushi, A., ALBanyan, E., Zamakhshary, M., et al. (2011). The progress test as a diagnostic tool for a new PBL curriculum. Education for Health(December, Article No. 493). [8] Mardiastuti, H. W., & Werdhani, R. A. (2011). Grade point average, progress test, and try outs's test as tools for curriculum evaluation and graduates' performance prediciton at the national baord examination. Journal of Medicine and Medical Sciences, 2(12), 1302-1305. [9] Freeman, A., van der Vleuten, C., Nouns, Z., & Ricketts, C. (2010). Progress testing internationally. Medical Teacher, 32, 451-455. [10] De Champlain, A., Cuddy, M. M., Scoles, P. V., Brown, M., Swanson, D. B., Holtzman, K., et al. (2010). Progress testing in clinical science education: Results of a pilot project between the National Board of Medical Examiners and a US medical School. Medical Teacher, 32, 503-508. [11] International Foundations of Medicine (2011). Retrieved 20 July 2011, from http:/ / www. nbme. org/ Schools/ iFoM/ index. html

Progress testing
[12] Finucane, P., Flannery, D., Keane, D., & Norman, G. (2010). Cross-institutional progress testing: Feasibility and value to a new medical school. Medical Education, 44, 184-186. [13] Albano, M. G., Cavallo, F., Hoogenboom, R., Magni, F., Majoor, G., Manenti, F., et al. (1996). An international comparison of knowledge levels of medical students: The Maastricht progress test. Medical Education, 30, 239-245. [14] International Partnership for Progress Testing (2011). Retrieved 18 July 2011, from http:/ / ipptx. org/ [15] Bennett, J., Freeman, A., Coombes, L., Kay, L., & Ricketts, C. (2010). Adaptation of medical progress testing to a dental setting. Medical Teacher, 32, 500-502. [16] Boshuizen, H. P. A., van der Vleuten, C. P. M., Schmidt, H., & Machiels-Bongaerts, M. (1997). Measuring knowledge and clinical reasoning skills in a problem-based curriculum. Medical Education, 31, 115-121. [17] Coombes, L., Ricketts, C., Freeman, A., & Stratford, J. (2010). Beyond assessment: Feedback for individuals and institutions based on the progress test. Medical Teacher, 32, 486-490. [18] Dijksterhuis, M. G. K., Scheele, F., Schuwirth, L. W. T., Essed, G. G. M., & Nijhuis, J. G. (2009). Progress testing in postgraduate medical education. Medical Teacher, 31, e464-e468. [19] Freeman, A., & Ricketts, C. (2010). Choosing and designing knowledge assessments: Experience at a new medical school. Medical Teacher, 32, 578-581. [20] Schaap, L., Schmidt, H., & Verkoeijen, P. J. L. (2011). Assessing knowledge growth in a psychology curriculum: which students improve most? Assessment & Evaluation in Higher Education, 1-13. [21] van der Vleuten, C. P. M., Verwijnen, G. M., & Wijnen, W. H. F. W. (1996). Fifteen years of experience with progress testing in a problem-based learning curriculum. Medical Teacher, 18(2), 103-109. [22] van Diest, R., van Dalen, J., Bak, M., Schruers, K., van der Vleuten, C., Muijtjens, A. M. M., et al. (2004). Growth of knowledge in psychiatry and behavioural sciences in a problem-based learning curriculum. Medical Education, 38, 1295-1301. [23] Verhoeven, B. H., Verwijnen, G. M., Scherpbier, A. J. J. A., & van der Vleuten, C. P. M. (2002). Growth of medical knowledge. Medical Education, 36, 711-717. [24] Muijtjens, A. M. M., Timmermans, I., Donkers, J., Peperkamp, R., Medema, H., Cohen-Schotanus, J., et al. (2010). Flexible electronic feedback using the virtues of progress testing. Medical Teacher, 32, 491-495. [25] Verhoeven, B. H., Snellen-Balendong, H. A. M., Hay, I. T., Boon, J. M., Van Der Linde, M. J., Blitz-Lindeque, J. J., et al. (2005). The versatility of progress testing assessed in an international context: a start for benchmarking global standardization? Medical Teacher, 27(6), 514-520. [26] Schauber, S., & Nouns, Z. B. (2010). Using the cumulative deviation method for cross-institutional benchmarking in the Berlin progress test. Medical Teacher, 32, 471-475. [27] Muijtjens AM, Mameren HV, Hoogenboom RJ, Evers JL, van der Vleuten CP. 1999. The effect of a dont know option on test scores: Number-right and formula scoring compared. Medical Education 33(4):267275. [28] Shen L. 2000. Progress testing for postgraduate medical education: A four year experiment of American College of Osteopathic Surgeons Resident Examinations. Advances in Health Sciences Education: Theory and Practice 5(2):117129 [29] Verhoeven BH, Snellen-Balendong HA, Hay IT, Boon JM, van der Linde MJ, Blitz-Lindeque JJ, Hoogenboom RJI, Verwijnen GM, Wijnen WHFW, Scherpbier AJJA, et al. 2005. The versatility of progress testing assessed in an international context: A start for benchmarking global standardization? Medical Teacher 27(6):514520

259

External links
Progress test Medicine Universittsmedizin Berlin (http://ptm.charite.de/en/) interuniversity Progress Test Medicine, the Netherlands (http://www.ivtg.nl/en/node/69) Academic Medicine (http://journals.lww.com/academicmedicine/pages/default.aspx) (Subscription) Advances in Health Sciences Education (http://www.springer.com/education/journal/10459) (Subscription) Medical Education (http://www.mededuc.com/) (Subscription) Medical Teacher (http://www.medicalteacher.org/) (Subscription)

Projective test

260

Projective test
Projective tests
Diagnostics MeSH D011386 [1]

In psychology, a projective test is a personality test designed to let a person respond to ambiguous stimuli, presumably revealing hidden emotions and internal conflicts. This is sometimes contrasted with a so called "objective test" in which responses are analyzed according to a universal standard (for example, a multiple choice exam). The responses to projective tests are content analyzed for meaning rather than being based on presuppositions about meaning, as is the case with objective tests. Projective tests have their origins in psychoanalytic psychology, which argues that humans have conscious and unconscious attitudes and motivations that are beyond or hidden from conscious awareness.

Theory
The general theoretical position behind projective tests is that whenever a specific question is asked, the response will be consciously-formulated and socially determined. These responses do not reflect the respondent's unconscious or implicit attitudes or motivations. The respondent's deep-seated motivations may not be consciously recognized by the respondent or the respondent may not be able to verbally express them in the form demanded by the questioner. Advocates of projective tests stress that the ambiguity of the stimuli presented within the tests allow subjects to express thoughts that originate on a deeper level than tapped by explicit questions. Projective tests lost some of their popularity during the 1980s and 1990s in part because of the overall loss of popularity of the psychoanalytic method and theories. Despite this, they are still used quite frequently.

Projective Hypothesis
This holds that an individual puts structure on an ambiguous situation in a way that is consistent with their own conscious & unconscious needs. It is an indirect method- testee is talking about something other than him/her self. Reduces temptation to fake Doesn't depend as much on verbal abilities Taps both conscious & unconscious traits Focus is clinical perspective - not normative - but has developed norms over the years [2]

Common variants
Rorschach
The best known and most frequently used projective test is the Rorschach inkblot test, in which a subject is shown a series of ten irregular but symmetrical inkblots, and asked to explain what they see.[] The subject's responses are then analyzed in various ways, noting not only what was said, but the time taken to respond, which aspect of the drawing was focused on, and how single responses compared to other responses for the same drawing. For example, if someone consistently sees the images as threatening and frightening, the tester might infer that the subject may suffer from paranoia.

Projective test

261

Holtzman Inkblot Test


This is a variation of the Rorschach test. Its main differences lie in its objective scoring criteria as well as limiting subjects to one response per inkblot (to avoid variable response productivity). Different variables such as reaction time are scored for an individual's response upon seeing an inkblot.[3]

Thematic apperception test


Another popular projective test is the Thematic Apperception Test (TAT) in which an individual views ambiguous scenes of people, and is asked to describe various aspects of the scene; for example, the subject may be asked to describe what led up to this scene, the emotions of the characters, and what might happen afterwards. The examiner then evaluates these descriptions, attempting to discover the conflicts, motivations and attitudes of the respondent. In the answers, the respondent "projects" their unconscious attitudes and motivations into the picture, which is why these are referred to as "projective tests."

Draw-A-Person test
The Draw-A-Person test requires the subject to draw a person. The results are based on a psychodynamic interpretation of the details of the drawing, such as the size, shape and complexity of the facial features, clothing and background of the figure. As with other projective tests, the approach has very little demonstrated validity and there is evidence that therapists may attribute pathology to individuals who are merely poor artists.[] A similar class of techniques is kinetic family drawing. Criticisms of Drawing Tests Among the plausible but empirically untrue relations that have been claimed: Large size = Emotional expansiveness or acting out Small size = emotional constriction; withdrawal, or timidity Erasures around male buttocks; long eyelashes on males = homoeroticism Overworked lines = tension, aggression Distorted or omitted features = Conflicts related to that feature Large or elaborate eyes = Paranoia [4]

Animal Metaphor Test


The Animal Metaphor test consists of a series of creative and analytical prompts in which the person filling out the test is asked to create a story and then interpret its personal significance. Unlike conventional projective tests, the Animal Metaphor Test works as both a diagnostic and therapeutic battery. Unlike the Rorschach test and TAT, the Animal Metaphor is premised on self-analysis via self-report questions. The test combines facets of art therapy, cognitive behavioral therapy, and insight therapy, while also providing a theoretical platform of behavioral analysis. The test has been used widely as a clinical tool, as an educational assessment, and in human resource selection. The test is accompanied by an inventory, The Relational Modality Evaluation Scale, a self-report measure that targets individuals' particular ways of resolving conflict and ways of dealing with relational stress. These tests were developed by Dr. Albert J Levis at the Center for the Study of Normative Behavior in Hamden, CT, a clinical training and research center.

Projective test

262

Sentence completion test


Sentence completion tests require the subject complete sentence "stems" with their own words. The subject's response is considered to be a projection of their conscious and/or unconscious attitudes, personality characteristics, motivations, and beliefs.

Picture Arrangement Test


Created by Silvan Tomkins, this psychological test consists of 25 sets of 3 pictures which the subject must arrange into a sequence that they "feel makes the best sense". The reliability of this test has been disputed, however. For example, patients suffering from schizophrenia have been found to score as more "normal" than patients with no such mental disorders.[5] Other picture tests: Thompson version, CAT (animals) and CAT-H, (humans) Senior AT, Blacky pictures test - dogs Picture Story Test - adolescents Education Apperception Test -attitudes towards learning Michigan Picture Test - children 8-14

TEMAS - hispanic children Make-A-Picture Story- make own pictures from figures 6yrs & up [2]

Word Association Test


Word association testing is a technique developed by Carl Jung to explore complexes in the personal unconscious. Jung came to recognize the existence of groups of thoughts, feelings, memories, and perceptions, organized around a central theme, that he termed psychological complexes. This discovery was related to his research into word association, a technique whereby words presented to patients elicit other word responses that reflect related concepts in the patients psyche, thus providing clues to their unique psychological make-up [6] [7] [8]

Graphology
A lesser-known projective test is graphology or handwriting analysis. Clinicians who assess handwriting to derive tentative information about the writer's personality attend to and analyze the writing's organization on the page, movement style and use of distinct letterforms.[9]

Statistical debate
From the perspective of statistical validity, psychometrics and positivism, criticisms of projective tests, and depth psychology tests, usually include the well-known discrepancy between statistical validity and clinical validity.[10] include that they rely heavily on clinical judgement, lack statistical reliability and statistical validity and many have no standardized criteria to which results may be compared, however this is not always the case. These tests are used frequently, though the scientific evidence is sometimes debated. There have been many empirical studies based on projective tests (including the use of standardized norms and samples), particularly more established tests. The criticism of lack of scientific evidence to support them and their continued popularity has been referred to as the "projective paradox".[] Responding to the statistical criticism of his projective test, Leopold Szondi said that his test actually discovers "fate and existential possibilities hidden in the inherited familial uncounscious and the personal unconscious, even those hidden because never lived through or because have been rejected. Is any statistical method hable to span, understand and integrate mathematically all these possibilities? I deny this cathegorically."[11]

Projective test

263

Concerns with Projective Tests


Assumptions
The more unstructured the stimuli, the more examinees reveal about their personality. Projection is greater to stimulus materal that is similar to the examinee Every response provides meaning for personality analysis. There is an "unconscious." Subjects are unaware of what they disclose

Situation Variables
Age of examiner Specific instructions Subtle reinforcement cues Setting-privacy [12]

Terminology
The terms "objective test" and "projective test" have recently come under criticism in the Journal of Personality Assessment. The more descriptive "rating scale or self-report measures" and "free response measures" are suggested, rather than the terms "objective tests" and "projective tests," respectively.[13]

Uses in marketing
Projective techniques, including TATs, are used in qualitative marketing research, for example to help identify potential associations between brand images and the emotions they may provoke. In advertising, projective tests are used to evaluate responses to advertisements. The tests have also been used in management to assess achievement motivation and other drives, in sociology to assess the adoption of innovations, and in anthropology to study cultural meaning. The application of responses is different in these disciplines than in psychology, because the responses of multiple respondents are grouped together for analysis by the organisation commissioning the research, rather than interpreting the meaning of the responses given by a single subject.

References
[1] http:/ / www. nlm. nih. gov/ cgi/ mesh/ 2011/ MB_cgi?field=uid& term=D011386 [2] Projective Methods for Personality Assessment. (n.d.). Retrieved November 21, 2012, from http:/ / www. neiu. edu/ ~mecondon/ proj-lec. htm. [3] Gamble, K. R. (1972). The holtzman inkblot technique. Psychological Bulletin, 77(3), 172-194. [4] Projective Tests. (n.d.) Retrieved November 21, 2012 from http:/ / web. psych. ualberta. ca/ ~chrisw/ L12ProjectiveTests/ L12ProjectiveTests. pdf [5] Piotrowski, Z. (1958-01-01). The Tomkins-Horn Picture Arrangement Test. The journal of nervous and mental disease, 126(1), 106. [6] Merriam-Webster. (n.d.). Retrieved November 21, 2012, from http:/ / www. merriam-webster. com/ dictionary/ word-association%20test [7] Spiteri, S. P. (n.d.). "Word association testing and thesaurus construction." Retrieved November 21,2012, from Dalhousie University, School of Library and Information Studies website: http:/ / libres. curtin. edu. au/ libres14n2/ Spiteri_final. htm [8] Schultz, D. P., & Schultz, S. E. (2000). "The history of modern psychology." Seventh edition. Harcourt College Publishers. [9] Poizner, Annette (2012). Clinical Graphology: An Interpretive Manual for Mental Health Practitioners. Springfield, Illinois: Charles C Thomas Publishers. [10] Leopold Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396. Quotation: [11] Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396 [12] Shatz, Phillip. (n.d.) "Projective personality testing: Psychological testing." Retrieved November 21, 2012, from Staint Joseph's University: Department of Psychology Web site: http:/ / schatz. sju. edu/ intro/ 1001lowfi/ personality/ projectiveppt/ sld001. htm

Projective test
[13] Meyer, Gregory J. and Kurtz, John E.(2006) 'Advancing Personality Assessment Terminology: Time to Retire "Objective" and "Projective" As Personality Test Descriptors', Journal of Personality Assessment, 87: 3, 223 225

264

Footnotes
Theodor W. Adorno, et al. (1964). The Authoritarian Personality. New York: John Wiley & Sons. Lawrence Soley & Aaron Lee Smith (2008). Projective Techniques for Social Science and Business Research. Milwaukee: The Southshore Press.

Prometric

265

Prometric
Prometric
Type Founded Subsidiary 1990

Headquarters Baltimore Parent Website Educational Testing Service Prometric [1]

Prometric is a U.S. company in the test administration industry. Prometric operates a test center network composed of over 10,000 sites in 160 countries. Many examinations are administered at Prometric sites including those from Nationwide Mortgage Licensing System and Registry, Microsoft, IBM, Apple, the Common Admission Test (CAT) of the IIMs, the European Personnel Selection Office, the Medical College Admission Test, USMLE, the Diplomate of National Board-Common Entrance Test of National Board of Examinations, the Uniform Certified Public Accountant Examination, Architect Registration Examination, and the USPTO registration examination. Prometric's corporate headquarters are located in Canton (Baltimore, Maryland) in the United States.

History
Prometric's computerized testing centers were originally founded by Drake International in 1990 under the name Drake Prometric.[2] In 1995, Drake Prometric L.P. was sold to Sylvan Learning in a cash and stock deal worth approximately $44.5 million.[3] The acquired business was renamed Sylvan Prometric, then sold to Thomson Corporation in 2000.[4] The Thomson Corporation announced its desire to sell Prometric in the fall of 2006, and Educational Testing Service announced its plans to acquire it.[5] On Monday, October 15, 2007, Educational Testing Service (ETS) closed its acquisition of Prometric from the Thomson Corporation.[6] Prometric is currently a wholly owned, independently operated subsidiary of ETS, allowing ETS to maintain non-profit status.

Business
Prometric sells a range of services, including test development, test delivery, and data management capabilities. Prometric delivers and administers tests to approximately 500 clients in the academic, professional, government, corporate and information technology markets. While there are 3000 Prometric test centers across the world,[7] including every U.S. state and territory (except Wake Island), whether a particular test can be taken outside the U.S. depends on the testing provider. For example, despite the fact that Prometric test centers exist worldwide, some exams are only offered in the country where the client program exists. The locations where a test is offered, as well as specific testing procedures for the day of the exam, are dictated by the client. In 2009, the company was involved in a controversy due to widespread technical problems on one of India's MBA entrance exams, the Common Admission Test.[8] While Prometric claims that the problems were due to common viruses,[9] this claim was disputed since these tests were not internet-based and were rather offered on local area networks within India, where the virus was pre-existent.[10] Due to this controversy Prometric allowed 8000 students to reappear for the examination.[11]

Prometric

266

International
In the Republic of Ireland, Prometric's local subsidiary are responsible for administering the Driver Theory Test.[12]

References
[1] [2] [3] [4] [5] http:/ / www. prometric. com Drake International early years (http:/ / celebratewithdrake. com/ entrepreneurial) Sylvan to acquire test firm (http:/ / articles. baltimoresun. com/ 1995-07-22/ business/ 1995203052_1_sylvan-drake-financial-targets) Thomson Acquires Prometric (http:/ / www. encyclopedia. com/ doc/ 1G1-58958755. html) ETS news ETS to Acquire Prometric (http:/ / www. etsemea-customassessments. org/ cas-en/ media/ press-releases/ ets-to-acquire-thomson-prometric/ ) [6] (http:/ / thomsonreuters. com/ content/ press_room/ corp/ corp_news/ 217831) [7] QAI India Ltd Announces A Partnership with Prometric (http:/ / www. newswiretoday. com/ news/ 38336/ ) [8] Online CAT Puts Prometric in Mousetrap (http:/ / news. ciol. com/ News/ News-Reports/ Online-CAT-puts-Prometric-in-mousetrap/ 301109128324/ 0/ ) [9] Time of India - Viruses Cause CAT Failure (http:/ / timesofindia. indiatimes. com/ india/ IIM-A-names-2-viruses-that-caused-CAT-chaos/ articleshow/ 5286411. cms) [10] CAT Server Crash: Prometric's Virus Theory Rubbished (http:/ / businesstechnology. in/ tools/ news/ 2009/ 11/ 30/ CAT-server-crash-Prometric-s-virus-theory-rubbished. html) [11] Retest for 8000 students (http:/ / www. catiim. in/ notice_17122009. html) [12] http:/ / www. theorytest. ie/

External links
Prometric website (http://www.prometric.com/)

Psychological statistics
Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality

Psychological statistics

267
Positive Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

Psychological statistics is the application of statistics to psychology. Some of the more common applications include: 1. 2. 3. 4. 5. 6. 7. psychometrics learning theory perception human development abnormal psychology Personality test psychological tests

Some of the more commonly used statistical tests in psychology are: Parametric tests Student's t-test analysis of variance (ANOVA) ANCOVA (Analysis of Covariance) MANOVA (Multivariate Analysis of Variance) regression analysis

Psychological statistics linear regression hierarchical linear modelling correlation Pearson product-moment correlation coefficient Spearman's rank correlation coefficient Non-parametric tests chi-square MannWhitney U

268

References
Cohen, B.H. (2007) Explaining Psychological Statistics, 3rd Edition, Wiley. ISBN 978-0-470-00718-1 Howell, D. (2009) Statistical Methods for Psychology, International Edition, Wadsworth. ISBN 0-495-59785-6

External links
Charles McCreerys tutorials on chi-square, probability and Bayes theorem for Oxford University psychology students [1] Matthew Rockloff's tutorials on t-tests, correlation and ANOVA [2]

References
[1] http:/ / www. celiagreen. com/ charlesmccreery. html [2] http:/ / psychologyaustralia. homestead. com/ index. htm

Psychometric function
A psychometric function describes the relationship between a parameter of a physical stimulus and the subjective responses of the subject. The psychometric function is a special case of the General Linear Model (GLM). The probability of response is related to a linear combine of predictors by means of a sigmoid link function (e.g. probit, logit, etc.). Depending on the number of alternative choices, the psychophysical experimental paradigms classify as simple forced choice (also known as yes-no task), two-alternative forced choice (2AFC), and n-alternative forced choice. The number of alternatives in the experiment determine the lower asymptode of the function. Two different types of psychometric plots are in common use. One plots the percentage of correct responses (or a similar value) displayed on the y-axis and the physical parameter on the x-axis. If the stimulus parameter is very far towards one end of its possible range, the person will always be able to respond correctly. Towards the other end of the range, the person never perceives the stimulus properly and therefore the probability of correct responses is at chance level. In between, there is a transition range where the subject has an above-chance rate of correct responses, but does not always respond correctly. The inflection point of the sigmoid function or the point at which the function reaches the middle between the chance level and 100% is usually taken as sensory threshold. The second type plots the proportion of "yes" responses on the y-axis, and therefore will have a sigmoidal shape covering the range [0, 1], rather than merely [0.5, 1], and we move from a subject being certain that the stimulus was not of the particular type requested to certainty that it was. This second way of plotting psychometric functions is often preferable, as it is more easily amenable to principled quantitative analysis using tools such as probit analysis (fitting of cumulative Gaussian distributions). However, it also has important drawbacks. First, the threshold estimation is based only on p(yes), namely on "Hit" in Signal Detection Theory terminology. Second, and consequently, it is not bias free or criterion free. Third, the threshold is identified with the p(yes) = .5, which is just a conventional and arbitrary choice.

Psychometric function A common example is visual acuity testing with an eye chart. The person sees symbols of different sizes (the size is the relevant physical stimulus parameter) and has to decide which symbol it is. Usually, there is one line on the chart where a subject can identify some, but not all, symbols. This is equal to the transition range of the psychometric function and the sensory threshold corresponds to visual acuity. (Strictly speaking, a typical optometric measurement does not exactly yield the sensory threshold due to biases in the standard procedure.)

269

Psychometrics of racism
Psychometrics of racism is an emerging field that aims to measure the incidence and impacts of racism on the psychological well-being of people of all races. At present, there are few instruments that attempt to capture the experience of racism in all of its complexity.[1]

Self-reported inventories
The Schedule of Racist Events (SRE) is questionnaire for assessing frequency of racial discrimination in lives of African Americans created in 1998 by Hope Landrine and Elizabeth A. Klonoff. SRE is an 18-item self-report inventory, assesses frequency of specific racist events in past year and in one's entire life, and measures to what extent this discrimination was stressful.[2] Other psychometric tools for assessing the impacts of racism include:[3] The Racism Reaction Scale (RRS) Perceived Racism Scale (PRS) Index of Race-Related Stress (IRRS) Racism and Life Experience Scale-Brief Version (RaLES-B) Telephone-Administered Perceived Racism Scale (TPRS)[4]

Physiological metrics
In a summary of recent research Jules P. Harrell, Sadiki Hall, and James Taliaferro describe how a growing body of research has explored the impact of encounters with racism or discrimination on physiological activity. Several of the studies suggest that higher blood pressure levels are associated with the tendency not to recall or report occurrences identified as racist and discriminatory. In other words, failing to recognize instances of racism is directly impacted by the blood pressure of the person experiencing the racist event. Investigators have reported that physiological arousal is associated with laboratory analogues of ethnic discrimination and mistreatment.[5]

References
[1] The perceived racism scale: a multidimensional assessment of the experience of white racism among African Americans. (http:/ / www. ncbi. nlm. nih. gov/ entrez/ query. fcgi?cmd=Retrieve& db=PubMed& list_uids=8882844& dopt=Citation) [2] The Schedule of Racist Events: A Measure of Racial Discrimination and a Study of Its Negative Physical and Mental Health Consequences. (http:/ / eric. ed. gov/ ERICWebPortal/ Home. portal?_nfpb=true& _pageLabel=RecordDetails& ERICExtSearch_SearchValue_0=EJ528856& ERICExtSearch_SearchType_0=eric_accno& objectId=0900000b8002502e) [3] Assessing the Stressful Effects of Racism: A Review of Instrumentation (http:/ / jbp. sagepub. com/ cgi/ content/ abstract/ 24/ 3/ 269) [4] Development and Reliability of a Telephone-Administered Perceived Racism Scale (TPRS): A Tool for Epidemiological Use (http:/ / apt. allenpress. com/ aptonline/ ?request=get-abstract& issn=1049-510X& volume=011& issue=02& page=0251) [5] Physiological Responses to Racism and Discrimination: An Assessment of the Evidence (http:/ / www. ajph. org/ cgi/ content/ abstract/ 93/ 2/ 243)

Quantitative marketing research

270

Quantitative marketing research


Quantitative marketing research is the application of quantitative research techniques to the field of marketing. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion. As a social research method, it typically involves the construction of questionnaires and scales. People who respond (respondents) are asked to complete the survey. Marketers use the information so obtained to understand the needs of individuals in the marketplace, and to create strategies and marketing plans.

Typical general procedure


Simply put, there are five major and important steps involved in the research process: 1. 2. 3. 4. Defining the problem. Research design. Data collection. Data analysis.

5. Report Writing & presentation. A brief discussion on these steps is: 1. Problem audit and problem definition - What is the problem? What are the various aspects of the problem? What information is needed? 2. Conceptualization and operationalization - How exactly do we define the concepts involved? How do we translate these concepts into observable and measurable behaviours? 3. Hypothesis specification - What claim(s) do we want to test? 4. Research design specification - What type of methodology to use? - examples: questionnaire, survey 5. Question specification - What questions to ask? In what order? 6. Scale specification - How will preferences be rated? 7. Sampling design specification - What is the total population? What sample size is necessary for this population? What sampling method to use?- examples: Probability Sampling:- (cluster sampling, stratified sampling, simple random sampling, multistage sampling, systematic sampling) & Nonprobability sampling:- (Convenience Sampling,Judgement Sampling, Purposive Sampling, Quota Sampling, Snowball Sampling, etc. ) 8. Data collection - Use mail, telephone, internet, mall intercepts 9. Codification and re-specification - Make adjustments to the raw data so it is compatible with statistical techniques and with the objectives of the research - examples: assigning numbers, consistency checks, substitutions, deletions, weighting, dummy variables, scale transformations, scale standardization 10. Statistical analysis - Perform various descriptive and inferential techniques (see below) on the raw data. Make inferences from the sample to the whole population. Test the results for statistical significance. 11. Interpret and integrate findings - What do the results mean? What conclusions can be drawn? How do these findings relate to similar research? 12. Write the research report - Report usually has headings such as: 1) executive summary; 2) objectives; 3) methodology; 4) main findings; 5) detailed charts and diagrams. Present the report to the client in a 10 minute presentation. Be prepared for questions. The design step may involve a pilot study in order to discover any hidden issues. The codification and analysis steps are typically performed by computer, using statistical software. The data collection steps, can in some instances be automated, but often require significant manpower to undertake. Interpretation is a skill mastered only by experience.

Quantitative marketing research

271

Statistical analysis
The data acquired for quantitative marketing research can be analysed by almost any of the range of techniques of statistical analysis, which can be broadly divided into descriptive statistics and statistical inference. An important set of techniques is that related to statistical surveys. In any instance, an appropriate type of statistical analysis should take account of the various types of error that may arise, as outlined below.

Reliability and validity


Research should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population. Reliability is the extent to which a measure will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Stability over repeated measures is assessed with the Pearson coefficient. Alternative forms reliability checks how similar the results are if the research is repeated using different forms. Internal consistency reliability checks how well the individual measures included in the research are converted into a composite measure. Internal consistency may be assessed by correlating performance on two halves of a test (split-half reliability). The value of the Pearson product-moment correlation coefficient is adjusted with the SpearmanBrown prediction formula to correspond to the correlation between two full-length tests. A commonly used measure is Cronbach's , which is equivalent to the mean of all possible split-half coefficients. Reliability may be improved by increasing the sample size. Validity asks whether the research measured what it intended to. Content validation (also called face validity) checks how well the content of the research are related to the variables to be studied; it seeks to answer whether the research questions are representative of the variables being researched. It is a demonstration that the items of a test are drawn from the domain being measured. Criterion validation checks how meaningful the research criteria are relative to other possible criteria. When the criterion is collected later the goal is to establish predictive validity. Construct validation checks what underlying construct is being measured. There are three variants of construct validity: convergent validity (how well the research relates to other measures of the same construct), discriminant validity (how poorly the research relates to measures of opposing constructs), and nomological validity (how well the research relates to other variables as required by theory). Internal validation, used primarily in experimental research designs, checks the relation between the dependent and independent variables (i.e. Did the experimental manipulation of the independent variable actually cause the observed results?) External validation checks whether the experimental results can be generalized. Validity implies reliability: A valid measure must be reliable. Reliability does not necessarily imply validity, however: A reliable measure does not imply that it is valid.

Quantitative marketing research

272

Types of errors
Random sampling errors: sample too small sample not representative inappropriate sampling method used random errors

Research design errors: bias introduced measurement error data analysis error sampling frame error population definition error scaling error question construction error

Interviewer errors: recording errors cheating errors questioning errors respondent selection error Respondent errors: non-response error inability error falsification error Hypothesis errors: type I error (also called alpha error) the study results lead to the rejection of the null hypothesis even though it is actually true type II error (also called beta error) the study results lead to the acceptance (non-rejection) of the null hypothesis even though it is actually false

List of related topics


List of marketing topics List of management topics List of economics topics List of finance topics List of accounting topics

Quantitative marketing research

273

References
Bradburn, Norman M. and Seymour Sudman. Polls and Surveys: Understanding What They Tell Us (1988) Converse, Jean M. Survey Research in the United States: Roots and Emergence 1890-1960 (1987), the standard history Glynn, Carroll J., Susan Herbst, Garrett J. O'Keefe, and Robert Y. Shapiro. Public Opinion (1999) [1] textbook Oskamp, Stuart and P. Wesley Schultz; Attitudes and Opinions (2004) [2] James G. Webster, Patricia F. Phalen, Lawrence W. Lichty; Ratings Analysis: The Theory and Practice of Audience Research Lawrence Erlbaum Associates, 2000 Young, Michael L. Dictionary of Polling: The Language of Contemporary Opinion Research (1992) [3]

References
[1] http:/ / www. questia. com/ PM. qst?a=o& d=100501261 [2] http:/ / www. questia. com/ PM. qst?a=o& d=104829752 [3] http:/ / www. questia. com/ PM. qst?a=o& d=59669912

Quantitative psychology
Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community Consumer

Quantitative psychology

274
Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

The American Psychological Association defines Quantitative Psychology as "the study of methods and techniques for the measurement of human attributes, the statistical and mathematical modeling of psychological processes, the design of research studies, and the analysis of psychological data".[1] Quantitative Psychology specializes in the measurement, methodology and research design and analysis relevant to data in the social sciences.[2] "The Research in study of Quantitative psychology develops psychological theory in relation to mathematics and statistics. Elaborating the existing methods and developing new concepts, the quantitative psychology involves much more than "applications" of statistics and mathematics." [3] Research in quantitative psychology develops psychological theory in relation to mathematics and statistics. Psychological research requires the elaboration of existing methods and the development of new concepts, so that quantitative psychology requires more than "applications" of statistics and mathematics.[1] Quantitative psychology has two major subfields, psychometrics and mathematical psychology. Research in psychometrics develops methods of practice and analysis of psychological measurement, for example, developing a questionnaire to test memory and methods of analyzing data from that questionnaire.[4] Research in mathematical psychology develops novel mathematical models that describe psychological processes.[5] Quantitative psychology is served by several scientific organizations. These include the Psychometric Society, Division 5 of the American Psychological Association (Evaluation, Measurement and Statistics), the Society of Multivariate Experimental Psychology, and the European Society for Methodology. Associated disciplines include statistics, mathematics, educational measurement, educational statistics, sociology, and political science. Several scholarly journals reflect the efforts of scientists in these areas, notably Psychometrika, Multivariate Behavioral Research, Structural Equation Modeling and Psychological Methods. In August 2005, the APA expressed the need for more quantitative psychologists in the industryfor every PhD awarded in the subject, there were about 2.5 quantitative psychologist position openings.[6] Currently, 23 American universities offer Ph.D. programs in quantitative psychology within their psychology departments (and additional universities offer programs that focus on but do not necessarily encompass the field).[7] There is also a comparable

Quantitative psychology number of Master-level programs in quantitative psychology in the US.[8]

275

References
[1] Quantitative Psychology (http:/ / www. apa. org/ research/ tools/ quantitative/ index. aspx) [2] Quantitative Psychology UCLA Psychology Department: Home (http:/ / www. psych. ucla. edu/ graduate/ areas-of-study/ quantitative-psychology) [3] Quantitative Psychology For Measuring The Human Attributes (http:/ / www. psychoid. net/ quantitative-psychology-for-measuring-the. html) [4] Psychometrics [5] Mathematical Psychology [6] Report of the Task Force for Increasing the Number of Quantitative Psychologists (http:/ / www. apa. org/ research/ tools/ quantitative/ quant-task-force-report. pdf), page 1. American Psychological Association. Retrieved February 15, 2012 [7] Introduction to Quantitative Psychology (http:/ / www. apa. org/ research/ tools/ quantitative/ index. aspx#review) page 2. American Psychological Association. Retrieved February 15, 2012. [8] Graduate Studies in Psychology (http:/ / www. apa. org/ pubs/ books/ 4270096. aspx)

External links
APA Division 5: Evaluation, Measurement and Statistics (http://www.apa.org/divisions/div5/) The Psychometric Society (http://www.psychometrika.org/) The Society of Multivariate Experimental Psychology (http://www.smep.org/) The European Society for Methodology (http://www.smabs.org/) Society for Mathematical Psychology (http://www.cogs.indiana.edu/socmathpsych/)

Questionnaire construction
A questionnaire is a series of questions asked to individuals to obtain statistically useful information about a given topic.[1] When properly constructed and responsibly administered, questionnaires become a vital instrument by which statements can be made about specific groups or people or entire populations. Questionnaires are frequently used in quantitative marketing research and social research. They are a valuable method of collecting a wide range of information from a large number of individuals, often referred to as respondents. Adequate questionnaire construction is critical to the success of a survey. Inappropriate questions, incorrect ordering of questions, incorrect scaling, or bad questionnaire format can make the survey valueless, as it may not accurately reflect the views and opinions of the participants. A useful method for checking a questionnaire and making sure it is accurately capturing the intended information is to pretest among a smaller subset of target respondents.

Questionnaire construction issues


Know how (and whether) you will use the results of your research before you start. If, for example, the results won't influence your decision or you can't afford to implement the findings or the cost of the research outweighs its usefulness, then save your time and money; don't bother doing the research. The research objectives and frame of reference should be defined beforehand, including the questionnaire's context of time, budget, manpower, intrusion and privacy. How (randomly or not) and from where (your sampling frame) you select the respondents will determine whether you will be able to generalize your findings to the larger population. The nature of the expected responses should be defined and retained for interpretation of the responses, be it preferences (of products or services), facts, beliefs, feelings, descriptions of past behavior, or standards of action.

Questionnaire construction Unneeded questions are an expense to the researcher and an unwelcome imposition on the respondents. All questions should contribute to the objective(s) of the research. If you "research backwards" and determine what you want to say in the report (i.e., Package A is more/less preferred by X% of the sample vs. Package B, and y% compared to Package C) then even though you don't know the exact answers yet, you will be certain to ask all the questions you need - and only the ones you need - in such a way (metrics) to write your report. The topics should fit the respondents frame of reference. Their background may affect their interpretation of the questions. Respondents should have enough information or expertise to answer the questions truthfully. The type of scale, index, or typology to be used shall be determined. The level of measurement you use will determine what you can do with and conclude from the data. If the response option is yes/no then you will only know how many or what percent of your sample answered yes/no. You cannot, however, conclude what the average respondent answered. The types of questions (closed, multiple-choice, open) should fit the statistical data analysis techniques available and your goals. Questions and prepared responses to choose from should be neutral as to intended outcome. A biased question or questionnaire encourages respondents to answer one way rather than another.[2] Even questions without bias may leave respondents with expectations. The order or "natural" grouping of questions is often relevant. Prior previous questions may bias later questions. The wording should be kept simple: no technical or specialized words. The meaning should be clear. Ambiguous words, equivocal sentence structures and negatives may cause misunderstanding, possibly invalidating questionnaire results. Double negatives should be reworded as positives. If a survey question actually contains more than one issue, the researcher will not know which one the respondent is answering. Care should be taken to ask one question at a time. The list of possible responses should be collectively exhaustive. Respondents should not find themselves with no category that fits their situation. One solution is to use a final category for "other ________". The possible responses should also be mutually exclusive. Categories should not overlap. Respondents should not find themselves in more than one category, for example in both the "married" category and the "single" category there may be need for separate questions on marital status and living situation. Writing style should be conversational, yet concise and accurate and appropriate to the target audience. Many people will not answer personal or intimate questions. For this reason, questions about age, income, marital status, etc. are generally placed at the end of the survey. This way, even if the respondent refuses to answer these "personal" questions, he/she will have already answered the research questions. "Loaded" questions evoke emotional responses and may skew results. Presentation of the questions on the page (or computer screen) and use of white space, colors, pictures, charts, or other graphics may affect respondent's interest or distract from the questions. Numbering of questions may be helpful. Questionnaires can be administered by research staff, by volunteers or self-administered by the respondents. Clear, detailed instructions are needed in either case, matching the needs of each audience.

276

Methods of collection

Questionnaire construction

277

Method Postal Telephone

Benefits/Cautions Low cost-per-response. Mail is subject to postal delays, which can be substantial when posting remote areas or unpredictable events such as natural disasters. Survey participants can choose to remain anonymous. It is not labour intensive. Questionnaires can be conducted swiftly. Rapport with respondents High response rate Be careful that your sampling frame (i.e., where you get the phone numbers from) doesn't skew your sample, For example, if you select the phone numbers from a phone book, you are necessarily excluding people who only have a mobile phone, those who requested an unpublished phone number, and individuals who have recently moved to the area because none of these people will be in the book. Are more prone to social desirability biases than other modes, so telephone interviews are generally not suitable for sensitive [3][4] topics This method has a low ongoing cost, and on most surveys costs nothing for the participants and little for the surveyors. However, Initial set-up costs can be high for a customised design due to the effort required in developing the back-end system or programming the questionnaire itself. Questionnaires can be conducted swiftly, without postal delays. Survey participants can choose to remain anonymous, though risk being tracked through cookies, unique links and other technology. It is not labour intensive. Questions can be more detailed, as opposed to the limits of paper or telephones. [citation needed] This method works well if your survey contains several branching questions. Help or instructions can be dynamically displayed with the question as needed, and automatic sequencing means the computer can determine the next question, rather than relying on respondents to correctly follow skip instructions. Not all of the sample may be able to access the electronic form, and therefore results may not be representative of the target population. Questions can be more detailed and obtains a lot of comprehensive information, as opposed to the limits of paper or telephones. However, respondents are often limited to their working memory: specially designed visual cues (such as prompt cards) may help in some cases. Rapport with respondents is generally higher than other modes Typically higher response rate than other modes. Can be extremely expensive and time consuming to train and maintain an interviewer panel. Each interview also has a marginal cost associated with collecting the data. Usually a convenience (vs. a statistical or representative) sample so you cannot generalize your results. However, use of rigorous selection methods (e.g. those used by national statistical organisations) can result in a much more representative sample.

Electronic

Personally Administered

Types of questions
1. Contingency questions - A question that is answered only if the respondent gives a particular response to a previous question. This avoids asking questions of people that do not apply to them (for example, asking men if they have ever been pregnant). 2. Matrix questions - Identical response categories are assigned to multiple questions. The questions are placed one under the other, forming a matrix with response categories along the top and a list of questions down the side. This is an efficient use of page space and respondents time. 3. Closed ended questions - Respondents answers are limited to a fixed set of responses. Most scales are closed ended. Other types of closed ended questions include: Yes/no questions - The respondent answers with a "yes" or a "no". Multiple choice - The respondent has several option from which to choose.

Questionnaire construction Scaled questions - Responses are graded on a continuum (example : rate the appearance of the product on a scale from 1 to 10, with 10 being the most preferred appearance). Examples of types of scales include the Likert scale, semantic differential scale, and rank-order scale (See scale for a complete list of scaling techniques.). 4. Open ended questions - No options or predefined categories are suggested. The respondent supplies their own answer without being constrained by a fixed set of possible responses. Examples of types of open ended questions include: Completely unstructured - For example, "What is your opinion on questionnaires?" Word association - Words are presented and the respondent mentions the first word that comes to mind. Sentence completion - Respondents complete an incomplete sentence. For example, "The most important consideration in my decision to buy a new house is . . ." Story completion - Respondents complete an incomplete story. Picture completion - Respondents fill in an empty conversation balloon. Thematic apperception test - Respondents explain a picture or make up a story about what they think is happening in the picture

278

Question sequence
Questions should flow logically from one to the next. The researcher must ensure that the answer to a question is not influenced by previous questions. Questions should flow from the more general to the more specific. Questions should flow from the least sensitive to the most sensitive. Questions should flow from factual and behavioral questions to attitudinal and opinion questions. Questions should flow from unaided to aided questions. According to the three stage theory (also called the sandwich theory), initial questions should be screening and rapport questions. Then in the second stage you ask all the product specific questions. In the last stage you ask demographic questions.

Marketings
Computer-assisted telephone interviewing Computer-assisted personal interviewing Automated computer telephone interviewing Official statistics Bureau of Labor Statistics Questionnaires Questionnaire construction Paid survey Data Mining NIPO Software DIY research SPSS Marketing Marketing Research Scale Statistical survey

Quantitative marketing research How to make a questionnaire

Questionnaire construction

279

Lists of related topics


List of marketing topics List of management topics List of economics topics List of finance topics List of accounting topics List of library management topics

References
[1] Merriam-Webster's Online Dictionary, s.v. "questionnaire," http:/ / www. merriam-webster. com/ dictionary/ questionnaire (accessed May 21, 2008) [2] Timothy R. Graeff, 2005. "Response Bias," Encyclopedia of Social Measurement, pp. 411 (http:/ / www. sciencedirect. com/ science/ article/ pii/ B0123693985000372)-418. ScienceDirect. [3] Frauke Kreuter, Stanley Presser, and Roger Tourangeau, 2008. "Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity", Public Opinion Quarterly, 72(5): 847-865 first published online January 26, 2009 [4] Allyson L. Holbrook, Melanie C. Green And Jon A. Krosnick, 2003. "Telephone versus Face-to-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Response Bias". Public Opinion Quarterly,67(1): 79-125. .

External links
How to ask questions for better survey response (http://www.sensorpro.net/SurveyGuidelines.pdf) (SensorPro)

Rasch model
Rasch models are used for analyzing categorical data from assessments to measure variables such as abilities, attitudes, and personality traits. For example, they may be used to estimate a student's reading ability from answers to questions on a reading assessment, or the extremity of a person's attitude to capital punishment from responses on a questionnaire. Rasch models are particularly used in psychometrics, the field concerned with the theory and technique of psychological and educational measurement. In addition, they are increasingly being used in other areas, including the health profession and market research because of their general applicability. The mathematical theory underlying Rasch models is a special case of item response theory and, more generally, a special case of a generalized linear model. However, there are important differences in the interpretation of the model parameters and its philosophical implications [1] that separate proponents of the Rasch model from the item response modeling tradition. A central aspect of this divide relates to the role of specific objectivity [2], a defining property of the Rasch model according to Georg Rasch, as a requirement for successful measurement. Application of the models provides diagnostic information regarding how well the criterion is met. Application of the models can also provide information about how well items or questions on assessments work to measure the ability or trait. Prominent advocates of Rasch models include Benjamin Drake Wright, David Andrich and Erling Andersen.

Rasch model

280

Overview
The Rasch model for measurement
In the Rasch model, the probability of a specified response (e.g. right/wrong answer) is modeled as a function of person and item parameters. Specifically, in the simple Rasch model, the probability of a correct response is modeled as a logistic function of the difference between the person and item parameter. The mathematical form of the model is provided later in this article. In most contexts, the parameters of the model pertain to the level of a quantitative trait possessed by a person or item. For example, in educational tests, item parameters pertain to the difficulty of items while person parameters pertain to the ability or attainment level of people who are assessed. The higher a person's ability relative to the difficulty of an item, the higher the probability of a correct response on that item. When a person's location on the latent trait is equal to the difficulty of the item, there is by definition a 0.5 probability of a correct response in the Rasch model. The purpose of applying the model is to obtain measurements from categorical response data. Estimation methods are used to obtain estimates from matrices of response data based on the model (Linacre, 1999). A Rasch model is a model in one sense in that it represents the structure which data should exhibit in order to obtain measurements from the data; i.e. it provides a criterion for successful measurement. Beyond data, Rasch's equations model relationships we expect to obtain in the real world. For instance, education is intended to prepare children for the entire range of challenges they will face in life, and not just those that appear in textbooks or on tests. By requiring measures to remain the same (invariant) across different tests measuring the same thing, Rasch models make it possible to test the hypothesis that the particular challenges posed in a curriculum and on a test coherently represent the infinite population of all possible challenges in that domain. A Rasch model is therefore a model in the sense of an ideal or standard that provides a heuristic fiction serving as a useful organizing principle even when it is never actually observed in practice. The perspective or paradigm underpinning the Rasch model is distinctly different from the perspective underpinning statistical modelling. Models are most often used with the intention of describing a set of data. Parameters are modified and accepted or rejected based on how well they fit the data. In contrast, when the Rasch model is employed, the objective is to obtain data which fit the model (Andrich, 2004; Wright, 1984, 1999). The rationale for this perspective is that the Rasch model embodies requirements which must be met in order to obtain measurement, in the sense that measurement is generally understood in the physical sciences. A useful analogy for understanding this rationale is to consider objects measured on a weighing scale. Suppose the weight of an object A is measured as being substantially greater than the weight of an object B on one occasion, then immediately afterward the weight of object B is measured as being substantially greater than the weight of object A. A property we require of measurements is that the resulting comparison between objects should be the same, or invariant, irrespective of other factors. This key requirement is embodied within the formal structure of the Rasch model. Consequently, the Rasch model is not altered to suit data. Instead, the method of assessment should be changed so that this requirement is met, in the same way that a weighing scale should be rectified if it gives different comparisons between objects upon separate measurements of the objects. Data analysed using the model are usually responses to conventional items on tests, such as educational tests with right/wrong answers. However, the model is a general one, and can be applied wherever discrete data are obtained with the intention of measuring a quantitative attribute or trait.

Rasch model

281

Scaling
When all test-takers have an opportunity to attempt all items on a single test, each total score on the test maps to a unique estimate of ability and the greater the total, the greater the ability estimate. Total scores do not have a linear relationship with ability estimates. Rather, the relationship is non-linear as shown in Figure 1. The total score is shown on the vertical axis, while the corresponding person Figure 1: Test characteristic curve showing the relationship between total score on a test location estimate is shown on the and person location estimate horizontal axis. For the particular test on which the test characteristic curve (TCC) shown in Figure 1 is based, the relationship is approximately linear throughout the range of total scores from about 10 to 33. The shape of the TCC is generally somewhat sigmoid as in this example. However, the precise relationship between total scores and person location estimates depends on the distribution of items on the test. The TCC is steeper in ranges on the continuum in which there are a number of items, such as in the range on either side of 0 in Figures 1 and 2. In applying the Rasch model, item locations are often scaled first, based on methods such as those described below. This part of the process of scaling is often referred to as item calibration. In educational tests, the smaller the proportion of correct responses, the higher the difficulty of an item and hence the higher the item's scale location. Once item locations are scaled, the person locations are measured on the scale. As a result, person and item locations are estimated on a single scale as shown in Figure 2.

Interpreting scale locations


For dichotomous data such as right/wrong answers, by definition, the location of an item on a scale corresponds with the person location at which there is a 0.5 probability of a correct response to the question. In general, the probability of a person responding correctly to a question with difficulty lower than that person's location is greater than 0.5, while the probability of responding correctly to a Figure 2: Graph showing histograms of person distribution (top) and item distribution question with difficulty greater than (bottom) on a scale the person's location is less than 0.5. The Item Characteristic Curve (ICC) or Item Response Function (IRF) shows the probability of a correct response as a function of the ability of persons. A single ICC is shown and explaind in more detail in relation to Figure 4 in this article (see also the item response function). The leftmost ICCs in Figure 3 are the easiest items, the rightmost items in the same figure are the most difficult items.

Rasch model When responses of a person are listed according to item difficulty, from lowest to highest, the most likely pattern is a Guttman pattern or vector; i.e. {1,1,...,1,0,0,0,...,0}. However, while this pattern is the most probable given the structure of the Rasch model, the model requires only probabilistic Guttman response patterns; that is, patterns which tend toward the Guttman pattern. It is unusual for responses to conform strictly to the pattern because there are many possible patterns. It is unnecessary for responses to conform strictly to the pattern in order for data to fit the Rasch model. Each ability estimate has an associated standard error of measurement, which quantifies the degree of uncertainty associated with the ability estimate. Item estimates also have standard errors. Generally, the standard errors of item estimates are considerably smaller than the standard errors of person estimates because there are usually more response data for an item than for Figure 3: ICCs for a number of items. ICCs are coloured to highlight the change in the a person. That is, the number of people probability of a successful response for a person with ability location at the vertical line. attempting a given item is usually The person is likely to respond correctly to the easiest items (with locations to the left and higher curves) and unlikely to respond correctly to difficult items (locations to the right greater than the number of items and lower curves). attempted by a given person. Standard errors of person estimates are smaller where the slope of the ICC is steeper, which is generally through the middle range of scores on a test. Thus, there is greater precision in this range since the steeper the slope, the greater the distinction between any two points on the line. Statistical and graphical tests are used to evaluate the correspondence of data with the model. Certain tests are global, while others focus on specific items or people. Certain tests of fit provide information about which items can be used to increase the reliability of a test by omitting or correcting problems with poor items. In Rasch Measurement the person separation index is used instead of reliability indices. However, the person separation index is analogous to a reliability index. The separation index is a summary of the genuine separation as a ratio to separation including measurement error. As mentioned earlier, the level of measurement error is not uniform across the range of a test, but is generally larger for more extreme scores (low and high).

282

Features of the Rasch model


The class of models is named after Georg Rasch, a Danish mathematician and statistician who advanced the epistemological case for the models based on their congruence with a core requirement of measurement in physics; namely the requirement of invariant comparison. This is the defining feature of the class of models, as is elaborated upon in the following section. The Rasch model for dichotomous data has a close conceptual relationship to the law of comparative judgment (LCJ), a model formulated and used extensively by L. L. Thurstone (cf Andrich, 1978b), and therefore also to the Thurstone scale. Prior to introducing the measurement model he is best known for, Rasch had applied the Poisson distribution to reading data as a measurement model, hypothesizing that in the relevant empirical context, the number of errors made by a given individual was governed by the ratio of the text difficulty to the person's reading ability. Rasch referred to this model as the multiplicative Poisson model. Rasch's model for dichotomous data i.e. where responses are classifiable into two categories is his most widely known and used model, and is the main focus here. This model has the form of a simple logistic function.

Rasch model The brief outline above highlights certain distinctive and interrelated features of Rasch's perspective on social measurement, which are as follows: 1. He was concerned principally with the measurement of individuals, rather than with distributions among populations. 2. He was concerned with establishing a basis for meeting a priori requirements for measurement deduced from physics and, consequently, did not invoke any assumptions about the distribution of levels of a trait in a population. 3. Rasch's approach explicitly recognizes that it is a scientific hypothesis that a given trait is both quantitative and measurable, as operationalized in a particular experimental context. Thus, congruent with the perspective articulated by Thomas Kuhn in his 1961 paper The function of measurement in modern physical science, measurement was regarded both as being founded in theory, and as being instrumental to detecting quantitative anomalies incongruent with hypotheses related to a broader theoretical framework. This perspective is in contrast to that generally prevailing in the social sciences, in which data such as test scores are directly treated as measurements without requiring a theoretical foundation for measurement. Although this contrast exists, Rasch's perspective is actually complementary to the use of statistical analysis or modelling that requires interval-level measurements, because the purpose of applying a Rasch model is to obtain such measurements. Applications of Rasch models are described in a wide variety of sources, including Sivakumar, Durtis & Hungi (2005), Bezruzcko (2005), Bond & Fox (2007), Fisher & Wright (1994), Masters & Keeves (1999), and the Journal of Applied Measurement.

283

Invariant comparison and sufficiency


The Rasch model for dichotomous data is often regarded as an item response theory (IRT) model with one item parameter. However, rather than being a particular IRT model, proponents of the model regard it as a model that possesses a property which distinguishes it from other IRT models. Specifically, the defining property of Rasch models is their formal or mathematical embodiment of the principle of invariant comparison. Rasch summarised the principle of invariant comparison as follows: The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which other stimuli within the considered class were or might also have been compared. Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison; and it should also be independent of which other individuals were also compared, on the same or some other occasion (Rasch, 1961, p. 332). Rasch models embody this principle because their formal structure permits algebraic separation of the person and item parameters, in the sense that the person parameter can be eliminated during the process of statistical estimation of item parameters. This result is achieved through the use of conditional maximum likelihood estimation, in which the response space is partitioned according to person total scores. The consequence is that the raw score for an item or person is the sufficient statistic for the item or person parameter. That is to say, the person total score contains all information available within the specified context about the individual, and the item total score contains all information with respect to item, with regard to the relevant latent trait. The Rasch model requires a specific structure in the response data, namely a probabilistic Guttman structure. In somewhat more familiar terms, Rasch models provide a basis and justification for obtaining person locations on a continuum from total scores on assessments. Although it is not uncommon to treat total scores directly as measurements, they are actually counts of discrete observations rather than measurements. Each observation represents the observable outcome of a comparison between a person and item. Such outcomes are directly analogous to the observation of the rotation of a balance scale in one direction or another. This observation would indicate that one or other object has a greater mass, but counts of such observations cannot be treated directly as

Rasch model measurements. Rasch pointed out that the principle of invariant comparison is characteristic of measurement in physics using, by way of example, a two-way experimental frame of reference in which each instrument exerts a mechanical force upon solid bodies to produce acceleration. Rasch (1960/1980, pp. 1123) stated of this context: "Generally: If for any two objects we find a certain ratio of their accelerations produced by one instrument, then the same ratio will be found for any other of the instruments". It is readily shown that Newton's second law entails that such ratios are inversely proportional to the ratios of the masses of the bodies.

284

The mathematical form of the Rasch model for dichotomous data


Let and be a dichotomous random variable where, for example, is given by: denotes a correct response an incorrect response to a given assessment item. In the Rasch model for dichotomous data, the

probability of the outcome

where

is the ability of person

and

is the difficulty of item

. Thus, in the case of a dichotomous

attainment item,

is the probability of success upon interaction between the relevant person and

assessment item. It is readily shown that the log odds, or logit, of correct response by a person to an item, based on the model, is equal to . It can be shown that the log odds of a correct response by a person to one item, conditional on a correct response to one of two items, is equal to the difference between the item locations. For example, where is the total score of person n over the two items, which implies a correct response to one or other of the items (Andersen, 1977; Rasch, 1960; Andrich, 2010). Hence, the conditional log odds does not involve the person parameter , which can therefore be eliminated by conditioning on the total score . That is, by partitioning the responses according to raw scores and calculating the log odds of a correct response, an estimate is obtained without involvement of . More generally, a number of item parameters can be estimated iteratively through application of a process such as Conditional Maximum Likelihood estimation (see Rasch model estimation). While more involved, the same fundamental principle applies in such estimations. The ICC of the Rasch model for dichotomous data is shown in Figure 4. The grey line maps a person with a location of approximately 0.2 on the latent continuum, to the probability of the discrete outcome for items with different locations on the latent continuum. The location of an item is, by definition, that location at which the probability that is equal to 0.5. In figure 4, the black circles represent the

Figure 4: ICC for the Rasch model showing the comparison between observed and expected proportions correct for five Class Intervals of persons

actual or observed proportions of persons within Class Intervals for which the outcome was observed. For example, in the case of an assessment item used in the context of educational psychology, these could represent the proportions of persons who answered the item correctly. Persons are ordered by the estimates of their locations on the latent continuum and classified into Class Intervals on this basis in order to graphically inspect the accordance of observations with the

Rasch model model. There is a close conformity of the data with the model. In addition to graphical inspection of data, a range of statistical tests of fit are used to evaluate whether departures of observations from the model can be attributed to random effects alone, as required, or whether there are systematic departures from the model.

285

The polytomous form of the Rasch model


The polytomous Rasch model, which is a generalisation of the dichotomous model, can be applied in contexts in which successive integer scores represent categories of increasing level or magnitude of a latent trait, such as increasing ability, motor function, endorsement of a statement, and so forth. The Polytomous response model is, for example, applicable to the use of Likert scales, grading in educational assessment, and scoring of performances by judges.

Other considerations
A criticism of the Rasch model is that it is overly restrictive or prescriptive because it does not permit each item to have a different discrimination. A criticism specific to the use of multiple choice items in educational assessment is that there is no provision in the model for guessing because the left asymptote always approaches a zero probability in the Rasch model. These variations are available in models such as the two and three parameter logistic models (Birnbaum, 1968). However, the specification of uniform discrimination and zero left asymptote are necessary properties of the model in order to sustain sufficiency of the simple, unweighted raw score. Verhelst & Glas (1995) derive Conditional Maximum Likelihood (CML) equations for a model they refer to as the One Parameter Logistic Model (OPLM). In algebraic form it appears to be identical with the 2PL model, but OPLM contains preset discrimination indexes rather than 2PL's estimated discrimination parameters. As noted by these authors, though, the problem one faces in estimation with estimated discrimination parameters is that the discriminations are unknown, meaning that the weighted raw score "is not a mere statistic, and hence it is impossible to use CML as an estimation method" (Verhelst & Glas, 1995, p. 217). That is, sufficiency of the weighted "score" in the 2PL cannot be used according to the way in which a sufficient statistic is defined. If the weights are imputed instead of being estimated, as in OPLM, conditional estimation is possible and some of the properties of the Rasch model are retained (Verhelst, Glas & Verstralen, 1995; Verhelst & Glas, 1995). In OPLM, the values of the discrimination index are restricted to between 1 and 15. A limitation of this approach is that in practice, values of discrimination indexes must be preset as a starting point. This means some type of estimation of discrimination is involved when the purpose is to avoid doing so. The Rasch model for dichotomous data inherently entails a single discrimination parameter which, as noted by Rasch (1960/1980, p. 121), constitutes an arbitrary choice of the unit in terms of which magnitudes of the latent trait are expressed or estimated. However, the Rasch model requires that the discrimination is uniform across interactions between persons and items within a specified frame of reference (i.e. the assessment context given conditions for assessment).

Rasch model

286

Notes
[1] Linacre J.M. (2005). Rasch dichotomous model vs. One-parameter Logistic Model. Rasch Measurement Transactions, 19:3, 1032 [2] Rasch, G. (1977). On Specific Objectivity: An attempt at formalizing the request for generality and validity of scientific statements. The Danish Yearbook of Philosophy, 14, 58-93.

References and further reading


Alagumalai, S., Curtis, D.D. & Hungi, N. (2005). Applied Rasch Measurement: A book of exemplars. Springer-Kluwer. Andersen, E.B. (1977). Sufficient statistics and latent trait models, Psychometrika, 42, 6981. Andrich, D. (1978a). A rating formulation for ordered response categories. Psychometrika, 43, 35774. Andrich, D. (1978b). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 449460. Andrich, D. (1988). Rasch models for measurement. Beverly Hills: Sage Publications. Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care, 42, 116. Andrich, D. (2010). Sufficiency and conditional estimation of person parameters in the polytomous Rasch model. Psychometrika, 75(2), 292-308. Baker, F. (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD. Available free with software included from IRT at Edres.org (http:// edres.org/irt/) Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinees ability. In Lord, F.M. & Novick, M.R. (Eds.), Statistical theories of mental test scores. Reading, MA: AddisonWesley. Bond, T.G. & Fox, C.M. (2007). Applying the Rasch Model: Fundamental measurement in the human sciences. 2nd Edn (includes Rasch software on CD-ROM). Lawrence Erlbaum. Fischer, G.H. & Molenaar, I.W. (1995). Rasch models: foundations, recent developments and applications. New York: Springer-Verlag. Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664. Goldstein H & Blinkhorn.S (1977). Monitoring Educational Standards: an inappropriate model. . Bull.Br.Psychol.Soc. 30 309311 Goldstein H & Blinkhorn.S (1982). The Rasch Model Still Does Not Fit. . BERJ 82 167170. Hambleton RK, Jones RW. Comparison of classical test theory and item response Educational Measurement: Issues and Practice. 1993; 12(3):3847. available in the ITEMS Series from the National Council on Measurement in Education (http://www.ncme.org/pubs/items.cfm) Harris D. Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice;. 1989; 8: 3541 available in the ITEMS Series from the National Council on Measurement in Education (http:// www.ncme.org/pubs/items.cfm) Kuhn, T.S. (1961). The function of measurement in modern physical science. ISIS, 52, 161193. JSTOR (http:// www.jstor.org/stable/228678) Linacre, J. M. (1999). "Understanding Rasch measurement: Estimation methods for Rasch measures". Journal of Outcome Measurement 3 (4): 382-405. Masters, G. N., & Keeves, J. P. (Eds.). (1999). Advances in measurement in educational research and assessment. New York: Pergamon. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.

Rasch model Rasch, G. (1961). On general laws and the meaning of measurement in psychology, pp. 321334 in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, IV. Berkeley, California: University of California Press. Available free from Project Euclid (http://projecteuclid.org/ DPubS?verb=Display&version=1.0&service=UI&handle=euclid.bsmsp/1200512895&page=record) Verhelst, N.D. and Glas, C.A.W. (1995). The one parameter logistic model. In G.H. Fischer and I.W. Molenaar (Eds.), Rasch Models: Foundations, recent developments, and applications (pp. 215238). New York: Springer Verlag. Verhelst, N.D., Glas, C.A.W. and Verstralen, H.H.F.M. (1995). One parameter logistic model (OPLM). Arnhem: CITO. von Davier, M., & Carstensen, C. H. (2007). Multivariate and Mixture Distribution Rasch Models: Extensions and Applications. New York: Springer. Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 (http://www.rasch.org/memo41.htm). Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Wright, B.D., & Stone, M.H. (1979). Best Test Design. Chicago, IL: MESA Press. Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne, Australia: Educational Measurement Solutions. Available free from Educational Measurement Solutions (http://www.edmeasurement.com.au/Learning.html)

287

External links
Institute for Objective Measurement Online Rasch Resources (http://www.rasch.org/memos.htm) Pearson Psychometrics Laboratory, with information about Rasch models (http://www.education.uwa.edu.au/ ppl) Journal of Applied Measurement (http://www.jampress.org) Journal of Outcome Measurement (all issues available for free downloading) (http://www.jampress.org/JOM. htm) Berkeley Evaluation & Assessment Research Center (ConstructMap software) (http://bearcenter.berkeley.edu) Directory of Rasch Software freeware and paid (http://www.rasch.org/software.htm) IRT Modeling Lab at U. Illinois Urbana Champ. (http://work.psych.uiuc.edu/irt/) National Council on Measurement in Education (NCME) (http://www.ncme.org) Rasch analysis (http://www.rasch-analysis.com/) Rasch Measurement Transactions (http://www.rasch.org/rmt/contents.htm) The Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html)

Rasch model estimation

288

Rasch model estimation


Estimation of a Rasch model is used to estimate the parameters of the Rasch model. Various techniques are employed to estimate the parameters from matrices of response data. The most common approaches are types of maximum likelihood estimation, such as joint and conditional maximum likelihood estimation. Joint maximum likelihood (JML) equations are efficient, but inconsistent for a finite number of items, whereas conditional maximum likelihood (CML) equations give consistent and unbiased item estimates. Person estimates are generally thought to have bias associated with them, although weighted likelihood estimation methods for the estimation of person parameters reduce the bias.

Rasch model
The Rasch model for dichotomous data takes the form:

where

is the ability of person

and

is the difficulty of item

Joint maximum likelihood


Let denote the observed response for person n on item i. The probability of the observed data matrix, which is the product of the probabilities of the individual responses, is given by the likelihood function

The log-likelihood function is then

where

is the total raw score for person n,

is the total raw score for item i, N is the total

number of persons and I is the total number of items. Solution equations are obtained by taking partial derivatives with respect to to 0. The JML solution equations are:

and

and setting the result equal

where multiplying the estimates by .

. A more accurate estimate of each

is obtained by

Rasch model estimation

289

Conditional maximum likelihood


The conditional likelihood function is defined as

in which

is the elementary symmetric function of order r, which represents the sum over all combinations of r items. For example, in the case of three items,

Estimation algorithms
Some kind of expectation-maximization algorithm is used in the estimation of the parameters of Rasch models. Algorithms for implementing Maximum Likelihood estimation commonly employ Newton-Raphson iterations to solve for solution equations obtained from setting the partial derivatives of the log-likelihood functions equal to 0. Convergence criteria are used to determine when the iterations cease. For example, the criterion might be that the mean item estimate changes by less than a certain value, such as 0.001, between one iteration and another for all items.

References
Linacre, J.M. (2004). Estimation methods for Rasch measures. Chapter 2 in E.V. Smith & R. M. Smith (Eds.) Introduction to Rasch Measurement. Maple Grove MN: JAM Press. Linacre, J.M. (2004). Rasch model estimation: further topics. Chapter 24 in E.V. Smith & R. M. Smith (Eds.) Introduction to Rasch Measurement. Maple Grove MN: JAM Press.

Rating scale

290

Rating scale
Concerning rating scales as systems of educational marks, see articles about education in different countries (named "Education in ..."), for example, Education in Ukraine. Concerning rating scales used in the practice of medicine, see articles about diagnoses, for example, Major depressive disorder.
An example of a common type of rating scale, the "rate this with 1 to 5 stars" model. This example is from Wikipedia's user-survey efforts.

A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative attribute. In the social sciences, common examples are the Likert scale and 1-10 rating scales in which a person selects the number which is considered to reflect the perceived quality of a product.

Background
A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the rated object, as a measure of some rated attribute.

Types of Rating Scales


All rating scales can be classified into one of three classifications:1. Some data are measured at the ordinal level. Numbers indicate the relative position of items, but not the magnitude of difference. One example is a Likert scale: Statement: e.g. "I could not live without my computer". Response options: 1. Strongly disagree 2. Disagree 3. Agree 4. Strongly agree 2. Some data are measured at the interval level. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. Examples are attitude scales and opinion scales. 3. Some data are measured at the ratio level. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include age, income, price, costs, sales revenue, sales volume and market share. More than one rating scale is required to measure an attitude or perception due to the requirement for statistical comparisons between the categories in the polytomous Rasch model for ordered categories.[1] In terms of Classical test theory, more than one question is required to obtain an index of internal reliability such as Cronbach's alpha,[2] which is a basic criterion for assessing the effectiveness of a rating scale and, more generally, a psychometric instrument.

Rating scale

291

Rating scales used online


Rating scales are used widely online in an attempt to provide indications of consumer opinions of products. Examples of sites which employ ratings scales are IMDb, Epinions.com, Internet Book List, Yahoo! Movies, Amazon.com, BoardGameGeek, TV.com and Ratings.net. The Criticker website uses a rating scale from 0 to 100 in order to obtain "personalised film recommendations". In almost all cases, online rating scales only allow one rating per user per product, though there are exceptions such as Ratings.net, which allows users to rate products in relation to several qualities. Most online rating facilities also provide few or no qualitative descriptions of the rating categories, although again there are exceptions such as Yahoo! Movies, which labels each of the categories between F and A+ and BoardGameGeek, which provides explicit descriptions of each category from 1 to 10. Often, only the top and bottom category is described, such as on IMDb's online rating facility.

Validity
With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent).The degree of validity of an instrument is determined through the application of logic/or statistical procedures." A measurement procedure is valid to the degree that if measures what it proposes to measure" Another fundamental issue is that online ratings usually involve convenience sampling much like television polls, i.e. they represent only the opinions of those inclined to submit ratings. TYPES OF VALIDITY Validity is concerned with different aspects of the measurement process.Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. 1. CONTENT VALIDITY 2. PREDICTIVE VALIDITY 3. CONSTRUCT VALIDITY

Sampling
Sampling errors can lead to results which have a specific bias, or are only relevant to a specific subgroup. Consider this example: suppose that a film only appeals to a specialist audience90% of them are devotees of this genre, and only 10% are people with a general interest in movies. Assume the film is very popular among the audience that views it, and that only those who feel most strongly about the film are inclined to rate the film online; hence the raters are all drawn from the devotees. This combination may lead to very high ratings of the film, which do not generalize beyond the people who actually see the film (or possibly even beyond those who actually rate it).

Qualitative description
Qualitative description of categories improve the usefulness of a rating scale. For example, if only the points 1-10 are given without description, some people may select 10 rarely, whereas others may select the category often. If, instead, "10" is described as "near flawless", the category is more likely to mean the same thing to different people. This applies to all categories, not just the extreme points. The above issues are compounded, when aggregated statistics such as averages are used for lists and rankings of products. User ratings are at best ordinal categorizations. While it is not uncommon to calculate averages or means for such data, doing so cannot be justified because in calculating averages, equal intervals are required to represent the same difference between levels of perceived quality. The key issues with aggregate data based on the kinds of rating scales commonly used online are as follow: Averages should not be calculated for data of the kind collected. It is usually impossible to evaluate the reliability or validity of user ratings.

Rating scale Products are not compared with respect to explicit, let alone commonWikipedia:Please clarify, criteria. Only users inclined to submit a rating for a product do so. Data are not usually published in a form that permits evaluation of the product ratings. More developed methodologies include Choice Modelling or Maximum Difference methods, the latter being related to the Rasch model due to the connection between Thurstone's law of comparative judgementWikipedia:Please clarify and the Rasch model.

292

References
[1] Andrich, D. (1978). "A rating formulation for ordered response categories". Psychometrika, 43, 357-74. [2] Cronbach, L. J. (1951). "Coefficient alpha and the internal structure of tests". Psychometrika, 16, 297-333.

External links
How to apply Rasch analysis (http://www.rasch-analysis.com/)

Rating scales for depression


A depression rating scale is a psychiatric measuring instrument having descriptive words and phrases that indicate the severity of depression symptoms for a time period.[] When used, an observer may make judgements and rate a person at a specified scale level with respect to identified characteristics. Rather than being used to diagnose depression, a depression rating scale may be used to assign a score to a person's behaviour where that score may be used to determine whether that person should be evaluated more thoroughly for a depressive disorder diagnosis.[] Several rating scales are used for this purpose.[]

Scales completed by researchers


Some depression rating scales are completed by researchers. For example, the Hamilton Depression Rating Scale includes 21 questions with between 3 and 5 possible responses which increase in severity. The clinician must choose the possible responses to each question by interviewing the patient and by observing the patient's symptoms. Designed by psychiatrist Max Hamilton in 1960, the Hamilton Depression Rating Scale is one of the two most commonly used among those completed by researchers assessing the effects of drug therapy.[][1] Alternatively, the Montgomery-sberg Depression Rating Scale has ten items to be completed by researchers assessing the effects of drug therapy and is the other of the two most commonly used among such researchers.[][2] Other scale is the Raskin Depression Rating Scale; which rates the severity of the patients symptoms in three areas: verbal reports, behavior, and secondary symptoms of depression.[]

Scales completed by patients


The two questions on the Patient Health Questionnaire-2 (PHQ-2):[] During the past month, have you often been bothered by feeling down, depressed, or hopeless? During the past month, have you often been bothered by little interest or pleasure in doing things? Some depression rating scales are completed by patients. The Beck Depression Inventory, for example, is a 21-question self-report inventory that covers symptoms such as irritability, fatigue, weight loss, lack of interest in sex, and feelings of guilt, hopelessness or fear of being punished.[] The scale is completed by patients to identify the presence and severity of symptoms consistent with the DSM-IV diagnostic criteria.[3] The Beck Depression Inventory was originally designed by psychiatrist Aaron T. Beck in 1961.[]

Rating scales for depression The Geriatric Depression Scale (GDS) is another self-administered scale, but in this case it is used for older patients, and for patients with mild to moderate dementia. Instead of presenting a five-category response set, the GDS questions are answered with a simple "yes" or "no".[4][] The Zung Self-Rating Depression Scale is similar to the Geriatric Depression Scale in that the answers are preformatted. In the Zung Self-Rating Depression Scale, there are 20 items: ten positively-worded and ten negatively-worded. Each question is rated on a scale of 1 through 4 based on four possible answers: "a little of the time", "some of the time", "good part of the time", and "most of the time".[] The Patient Health Questionnaire (PHQ) sets are self-reported depression rating scales. For example, the Patient Health Questionnaire-9 (PHQ-9) is a self-reported, 9-question version of the Primary Care Evaluation of Mental Disorders.[] The Patient Health Questionnaire-2 (PHQ-2) is a shorter version of the PHQ-9 with two screening questions to assess the presence of a depressed mood and a loss of interest or pleasure in routine activities; a positive response to either question indicates further testing is required.[]

293

Scales completed by patients and researchers


The Primary Care Evaluation of Mental Disorders (PRIME-MD) is completed by the patient and a researcher. This depression rating scale includes a 27-item screening questionnaire and follow-up clinician interview designed to facilitate the diagnosis of common mental disorders in primary care. Its lengthy administration time has limited its clinical usefulness; it has been replaced by the Patient Health Questionnaire.[]

Usefulness
Screening programs using rating scales to search for candidates for a more in-depth evaluation have been advocated to improve detection of depression, but there is evidence that they do not improve detection rates, treatment, or outcome.[5] There is also evidence that a consensus on the interpretation of rating scales, in particular the Hamilton Rating Scale for Depression, is largely missing, leading to misdiagnosis of the severity of a patient's depression.[6] However, there is evidence that portions of rating scales, such as the somatic section of the PHQ-9, can be useful in predicting outcomes for subgroups of patients like coronary heart disease patients.[7]

Copyrighted vs. Public Domain scales


The Beck Depression Inventory is copyrighted, a fee must be paid for each copy used, and photocopying it is a violation of copyright. There is no evidence that the BDI-II is more valid or reliable than other depression scales,[8] and public domain scales such as the Patient Health Questionnaire Nine Item (PHQ-9) has been studied as a useful tool.[9] Other public domain scales include the Clinically Useful Depression Outcome Scale (CUDOS)[10][11] and the Quick Inventory of Depressive Symptoms Self Report 16 Item (QIDS-SR16).[12] [13]

References
[8] Zimmerman M. Using scales to monitor symptoms and treatment of depression (measurement based care). In UpToDate, Rose, BD (Ed), UpToDate, Waltham, MA, 2011. [11] OutcomeTracker (http:/ / www. outcometracker. org/ ) - Clinically Useful Depression Outcome Scale (CUDOS) official website [13] Inventory of Depressive Symptomatology (IDS) and Quick Inventory of Depressive Symptomatology (QIDS) (http:/ / www. ids-qids. org/ ). official website

Reliability (psychometrics)

294

Reliability (psychometrics)
In the psychometrics, reliability is used to describe the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions. For example, measurements of peoples height and weight are often extremely reliable.[1][2]

Types
There are several general classes of reliability estimates: Inter-rater reliability assesses the degree to which test scores are consistent when measurements are taken by different people using the same methods. Test-retest reliability assesses the degree to which test scores are consistent from one test administration to the next. Measurements are gathered from a single rater who uses the same methods or instruments and the same testing conditions.[2] This includes intra-rater reliability. Inter-method reliability assesses the degree to which test scores are consistent when there is a variation in the methods or instruments used. This allows inter-rater reliability to be ruled out. When dealing with forms, it may be termed parallel-forms reliability.[3] Internal consistency reliability, assesses the consistency of results across items within a test.[3]

Difference from validity


Reliability does not imply validity. That is, a reliable measure that is measuring something consistently, may not be measuring what you want to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is analogous to precision, while validity is analogous to accuracy. While reliability does not imply validity, a lack of reliability does place a limit on the overall validity of a test. A test that is not perfectly reliable cannot be perfectly valid, either as a means of measuring attributes of a person or as a means of predicting scores on a criterion. While a reliable test may provide useful valid information, a test that is not reliable cannot possibly be valid.[] An example often used to illustrate the difference between reliability and validity in the experimental sciences involves a common bathroom scale. If someone who is 200 pounds steps on a scale 10 times and gets readings of 15, 250, 95, 140, etc., the scale is not reliable. If the scale consistently reads "150", then it is reliable, but not valid. If it reads "200" each time, then the measurement is both reliable and valid.
Validity and reliability

Reliability (psychometrics)

295

General model
In practice, testing measures are never perfectly consistent.Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the influence of two sorts of factors:[] 1. Factors that contribute to consistency: stable characteristics of the individual or the attribute that one is trying to measure 2. Factors that contribute to inconsistency: features of the individual or the situation that can affect test scores but have nothing to do with the attribute being measured Some of these inconsistencies include:[] Temporary but general characteristics of the individual: health, fatigue, motivation, emotional strain Temporary and specific characteristics of individual: comprehension of the specific test task, specific tricks or techniques of dealing with the particular test materials, fluctuations of memory, attention or accuracy Aspects of the testing situation: freedom from distractions, clarity of instructions, interaction of personality, sex, or race of examiner Chance factors: luck in selection of answers by sheer guessing, momentary distractions The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.[] A true score is the replicable feature of the concept being measured. It is the part of the observed score that would recur across different measurement occasions in the absence of error. Errors of measurement are composed of both random error and systematic error. It represents the discrepancies between scores obtained on tests and the corresponding true scores. This conceptual breakdown is typically represented by the simple equation:

Observed test score = true score + errors of measurement

Classical test theory


The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimized. The central assumption of reliability theory is that measurement errors are essentially random. This does not mean that errors arise from random processes. For any individual, an error in measurement is not a completely random event. However, across a large number of individuals, the causes of measurement error are assumed to be so varied that measure errors act as random variables.[] If errors have the essential characteristics of random variables, then it is reasonable to assume that errors are equally likely to be positive or negative, and that they are not correlated with true scores or with errors on other tests. It is assumed that:[4] 1. Mean error of measurement = 0 2. True scores and errors are uncorrelated 3. Errors on different measures are uncorrelated Reliability theory shows that the variance of obtained scores is simply the sum of the variance of true scores plus the variance of errors of measurement.[]

This equation suggests that test scores vary as the result of two factors: 1. Variability in true scores

Reliability (psychometrics) 2. Variability due to errors of measurement. The reliability coefficient provides an index of the relative influence of true and error scores on attained test scores. In its general form, the reliability coefficient is defined as the ratio of true score variance to the total variance of test scores. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:

296

Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test. Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.

Item response theory


It was well-known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score.

Estimation
The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores. Four practical strategies have been developed that provide workable methods of estimating test reliability.[] 1. Test-retest reliability method: directly assesses the degree to which test scores are consistent from one test administration to the next. It involves: Administering a test to a group of individuals Re-administering the same test to the same group at some later time Correlating the first set of scores with the second The correlation between scores on the first test and the scores on the retest is used to estimate the reliability of the test using the Pearson product-moment correlation coefficient: see also item-total correlation. 2. Parallel-forms method: The key to this method is the development of alternate test forms that are equivalent in terms of content, response processes and statistical characteristics. For example, alternate forms exist for several tests of general intelligence, and these tests are generally seen equivalent.[] With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a persons true score on form A would be identical to their true score on form B. If both forms of the test were administered to a number of people, differences between scores on form A and form B may be due to errors in measurement only.[] It involves: Administering one form of the test to a group of individuals At some later time, administering an alternate form of the same test to the same group of people

Reliability (psychometrics) Correlating scores on form A with scores on form B The correlation between scores on the two alternate forms is used to estimate the reliability of the test. This method provides a partial solution to many of the problems inherent in the test-retest reliability method. For example, since the two forms of the test are different, carryover effect is less of a problem. Reactivity effects are also partially controlled; although taking the first test may change responses to the second test. However, it is reasonable to assume that the effect will not be as strong with alternate forms of the test as with two administrations of the same test.[] However, this technique has its disadvantages: It may very difficult to create several alternate forms of a test It may also be difficult if not impossible to guarantee that two alternate forms of a test are parallel measures 3. Split-half method: This method treats the two halves of a measure as alternate forms. It provides a simple solution to the problem that the parallel-forms method faces: the difficulty in developing alternate forms.[] It involves: Administering a test to a group of individuals Splitting the test in half Correlating scores on one half of the test with scores on the other half of the test The correlation between these two split halves is used in estimating the reliability of the test. This halves reliability estimate is then stepped up to the full test length using the SpearmanBrown prediction formula. There are several ways of splitting a test to estimate reliability. For example, a 40-item vocabulary test could be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21 through 40. However, the responses from the first half may be systematically different from responses in the second half due to an increase in item difficulty and fatigue.[] In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which the odd-numbered items form one half of the test and the even-numbered items form the other. This arrangement guarantees that each half will contain an equal number of items from the beginning, middle, and end of the original test.[] 4. Internal consistency: assesses the consistency of results across items within a test. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients.[5] Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, KuderRichardson Formula 20.[5] Although the most commonly used, there are some misconceptions regarding Cronbach's alpha.[6] [7] These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true variability is different in this second population. (This is true of measures of all typesyardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.) Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,[5] and other informal means. However, formal psychometric analysis, called item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better

297

Reliability (psychometrics) items, the reliability of the measure will increase. (where is the failure rate)

298

References
[2] The Marketing Accountability Standards Board (MASB) endorses this definition as part of its ongoing Common Language: Marketing Activities and Metrics Project (http:/ / www. themasb. org/ common-language-project/ ). [3] Types of Reliability (http:/ / www. socialresearchmethods. net/ kb/ reltypes. php) The Research Methods Knowledge Base. Last Revised: 20 October 2006 [5] Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98104. [6] Ritter, N. (2010). Understanding a widely misunderstood statistic: Cronbach's alpha. Paper presented at Southwestern Educational Research Association (SERA) Conference 2010, New Orleans, LA (ED526237).

External links
Uncertainty models, uncertainty quantification, and uncertainty processing in engineering (http://www. uncertainty-in-engineering.net) The relationships between correlational and internal consistency concepts of test reliability (http://www. visualstatistics.net/Statistics/Principal Components of Reliability/PCofReliability.asp) The problem of negative reliabilities (http://www.visualstatistics.net/Statistics/Reliability Negative/Negative Reliability.asp)

Repeatability
Repeatability or test-retest reliability[1] is the variation in measurements taken by a single person or instrument on the same item and under the same conditions. A less-than-perfect test-retest reliability causes test-retest variability. Such variability can be caused by, for example, intra-individual variability and intra-observer variability. A measurement may be said to be repeatable when this variation is smaller than some agreed limit. Test-retest variability is practically used, for example, in medical monitoring of conditions. In these situations, there is often a predetermined "critical difference", and for differences in monitored values that are smaller than this critical difference, the possibility of pre-test variability as a sole cause of the difference may be considered in addition to, for examples, changes in diseases or treatments.[]

Establishment
According to the Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results, the following conditions need to be fulfilled in the establishment of repeatability: the same measurement procedure the same observer the same measuring instrument, used under the same conditions the same location repetition over a short period of time.

Repeatability methods were developed by Bland and Altman (1986).[2] If the correlation between separate administrations of the test is high (e.g. 0.7 or higher as in this Cronbach's alpha-internal consistency-table[3]), then it has good test-retest reliability.

Repeatability The repeatability coefficient is a precision measure which represents the value below which the absolute difference between two repeated test results may be expected to lie with a probability of 95%. The standard deviation under repeatability conditions is part of precision and accuracy.

299

Desirability of repeatability
Test-retest reliability is desirable in measures of constructs that are not expected to change over time. For example, if you use a certain method to measure an adult's height, and then do the same again two years later, you would expect a very high correlation; if the results differed by a great deal, you would suspect that the measure was inaccurate. The same is true for personality traits such as extraversion, which are believed to change only very slowly. In contrast, if you were trying to measure mood, you would expect only moderate test-retest reliability, since people's moods are expected to change from day to day. Very high test-retest reliability would be bad, since it would suggest that you were not picking up on these changes.

Attribute Agreement Analysis for Defect Databases


An attribute agreement analysis is designed to simultaneously evaluate the impact of repeatability and reproducibility on accuracy. It allows the analyst to examine the responses from multiple reviewers as they look at several scenarios multiple times. It produces statistics that evaluate the ability of the appraisers to agree with themselves (repeatability), with each other (reproducibility), and with a known master or correct value (overall accuracy) for each characteristic over and over again.[4]

Psychological testing
Since the same test is administered twice and every test is parallel with itself, differences between scores on the test and scores on the retest should be due solely to measurement error. This sort of argument is quite probably true for many physical measurements. However, this argument is often inappropriate for psychological measurement, since it is often impossible to consider the second administration of a test a parallel measure to the first.[] The second administration of a psychological test might yield systematically different scores than the first administration due to the following reasons[]: 1. The attribute that is being measured may change between the first test and the retest. For example, a reading test that is administered in September to a third grade class may yield different results when retaken in June. We would expect some change in childrens reading ability over that span of time, a low test-retest correlation might reflect real changes in the attribute itself. 2. The experience of taking the test itself can change a persons true score. For example, completing an anxiety inventory could serve to increase a persons level of anxiety. 3. Carryover effect, particularly is the interval between test and retest is short. When retested, people may remember their original answer, which could affect answers on the second administration.

Repeatability

300

References
[1] Types of Reliability (http:/ / www. socialresearchmethods. net/ kb/ reltypes. php) The Research Methods Knowledge Base. Last Revised: 20 October 2006 [2] http:/ / www-users. york. ac. uk/ ~mb55/ meas/ ba. htm [3] George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon. [4] http:/ / www. isixsigma. com/ tools-templates/ measurement-systems-analysis-msa-gage-rr/ attribute-agreement-analysis-defect-databases/

External links
Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results; appendix D (http:// physics.nist.gov/Pubs/guidelines/appd.1.html)

Reproducibility
Reproducibility is the ability of an entire experiment or study to be reproduced, or by someone else working independently. It is one of the main principles of the scientific method. The result values are said to be commensurate if they are obtained (in distinct experimental trials) according to the same reproducible experimental description and procedure. The basic idea can be seen in Aristotle's dictum that there is no scientific knowledge of the individual, where the word used for individual in Greek had the connotation of the idiosyncratic, or wholly isolated occurrence. Thus all knowledge, all science, necessarily involves the formation of general concepts and the invocation of their corresponding symbols in language (cf. Turner). Reproducibility also refers to the degree of agreement between measurements or observations conducted on replicate specimens in different locations by different people, as part of the precision of a test method.[1]

Reproducible data
Reproducibility is one component of the precision of a test method. The other component is repeatability which is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory. Both repeatability and reproducibility are usually reported as a standard deviation. A reproducibility limit is the value below which the difference between two test results obtained under reproducibility conditions may be expected to occur with a probability of approximately 0.95 (95%).[2] Reproducibility is determined from controlled interlaboratory test programs.[3][4]

Reproducible research
The term reproducible research refers to the idea that the ultimate product of research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. necessary for reproduction of the results and building upon the research.[5][6][7] In 2012, a study found that 47 out of 53 medical research papers on the subject of cancer were irreproducible.[8] John P. A. Ioannidis wrote: While currently there is unilateral emphasis on "first" discoveries, there should be as much emphasis on replication of discoveries."[9] While repeatability of scientific experiments is desirable, it is not considered necessary to establish the scientific validity of a theory. For example, the cloning of animals is difficult to repeat, but has been reproduced by various teams working independently, and is a well established research domain. One failed cloning does not mean that the theory is wrong or unscientific. Repeatability is often low in protosciences.

Reproducibility

301

Noteworthy irreproducible results


Hideyo Noguchi became famous for correctly identifying the bacterial agent of syphilis, but also claimed that he could culture this agent in his laboratory. Nobody else has been able to produce this latter result. In March 1989, University of Utah chemists Stanley Pons and Martin Fleischmann reported the production of excess heat that could only be explained by a nuclear process ("cold fusion"). The report was astounding given the simplicity of the equipment: it was essentially an electrolysis cell containing heavy water and a palladium cathode which rapidly absorbed the deuterium produced during electrolysis. The news media reported on the experiments widely, and it was a front-page item on many newspapers around the world (see science by press conference). Over the next several months others tried to replicate the experiment, but were unsuccessful. Nikola Tesla claimed as early as 1899 to have used a high frequency current to light gas-filled lamps from over 25 miles (40km) away without using wires. In 1904 he built Wardenclyffe Tower on Long Island to demonstrate means to send and receive power without connecting wires. The facility was never fully operational and was not completed due to economic problems.[10]

References
[1] ASTM E177 [2] [3] [4] [5] ASTM E177 ASTM E691 Standard Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method ASTM F1469 Standard Guide for Conducting a Repeatability and Reproducibility Study on Test Equipment for Nondestructive Testing Sergey Fomel and Jon Claerbout, " Guest Editors' Introduction: Reproducible Research (http:/ / www. rrplanet. com/ reproducible-research-librum/ viewtopic. php?f=30& t=372)," Computing in Science and Engineering, vol. 11, no. 1, pp. 57, Jan./Feb. 2009, [6] J. B. Buckheit and D. L. Donoho, " WaveLab and Reproducible Research (http:/ / www. rrplanet. com/ reproducible-research-librum/ viewtopic. php?f=30& t=53)," Dept. of Statistics, Stanford University, Tech. Rep. 474, 1995. [7] The Yale Law School Round Table on Data and Core Sharing: " Reproducible Research (http:/ / www. computer. org/ portal/ web/ csdl/ doi/ 10. 1109/ MCSE. 2010. 113)", Computing in Science and Engineering, vol. 12, no. 5, pp. 812, Sept/Oct 2010, [8] http:/ / www. nature. com/ nature/ journal/ v483/ n7391/ full/ 483531a. html [9] Is the spirit of Piltdown man alive and well? (http:/ / www. telegraph. co. uk/ technology/ 3342867/ Is-the-spirit-of-Piltdown-man-alive-and-well. html) [10] Cheney, Margaret(1999), Tesla Master of Lightning, New York: Barnes & Noble Books, ISBN 0-7607-1005-8, pp. 107.; "Unable to overcome his financial burdens, he was forced to close the laboratory in 1905."

Turner, William (1903), History of Philosophy, Ginn and Company, Boston, MA, Etext (http://www2.nd.edu/ Departments//Maritain/etext/hop.htm). See especially: "Aristotle" (http://www2.nd.edu/Departments// Maritain/etext/hop11.htm). Definition (PDF) (http://www.iupac.org/goldbook/R05305.pdf),Wikipedia:Link rot by International Union of Pure and Applied Chemistry

External links
Reproducible Research in Computational Science (http://www.csee.wvu.edu/~xinl/source.html) Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results; appendix D (http:// physics.nist.gov/Pubs/guidelines/appd.1.html) Definition of reproducibility in the IUPAC Gold Book (http://goldbook.iupac.org/R05305.html) Detailed article on Reproducibility (http://arstechnica.com/journals/science.ars/2006/10/25/5744) Reproducible Research Planet (http://www.rrplanet.com/) ReproducibleResearch.net (http://www.reproducibleresearch.net)

Riddle scale

302

Riddle scale
The Riddle scale (also known as Riddle homophobia scale or Riddle scale of homophobia) is a psychometric scale that measures the degree to which a person is or is not homophobic. The scale is frequently used in tolerance education about anti-discriminatory attitudes regarding sexual orientation. It is named after its creator, psychologist Dorothy Riddle.

Overview
The Riddle homophobia scale was developed by Dorothy Riddle in 197374 while she was overseeing research for the American Psychological Association Task Force on Gays and Lesbians.[1] The scale was distributed at talks and workshops but was not formally published for a long time; it is cited in the literature either as an (unpublished) conference presentation from 1985[2] or as an article from 1994.[3] At the time it was developed, Riddle's analysis was one of the first modern classifications of attitudes towards homosexuality.[citation needed] In that respect, the scale has served the purpose that Riddle originally had in mind: she devised the scale to explicate the continuum of attitudes toward gays and lesbians and to assess the current and desired institutional culture of an organization or a work place.[4]

Level of measurement
The Riddle scale is an eight-term uni-dimensional Likert-type interval scale with nominal labels and no explicit zero point. Each term is associated with a set of attributes and beliefs; individuals are assigned a position on the scale based on the attributes they exhibit and beliefs they hold. The scale is frequently divided into two parts, the 'homophobic levels of attitude' (first four terms) and the 'positive levels of attitude' (last four terms).[5]

The scale
Repulsion: Homosexuality is seen as a crime against nature. Gays/lesbians are considered sick, crazy, immoral, sinful, wicked, etc. Anything is justified to change them: incarceration, hospitalization, behavior therapy, electroshock therapy, etc. Pity: Represents heterosexual chauvinism. Heterosexuality is considered more mature and certainly to be preferred. It is believed that any possibility of becoming straight should be reinforced, and those who seem to be born that way should be pitied as less fortunate ("the poor dears"). Tolerance: Homosexuality is viewed as a phase of adolescent development that many people go through and most people grow out of. Thus, lesbians/gays are less mature than straights and should be treated with the protectiveness and indulgence one uses with children who are still maturing. It is believed that lesbians/gays should not be given positions of authority because they are still working through their adolescent behavior. Acceptance: Still implies that there is something to accept; the existing climate of discrimination is ignored. Characterized by such statements as "You're not lesbian to me, you're a person!" or "What you do in bed is your own business." or "That's fine with me as long as you don't flaunt it!" Support: People at this level may be uncomfortable themselves, but they are aware of the homophobic climate and the irrational unfairness, and work to safeguard the rights of lesbians and gays. Admiration: It is acknowledged that being lesbian/gay in our society takes strength. People at this level are willing to truly examine their homophobic attitudes, values, and behaviors.

Riddle scale Appreciation: The diversity of people is considered valuable and lesbians/gays are seen as a valid part of that diversity. People on this level are willing to combat homophobia in themselves and others. Nurturance: Assumes that gay/lesbian people are indispensable in our society. People on this level view lesbians/gays with genuine affection and delight, and are willing to be their allies and advocates.

303

Discussion
Riddle's analysis has been credited for pointing out that although 'tolerance' and 'acceptance' can be seen as positive attitudes, they should actually be treated as negative because they can mask underlying fear or hatred (somebody can tolerate a baby crying on an airplane while at the same time wishing that it would stop) or indicate that there is indeed something that we need to accept, and that we are the ones with the power to reject or to accept.[6][7] This observation generalizes to attitude evaluations in other areas besides sexual orientation and is one of the strengths of Riddle's study. Although it deals mostly with adult attitudes towards difference, the model has been positioned in the cognitive developmental tradition of Piaget and Kohlberg's stages of moral development.[8] As a psychometric scale, the Riddle scale has been considered to have acceptable face validity but its exact psychometric properties are unknown.[9][10]

References
[1] Staten Island LGBT history (http:/ / www. silgbtcenter. org/ ) Staten Island LGBT Community Center, Accessed Dec. 19, 2010. [2] Riddle, D. I. (1985). Homophobia scale. Opening doors to understanding and acceptance: A facilitators guide for presenting workshops on lesbian and gay issues, Workshop organized by Kathy Obear and Amy Reynolds, Boston. Unpublished essay. [3] Riddle, D., (1994). The Riddle scale. Alone no more: Developing a school support system for gay, lesbian and bisexual youth. St Paul: Minnesota State Department. [4] Peterkin, A. Risdon, C., (2003). Caring for lesbian and gay people: A clinical guide. Toronto: University of Toronto Press, Inc. [5] Clauss-Ehlers, C. S. (ed), (2010). Encyclopedia of Cross-Cultural School Psychology. New York: Springer. [6] Blumenfeld W. J. (2000). How homophobia hurts everyone. Readings for diversity and social justice. New York: Routledge, 267275. [7] Ollis, D., (2004). Im just a home economics teacher. Does discipline background impact on teachers ability to affirm and include gender and sexual diversity in secondary school health education programs? AARE Conference, Melbourne 2004 [8] Hirscheld, S., (2001). Moving beyond the safety zone: A staff development approach to anti-heterosexist education. Fordham Urban Law Journal, 29, 611641. [9] Finkel, M. J., Storaasli, R. D., Bandele, A., and Schaefer, V., (2003). Diversity training in graduate school: An exploratory evaluation of the safe zone project. Professional Psychology: Research and Practice, 34, 555561. [10] Tucker, E. W, and Potocky-Tripodi, M., (2006). Changing heterosexuals' attitudes toward homosexuals: A systematic review of the empirical literature. Research on Social Work Practice, 16 (2), 176190.

Risk Inclination Formula

304

Risk Inclination Formula


Risk Inclination Formula component uses the Principle of Moments or Varignons Theorem ([1][2]) to calculate the 1st factorial moment of probability in order to define this center point of balance among all confidence weights (i.e., the point of Risk Equilibration). Formal Derivation of the RIF. The following formal derivation of the RIF is divided into three separate calculations: (1) calculation of 1st factorial moment, (2) calculation of inclination, and (3) calculation of the Risk Inclination Score. The Risk Inclination Formula [3] is a component of the Risk Inclination Model.

References
[3] ,

Risk Inclination Model


Risk Inclination (RI) is defined as a mental disposition (i.e., confidence) toward an eventuality (i.e., a predicted state) that has consequences (i.e., either loss or gain). The Risk Inclination Model (RIM) is composed of three constructs: confidence weighting, restricted context, and the Risk Inclination Formula. Each of these constructs connects an outside observer with a respondents inner state of risk taking toward knowledge certainty.

Confidence weighting
The Confidence Weighting (CW) construct is concerned with indices that connect an outside observer to the respondents inner state of knowledge certainty toward specific content.[1][2][3][4] Underpinning the CW construct of the Risk Inclination Model is the individual's experience of coherence or rightness[5] and is used to calibrate the relationship between a respondents objective and observable measures of risk taking (i.e., weighted indices toward answer selections) with his or her subjective inner feelings of knowledge certainty (i.e., feelings of rightness).
RIM

Restricted context
The restricted context (RC) construct is based on Piagets theory of equilibration[6] and allows the outside observer to measure the way a respondent manages competing inner states of knowledge certainty during the application of confidence weights among items within the restricted Total Point Value (TPV) context of the test. RC sets the parameters where risk taking toward knowledge certainty occurs. These parameters are important because they allow an observer to scale and thereby measure the respondents inner state of equilibration among related levels of knowledge certainty. Equilibration is defined as a self-regulatory process that reflects the biological drive to produce an optimal state of balance between a persons cognitive structures (i.e., inner state) and their environment.[7]

Risk Inclination Model

305

Risk Inclination Formula


The Risk Inclination Formula (RIF) construct is based upon Varignon's Theorem and quantifies feelings of rightness toward knowledge certainty.[8][9] RIF uses the Principle of Moments or Varignons Theorem to calculate the first factorial moment of probability in order to define this center point of balance among all confidence weights (i.e., the point of Risk Equilibration).[10][11] The following formal derivation of the RIF is divided into three separate calculations: (1) calculation of the first factorial moment, (2) calculation of inclination, and (3) calculation of the Risk Inclination Score.

References
[8] , [9] Coxeter, H. S. M. and Greitzer, S. L. "Quadrangle; Varignon's theorem" 3.1 in Geometry Revisited. Washington, DC: Math. Assoc. Amer., pp. 5254, 1967.

Role-based assessment
Modern psychological testing can be traced back to 1908 with the introduction of the first successful intelligence test, the Binet-Simon Scale.[1] From the Binet-Simon came the revised version, the Stanford-Binet, which was used in the development of the Army Alpha and Army Beta tests used by the United States military.[2] During World War I, Robert S. Woodworth developed the Woodworth Personal Data Sheet (WPDS), to determine which soldiers were better prepared to handle the stresses of combat. The WPDS signaled a shift in the focus of psychological testing from intellect to personality.[3] By the 1940s, the quantitative measurement of personality traits had become a central theme in psychology, and it has remained so into the 2000s. During this time, numerous variations and versions of personality tests have been created, including the widely used Myers-Briggs, DISC, and Cattells 16PF Questionnaire.[4] Role-Based Assessment (RBA) differs significantly from personality testing.[5] Instead of quantifying individual personality factors, RBAs methodology was developed, from its very beginnings, to make qualitative observations of human interaction.[6] In this sense, RBA is a form of behavioral simulation. Understanding the quality of a persons behavior on a team can be a valuable adjunct to other forms of evaluation (such as data on experience, knowledge, skills, and personality) because the ability to successfully cooperate and collaborate with others is fundamental to organizational performance.

Concepts
Coherence
In TGI Role-Based Assessment, Coherence describes a positive and constructive orientation to working with others to achieve common goals, overcome obstacles, and meet organizational needs.[7][8][9]

Role
A persons Role describes their strongest affinity for, or attraction to, serving a certain type of organizational need, e.g., planning for the future vs. executing current tasks vs. preserving and sharing knowledge.[10][11]

Teaming Characteristics
Each RBA report includes a detailed section on Teaming Characteristics, which are derived, in part, from the relationship between a persons level of Coherence and their unique Role (or Roles). As their name suggests, Teaming Characteristics can help managers and coaches to understand how well a person will fit within a team

Role-based assessment and/or adapt to their job responsibilities.[12][13]

306

Historical Development
Dr. Janice Presser began collaborating with Dr. Jack Gerber in 1988 to develop tools and methods for measuring the fundamental elements of human teaming behavior, with a goal of improving individual and team performance. Their work combines decodes of research, blending Dr. Pressers earlier work in family and social relationships with Dr. Gerbers Mosaic Figures test, which had been designed to produce qualitative information on how individuals view other people.[14] Three generations of assessments were developed, tested and used in the context of actual business performance. The initial Executive Behavior Assessment was focused on the behavior of persons with broad responsibility for organizational performance. The second iteration, called the Enhanced Executive Behavior Assessment, incorporated metrics on the behavior of executives working in teams. Drs. Presser and Gerber then successfully applied their testing methodology to team contributors outside of the executive ranks, and as development and testing efforts continued, Role-Based Assessment (RBA) emerged.[15] By 1999, RBA was established as a paper-based assessment, and was being sold for use in pre-hire screening and organizational development.[16] Drs Presser and Gerber formed The Gabriel Institute in 2001, with the goal of making RBA available to a greater audience via the Internet.[17] Mid-year in 2009, TGI Role-Based AssessmentTM became generally available as an online assessment instrument. Later in 2009, the Society for Human Resource Management (SHRM) published a two-part white paper by Dr. Presser, which introduced ground- breaking ideas on the measurement and valuation of human synergy in organizations, and an approach to the creation of a strong, positively-oriented human infrastructure.[18][19]

Applications
The most common use of TGI Role-Based Assessment is in pre-hire screening evaluations. RBAs focus on teaming behavior offers a different way to allegedly predict how an individual will fit with company culture, on a given team, and how they are likely to respond to specific job requirements.[20] While other pre-hire testing may run the "risk of violating the ADA" (Americans with Disabilities Act), this does not appear to be an issue with Role-Based Assessment.[21] RBA is also claimed to have unique potential for strengthening a human infrastructure. Results from RBA reports can be aggregated, providing quantitative data that is used for analysis and resolution of team performance problems, and to identify and select candidates for promotion.[22]

References
[1] Santrock, John W. (2008) A Topical Approach to Life-Span Development (4th Ed.) Concept of Intelligence (283-284) New York: McGraw-Hill. [2] Fancher, R. (1985). The Intelligence Men: Makers of the IQ Controversy. New York:W.W. Norton & Company [4] Personality Theories, Types and Tests. (http:/ / www. businessballs. com/ personalitystylesmodels. htm) Businessballs.com. 2009. [18] SHRM - The Measurement & Valuation of Human Infrastructure: An Introduction to CHI Indicators (http:/ / www. shrm. org/ Research/ Articles/ Articles/ Pages/ InfrastructureCHI. aspx) [19] SHRM The Measurement & Valuation of Human Infrastructure: An Intro. To the New Way to Know (http:/ / www. shrm. org/ Research/ Articles/ Articles/ Pages/ New Way to Know. aspx) [20] Edmonds Wickman, Lindsay. Role-Based Assessment: Thinking Inside the Box. (http:/ / talentmgt. com/ articles/ view/ rolebased_assessment_thinking_inside_the_box/ 3) Talent Management Magazine (October 2008). Media Tec Publishing Inc. [22] Edmonds Wickman, Lindsay. Role-Based Assessment: Thinking Inside the Box. (http:/ / talentmgt. com/ articles/ view/ rolebased_assessment_thinking_inside_the_box/ 3) Talent Management Magazine (October 2008). Media Tec Publishing Inc.

Role-based assessment

307

External links
University of Pennsylvania Journal of Labor and Employment Law (http://www.law.upenn.edu/journals/jbl/ articles/volume9/issue1/Gonzales-Frisbie9U.Pa.J.Lab.&Emp.L.185(2006).pdf) Innovation America Put Your Money Where Your Team Is! (http://www.innovationamerica.us/index.php/ innovation-daily/3780-put-your-money-where- your-team-is-) National Association of Seed and Venture Funds (NASVF) Make Sure People Will FitBefore You Hire Them. (http://www.nasvf.org/index.php?option=com_content&view=article& id=146:make-sure-people-will-fit-nbefore-you-hire-them&catid=5:features&Itemid=38)

Scale (social sciences)


In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities. The level of measurement is the type of data that is measured.

Comparative and non comparative scaling


With comparative scaling, the items are directly compared with each other (example : Do you prefer Pepsi or Coke?). In noncomparative scaling each item is scaled independently of the others (example : How do you feel about Coke?).

Composite measures
Composite measures of variables are created by combining two or more separate empirical indicators into a single measure. Composite measures measure complex concepts more adequately than single indicators, extend the range of scores available and are more efficient at handling multiple items. In addition to scales, there are two other types of composite measures. Indexes are similar to scales except multiple indicators of a variable are combined into a single measure. The index of consumer confidence, for example, is a combination of several measures of consumer attitudes. A typology is similar to an index except the variable is measured at the nominal level. Indexes are constructed by accumulating scores assigned to individual attributes, while scales are constructed through the assignment of scores to patterns of attributes. While indexes and scales provide measures of a single dimension, typologies are often employed to examine the intersection of two or more dimensions. Typologies are very useful analytical tools and can be easily used as independent variables, although since they are not unidimensional it is difficult to use them as a dependent variable.

Scale (social sciences)

308

Data types
The type of information collected can influence scale construction. Different types of information are measured in different ways. 1. Some data are measured at the nominal level. That is, any numbers used are mere labels : they express no mathematical properties. Examples are SKU inventory codes and UPC bar codes. 2. Some data are measured at the ordinal level. Numbers indicate the relative position of items, but not the magnitude of difference. An example is a preference ranking. 3. Some data are measured at the interval level. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. Examples are attitude scales and opinion scales. 4. Some data are measured at the ratio level. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include: age, income, price, costs, sales revenue, sales volume, and market share.

Scale construction decisions


What level of data is involved (nominal, ordinal, interval, or ratio)? What will the results be used for? Should you use a scale, index, or typology? What types of statistical analysis would be useful? Should you use a comparative scale or a noncomparative scale? How many scale divisions or categories should be used (1 to 10; 1 to 7; 3 to +3)? Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.) What should the nature and descriptiveness of the scale labels be? What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal) Should a response be forced or be left optional?

Comparative scaling techniques


Pairwise comparison scale a respondent is presented with two items at a time and asked to select one (example : Do you prefer Pepsi or Coke?). This is an ordinal level technique when a measurement model is not applied. Krus and Kennedy (1977) elaborated the paired comparison scaling within their domain-referenced model. The BradleyTerryLuce (BTL) model (Bradley and Terry, 1952; Luce, 1959) can be applied in order to derive measurements provided the data derived from paired comparisons possess an appropriate structure. Thurstone's Law of comparative judgment can also be applied in such contexts. Rasch model scaling respondents interact with items and comparisons are inferred between items from the responses to obtain scale values. Respondents are subsequently also scaled based on their responses to items given the item scale values. The Rasch model has a close relation to the BTL model. Rank-ordering a respondent is presented with several items simultaneously and asked to rank them (example : Rate the following advertisements from 1 to 10.). This is an ordinal level technique. Bogardus social distance scale measures the degree to which a person is willing to associate with a class or type of people. It asks how willing the respondent is to make various associations. The results are reduced to a single score on a scale. There are also non-comparative versions of this scale. Q-Sort Up to 140 items are sorted into groups based a rank-order procedure. Guttman scale This is a procedure to determine whether a set of items can be rank-ordered on a unidimensional scale. It utilizes the intensity structure among several indicators of a given variable. Statements are listed in order of importance. The rating is scaled by summing all responses until the first negative response in the list. The Guttman scale is related to Rasch measurement; specifically, Rasch models bring the Guttman

Scale (social sciences) approach within a probabilistic framework. Constant sum scale a respondent is given a constant sum of money, script, credits, or points and asked to allocate these to various items (example : If you had 100 Yen to spend on food products, how much would you spend on product A, on product B, on product C, etc.). This is an ordinal level technique. Magnitude estimation scale In a psychophysics procedure invented by S. S. Stevens people simply assign numbers to the dimension of judgment. The geometric mean of those numbers usually produces a power law with a characteristic exponent. In cross-modality matching instead of assigning numbers, people manipulate another dimension, such as loudness or brightness to match the items. Typically the exponent of the psychometric function can be predicted from the magnitude estimation exponents of each dimension.

309

Non-comparative scaling techniques


Continuous rating scale (also called the graphic rating scale) respondents rate items by placing a mark on a line. The line is usually labeled at each end. There are sometimes a series of numbers, called scale points, (say, from zero to 100) under the line. Scoring and codification is difficult. Likert scale Respondents are asked to indicate the amount of agreement or disagreement (from strongly agree to strongly disagree) on a five- to nine-point scale. The same format is used for multiple questions. This categorical scaling procedure can easily be extended to a magnitude estimation procedure that uses the full scale of numbers rather than verbal categories. Phrase completion scales Respondents are asked to complete a phrase on an 11-point response scale in which 0 represents the absence of the theoretical construct and 10 represents the theorized maximum amount of the construct being measured. The same basic format is used for multiple questions. Semantic differential scale Respondents are asked to rate on a 7 point scale an item on various attributes. Each attribute requires a scale with bipolar terminal labels. Stapel scale This is a unipolar ten-point rating scale. It ranges from +5 to 5 and has no neutral zero point. Thurstone scale This is a scaling technique that incorporates the intensity structure among indicators. Mathematically derived scale Researchers infer respondents evaluations mathematically. Two examples are multi dimensional scaling and conjoint analysis.

Scale evaluation
Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population, given the scale you have selected. Reliability is the extent to which a scale will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Alternative forms reliability checks how similar the results are if the research is repeated using different forms of the scale. Internal consistency reliability checks how well the individual measures included in the scale are converted into a composite measure. Scales and indexes have to be validated. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Content validation (also called face validity) checks how well the scale measures what is supposed to measured. Criterion validation checks how meaningful the scale criteria are relative to other possible criteria. Construct validation checks what underlying construct is being measured. There are three variants of construct validity. They are convergent validity, discriminant validity, and nomological validity (Campbell and Fiske, 1959; Krus and Ney, 1978). The coefficient of reproducibility indicates how well the data from the individual measures included in the scale can be reconstructed from the composite scale.

Scale (social sciences)

310

Further reading
DeVellis, Robert F (2003), Scale Development: Theory and Applications [1] (2nd ed.), London: SAGE Publications, ISBN0-7619-2604-6 (cloth), retrieved 11 August 2010 Paperback ISBN 0-7619-2605-4 Lodge, Milton (1981), Magnitude Scaling: Quantitative Measurement of Opinions, Beverly Hills & London: SAGE Publications, ISBN0-8039-1747-3 McIver, John P. & Carmines, Edward G (1981), Unidimensional Scaling [2], Beverly Hills & London: SAGE Publications, ISBN0-8039-1736-8, retrieved 11 August 2010

References
Bradley, R.A. & Terry, M.E. (1952): Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324345. Campbell, D. T. & Fiske, D. W. (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81105. Hodge, D. R. & Gillespie, D. F. (2003). Phrase Completions: An alternative to Likert scales. Social Work Research, 27(1), 4555. Hodge, D. R. & Gillespie, D. F. (2005). Phrase Completion Scales. In K. Kempf-Leonard (Editor). Encyclopedia of Social Measurement. (Vol. 3, pp.5362). San Diego: Academic Press. Krus, D. J. & Kennedy, P. H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189193 (Request reprint). [3] Krus, D. J. & Ney, R. G. (1978) Convergent and discriminant validity in item analysis. Educational and Psychological Measurement, 38, 135137 (Request reprint). [4] Luce, R.D. (1959): Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley.

Lists of related topics


List of marketing topics List of management topics List of economics topics

External links
Handbook of Management Scales Multi-item metrics to be used in research, Wikibooks [5]

References
[1] [2] [3] [4] [5] http:/ / books. google. com/ books?id=BYGxL6xLokUC& printsec=frontcover& dq=scale+ development#v=onepage& q& f=false http:/ / books. google. com/ books?id=oL8xP7EX9XIC& printsec=frontcover& dq=unidimensional+ scaling#v=onepage& q& f=false http:/ / www. visualstatistics. net/ Scaling/ Domain%20Referenced%20Scaling/ Domain-Referenced%20Scaling. htm http:/ / www. visualstatistics. net/ Statistics/ Item%20Analysis%20CD%20Validity/ Item%20Analysis%20CD%20Validity. htm http:/ / en. wikibooks. org/ wiki/ Handbook_of_Management_Scales

Self-report inventory

311

Self-report inventory
Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies

Self-report inventory

312
Publications Research methods Theories Timeline Topics Psychology portal

A self-report inventory is a type of psychological test in which a person fills out a survey or questionnaire with or without the help of an investigator. Self-report inventories often ask direct questions about symptoms, behaviors, and personality traits associated with one or many mental disorders or personality types in order to easily gain insight into a patient's personality or illness. Most self-report inventories can be taken or administered within five to 15 minutes, although some, like the Minnesota Multiphasic Personality Inventory (MMPI), can take up to three hours to fully complete. There are three major approaches to developing self-report inventories: theory-guided, factor analysis, and criterion-key. Theory-guided inventories are constructed around a theory of personality. Criterion-keyed inventories are based around questions that have been shown to statistically discriminate between a control group and a criterion group. Questionnaires typically use one of three formats: a Likert scale, true-false, or forced choice. True-false involves questions that the individual denotes as either being true or false about themselves. Forced-choice is a pair of statements that require the individual to choose one as being most representative of themselves. Self-report inventories can have validity problems. Patients may exaggerate symptoms in order to make their situation seem worse, or they may under-report the severity or frequency of symptoms in order to minimize their problems. Another issue is the social desirability bias.

Problems with Self-report inventories


The biggest problem with self-report inventories is that patients may exaggerate symptoms in order to make their situation seem worse, or they may under-report the severity or frequency of symptoms in order to minimize their problems. For this reason, self-report inventories should be used only for measuring for symptom change and severity and should never be solely used to diagnose a mental disorder. Clinical discretion is advised for all self-report inventories. Many personality tests, such as the MMPI or the MBTI add questions that are designed to make it very difficult for a person to exaggerate traits and symptoms. However, these tests suffer from the inherent problems associated with personality theory and testing, in that personality is a fluid concept that can be difficult to define.

Popular Self-Report Inventories


16 PF Beck Anxiety Inventory Beck Depression Inventory Beck Hopelessness Scale California Psychological Inventory Eysenck Personality Questionnaire Geriatric Depression Scale Hirschfeld Mood Disorder Questionnaire Kuder Occupational Interest Survey Major Depression Inventory Minnesota Multiphasic Personality Inventory Myers-Briggs Type Indicator

Self-report inventory Personality Inventory for Children-2 Revised NEO Personality Inventory State-Trait Anxiety Inventory

313

References
Aiken, L.R. (2002) "Psychological Testing and Assessment." New York: Allyn & Bacon Gregory, R.J. (2007) "Psychological Testing: History, Principles, and Applications (5th ed.)" Boston: Pearson Education

Semantic differential

314

Semantic differential
Semantic differential
Diagnostics

Fig. 1. Modern Japanese version of the Semantic Differential. The Kanji characters in background stand for "God" and "Wind" respectively, with the compound reading "Kamikaze". (Adapted from Dimensions of Meaning. Visual Statistics Illustrated at VisualStatistics.net.) MeSH D012659 [1]

Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.

Semantic differential
Osgood's semantic differential was designed to measure the connotative meaning of concepts. The respondent is asked to choose where his or her position lies, on a scale between two bipolar adjectives (for example: "Adequate-Inadequate", "Good-Evil" or "Valuable-Worthless"). Semantic differentials can be used to describe not only persons, but also the connotative meaning of abstract conceptsa capacity used extensively in affect control theory.

Theoretical background
Nominalists and realists
Theoretical underpinnings of Charles E. Osgood's semantic differential have roots in the medieval controversy between the nominalists and realists.[citation needed] Nominalists asserted that only real things are entities and that abstractions from these entities, called universals, are mere words. The realists held that universals have an independent objective existence either in a realm of their own or in the mind of God. Osgoods theoretical work also bears affinity to linguistics and general semantics and relates to Korzybski's structural differential.[citation needed]

Use of adjectives
The development of this instrument provides an interesting insight into the border area between linguistics and psychology. People have been describing each other since they developed the ability to speak. Most adjectives can also be used as personality descriptors. The occurrence of thousands of adjectives in English is an attestation of the subtleties in descriptions of persons and their behavior available to speakers of English. Roget's Thesaurus is an early attempt to classify most adjectives into categories and was used within this context to reduce the number of adjectives to manageable subsets, suitable for factor analysis.

Semantic differential

315

Evaluation, potency, and activity


Osgood and his colleagues performed a factor analysis of large collections of semantic differential scales and found three recurring attitudes that people use to evaluate words and phrases: evaluation, potency, and activity. Evaluation loads highest on the adjective pair 'good-bad'. The 'strong-weak' adjective pair defines the potency factor. Adjective pair 'active-passive' defines the activity factor. These three dimensions of affective meaning were found to be cross-cultural universals in a study of dozens of cultures. This factorial structure makes intuitive sense. When our ancestors encountered a person, the initial perception had to be whether that person represents a danger. Is the person good or bad? Next, is the person strong or weak? Our reactions to a person markedly differ if perceived as good and strong, good and weak, bad and weak, or bad and strong. Subsequently, we might extend our initial classification to include cases of persons who actively threaten us or represent only a potential danger, and so on. The evaluation, potency and activity factors thus encompass a detailed descriptive system of personality. Osgood's semantic differential measures these three factors. It contains sets of adjective pairs such as warm-cold, bright-dark, beautiful-ugly, sweet-bitter, fair-unfair, brave-cowardly, meaningful-meaningless. The studies of Osgood and his colleagues revealed that the evaluative factor accounted for most of the variance in scalings, and related this to the idea of attitudes.[2]

Usage
The semantic differential is today one of the most widely used scales used in the measurement of attitudes. One of the reasons is the versatility of the items. The bipolar adjective pairs can be used for a wide variety of subjects, and as such the scale is nicknamed "the ever ready battery" of the attitude researcher.[3]

Statistical properties
Five items, or 5 bipolar pairs of adjectives, have been proven to yield reliable findings, which highly correlate with alternative measures of the same attitude [4] The biggest problem with this scale is that the properties of the level of measurement are unknown.[5] The most statistically sound approach is to treat it as an ordinal scale, but it can be argued that the neutral response (i.e. the middle alternative on the scale) serves as an arbitrary zero point, and that the intervals between the scale values can be treated as equal, making it an interval scale. A detailed presentation on the development of the semantic differential is provided in the monumental book, Cross-Cultural Universals of Affective Meaning.[6] David R. Heise's Surveying Cultures[7] provides a contemporary update with special attention to measurement issues when using computerized graphic rating scales.

Notes
[1] [2] [3] [4] [5] [6] [7] http:/ / www. nlm. nih. gov/ cgi/ mesh/ 2011/ MB_cgi?field=uid& term=D012659 Himmelfarb (1993) p 56 Himmelfarb (1993) p 57 Osgood, Suci and Tannebaum (1957) Himmelfarb (1993) p 57 Osgood, May, and Miron (1975) Heise (2010)

Semantic differential

316

References
Heise, David R. (2010). Surveying Cultures: Discovering Shared Conceptions and Sentiments. Hoboken NJ: Wiley Himmelfarb, S. (1993). The measurement of attitudes. In A.H. Eagly & S. Chaiken (Eds.), Psychology of Attitudes, 23-88. Thomson/Wadsworth Krus, D.J., & Ishigaki, Y. (1992) Kamikaze pilots: The Japanese and the American perspectives. Psychological Reports, 70, 599-602. (Request reprint). (http://www.visualstatistics.net/Readings/Kamikaze Pilots/Kamikaze Pilots.html) Osgood, C. E., May, W. H., and Miron, M. S. (1975) Cross-Cultural Universals of Affective Meaning. Urbana, IL: University of Illinois Press Osgood, C.E., Suci, G., & Tannenbaum, P. (1957) The measurement of meaning. Urbana, IL: University of Illinois Press Snider, J. G., and Osgood, C. E. (1969) Semantic Differential Technique: A Sourcebook. Chicago: Aldine.

External links
Osgood's Semantic Space (http://www.writing.ws/reference/history.htm) On-line Semantic Differential (http://www.indiana.edu/~socpsy/papers/AttMeasure/attitude..htm)

Sequential probability ratio test


The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald.[1] Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected (and its likelihood ratio known). While originally developed for use in quality control studies in the realm of manufacturing, SPRT has been formulated for use in the computerized testing of human examinees as a termination criterion.[2][3][]

Theory
As in classical hypothesis testing, SPRT starts with a pair of hypotheses, say alternative hypothesis respectively. They must be specified as follows: and for the null hypothesis and

The next step is calculate the cumulative sum of the log-likelihood ratio,

, as new data arrive:

The stopping rule is a simple thresholding scheme: : continue monitoring (critical inequality) : Accept : Accept ) depend on the desired type I and type II errors, and . They may be

where a and b ( chosen as follows: and

Sequential probability ratio test In other words, and must be decided beforehand in order to set the thresholds appropriately. The numerical

317

value will depend on the application. The reason for using approximation signs is that, in the discrete case, the signal may cross the threshold between samples. Thus, depending on the penalty of making an error and the sampling frequency, one might set the thresholds more aggressively. Of course, the exact bounds may be used in the continuous case.

Example
A textbook example is parameter estimation of a probability distribution function. Let us consider the exponential distribution:

The hypotheses are simply (LLF) for one sample is

and

, with

. Then the log-likelihood function

The cumulative sum of the LLFs for all x is

Accordingly, the stopping rule is

After re-arranging we finally find

The thresholds are simply two parallel lines with slope samples makes an excursion outside the continue-sampling region.

. Sampling should stop when the sum of the

Applications
Manufacturing
The test is done on the proportion metric, and tests that a variable p is equal to one of two desired points, p1 or p2. The region between these two points is known as the indifference region (IR). For example, suppose you are performing a quality control study on a factory lot of widgets. Management would like the lot to have 3% or less defective widgets, but 1% or less is the ideal lot that would pass with flying colors. In this example, p1 = 0.01 and p2 = 0.03 and the region between them is the IR because management considers these lots to be marginal and is OK with them being classified either way. Widgets would be sampled one at a time from the lot (sequential analysis) until the test determines, within an acceptable error level, that the lot is ideal or should be rejected.

Sequential probability ratio test

318

Testing of human examinees


The SPRT is currently the predominant method of classifying examinees in a variable-length computerized classification test (CCT). The two parameters are p1 and p2 are specified by determining a cutscore (threshold) for examinees on the proportion correct metric, and selecting a point above and below that cutscore. For instance, suppose the cutscore is set at 70% for a test. We could select p1 = 0.65 and p2 = 0.75 . The test then evaluates the likelihood that an examinee's true score on that metric is equal to one of those two points. If the examinee is determined to be at 75%, they pass, and they fail if they are determined to be at 65%. These points are not specified completely arbitrarily. A cutscore should always be set with a legally defensible method, such as a modified Angoff procedure. Again, the indifference region represents the region of scores that the test designer is OK with going either way (pass or fail). The upper parameter p2 is conceptually the highest level that the test designer is willing to accept for a Fail (because everyone below it has a good chance of failing), and the lower parameter p1 is the lowest level that the test designer is willing to accept for a pass (because everyone above it has a decent chance of passing). While this definition may seem to be a relatively small burden, consider the high-stakes case of a licensing test for medical doctors: at just what point should we consider somebody to be at one of these two levels? While the SPRT was first applied to testing in the days of classical test theory, as is applied in the previous paragraph, Reckase (1983) suggested that item response theory be used to determine the p1 and p2 parameters. The cutscore and indifference region are defined on the latent ability (theta) metric, and translated onto the proportion metric for computation. Research on CCT since then has applied this methodology for several reasons: 1. Large item banks tend to be calibrated with IRT 2. This allows more accurate specification of the parameters 3. By using the item response function for each item, the parameters are easily allowed to vary between items.

References
[2] Ferguson, Richard L. (1969). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction (http:/ / eric. ed. gov/ ERICWebPortal/ custom/ portlets/ recordDetails/ detailmini. jsp?_nfpb=true& _& ERICExtSearch_SearchValue_0=ED034406& ERICExtSearch_SearchType_0=no& accno=ED034406). Unpublished doctoral dissertation, University of Pittsburgh. [3] Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.

Holger Wilker: Sequential-Statistik in der Praxis, BoD, Norderstedt 2012, ISBN 978-3848232529.

SESAMO

319

SESAMO
SESAMO is the acronym of Sexrelation Evaluation Schedule Assessment Monitoring, is an Italian psychometric and psychological standardised and validated questionnaire (see Tab. 1) to examine single and couple aspect life, sexuality, interpersonal and intimate relationship.[1]

Features
As many others sexological tests, a female and a male version are available and both are made up of three sections (see Tab. 2): the first section contains items which investigate those areas relating to previous sexuality aspects; the subjects' social, environmental and personal features, health history and their BMI (Body Mass Index). After filling in this first section, all the subjects will be sent to either the second or third subsection depending on their affective-relational condition, which is defined as single condition or couple condition respectively. The second section collects all those items whose research areas relate to present sexuality and motivational aspects. This section is intended for single people, i.e. people lacking a stable sexual-affective relationship with a partner. The third section includes those areas which investigate the subjects' present sexuality and relational aspects within the couple. This section is intended for the dyadic condition, i. e. a sexual-affective relationship which is going on for at least six months.

Contents
The two versions (male/female) and their subsection (single/couple) of the questionnaire, contain 135 items for male and female single people, and 173 for males and females with a partner respectively. This method allows to detect dysfunctional sexual and relational aspects in singles people and people with a partner, aiming at two main goals: defining a psychosexual and social-affective profile as an "idiographic image" of the subject;[2] putting forward hypotheses about the dysfunctional aspects in individual and couple sexuality and their causes.

Tab.1 Cronbach' Alpha

Cronbach' Alpha SESAMO questionnaire Single condition Couple condition Male 0.710 0.771 Female 0.696 0.700

Assessment
The assessment essentially aims at those areas concerning previous and present sexuality and, at the same time, it takes into consideration all those elements that, even indirectly, could have affected the development, expression and display of personality, affectivity and relationality (interpersonal and intimate relationships). The questionnaire takes into consideration the following areas (as shown on Tab. 2): social environmental data, psychosexual identity, sphere of pleasure (sex play, paraphilias), previous and present masturbation, previous sexual experiences, affective-relational condition, sexual intercourse, imaginative eroticism, contraception, relational attitude; additional areas are intended only for subjects with a partner: couple interaction, communicativeness within the sexual sphere, roles within the couple and extrarelational sexuality (i.e. outside couple

SESAMO sexuality).

320

Tab.2 Domains of the questionnaire

Domains SESAMO questionnaire Section 1 General part Social environmental data Body image Psychosexual identity Desire Sphere of pleasure (paraphilias) Previous masturbation Previous sexual experiences Medical anamnesis Motivation and conflicts Total domains Section 2 Single condition Single situation Pleasure Sexual intercourses Present masturbation Imaginative eroticism Contraception Relational attitude Section 3 Couple condition Couple interaction Sexual intercourses Present masturbation Imaginative eroticism Communicativeness sexual sphere Roles within the couple Extrarelational sexuality Sexuality and pregnancy Contraception

Single condition = 16 domains Couple condition = 18 domains

Methodology
The SESAMO_Win methodology is provided with a software for administering the questionnaire and creating a multifactorial multilevel evaluation Report. This software analyses and decodes the answers obtained through direct administration on the computer or entered into the computer from printed forms and produces an anamnestic report about the subjects' sexual and relational condition. Once the administration has been completed, the software does not allow the questionnaire and its respective report to be altered or manipulated. This is necessary for deontological reasons and, above all, to assure its validity in legal appraisals and screenings. The software processes a report for each questionnaire. Each report can be displayed on the computer monitor or printed out. It is also possible to print out the whole report or its single parts.

Anamnestic report
The report is divided in 9 parts: 1. Heading It contains the subject's identification data and some directions for using the information in the report properly (interpretations, inferences and indications provided by the report). 2. Personal data and household It displays a summary of personal data, BMI (Body mass index), the starting and finishing time of the administration, the time required to fill in the questionnaire, the composition of the household, the present affective-relational condition and off-the-cuff comments from the subject at the end of the administration.

SESAMO

321

3. Scoring diagram for each area A diagram displays a comparative summary of the scores obtained by the subject in each area of analysis (it could be defined as a snapshot of the subject's sexual-relational condition). The right side of the diagram (displaying positive scores) indicates an hypothesis about the degree of discomfort/dysfunction for each area. 4. Critical traits The critical traits section highlights the most relevant and significative features of the subject's condition and his/her sexual-relational fields. These indications allow to get some relevant hints to be used in prospective in-depth medical, psychological, psychiatric interviews. 5. Narrative report It tells in a narrative and detailed way the subject's sexual-relational history, through the explanations and comments he/she made while completing the questionnaire. 6. Further diagnostic examinations and specialist examinations It gives some brief indications about those focal points which need to be addressed and carefully considered, besides it suggests prospective specialist examinations and counselling. 7. Parameters for the items and subliminal indexes This section of the report displays, as well as the topic relative to each question, the indexes of subliminal factors measured on the subject and the significance degree of the answers he/she has chosen for each item. Go-back index (it shows that the subject went back to previous items due to rethinking/rumination); Try-jump index (it reveals an attempt to jump or leave out the answer to an item); Significance index (or weight) of the answers chosen by the subject for each item; Latency time index for each item (measured for each answer); Kinetic reaction index of the subject (emotional motility measured for each item).

SESAMO Sexrelational Test diagram's example

8. The score for each area displays: a descriptive heading of the fields of investigation relative to the subject's affective-relational condition (single or couple); the number of the omitted answers for each area (this option is activated only when entering the answers into the computer from a paper questionnaire); the rough points obtained by the subject for each area; the Z scores (standard scores) for each area and their relative percentile ranks. 9. Completed questionnaire This section displays all the answers chosen and entered into the computer by the subject while completing the questionnaire; as well as being a documental report (official certificate), it can be used in personalised close examinations and to obtain the open answers entered through the keyboard by the subject.

SESAMO

322

Criticism
The disadvantages of this device are the time required for filling in the questionnaire (3060 minutes) and the fact that the complete report can be elaborated only by the software. A reduced version of the questionnaire has less items but can be administered and scored through the paper and pencil method. A clinical research that has used the brief version, expresses this: "During follow-up each patient received the SESAMO test (Sexuality Evaluation Schedule Assessment Monitoring) in the standard clinical form, with the end point of tracking down the sexual, affective, and relationship profile of each Htx pts [3] [...]. The SESAMO questionnaire is based on topics relative to male and female sexuality in mates situation. Topics are grouped in two section: the first one collects data on former sexuality, health history, and social behavior; the second one looks at the mate's relationship to show any situation revealing sexual worries. The questionnaire gives values based on a survey of 648 people with characteristics quite similar to the Italian population. The clinical test for mates is based on 81 items for males and 85 items for females. The row score for each topic is modified in standard scores. The exceeding of scores over a specified threshold gives concise information for diagnostic purpose".[4]

Notes
[1] Note. The test is available only for professional psychologists and physicians. [2] In psychology, an "idiographic image" (it:Immagine idiografica) is the representation of a study or research whose subjects are specific cases, thus avoiding generalizations. The idiographic method (also called historical method) is a criterion that involves evaluating past experiences, selecting and comparing information about a specific individual or event. [3] Note. Htx pts = cardiotransplanted patients. [4] Basile A. et al., Sexual Disorders After Heart Transplantation. Elsevier Science Inc., New York, Vol. 33, Issue 1, 2001.

Bibliography
Basile Fasolo C., Veglia F., Disturbi sessuali, in Conti L. (1999), Repertorio delle scale di valutazione in psichiatria, S.E.E. Edizioni Medico Scientifiche, Firenze. (http://www.pol-it.org/ital/scale/cap13-3.htm). Boccadoro L., Carulli S., (2009) Il posto dell'amore negato. Sessualit e psicopatologie segrete ( The place of the denied love. Sexuality and secret psychopathologies - Abstract (http://www.sexology.it/abstract_english. html)). Tecnoprint Editrice, Ancona. ISBN 978-88-95554-03-7 Boccadoro L., (2002) Sesamo_win: Sexrelation Evaluation Schedule Assessment Monitoring, Giunti O.S., Florence (Italy). it:SESAMO (test) Boccadoro L., (1996) SESAMO: Sexuality Evaluation Schedule Assessment Monitoring, Approccio differenziale al profilo idiografico psicosessuale e socioaffettivo, Organizzazioni Speciali, Firenze. IT\ICCU\CFI\0327719 (http:// www.giuntios.it/scheda_sesamo_eng.jsp) Brunetti M., Olivetti Belardinelli M. et al., Hypothalamus, sexual arousal and psychosexual identity in human males: a functional magnetic resonance imaging study. European Journal of Neuroscience, Vol. 27, 11, 2008. Calabr R.S., Bramantia P. et al., Topiramate-induced erectile dysfunction. Epilepsy & Behavior, 14, 3, 2009. Capodieci S. et al., (1999) SESAMO: una nuova metodica per l'assessment sessuorelazionale. In: Cociglio G., et al. (a cura di), La coppia, Franco Angeli, Milano. ISBN 88-464-1491-8 Dess A., Conte S., Men as well have problems with their body image and with sex. A study on men suffering from eating disorders. Sexologies, 17, 1, 2008. Dttore D., (2001) Psicologia e psicopatologia del comportamento sessuale, McGraw-Hill, Milano. ISBN
88-386-2747-9

Ferretti A., Caulo M., Del Gratta C. et al., Dynamics of Male Sexual Arousal: Distinct Components of Brain Activation Revealed by fMRI. Neuroimage, 26, 4, 2005.

SESAMO Natale V., Albertazzi P., Zini M., Di Micco R., Exploration of cyclical changes in memory and mood in postmenopausal women taking sequential combined oestrogen and progestogen preparations. British Journal of Obstetrics and Gynaecology. Vol. 108, 286-290, 2001. Ugolini V., Baldassarri F., Valutazione della vita sessuorelazionale in uomini affetti da sterilit attraverso il SESAMO. In Rivista di Sessuologia, vol.25, n.4, 2001. Vignati R. et al., Un nuovo test per lindagine sessuale. In Journal of Sexological Sciences - Rivista Scienze Sessuologiche, Vol.11 n.3, 1998. Vignati R La valutazione del disagio nellapproccio ai disturbi sessuorelazionali PSYCHOMEDIA, 2010 http:// www.psychomedia.it/pm/grpind/family/vignati.htm

323

Situational judgement test


Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal

Situational judgement test


324
Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

Situational judgment tests (SJTs) or Inventories (SJIs) are a type of psychological test which present the test-taker with realistic, hypothetical scenarios and ask the individual to identify the most appropriate response or to rank the responses in the order they feel is most effective.[] SJTs can be presented to test-takers through a variety of modalities, such as booklets, films, or audio recordings.[1] SJTs represent a distinct psychometric approach from the common knowledge-based multiple choice item.[][] They are often used in industrial-organizational psychology applications such as personnel selection. Situational judgment tests tend to determine behavioral tendencies, assessing how an individual will behave in a certain situation, and knowledge instruction, which evaluates the effectiveness of possible responses.[] Situational judgment tests could also reinforce the status quo with an organization.[] Unlike most psychological tests SJTs are not acquired 'off-the-shelf', but are in fact designed as a bespoke tool, tailor-made to suit the individual role requirements.[] This is because SJTs are not a type of test with respect to their content, but are a method of designing tests.

Developing a Situational Judgment Test


Developing a situational judgment test begins with conducting a job analysis that includes collecting critical incidents. These critical incidents are used to develop different situations where the judgment of the prospective new hire would need to make a decision. Once these situations are developed, subject matter experts (excellent employees) are asked to suggest effective and less effective solutions to the situation. Then a different group of subject matter experts rate these responses from best to worst and the test is scored with the highest ranked options giving the respondent the higher score (or lower if the test is reverse scored).[2]

Situational judgement test

325

Validity
The validity of the test corresponds to the types of questions are being asked. Knowledge instruction questions correlate more highly with general mental ability while behavioral tendency questions correlate more highly with personality.[] Key results from a study show that knowledge about interpersonal behavior measured with situational judgment tests was valid for internships (7 years later) as well as job performance (9 years later). Also, students' knowledge of interpersonal behavior showed progressive validity over cognitive factors for predicting academic and post academic success. This study was also the first study to show evidence of the predictive long-term power of interpersonal skill assessed though situational judgment tests.[3] There are many problems within scoring SJTs. "Attempts to address this issue include expert-novice differences, where an item is scored in the direction favoring the experts after the average ratings of experts and novices on each item are compared; expert judgment, where a team of experts decides the best answer to each question; target scoring, where the test author determines the correct answer; and consensual scoring, where a score is allocated to each option according to the percentage of people choosing that option." [4]

History
The situational judgment test has been around for over fifty years. The first two that were documented were the How supervise and the Cardall Practical Judgment Test. In 1958 the Supervisory Practice Test came about by Bruce and Learner.[] The Supervisory Practice Test was to point out whether or not supervisors could handle certain situations on the job. This test is said to effectively identify who could and could not be a supervisor.[] The situational judgment test did not really take off and become a great use in the employment field until the early 1990s.[] Situational Judgment Tests then went on to be used in World War II by psychologists in the US military.[] "In the 1950s and 60s, their use was extended to predict, as well as assess,managerial success." [5] Today, SJTs are used in many organizations, are promoted by various consulting firms, and are researched by many.[]

Tests to Measure Individual Adaptability in Applied Settings


A Thesis Submitted to George Mason University in 2010 by Adam M. Grim created a study to measure individual adaptability in applied settings. An Adaptability Situational Judgment Test (ASJT) was designed to provide a practical and valid selection and assessment instrument that had incremental validity beyond the Big Five personality traits and cognitive ability in predicting supervisor ratings of adaptability.[] "The research contributes to the selection and adaptive performance literatures by demonstrating that it is possible to use a situational judgment test to measure individual adaptability in both military and non-military applied settings."[] ASJT had similar relationships with all variables of interest in both samples, thus providing support for the generalizability of the measure to both military and business settings. Practical implications and recommendations for future refinements of ASJT are discussed.[] With this ASJT did not have differential validity and provides a selection instrument that would not cause adverse impact or be subject to legal challenge because of predictive bias.[] For this study there were both business and military setting scenarios which subjects would read and indicate how likely they were to do the list of behaviors related to that scenario.[]

Situational judgement test

326

Multiple-choice Examples
Consist of either taking the test on paper or written out examples online. The online version offers a few advantages such as, faster results and better quality. Whereas traditional Multiple-choice questions have only one correct answer, it is often the case that Situational Judgment Test have multiple correct answers even though an answer might be more preferred by the hiring organization.[] You are the leader of a manufacturing team that works with heavy machinery. One of your productions operators tells you that one machine in the work area is suddenly malfunctioning and may endanger the welfare of your work team. Rank order the following possible courses of action to effectively address this problem. from most desirable to least desirable. 1. Call a meeting of your team members to discuss the problem. 2. Report the problem to the Director of Safety 3. shut off the machine immediately. 4. Individually ask other production operators about problems with their machines. 5.evacuate your team from the production facility[]

Video-based Examples
Consists of videos that contain different scenarios that the employee may face. Scenarios for this section can be found on youtube.com. Scenarios are in many different styles such as: Animated people and situations. The boss of the company could be recorded asking the question. The answering process can be different for each test. * The correct answer could be given. * The individual could be ask to give the most reasonable answer. * The individual is asked to explain what they were to do if they were in that situation.

Advantages over other measures


They show reduced levels of adverse impact, by gender and ethnicity,[6] compared to cognitive ability tests.[][] They use measures that directly assess job relevant behaviours.[] They can be administered in bulk, either via pen and paper or on-line.[] The SJT design process results in higher relevance of content than other psychometric assessments[][7] They are therefore more acceptable and engaging to candidates compared to cognitive ability tests since scenarios are based on real incidents[] It is unlikely that practice will enhance candidate performance as the answers cannot be arrived at logically a response to a situation may be appropriate in one organisation and inappropriate in another.[] They can tap into a variety of constructs ranging from problem solving and decision making to interpersonal skills.[] Traditional psychometric tests do not account for the interaction between ability, personality and other traits.[] Conscientiousness can be built into a test as a major factor of individual differences.[8] They can be used in combination with a knowledge based test to give a better overall picture of a candidate's aptitude for a certain job.[9]

Situational judgement test

327

Company Use
Companies using SJTs report the following anecdotal evidence supporting the use of SJT. Note: these reports are not supported by peer reviewed research. Can highlight employee developmental needs[] They are relatively easy and cost-effective to develop, administer and score[] There has been more favorable applicant reactions to this test than to general mental ability tests.

Criticisms
The scenarios in many SJTs tend to be brief; therefore candidates do not become fully immersed in the scenario. This can remove some of the intended realism of the scenario and may reduce the quality and depth of assessment.[] SJT responses can be transparent, providing more of an index of best practice knowledge in some cases and therefore failing to differentiate between candidates' work-related performance.[] The response formats in some SJTs do not present a full enough range of responses to the scenario. Candidates can be forced to select actions or responses that do not necessarily fit their behavior. They can find this frustrating and this can affect the validity of such measures[10][11][12] Because of the adaptability of SJTs, arguments persist about whether or not they are a valid measurement of a particular construct (Job Knowledge), or a measurement tool which can be applied to a variety of different constructs, such as cognitive ability, conscientiousness, agreeableness, or emotional stability [13] SJTs are best suited for assessing multiple constructs, and as such, it is difficult to separate the constructs assessed in the test. If one construct is of particular interest, a different measure may be more practical.[14] Due to the multi-dimensional nature of SJTs, it is problematic to assess reliability through the use of standard measures.[15]

Sample tests
Europa.eu Wikipedia:Identifying reliable sourcesSJT [16] (four questions with answers and scoring example) Assessmentday.com Wikipedia:Identifying reliable sourcesSJT [17] (four questions) Abilitus.com Wikipedia:Identifying reliable sourcesSJT [18] (Free Demo of Situational Judgement Tests - 5 questions in English and French - Many Practice Tests - Very useful for EPSO competition) Practise business situational judgement test Wikipedia:Identifying reliable sources [19] (takes 30 minutes with feedback) Blog on Situational Judgement SJT [20]Wikipedia:Identifying reliable sources (practice SJ tests on iPhone and iPad, samples, hints) Demo test on situational judgement [21](methodology, tests and corrected tests)

Situational judgement test

328

Notes
[4] http:/ / eprints. usq. edu. au/ 787/ 1/ Strahan_Fogarty_Machin_APS_Conference_proceedings. pdf [5] http:/ / eprints. usq. edu. au/ 787/ [6] Hoare, S., Day, A., & Smith, M. (1998). The development and evaluation of situations inventories. Selection & Development Review, 14(6), 3-8. [7] Motowildo, S.J., Hanson, M.A., & Crafts, J.L. (1997). Low fidelity simulations. In D.L. Whetzel & G.R. Wheaton (Eds.), Applied Measurement in industrial Psychology. Palo Alto, CA: Davies-Black. [8] McDaniel, Michael. & Nguyen, Nhung "Situational Judgement Tests: A Review of Practice and Constructs Assessed" (http:/ / www. people. vcu. edu/ ~mamcdani/ Publications/ McDaniel & Nguyen 2001 IJSA. pdf), Blackwell Publishers LTD, Oxford, March/June 2001. Retrieved on 17 October 2012. [10] Chan, D., & Schmitt, N. (2005). An agenda for future research on applicants' reactions to selection procedures: A construct-orientated approach. International Journal of Selection and Assessment, 12, 9-23. [11] Ployhart, R.E., & Harold, C.M. (2004). The applicant attribution-reaction theory (AART): An integrative approach of applicant attributional processing. International Journal of Selection & Assessment, 12, 84-98. [12] Schmit, M.J., & Ryan, A.M. (1992). Test-taking dispositions: A missing link? Journal of Applied Psychology, 77, 629-637. [13] McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A., & Braverman, E.P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730-740.001 [14] McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A., & Braverman, E.P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730-740. [15] McDaniel, M.A. & Whetzel, D.L. (2007). Situational Judgement Tests. In D.L. Whetzel & G.R. Wheaton (Eds). Applied measurement: Industrial psychology in human resources management. Erlbaum. 235-258. [16] [17] [18] [19] [20] [21] http:/ / europa. eu/ epso/ discover/ prepa_test/ sample_test/ index_en. htm#chapter2/ http:/ / www. assessmentday. co. uk/ situational-judgement-test/ http:/ / www. abilitus. com/ https:/ / www. surveymonkey. com/ s/ BusinessSituations http:/ / situationaljudgement. blogspot. be/ http:/ / / www. orseu-concours. com/ en/ run_test. php?test=demo

Psychometric software
Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.[citation needed]

Sources
Because only a few commercial businesses (most notably Assessment Systems Corporation and Scientific Software International) develop specialized psychometric tools, there exist many free tools developed by researchers and educators. Important websites for free psychometric software include: CASMA at the University of Iowa, USA [1] REMP at the University of Massachusetts, USA [2] Software from Brad Hanson [3] Software from John Uebersax [4] Software from J. Patrick Meyer [5] Software directory at the Institute for Objective Measurement [4]

Psychometric software

329

Classical test theory


Classical test theory is an approach to psychometric analysis that has weaker assumptions than item response theory and is more applicable to smaller sample sizes.

CITAS
CITAS (Classical Item and Test Analysis Spreadsheet) is a free Excel workbook designed to provide scoring and statistical analysis of classroom tests. Item responses (ABCD) and keys are typed or pasted into the workbook, and the output automatically populates; unlike other programs, CITAS does not require any "running" or experience in psychometric analysis, making it accessible to school teachers and professors. It is available for free download here [6] .

jMetrik
jMetrik [7] is free and open source software for conducting a comprehensive psychometric analysis. It was developed by J. Patrick Meyer at the University of Virginia. Current methods include classical item analysis, differential item functioning (DIF) analysis, confirmatory factor analysis, item response theory, IRT equating, and nonparametric item response theory. The item analysis includes proportion, point biserial, and biserial statistics for all response options. Reliability coefficients include Cronbach's alpha, Guttman's lambda, the Feldt-Gilmer Coefficient, the Feldt-Brennan coefficient, decision consistency indices, the conditional standard error of measurement, and reliability if item deleted. The DIF analysis is based on nonparametric item characteristic curves and the Mantel-Haenszel procedure. DIF effect sizes and ETS DIF classifications are included in the output. Confirmatory factor analysis is limited to the common factor model for congeneric, tau-equivalent, and parallel measures. Fit statistics are reported along with factor loadings and error variances. IRT methods include the Rasch, partial credit, and rating scale models. IRT equating methods include mean/mean, mean/sigma, Haebara, and Stocking-Lord procedures. jMetrik also include basic descriptive statistics and a graphics facility that produces bar charts, pie chart, histograms, kernel density estimates, and line plots. jMetrik is a pure Java application that runs on 32-bit and 64-bit versions of Windows, Mac, and Linux operating systems. jMetrik requires Java 1.6 on the host computer. jMetrik is available as a free download from www.ItemAnalysis.com [7].

Iteman
Iteman is a commercial program specifically designed for classical test analysis, producing rich text (RTF) reports with graphics, narratives, and embedded tables. It calculates the proportion and point biserial of each item, as well as high/low subgroup proportions, and detailed graphics of item performance. It also calculates typical descriptive statistics, including the mean, standard deviation, reliability, and standard error of measurement, for each domain and the overall tests. It is only available from Assessment Systems Corporation [8].

Lertap
Lertap (Laboratory of Educational Research Test Analysis Program) is a comprehensive software package for classical test analysis developed for use with Microsoft Excel. It includes test, item, and option statistics, classification consistency and mastery test analysis, procedures for cheating detection, and extensive graphics (e.g., trace lines for item options, conditional standard errors of measurement, scree plots, boxplots of group differences, histograms, scatterplots). DIF, differential item functioning, is supported in the Excel 2007, Excel 2010, Excel 2011 (Macintosh), and Excel 2013 versions of Lertap. Mantel-Haenszel methods are used; graphs of results are provided.

Psychometric software Lertap will produce ASCII data files ready for input to Xcalibre and Bilog MG. Several sample datasets for use with Lertap and/or other item and test analysis programs are available [9]; these involve both cognitive tests, and affective (or rating) scales. Technical papers related to the application of Lertap are also available [10]. Lertap was developed by Larry Nelson at Curtin University; commercial versions are available from Assessment Systems Corporation [11].

330

TAP
TAP (the Test Analysis Program) is a free program for basic classical analysis developed by Gordon Brooks at Ohio University. It is available here [12].

ViSta-CITA
ViSta-CITA (Classical Item and Test Analysis) is a module included in the Visual Statistics System (ViSta) that focuses on graphical-oriented methods applied to psychometric analysis. It is freely available at [13]. It was developed by Ruben Ledesma, J. Gabriel Molina, Pedro M. Valero-Mora, and Forrest W. Young.

Item response theory calibration


Item response theory (IRT) is a psychometric approach which assumes that the probability of a certain response is a direct function of an underlying trait or traits. Various functions have been proposed to model this relationship, and the different calibration packages reflect this. Several software packages have been developed for additional analysis such as equating; they are listed in the next section.

BILOG-MG
BILOG-MG is a software program for IRT analysis of dichotomous (correct/incorrect) data, including fit and differential item functioning. It is commercial, and only available from Scientific Software International [14] or Assessment Systems Corporation [15].

Facets
Facets is a software program for Rasch analysis of rater- or judge-intermediated data, such as essay grades, diving competitions, satisfaction surveys and quality-of-life data. Other applications include rank-order data, binomial trials and Poisson counts. For availability, see Software directory at the Institute for Objective Measurement [4].

flexMIRT
flexMIRT is a new multilevel and multiple group IRT software package for item analysis and test scoring. This IRT software package fits a variety of unidimensional and multidimensional item response theory models (also known as item factor analysis models) to single-level and multilevel data in any number of groups. It is available from Vector Psychometric Group, LLC [16].

Psychometric software

331

ICL
ICL (IRT Command Language) performs IRT calibrations, including the 1, 2, and 3 parameter logistic models as well as the partial credit model and generalized partial credit model. It can also generate response data. As the name implies, it is completely command code driven, with no graphical user interface. It is available for free download here [17].

jMetrik
jMetrik [7] is free and open source software for conducting a comprehensive psychometric analysis. It was developed by J. Patrick Meyer at the University of Virginia. Current methods include classical item analysis, differential item functioning (DIF) analysis, confirmatory factor analysis, item response theory, IRT equating, and nonparametric item response theory. The item analysis includes proportion, point biserial, and biserial statistics for all response options. Reliability coefficients include Cronbach's alpha, Guttman's lambda, the Feldt-Gilmer Coefficient, the Feldt-Brennan coefficient, decision consistency indices, the conditional standard error of measurement, and reliability if item deleted. The DIF analysis is based on nonparametric item characteristic curves and the Mantel-Haenszel procedure. DIF effect sizes and ETS DIF classifications are included in the output. Confirmatory factor analysis is limited to the common factor model for congeneric, tau-equivalent, and parallel measures. Fit statistics are reported along with factor loadings and error variances. IRT methods include the Rasch, partial credit, and rating scale models. IRT equating methods include mean/mean, mean/sigma, Haebara, and Stocking-Lord procedures. jMetrik also include basic descriptive statistics and a graphics facility that produces bar charts, pie chart, histograms, kernel density estimates, and line plots. jMetrik is a pure Java application that runs on 32-bit and 64-bit versions of Windows, Mac, and Linux operating systems. jMetrik requires Java 1.6 on the host computer. jMetrik is available as a free download from www.ItemAnalysis.com [7].

MULTILOG
MULTILOG is an extension of BILOG to data with polytomous (multiple) responses. It is commercial, and only available from Scientific Software International [14] or Assessment Systems Corporation [18].

PARSCALE
PARSCALE is a program designed specifically for polytomous IRT analysis. It is commercial, and only available from Scientific Software International [14] or Assessment Systems Corporation [19].

PARAM-3PL
PARAM-3PL [20] is a free program for the calibration of the 3-parameter logistic IRT model. It was developed by Lawrence Rudner at the Education Resources Information Center (ERIC). The latest release was version 0.89 in June 2007. It is available from ERIC here [21].

TESTFact
Testfact features [22] - Marginal maximum likelihood (MML) exploratory factor analysis and classical item analysis of binary data - Computes tetrachoric correlations, principal factor solution, classical item descriptive statistics, fractile tables and plots - Handles up to 10 factors using numerical quadrature: up to 5 for non-adaptive and up to 10 for adaptive quadrature - Handles up to 15 factors using Monte Carlo integration techniques - Varimax (orthogonal) and PROMAX (oblique) rotation of factor loadings - Handles an important form of confirmatory factor analysis known as "bifactor" analysis: Factor pattern consists of one main factor plus group factors - Simulation of responses

Psychometric software to items based on user specified parameters - Correction for guessing and not-reached items - Allows imposition of constraints on item parameter estimates - Handles omitted and not-presented items - Detailed online HELP documentation includes syntax and annotated examples.

332

WINMIRA 2001
WINMIRA 2001 is a program for analyses with the Rasch model for dichotomous and polytomous ordinal responses, with the latent class analysis, and with the Mixture Distribution Rasch model for dichotomous [23] and polytomous item responses.[24] The software provides conditional maximum likelihood (CML) estimation of item parameters, as well as MLE and WLE estimates of person parameters, and person- and item-fit statistics as well as information criteria (AIC, BIC, CAIC) for model selection. The software also performs a parametric bootstrap procedure for the selection of the number of mixture components. A free student version is available from Matthias von Davier's webpage at http:/ / www. von-davier. com/ [25], a commercial version is available through ASSESS.COM at [26].

Winsteps
Winsteps is a program designed for analysis with the Rasch model, a one-parameter item response theory model which differs from the 1PL model in that each individual in the person sample is parameterized for item estimation and it is prescriptive and criterion-referenced, rather than descriptive and norm-referenced in nature.[27] It is commercially available from Winsteps, Inc. [28]. A previous DOS-based version, BIGSTEPS, is also available.

Xcalibre
XCalibre is a commercial program that performs marginal maximum likelihood estimation of both dichotomous (1PL-Rasch, 2PL, 3PL) and all major polytomous IRT models. The interface is point-and-click; no command code required. Its output includes both spreadsheets and a detailed, narrated report document with embedded tables and figures, which can be printed and delivered to subject matter experts for item review. It is only available from Assessment Systems Corporation [29].

IATA
IATA is a software package for analysing psychometric and educational assessment data. The interface is point-and-click, and all functionality is delivered through wizard-style interfaces that are based on different workflows or analysis goals, such as pilot testing or equating. IATA reads and writes csv, Excel and SPSS file formats, and produces exportable graphics for all statistical analyses. Each analysis also includes heuristics suggesting appropriate interpretations of the numerical results. IATA performs factor analysis, (1PL-Rasch, 2PL, 3PL) scaling and calibration, differential item functioning (DIF) analysis, (basic) computer aided test development, equating, IRT-based standard setting, score conditioning, and plausible value generation. It is available for free from Polymetrika International [30].

Psychometric software

333

Additional item response theory software


Because of the complexity of IRT, there exist few software packages capable of calibration. However, many software programs exist for specific ancillary IRT analyses such as equating and scaling. Examples of such software follow.

eqboot
eqboot is an open source syntax-based Java application for conducting IRT equating and computing the bootstrap standard error of equating developed by J. Patrick Meyer. The program runs on any 32- or 64-bit operating system that has the Java Runtime Environment (JRE) version 1.6 or higher installed. At the moment, the programs only support equating with binary items. EQBOOT will compute equating constants using the mean/mean, mean/sigma, Haebara,[31] and Stocking-Lord[32] procedures. It will also compute the standard error of equating if the user provides a comma delimited file of bootstrapped item parameter estimates from both forms, a comma delimited file of bootstrapped ability estimates for Form X examinees, and a comma delimited file of bootstrapped ability estimates for Form Y examinees. Options allow the user to specify the criterion function for the Haebara and Stocking-Lord methods.[33] In addition, the examinee distribution over which the criterion function is minimized may be set to the observed theta estimates, a histogram of theta estimates, a kernel density estimate of theta estimates, or uniformly spaced values on the theta scale. The software is a free download from www.ItemAnalysis.com [34].

IRTEQ
IRTEQ [35] is a freeware Windows GUI application that implements IRT scaling and equating developed by Kyung (Chris) T. Han. It implements IRT scaling/equating methods that are widely used with the Non-Equivalent Groups Anchor Test design: Mean/Mean,[36] Mean/Sigma,[37] Robust Mean/Sigma,[38] and TCC methods.[39][40] For TCC methods, IRTEQ provides the user with the option to choose various score distributions for incorporation into the loss function. IRTEQ supports various popular unidimensional IRT models: Logistic models for dichotomous responses (with 1, 2, or 3 parameters) and the Generalized Partial Credit Model (GPCM) (including Partial Credit Model (PCM), which is a special case of GPCM) and Graded Response Model (GRM) for polytomous responses. IRTEQ can also equate test scores on the scale of a test to the scale of another test using IRT true score equating.[41]

ResidPlots-2
ResidPlots-2 [42] is a free program for IRT graphical residual analysis. It was developed by Tie Liang, Kyung (Chris) T. Han, and Ronald K. Hambleton at the University of Massachusetts Amherst.

WinGen
WinGen [43] is a free Windows-based program that generates IRT parameters and item responses. Kyung (Chris) T. Han at the University of Massachusetts Amherst.[44]

ST
ST [1] conducts item response theory (IRT) scale transformations for dichotomously scored tests.

Psychometric software

334

POLYST
POLYST [1] conducts IRT scale transformations for dichotomously and polytomously scored tests.

STUIRT
STUIRT [1] conducts IRT scale transformations for mixed-format tests (tests that include some multiple choice items and some polytomous items).

Decision consistency
Decision consistency methods are applicable to criterion-referenced tests such as licensure exams and academic mastery testing.

Iteman
Iteman [8] provides an index of decision consistency as well as a classical estimate of the conditional standard error of measurement at the cutscore, which is often requested for accreditation of a testing program.

jMetrik
jMetrik [7] is free and open source software for conducting a comprehensive psychometric analysis. Detailed information is listed above. jMetrik includes Huynh's decision consistency estimates if cut-scores are provided in the item analysis.

Lertap
Lertap [45] calculates several statistics related to decision and classification consistency, including the Brennan-Kane dependability index, kappa, and an estimate of p(0) derived by using the Peng-Subkoviac adaptation of Huynh's method. More detailed information concerning Lertap is provided above, under 'Classical test theory'.

General statistical analysis software


Software designed for general statistical analysis can often be used for certain types of psychometric analysis. Moreover, code for more advanced types of psychometric analysis is often available.

R
R is a programming environment designed for statistical computing and production of graphics. It is freely available at [46]. Basic R functionality can be extended through installing contributed 'packages', and a list of psychometric related packages may be found at [47].

SPSS
SPSS, originally called the Statistical Package for the Social Sciences, is a commercial general statistical analysis program where the data is presented in a spreadsheet layout and common analyses are menu driven.

Psychometric software

335

S-Plus
S-Plus is a commercial analysis package based on the programming language S.

SAS
SAS is a commercially available package for statistical analysis and manipulation of data. It is also command-based.

References
[1] http:/ / www. education. uiowa. edu/ casma/ computer_programs. htm [2] http:/ / www. umass. edu/ remp/ main_software. html [3] http:/ / www. b-a-h. com/ [4] http:/ / john-uebersax. com/ stat/ papers. htm [5] http:/ / www. jMetrik. com [6] http:/ / www. assess. com/ xcart/ product. php?productid=522& cat=19& page=1 [7] http:/ / www. ItemAnalysis. com/ [8] http:/ / www. assess. com/ xcart/ product. php?productid=541 [9] http:/ / lertap. curtin. edu. au/ HTMLHelp/ Lrtp59HTML/ index. html [10] http:/ / lertap. curtin. edu. au/ Documentation/ Techdocs. htm [11] http:/ / assess. com/ xcart/ product. php?productid=235& cat=0& page=1 [12] http:/ / oak. cats. ohiou. edu/ ~brooksg/ tap. htm [13] http:/ / www. uv. es/ visualstats/ Book/ DownloadBook. htm [14] http:/ / www. ssicentral. com/ irt/ index. html [15] http:/ / www. assess. com/ xcart/ product. php?productid=217 [16] http:/ / flexmirt. vpgcentral. com/ [17] http:/ / www. b-a-h. com/ software/ irt/ icl/ index. html [18] http:/ / www. assess. com/ xcart/ product. php?productid=244& cat=0& page=1 [19] http:/ / www. assess. com/ xcart/ product. php?productid=248& cat=0& page=1 [20] http:/ / echo. edres. org:8080/ irt/ param/ [21] http:/ / edres. org/ irt/ param [22] http:/ / www. scienceplus. nl/ catalog/ testfact?vmcchk=1 [23] Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. [24] von Davier, M., & Rost, J. (1995). Polytomous mixed Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models, foundations, recent developments, and applications (pp. 371-382). New York: Springer. [25] http:/ / www. von-davier. com/ [26] http:/ / www. assess. com/ xcart/ product. php?productid=269& cat=0& page=1 [27] Rasch dichotomous model vs. One-parameter Logistic Model (http:/ / www. rasch. org/ rmt/ rmt193h. htm). Rasch Measurement Transactions (http:/ / www. rasch. org/ rmt/ ), 2005, 19:3 p. 1032 [28] http:/ / www. winsteps. com/ [29] http:/ / www. assess. com/ xcart/ product. php?productid=569 [30] http:/ / www. polymetrika. org/ IATA [31] Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144149. [32] Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. [33] Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods.Journal of Educational and Behavioral Statistics, 32, 371-397. [34] http:/ / www. ItemAnalysis. com [35] http:/ / www. umass. edu/ remp/ software/ irteq/ [36] Loyd & Hoover, 1980 [37] Marco, 1977 [38] Linn, Levine, Hastings, & Wardrop, 1981 [39] Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144149. [40] Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. [41] Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. [42] http:/ / www. umass. edu/ remp/ software/ residplots/ [43] http:/ / www. umass. edu/ remp/ software/ wingen/

Psychometric software
[44] Han, K. T. (2007). WinGen: Windows software that generates IRT parameters and item responses. Applied Psychological Measurement, 31, 457-459. [45] http:/ / www. lertap. curtin. edu. au/ [46] http:/ / www. r-project. org/ [47] http:/ / cran. r-project. org/ web/ views/ Psychometrics. html

336

SpearmanBrown prediction formula


The SpearmanBrown prediction formula, also known as the SpearmanBrown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length.[1] The method was published independently by Spearman (1910) and Brown (1910).[2][3]

Calculation
Predicted reliability, , is estimated as:

where N is the number of "tests" combined (see below) and

is the reliability of the current "test". The formula

predicts the reliability of a new test composed by replicating the current test N times (or, equivalently, creating a test with N parallel forms of the current exam). Thus N=2 implies doubling the exam length by adding items with the same properties as those in the current exam. Values of N less than one may be used to predict the effect of shortening a test.

Forecasting test length


The formula can also be rearranged to predict the number of replications required to achieve a degree of reliability:

Use and related topics


This formula is commonly used by psychometricians to predict the reliability of a test after changing the test length. This relationship is particularly vital to the split-half and related methods of estimating reliability (where this method is sometimes known as the "Step Up" formula).[4] The formula is also helpful in understanding the nonlinear relationship between test reliability and test length. Test length must grow by increasingly larger values as the desired reliability approaches 1.0. If the longer/shorter test is not parallel to the current test, then the prediction will not be strictly accurate. For example, if a highly reliable test was lengthened by adding many poor items then the achieved reliability will probably be much lower than that predicted by this formula. For the reliability of a two-item test, the formula is more appropriate than Cronbach's alpha.[5] Item response theory item information provides a much more precise means of predicting changes in the quality of measurement by adding or removing individual items.[citation needed]

SpearmanBrown prediction formula

337

Citations
[2] Stanley, J. (1971). Reliability. In R. L. Thorndike (Ed.), Educational Measurement. Second edition. Washington, DC: American Council on Education [3] Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In H. Wainer and D. Thissen, (Eds.), Test Scoring. Mahwah, NJ:Lawrence Erlbaum [4] Stanley, J. (1971). Reliability. In R. L. Thorndike (Ed.), Educational Measurement. Second edition. Washington, DC: American Council on Education

References
Spearman, Charles, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271295. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296322.

Standard-setting study
A standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the USA and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined, it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Standard-setting studies are often performed using focus groups of 5-15 subject matter experts that represent key stakeholders for the test. For example, in setting cut scores for educational testing, experts might be instructors familiar with the capabilities of the student population for the test.

Types of standard-setting studies


Standard-setting studies fall into two categories, item-centered and person-centered. Examples of item-centered methods include the Angoff, Ebel, Nedelsky, and Bookmark methods, while examples of person-centered methods include the Borderline Survey and Contrasting Groups approaches. These are so categorized by the focus of the analysis; in item-centered studies, the organization evaluates items with respect to a given population of persons, and vice versa for person-centered studies.

Item-centered studies
The Angoff approach is very widely used.[1] This method requires the assembly of a group of subject matter experts, who are asked to evaluate each item and estimate the proportion of minimally competent examinees that would correctly answer the item. The ratings are averaged across raters for each item and then summed to obtain a panel-recommended raw cutscore. This cutscore then represents the score which the panel estimates a minimally competent candidate would get. This is of course subject to decision biases such as the overconfidence bias. Calibration with other - more objective - sources of data is preferable. The Bookmark method is another widely used item-centered approach. Items in a test (or a subset of them) are ordered by difficulty, and each expert places a "bookmark" in the sequence at the location of the cutscore.[2][3]

Standard-setting study

338

Person-centered studies
Rather than the items that distinguish competent candidates, person-centered studies evaluate the examinees themselves. While this might seem more appropriate, it is often more difficult because examinees are not a captive population, as is a list of items. For example, if a new test comes out regarding new content (as often happens in information technology tests), the test could be given to an initial sample called a beta sample, along with a survey of professional characteristics. The testing organization could then analyze and evaluate the relationship between the test scores and important statistics, such as skills, education, and experience. The cutscore could be set as the score that best differentiates between those examinees characterized as "passing" and those as "failing."

References
[1] Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates. [2] Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ. [3] Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

Standards for Educational and Psychological Testing


The Standards for Educational and Psychological Testing is a set of testing standards developed jointly by the American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). APAstandards Revised significantly from the 1985 version, the 1999 Standards for Educational and Psychological Testing has more in-depth background material in each chapter, a greater number of standards, and a significantly expanded glossary and index. The 1999 version Standards reflects changes in United States federal law and measurement trends affecting validity; testing individuals with disabilities or different linguistic backgrounds; and new types of tests as well as new uses of existing tests. The Standards is written for the professional and for the educated layperson and addresses professional and technical issues of test development and use in education, psychology and employment. The Standards are currently under review once again. A revised version is expected to be published some time after 2012.

Overview of organization and content


Part I: Test Construction, Evaluation, and Documentation
1. Validity 2. Reliability and Errors of Measurement 3. Test Development and Revision 4. Scales, Norms, and Score Comparability 5. Test Administration, Scoring, and Reporting 6. Supporting Documentation for Tests

Standards for Educational and Psychological Testing

339

Part II: Fairness in Testing


7. Fairness in Testing and Test Use 8. The Rights and Responsibilities of Test Takers 9. Testing Individuals of Diverse Linguistic Backgrounds 10. Testing Individuals with Disabilities

Part III: Testing Applications


11. The Responsibilities of Test Users 12. Psychological Testing and Assessment 13. Educational Testing and Assessment 14. Testing in Employment and Credentialing 15. Testing in Program Evaluation and Public Policy

Related standards
In 1974, the Joint Committee on Standards for Educational Evaluation was charged with the responsibility of writing a companion volume to the 1974 revision of the Standards for Educational and Psychological Tests. [1] This companion volume was to deal with issues and standards for program and curriculum evaluation in education. In 1975, the Joint Committee began work and ultimately decided to establish three separate sets of standards. These standards include The Personnel Evaluation Standards, The Program Evaluation Standards, and The Student Evaluation Standards.

Notes and references


1. ^APAstandards The Standards for Educational and Psychological Testing [2] 2. ^ American Educational Research Association. (1977, September 12). Joint Committee on Standards for Educational Evaluation Update--September 1977. AERA Division H Newsletter. [3]

External links
The Standards for Educational and Psychological Testing [4] (apa.org) The Standards for Educational and Psychological Testing [5] (aera.net)

References
[1] [2] [3] [4] [5] http:/ / en. wikipedia. org/ wiki/ Standards_for_Educational_and_Psychological_Testing#endnote_AERAnewsletter http:/ / www. apa. org/ science/ standards. html#overview http:/ / www. wmich. edu/ evalctr/ jc/ AERADivisionHNewsletterSeptember1977. pdf http:/ / www. apa. org/ science/ standards. html http:/ / www. aera. net/ publications/ Default. aspx?menu_id=46& id=1407

StanfordBinet Intelligence Scales

340

StanfordBinet Intelligence Scales


StanfordBinet Intelligence scales
Diagnostics ICD-9-CM MeSH 94.01 [1] [1]

D013195

The development of the StanfordBinet Intelligence Scales initiated the modern field of intelligence testing and was one of the first examples of an adaptive test. The test originated in France, then was revised in the United States. The StanfordBinet test started with the French psychologist Alfred Binet, whom the French government commissioned with developing a method of identifying intellectually challenged children for their placement in special education programs. As Binet indicated, case studies might be more detailed and helpful, but the time required to test many people would be excessive. In 1916, at Stanford University, the psychologist Lewis Terman released a revised examination which became known as the "StanfordBinet test".

Development
Later, Alfred Binet and physician Theodore Simon collaborated in studying mental retardation in French school children. Theodore Simon was a student of Binet.[2] Between 1905 and 1908, their research at a boys' school, in Grange-aux-Belles, led to their developing the BinetSimon tests; assessing attention, memory, and verbal skill. The test consisted of 30 items ranging from the ability to touch one's nose or ear, when asked, to the ability to draw designs from memory and to define abstract concepts,[2] and varying in difficulty. Binet proposed that a child's intellectual ability increases with age. In June 1905, their test was published as the Binet-Simon Intelligence Test in L'Anne Psychologique. In this essay, they described three methods that should be employed to study "inferior states of intelligence." These methods include the medical method (anatomical, physiological, and pathological signs of inferior intelligence), the pedagogical method (judging intelligence based on a sum of acquired knowledge), and the psychological method (making direct observations and measurements of intelligence). They claimed that the psychological method is the most direct method because it measures intelligence as it is in the present moment by assessing his/her capacity to judge, comprehend, reason, and invent.[3] Both Binet and Simon's test was considerably accurate at determining a child's grades at school and they found that intelligence influences how well a child performs at school.[4] The original tests in the 1905 form include: 1. "Le Regard" 2. Prehension Provoked by a Tactile Stimulus 3. Prehension Provoked by a Visual Perception 4. Recognition of Food 5. Quest of Food Complicated by a Slight Mechanical Difficulty 6. Execution of Simple Commands and Imitation of Simple Gestures 7. Verbal Knowledge of Objects 8. Verbal Knowledge of Pictures 9. Naming of Designated Objects 10. Immediate Comparison of Two Lines of Unequal Lengths 11. Repetition of Three Figures 12. Comparison of Two Weights

StanfordBinet Intelligence Scales 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. Suggestibility Verbal Definition of Known Objects Repetition of Sentences of Fifteen Words Comparison of Known Objects from Memory Exercise of Memory on Pictures Drawing a Design from Memory Immediate Repetition of Figures Resemblances of Several Known Objects Given from Memory Comparison of Lengths Five Weights to be Placed in Order Gap in Weights Exercise upon Rhymes Verbal Gaps to be Filled Synthesis of Three Words in One Sentence Reply to an Abstract Question Reversal of the Hands of a Clock Paper Cutting

341

30. Definitions of Abstract Terms New forms of the test were published in 1908 and again in 1911, after extensive research using "normal" examinees in addition to examinees that were considered to be Mental retardation. In 1912, William Stern created the concept of mental age (MA): an individual's level of mental development relative to others.[2] Binet placed a confidence interval around the scores returned from his tests, both because he thought intelligence was somewhat plastic, and because of inherent margin of error in psychometric tests.[5] In 1916, the Stanford University psychologist Lewis Terman released the "Stanford Revision of the BinetSimon Scale", the "StanfordBinet", for short. He wrote The Measurement of Intelligence: An Explanation of and a Complete Guide for the Use of the Stanford Revision and Extension of the Binet-Simon Intelligence Scale, which provided English translations for the French items as well as new items. Despite other available translations, Terman is noted for his normative studies and methodological approach. With one of his graduate students at Stanford University, Maud Merrill, Terman created two parallel forms of the Stanford-Binet: Form L (for Lewis) and Form M (for Maud). Then, in the 1950s, Merrill revised the Stanford-Binet and created a new version that included what he considered to be the best test items from Forms L and M. This version was published in 1960 and renormed in 1973. Soon, the test was so popular that Robert Yerkes, the president of the American Psychological Association, decided to use it in developing the Army Alpha and the Army Beta tests to classify recruits.[citation needed] Thus, a high-scoring recruit might earn an A-grade (high officer material), whereas a low-scoring recruit with an E-grade would be rejected for military service.[5] The fourth edition of the test, which was published in 1986, converted from Binet's age-scale format to a point-scale format. The age-scale format, which was originally designed to provide a translation of the child's performance to mental age, was arguably inappropriate for more current generations of test-takers. The point scale arranged the tests into subtests, where all items of a type were administered together. The Fifth Edition includes the age-scale format to provide a variety of items at each level and to keep examinees interested. In 1960, the present day Stanford-Binet Scale replaced the ratio IQ with the deviation IQ. The deviation IQ compares and contrasts a child's score with numerous other scores obtained by other children of the same comparable age. This deviation IQ was developed by David Wechsler.[6] To test the validity of the Stanford-Binet Intelligence, three methods were used: 1. Professional judgement by researchers and examiners of all test items 2. Professional judgement by experts in CHC theory

StanfordBinet Intelligence Scales 3. Empirical Item analyses[7] Construct validity was obtained from the analyses of age trends for each of the five factor scores, which included both growth and decline, intercorrelations of tests, factors, IQs, and evidence for general ability.[8]

342

Timeline
April 1905: Development of Binet-Simon Test announced at a conference in Rome June 1905: Binet-Simon Intelligence Test introduced 1908 and 1911: New Versions of Binet-Simon Intelligence Test 1916: Stanford-Binet First Edition by Terman 1937: Second Edition by Terman and Merrill 1973: Third Edition by Merrill 1986: Fourth Edition by Thorndike, Hagen, and Sattler 2003: Fifth Edition by Roid

Scale
Binet's intelligence scale was divided into categories based on IQ score. The original names, which included "moron," "imbecile," and "idiot," among others, are no longer used. These categories were later replaced with words that were more descriptive of a scale of intellectual deficiency, marked from mild to profound deficiency.[9]

Binet Scale of Human Intelligence


IQ Score Original Name Modern Term Over 145 Genius 130 - 144 Superior 120 - 129 Bright or Smart 110 - 119 High Average 90-109 80 - 89 70 - 79 55 - 69 40 - 54 Average or Normal Low Average Borderline Impaired Mildly Impaired Moderately Impaired Severe Profound Delayed

Below 20 Mental Retardation

Present use
Since the inception of the StanfordBinet, it has been revised several times. Currently, the test is in its fifth edition, which is called the Stanford-Binet Intelligence Scales, Fifth Edition, or SB5. According to the publisher's website, "The SB5 was normed on a stratified random sample of 4,800 individuals that matches the 2000 U.S. Census." By administering the StanfordBinet test to large numbers of individuals selected at random from different parts of the United States, it has been found that the scores approximate a normal distribution. The revised edition of the Stanford-Binet over time has devised substantial changes in the way the tests are presented. The test has improved when looking at the introduction of a more parallel form and more demonstrative standards. For one, a non-verbal IQ component is included in the present day tests whereas in the past, there was only a verbal component. In fact, it now has equal balance of verbal and non-verbal content in the tests. It is also more animated than the other tests, providing the test-takers with more colourful artwork, toys and manipulatives. This allows the test to have a higher

StanfordBinet Intelligence Scales range in the age of the test takers.[10] This test is very useful in assessing the intellectual capabilities of people ranging from young children all the way to young adults. However, the test has come under criticism for not being able to compare people of different age categories, since each category gets a different set of tests. furthermore, very young children tend to do poor on the test due to the fact that they are lacking in the concentration needed to finish the test.[11] Current uses for the test include clinical and neuropsychological assessment, educational placement, compensation evaluations, career assessment, adult neuropsychological treatment, forensics, and research on aptitude.[12] Various high-IQ societies also accept this test for admission into their ranks; for example, the Triple Nine Society accepts a minimum qualifying score of 151 for Form L or M, 149 for Form LM if taken in 1986 or earlier, 149 for SB-IV, and 146 for SB-V; in all cases the applicant must have been at least 16 years old at the date of the test.[13]

343

References
[1] http:/ / www. nlm. nih. gov/ cgi/ mesh/ 2011/ MB_cgi?field=uid& term=D013195 [2] Santrock, John W. (2008) "A Topical Approach to Life-Span Development", (4th Ed.) Concept of Intelligence (283284), New York: McGrawHill. [3] Binet, Alfred. (1905) L'Annee Psychologique, 12,191-244. [4] Gilbert, D.T., Schacter, D.L., & Wegner, D.M. (2011). Psychology. New York, NY:Worth Publishers. [5] Fancher, Raymond E. (1985) "The Intelligence Men: Makers of the IQ Controversy", New York (NY): W. W. Norton. [6] Carlson, N. R. (2010). Psychology, the science of behaviour. (4 ed.). Upper Saddle River, New Jersey: Pearson Education, Inc. p.336 [7] Janzen, Henry, John Obrzut, and Christopher Marusiak. "Stanford-Binet Intelligent Scales." Canadian Journal of School Psychology. 19.1/2 (2004): 235-245. Web. 7 Mar. 2012. <http://journals2.scholarsportal.info.myaccess.library.utoronto.ca/tmp/15371491548931334389.pdf>. [8] Janzen, Henry, John Obrzut, and Christopher Marusiak. "Stanford-Binet Intelligent Scales." Canadian Journal of School Psychology. 19.1/2 (2004): 235-245. Web. 7 Mar. 2012. <http://journals2.scholarsportal.info.myaccess.library.utoronto.ca/tmp/15371491548931334389.pdf>. [10] "Stanford-Binet Intelligence Scales, Fifth Edition Assessment Service Bulletin Number 1" (http:/ / www. assess. nelson. com/ pdf/ sb5-asb1. pdf) [11] http:/ / www. minddisorders. com/ Py-Z/ Stanford-Binet-Intelligence-Scale. html [12] Riverside Publishing. Stanford-Binet Intelligence Scales, SB5, Fifth Edition. Accessed on December 2, 2011. http:/ / www. riversidepublishing. com/ products/ sb5/ details. html [13] http:/ / www. triplenine. org/ main/ admission. asp

Further reading
Becker, K.A (2003). "History of the Stanford-Binet Intelligence scales: Content and psychometrics.". Stanford-Binet Intelligence Scales, Fifth Edition Assessment Service Bulletin No. 1. Binet, Alfred; Simon, Th. (1916). The development of intelligence in children: The BinetSimon Scale (http:// books.google.com/books?id=jEQSAAAAYAAJ&dq=The development of intelligence in children Binet& pg=PA1#v=onepage&q&f=false). Publications of the Training School at Vineland New Jersey Department of Research No. 11. E. S. Kite (Trans.). Baltimore: Williams & Wilkins. Retrieved 18 July 2010. Brown, A. L.; French, L. A. (1979). "The Zone of Potential Development: Implications for Intelligence Testing in the Year 2000". Intelligence 3 (3): 255273. Fancher, Raymond E. (1985). The Intelligence Men: Makers of the IQ Controversy. New York (NY): W. W. Norton. ISBN978-0-393-95525-5. Freides, D. (1972). "Review of StanfordBinet Intelligence Scale, Third Revision". In Oscar Buros. Seventh Mental Measurements Yearbook. Highland Park (NJ): Gryphon Press. pp.772773. Gould, Stephen Jay (1981). The Mismeasure of Man. New York (NY): W. W. Norton. ISBN978-0-393-31425-0. Lay summary (http://www.nytimes.com/books/97/11/09/home/gould-mismeasure.html) (10 July 2010). McNemar, Quinn (1942). The revision of the StanfordBinet Scale. Boston: Houghton Mifflin. Pinneau, Samuel R. (1961). Changes in Intelligence Quotient Infancy to Maturity: New Insights from the Berkeley Growth Study with Implications for the StanfordBinet Scales and Applications to Professional Practice. Boston: Houghton Mifflin.

StanfordBinet Intelligence Scales Terman, Lewis Madison; Merrill, Maude A. (1937). Measuring intelligence: A guide to the administration of the new revised StanfordBinet tests of intelligence. Riverside textbooks in education. Boston (MA): Houghton Mifflin. Terman, Lewis Madison; Merrill, Maude A. (1960). StanfordBinet Intelligence Scale: Manual for the Third Revision Form LM with Revised IQ Tables by Samuel R. Pinneau. Boston (MA): Houghton Mifflin. Richardson, Nancy (1992). "StanfordBinet IV, of Course!: Time Marches On! (originally published as Which StanfordBinet for the Brightest?)" (http://www.davidsongifted.org/db/Articles_id_10128.aspx). Roeper Review 15 (1): 3234. Waddell, Deborah D. (1980). "The StanfordBinet: An Evaluation of the Technical Data Available since the 1972 Restandardization" (http://www.eric.ed.gov/ERICWebPortal/search/detailmini.jsp?_nfpb=true&_& ERICExtSearch_SearchValue_0=EJ233903&ERICExtSearch_SearchType_0=no&accno=EJ233903). Journal of School Psychology 18 (3): 203209. doi: 10.1016/0022-4405(80)90060-6 (http://dx.doi.org/10.1016/ 0022-4405(80)90060-6). Retrieved 29 June 2010.

344

Stanine
Stanine (STAndard NINE) is a method of scaling test scores on a nine-point standard scale with a mean of five and a standard deviation of two. This method was used for standardized school testing, but has now been replaced by other forms of measuring aptitude. Some web sources attribute stanines to the U.S. Army Air Forces during World War II. Psychometric legend has it that a 0-9 scale was used because of the compactness of recording the score as a single digit but Thorndike [1] claims that by reducing scores to just nine values, stanines "reduce the tendancy to try to interpret small score differences (p. 131)". The earliest known use of stanines was by the U.S. Army Air Forces in 1943.[citation needed] Test scores are scaled to stanine scores using the following algorithm: 1. Rank results from lowest to highest 2. Give the lowest 4% a stanine of 1, the next 7% a stanine of 2, etc., according to the following table:

Calculating Stanines
Result Ranking Stanine 4% 1 7% 2 12% 3 17% 4 20% 5 17% 6 12% 7 7% 8 4% 9

Standard score below -1.75 -1.75 to -1.25 -1.25 to -.75 -.75 to -.25 -.25 to +.25 +.25 to +.75 +.75 to +1.25 +1.25 to +1.75 above +1.75

The underlying basis for obtaining stanines is that a normal distribution is divided into nine intervals, each of which has a width of 0.5 standard deviations excluding the first and last, which are just the remainder (the tails of the distribution). The mean lies at the centre of the fifth interval. Stanines can be used to convert any test score into a single digit number. This was valuable when paper punch cards were the standard method of storing this kind of information. However, because all stanines are integers, two scores in a single stanine are sometimes further apart than two scores in adjacent stanines. This reduces their value. Today stanines are mostly used in educational assessment.[citation needed] The University of Alberta in Edmonton, Canada used the stanine system until 2003, when it switched to a 4-point scale [2]. In the United States, the Educational Records Bureau (they administer the "ERBs") reports test scores as stanines and percentiles.

Stanine

345

References
Ballew, Pat Origins of some arithmetic terms-2 [3]. Retrieved Dec. 26, 2004. Boydsten, Robert E. (February 27, 2000), Winning My Wings [4]
[1] [2] [3] [4] Thorndike, R. L. (1982). Applied Psychometrics. Boston, MA: Houghton Mifflin http:/ / www. registrar. ualberta. ca/ ro. cfm?id=184 http:/ / www. pballew. net/ arithme3. html#stanine http:/ / www. avca-sj. org/ WINGS32. html

Statistical hypothesis testing


A statistical hypothesis test is a method of making decisions using data from a scientific study. In statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level. The phrase "test of significance" was coined by statistician Ronald Fisher.[1] These tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance; this can help to decide whether results contain enough information to cast doubt on conventional wisdom, given that conventional wisdom has been used to establish the null hypothesis. The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejected in favor of the alternative hypothesis. Statistical hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis, which may not have pre-specified hypotheses. Statistical hypothesis testing is a key technique of frequentist inference. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position (null hypothesis) is incorrect based on how likely it would be for a set of observations to occur if the null hypothesis were true. Note that this probability of making an incorrect decision is not the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis. One naive Bayesian approach to hypothesis testing is to base decisions on the posterior probability,[2][3] but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory, attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via decision theory and optimal decisions, some of which have desirable properties, yet hypothesis testing is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the power of tests, which refers to the probability of correctly rejecting the null hypothesis when a given state of nature exists. Such considerations can be used for the purpose of sample size determination prior to the collection of data. In a famous example of hypothesis testing, known as the Lady tasting tea example,[4] a female colleague of Fisher claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the 4 cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (<5%; 1 of 70 1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup,[5] which would be considered a statistically significant result.

Statistical hypothesis testing

346

The testing process


In the statistical literature, statistical hypothesis testing plays a fundamental role.[] The usual line of reasoning is as follows: 1. There is an initial research hypothesis of which the truth is unknown. 2. The first step is to state the relevant null and alternative hypotheses. This is important as mis-stating the hypotheses will muddy the rest of the process. Specifically, the null hypothesis allows to attach an attribute: it should be chosen in such a way that it allows us to conclude whether the alternative hypothesis can either be accepted or stays undecided as it was before the test.[6] 3. The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid. 4. Decide which test is appropriate, and state the relevant test statistic T. 5. Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example the test statistic might follow a Student's t distribution or a normal distribution. 6. Select a significance level (), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%. 7. The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected, the so called critical region, and those for which it is not. The probability of the critical region is . 8. Compute from the observations the observed value tobs of the test statistic T. 9. Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and to accept or "fail to reject" the hypothesis otherwise. An alternative process is commonly used: Compute from the observations the observed value tobs of the test statistic T. 2. From the statistic calculate a probability of the observation under the null hypothesis (the p-value). 3. Reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis if and only if the p-value is less than the significance level (the selected probability) threshold. The two processes are equivalent.[7] The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results. The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a probability is useful for reporting. The calculations are now trivially performed with appropriate software. The difference in the two processes applied to the Radioactive suitcase example: "The Geiger-counter reading is 10. The limit is 9. Check the suitcase." "The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase." The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked. It is important to note the philosophical difference between accepting the null hypothesis and simply failing to reject it. The "fail to reject" terminology highlights the fact that the null hypothesis is assumed to be true from the start of the test; if there is a lack of evidence against it, it simply continues to be assumed true. The phrase "accept the null hypothesis" may suggest it has been proved simply because it has not been disproved, a logical fallacy known as the

Statistical hypothesis testing argument from ignorance. Unless a test with particularly high power is used, the idea of "accepting" the null hypothesis may be dangerous. Nonetheless the terminology is prevalent throughout statistics, where its meaning is well understood. Alternatively, if the testing procedure forces us to reject the null hypothesis (H0), we can accept the alternative hypothesis (H1) and we conclude that the research hypothesis is supported by the data. This fact expresses that our procedure is based on probabilistic considerations in the sense we accept that using another set of data could lead us to a different conclusion. The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations.[8][9] It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.

347

Interpretation
If the p-value is less than the required significance level (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the given level of significance. Rejection of the null hypothesis is a conclusion. This is like a "guilty" verdict in a criminal trial the evidence is sufficient to reject innocence, thus proving guilt. We might accept the alternative hypothesis (and the research hypothesis). If the p-value is not less than the required significance level (equivalently, if the observed test statistic is outside the critical region), then the test has no result. The evidence is insufficient to support a conclusion. (This is like a jury that fails to reach a verdict.) The researcher typically gives extra consideration to those cases where the p-value is close to the significance level. In the Lady tasting tea example (above), Fisher required the Lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. He defined the critical region as that case alone. The region was defined by a probability (that the null hypothesis was correct) of less than 5%. Whether rejection of the null hypothesis truly justifies acceptance of the research hypothesis depends on the structure of the hypotheses. Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of Bigfoot. Hypothesis testing emphasizes the rejection which is based on a probability rather than the acceptance which requires extra steps of logic.

Use and importance


Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. In the Lady tasting tea example, it was "obvious" that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the "obvious". Real world applications of hypothesis testing include:[] Testing whether more men than women suffer from nightmares Establishing authorship of documents Evaluating the effect of the full moon on behavior Determining the range at which a bat can detect an insect by echo Deciding whether hospital carpeting results in more infections Selecting the best means to stop smoking Checking whether bumper stickers reflect car owner behavior Testing the claims of handwriting analysts

Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected

Statistical hypothesis testing to do so in the foreseeable future". Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s).[] Other fields have favored the estimation of parameters (e.g., effect size).

348

Cautions
"If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed."[] This caution applies to hypothesis tests and alternatives to them. The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong. The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including: The Clever Hans effect. A horse appeared to be capable of doing simple arithmetic. The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse. The Placebo effect. Pills with no medically active ingredients were remarkably effective. A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy. The book How to Lie with Statistics[10][11] is the most popular book on statistics ever published.[12] It does not much consider hypothesis testing, but its cautions are applicable, including: Many claims are made on the basis of samples too small to convince. If a report does not mention sample size, be doubtful. Hypothesis testing acts as a filter of statistical conclusions; only those results meeting a probability threshold are publishable. Economics also acts as a publication filter; only those results favorable to the author and funding source may be submitted for publication. The impact of filtering on publication is termed publication bias. A related problem is that of multiple testing (sometimes linked to data mining), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported. These are often dealt with by using multiplicity correction procedures that control the family wise error rate (FWER) or the false discovery rate (FDR). Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).

Examples
Analogy Courtroom trial
A statistical test procedure is comparable to a criminal trial; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is convicted. In the start of the procedure, there are two hypotheses : "the defendant is not guilty", and : "the defendant is guilty". The first one is called null hypothesis, and is for the time being accepted. The second one is called alternative (hypothesis). It is the hypothesis one hopes to support. The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn't want to convict an innocent defendant. Such an error is called error of the first kind (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, the error of the

Statistical hypothesis testing second kind (acquitting a person who committed the crime), is often rather large.
H0 is true Truly not guilty Accept Null Hypothesis Acquittal Right decision H1 is true Truly guilty Wrong decision Type II Error Right decision

349

Reject Null Hypothesis Wrong decision Conviction Type I Error

A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.

Example 1 Philosopher's beans


The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was formalized and popularized.[13] Few beans of this handful are white. Most beans in this bag are white. Therefore: Probably, these beans were taken from another bag. This is an hypothetical [sic] inference. The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious" difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard. A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different that of the bag, then the sample probably originated from another bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test.

Example 2 Clairvoyant card game


A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is, of course: the person is (more or less) clairvoyant. If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly p. The hypotheses, then, are: null hypothesis and alternative hypothesis (true clairvoyant). (just guessing)

Statistical hypothesis testing When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? What is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? It is obvious that with the choice c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with c=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind a false positive, or Type I error. With c = 25 the probability of such an error is:

350

and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times. Being less critical, with c=10, gives:

Thus, c = 10 yields a much greater probability of false positive. Before the test is actually performed, the maximum acceptable probability of a Type I error () is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus:

From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: .

Example 3 Radioactive suitcase


As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that we would observe 10 counts per minute if the null hypothesis were true. If the null hypothesis predicts (say) on average 9 counts per minute and a standard deviation of 1 count per minute, then we say that the suitcase is compatible with the null hypothesis (this does not guarantee that there is no radioactive material, just that we don't have enough evidence to suggest there is). On the other hand, if the null hypothesis predicts 3 counts per minute and a standard deviation of 1 count per minute, then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements. The test does not directly assert the presence of radioactive material. A successful test asserts that the claim of no radioactive material present is unlikely given the reading (and therefore ...). The double negative (disproving the null hypothesis) of the method is confusing, but using a counter-example to disprove is standard mathematical practice. The attraction of the method is its practicality. We know (from experience) the expected range of counts with only ambient radioactivity present, so we can say that a measurement is unusually large. Statistics just formalizes the intuitive by using numbers instead of adjectives. We probably do not know the characteristics of the radioactive suitcases; We just assume that they produce larger readings. To slightly formalize intuition: Radioactivity is suspected if the Geiger-count with the suitcase is among or exceeds the greatest (5% or 1%) of the Geiger-counts made with ambient radiation alone. This makes no assumptions about the distribution of counts. Many ambient radiation observations are required to obtain good probability estimates for rare events.

Statistical hypothesis testing The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence. Statistical significance is a possible finding of the test, declared when the observed sample is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error.

351

Definition of terms
The following definitions are mainly based on the exposition in the book by Lehmann and Romano:[] Statistical hypothesis A statement about the parameters describing a population (not a sample). Statistic A value calculated from a sample, often to summarize the sample for comparison purposes. Simple hypothesis Any hypothesis which specifies the population distribution completely. Composite hypothesis Any hypothesis which does not specify the population distribution completely. Null hypothesis (H0) A simple hypothesis associated with a contradiction to a theory one would like to prove. Alternative hypothesis (H1) A hypothesis (often composite) associated with a theory one would like to prove. Statistical test A procedure whose inputs are samples and whose result is a hypothesis. Region of acceptance The set of values of the test statistic for which we fail to reject the null hypothesis. Region of rejection / Critical region The set of values of the test statistic for which the null hypothesis is rejected. Critical value The threshold value delimiting the regions of acceptance and rejection for the test statistic. Power of a test (1) The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate, . Power is termed sensitivity in biostatistics. ("This is a sensitive test. Because the result is negative, we can confidently say that the patient does not have the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions. Size / Significance level of a test () For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the upper bound of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate, (1), is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions. p-value

Statistical hypothesis testing The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. Statistical significance test A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used to describe the modern version which is now part of statistical hypothesis testing. Conservative test A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level. Exact test A test in which the significance level or critical value can be computed exactly, i.e., without any approximation. In some contexts this term is restricted to tests applied to categorical data and to permutation tests, in which computations are carried out by complete enumeration of all possible outcomes and their probabilities. A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality: Most powerful test For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis. Uniformly most powerful test (UMP) A test with the greatest power for all values of the parameter(s) being tested, contained in the alternative hypothesis.

352

Common test statistics


One-sample tests are appropriate when a sample is being compared to the population from a hypothesis. The population characteristics are known from theory or are calculated from the population. Two-sample tests are appropriate for comparing two samples, typically experimental and control samples from a scientifically controlled experiment. Paired tests are appropriate for comparing two samples where it is impossible to control important variables. Rather than comparing two sets, members are paired between samples so the difference between the members becomes the sample. Typically the mean of the differences is then compared to zero. Z-tests are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation. T-tests are appropriate for comparing means under relaxed conditions (less is assumed). Tests of proportions are analogous to tests of means (the 50% proportion). Chi-squared tests use the same calculations and the same probability distribution for different applications:

Statistical hypothesis testing Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does. Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with libertarian politics (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables). Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors. F-tests (analysis of variance, ANOVA) are commonly used when deciding whether groupings of data by category are meaningful. If the variance of test scores of the left-handed in a class is much smaller than the variance of the whole class, then it may be useful to study lefties as a group. The null hypothesis is that two variances are the same so the proposed grouping is not meaningful. In the table below, the symbols used are defined at the bottom of the table. Many other tests can be found in other articles.
Name One-sample z-test Formula Assumptions or notes (Normal population or n > 30) and known. (z is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within k standard deviations for any k (see: Chebyshev's inequality). Normal population and independent observations and 1 and 2 are known

353

Two-sample z-test

One-sample t-test

(Normal population or n > 30) and

unknown

Paired t-test

(Normal population of differences or n > 30) and sample size n < 30

unknown or small

Two-sample pooled t-test, equal variances

(Normal populations or n1+n2>40) and independent observations and 1 = 2 unknown

[14] Two-sample unpooled t-test, unequal variances [14] (Normal populations or n1+n2>40) and independent observations and 1 2 both unknown

One-proportion z-test

n .p0 > 10 and n (1p0) > 10 and it is a SRS (Simple Random Sample), see notes. n1 p1 > 5 and n1(1p1) > 5 and n2 p2>5 and n2(1p2) > 5 and independent observations, see notes.

Two-proportion z-test, pooled for

Statistical hypothesis testing

354
n1 p1 > 5 and n1(1p1) > 5 and n2 p2>5 and n2(1p2) > 5 and independent observations, see notes.

Two-proportion z-test, unpooled for

Chi-squared test for variance Chi-squared test for goodness of fit

Normal population

df = k - 1 - # parameters estimated, and one of these must hold. All [15] expected counts are at least 5. All expected counts are >1 and no more than 20% of expected counts are [16] less than5

Two-sample F test for equality of variances

Normal populations Arrange so and reject H0 for [17]

In general, the subscript 0 indicates a value taken from the null hypothesis, H0, which should be used as much as possible in constructing its test statistic. ... Definitions of other symbols: , the probability of Type I error (rejecting a null hypothesis when it is in fact true) = sample size = sample 1 size = sample 2 size = sample mean = hypothesized population mean = population 1 mean = population 2 mean = population standard deviation = population variance = sample standard deviation = sum (of k numbers) = sample variance = sample 1 standard deviation = sample 2 standard deviation = t statistic = degrees of freedom = sample mean of differences = hypothesized population mean difference = standard deviation of differences = Chi-squared statistic = F statistic

= x/n = sample proportion, unless specified otherwise = hypothesized population proportion = proportion 1 = proportion 2 = hypothesized difference in proportion = minimum of n1 and n2

Statistical hypothesis testing

355

Origins and early controversy


Hypothesis testing is largely the product of Ronald Fisher, and was further developed by Jerzy Neyman, Karl Pearson and (son) Egon Pearson. Ronald Fisher, genius mathematician and biologist described by Richard Dawkins as "the greatest biologist since Darwin", began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved, and sought to provide a more "objective" approach to inductive inference.[18] Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century. While hypothesis testing was popularized early in the 20th century, evidence of its use can be found much earlier. In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.[19] Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error. The p-value was devised as an informal, but objective, index meant to help a method of hypothesis testing, as researcher determine (based on other knowledge) whether to modify future well as the use of "nil" null hypotheses, is E.F. Lindquist in experiments or strengthen one's faith in the null hypothesis.[] Hypothesis testing his statistics textbook: Lindquist, (and Type I/II errors) were devised by Neyman and Pearson as a more objective E.F. (1940) Statistical Analysis In alternative to Fisher's p-value, also meant to determine researcher behaviour, but Educational Research. Boston: without requiring any inductive inference by the researcher.[][20] Neither strategy is Houghton Mifflin. meant to provide any way of drawing conclusions from a single experiment.[][21]Wikipedia:Disputed statement Both strategies were meant to assess the results of experiments that were replicated multiple times.[]Wikipedia:Disputed statement Neyman & Pearson considered a different problem (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities. Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing.(The defining paper[] was abstract. Mathematicians have generalized and refined the theory for decades.[]) Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion. The dispute between Fisher and Neyman-Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper
A likely originator of the "hybrid"

Statistical hypothesis testing role of models in statistical inference.[] Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership with Pearson and separating disputants (who had occupied the same building) by much of the planetary diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy.[22] Some of Neyman's later publications reported p-values and significance levels.[23] The modern version of hypothesis testing is a hybrid of the two approaches that resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s.[] (But signal detection, for example, still uses the Neyman/Pearson formulation.) Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.[] This history explains the inconsistent terminology (example: the null hypothesis is never accepted, but there is a region of acceptance). Sometime around 1940,[] in an apparent effort to provide researchers with a "non-controversial"[] way to have their cake and eat it too, the authors of statistical text books began anonymously combining these two strategies by using the p-value in place of the test statistic (or data) to test against the Neyman-Pearson "significance level".[] Thus, researchers were encouraged to infer the strength of their data against some null hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity provided by hypothesis testing. It then became customary for the null hypothesis, which was originally some realistic research hypothesis, to be used almost solely as a strawman "nil" hypothesis (one where a treatment has no effect, regardless of the context).[24] A comparison between Fisherian, frequentist (Neyman-Pearson)
Fisher's null hypothesis testing 1. Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference). NeymanPearson decision theory 1. Set up two statistical hypotheses, H1 and H2, and decide about , , and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis.

356

2. Report the exact level of significance (e.g., p = 0.051 or p = 2. If the data falls into the rejection region of H1, accept H2; otherwise accept H1. 0.049). Do not use a conventional 5% level, and do not talk Note that accepting a hypothesis does not mean that you believe in it, but only that about accepting or rejecting hypotheses. you act as if it were true. 3. Use this procedure only if little is known about the problem 3. The usefulness of the procedure is limited among others to situations where you at hand, and only to draw provisional conclusions in the have a disjunction of hypotheses (e.g., either 1 = 8 or 2 = 10 is true) and where context of an attempt to understand the experimental you can make meaningful cost-benefit trade-offs for choosing alpha and beta. situation.

Null hypothesis statistical significant testing vs hypothesis testing


The competing approaches of Fisher and Neyman-Pearson are complementary, but distinct. Significance testing is enhanced and illuminated by hypothesis testing: Hypothesis testing provides the means of selecting the test statistics used in significance testing.[] The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in sample size determination. The two methods remain philosophically distinct.[] They usually (but not always) produce the same mathematical answer. The preferred answer is context dependent.[] While the existing merger of Fisher and Neyman-Pearson theories has been heavily criticized, modifying the merger to achieve Bayesian goals has been considered.[25] Statistical hypothesis testing is little used in its original Neyman-Pearson form but it has been generalized to decision theory which is heavily used. The continuing disagreement over the roles of significance testing and hypothesis testing has produced lasting confusion. The terminology implies that the merger is complete, which is misleading.

Statistical hypothesis testing

357

Criticism
Criticism of statistical hypothesis testing fills volumes[][26][][][][] citing 300400 primary references. Much of the criticism can be summarized by the following issues: Confusion resulting (in part) from combining the methods of Fisher and Neyman-Pearson which are conceptually distinct.[27] Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments.[28] Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias.[29] Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused. "[I]t does not tell us what we want to know".[30] Lists of dozens of complaints are available.[][] Critics and supporters are largely in factual agreement regarding the characteristics of NHST: While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis. The continuing controversy concerns the selection of the best statistical practices for the near-term future given the (often poor) existing practices. Critics would prefer to ban NHST completely, forcing a complete departure from those practices, while supporters suggest a less absolute change. Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review,[31] medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias[32] and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively.[33] Textbooks have added some cautions[34] and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Major organizations have not abandoned use of significance tests although some have discussed doing so.[31]

Alternatives to significance testing


The numerous criticisms of significance testing do not lead to a single alternative or even to a unified set of alternatives. A unifying position of critics is that statistics should not lead to a conclusion or a decision but to a probability or to an estimated value with a confidence interval rather than to an accept-reject decision regarding a particular hypothesis. There is some consensus that the hybrid testing procedure[] that is commonly used is fundamentally flawed.Wikipedia:Disputed statement It is unlikely that the controversy surrounding significance testing will be resolved in the near future. Its supposed flaws and unpopularity do not eliminate the need for an objective and transparent means of reaching conclusions regarding studies that produce statistical results. Critics have not unified around an alternative. Other forms of reporting confidence or uncertainty could probably grow in popularity. Some recent work includes reconstruction and defense of NeymanPearson testing.[] One strong critic of significance testing suggested a list of reporting alternatives:[] effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. "The distinction between the ... approaches is largely one of reporting and interpretation."[] On one "alternative" there is no disagreement: Fisher himself said,[4] "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result." Cohen, an influential critic of significance testing, concurred,[30] "... don't look for a magic alternative to NHST [null hypothesis significance testing] ... It doesn't exist." "... given the problems of statistical induction, we must finally rely, as have the older sciences, on replication." The "alternative" to significance testing is repeated testing. The easiest way to decrease statistical uncertainty is by obtaining more data, whether by increased sample size or by repeated tests. Nickerson claimed to

Statistical hypothesis testing have never seen the publication of a literally replicated experiment in psychology.[] An indirect approach to replication is meta-analysis. Bayesian inference is one alternative to significance testing.[citation needed] For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist Kruschke, John K. has suggested Bayesian estimation as an alternative for the t-test.[35] Alternatively two competing models/hypothesis can be compared using Bayes factors.[36] Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used.[citation needed] Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected.[citation needed] Neither Fisher's significance testing, nor Neyman-Pearson hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of Bayes' Theorem, which was unsatisfactory to both the Fisher and Neyman-Pearson camps due to the explicit use of subjectivity in the form of the prior probability.[][37] Fisher's strategy is to sidestep this with the p-value (an objective index based on the data alone) followed by inductive inference, while Neyman-Pearson devised their approach of inductive behaviour.

358

Education
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.[38][39] Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. An informed public should understand the limitations of statistical conclusions[40][41][citation needed] and many college fields of study require a course in statistics for the same reason.[40][41][citation needed] An introductory college statistics class places much emphasis on hypothesis testing perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics,[] but a limited amount of development continues. The cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors.[42] While the problem was addressed more than a decade ago,[43] and calls for educational reform continue,[44] students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing.[45] Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.[]

References
[1] [2] [4] [6] R. A. Fisher (1925).Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p.43. Schervish, M (1996) Theory of Statistics, p. 218. Springer ISBN 0-387-94546-6 Originally from Fisher's book Design of Experiments. Adr,J.H. (2008). Chapter 12: Modelling. In H.J. Adr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 183209). Huizen, The Netherlands: Johannes van Kessel Publishing [12] "Over the last fifty years, How to Lie with Statistics has sold more copies than any other statistical text." J. M. Steele. " Darrell Huff and Fifty Years of How to Lie with Statistics (http:/ / www-stat. wharton. upenn. edu/ ~steele/ Publications/ PDF/ TN148. pdf). Statistical Science, 20 (3), 2005, 205209. [14] NIST handbook: Two-Sample t-Test for Equal Means (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda353. htm) [15] Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 350. [17] NIST handbook: F-Test for Equality of Two Standard Deviations (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda359. htm) (Testing standard deviations the same as testing variances)

Statistical hypothesis testing


[18] Raymond Hubbard, M.J. Bayarri, P Values are not Error Probabilities (http:/ / ftp. isds. duke. edu/ WorkingPapers/ 03-26. pdf). A working paper that explains the difference between Fisher's evidential p-value and the NeymanPearson Type I error rate UNIQ-math-0-0713b852e134e18e-QINU . [27] "Until we go through the accounts of testing hypotheses, separating [Neyman-Pearson] decision elements from [Fisher] conclusion elements, the intimate mixture of disparate elements will be a continual source of confusion." ... "There is a place for both "doing one's best" and "saying only what is certain," but it is important to know, in each instance, both which one is being done, and which one ought to be done." [28] "The emphasis given to formal tests of significance throughout [R.A. Fisher's] Statistical Methods ... has caused scientific research workers to pay undue attention to the results of the tests of significance they perform on their data, particularly data derived from experiments, and too little to the estimates of the magnitude of the effects they are investigating." ... "The emphasis on tests of significance and the consideration of the results of each experiment in isolation, have had the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective." [30] This paper lead to the review of statistical practices by the APA. Cohen was a member of the Task Force that did the review. [31] "Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval." (p 599). The committee used the cautionary term "forbearance" in describing its decision against a ban of hypothesis testing in psychology reporting. (p 603) [33] Journal of Articles in Support of the Null Hypothesis website: JASNH homepage (http:/ / www. jasnh. com/ ). Volume 1 number 1 was published in 2002, and all articles are on psychology-related subjects. [36] Department of Statistics, University of Washington Technical Paper [38] Mathematics > High School: Statistics & Probability > Introduction (http:/ / www. corestandards. org/ the-standards/ mathematics/ hs-statistics-and-probability/ introduction/ ) Common Core State Standards Initiative (relates to USA students) [39] College Board Tests > AP: Subjects > Statistics (http:/ / www. collegeboard. com/ student/ testing/ ap/ sub_stats. html) The College Board (relates to USA students) [40] 'Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, "opinion" polls, the census. But without writers who use the words with honesty and readers who know what they mean, the result can only be semantic nonsense.' [41] "...the basic ideas in statistics assist us in thinking clearly about the problem, provide some guidance about the conditions that must be satisfied if sound inferences are to be made, and enable us to detect many inferences that have no good logical foundation." [44] Preprint (http:/ / escholarshare. drake. edu/ bitstream/ handle/ 2092/ 413/ WhyWeDon't. pdf)

359

Further reading
Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In: Breakthroughs in Statistics, Volume 1, (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. ISBN 0-387-94037-5 (followed by reprinting of the paper) Neyman, J.; Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Phil. Trans. R. Soc., Series A 231 (694706): 289337. doi: 10.1098/rsta.1933.0009 (http://dx.doi.org/10.1098/ rsta.1933.0009).

External links
Hazewinkel, Michiel, ed. (2001), "Statistical hypotheses, verification of" (http://www.encyclopediaofmath.org/ index.php?title=p/s087400), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Wilson Gonzlez, Georgina; Karpagam Sankaran (September 10, 1997). "Hypothesis Testing" (http://www.cee. vt.edu/ewr/environmental/teach/smprimer/hypotest/ht.html). Environmental Sampling & Monitoring Primer. Virginia Tech. Bayesian critique of classical hypothesis testing (http://www.cs.ucsd.edu/users/goguen/courses/275f00/stat. html) Critique of classical hypothesis testing highlighting long-standing qualms of statisticians (http://www.npwrc. usgs.gov/resource/methods/statsig/stathyp.htm) Dallal GE (2007) The Little Handbook of Statistical Practice (http://www.tufts.edu/~gdallal/LHSP.HTM) (A good tutorial) References for arguments for and against hypothesis testing (http://core.ecu.edu/psyc/wuenschk/StatHelp/ NHST-SHIT.htm)

Statistical hypothesis testing Statistical Tests Overview: (http://www.wiwi.uni-muenster.de/ioeb/en/organisation/pfaff/ stat_overview_table.html) How to choose the correct statistical test An Interactive Online Tool to Encourage Understanding Hypothesis Testing (http://wasser.heliohost.org/ ?l=en) A non mathematical way to understand Hypothesis Testing (http://simplifyingstats.com/data/ HypothesisTesting.pdf)

360

Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that is subject to random variation, for example, observational errors or sampling variation.[1] More substantially, the terms statistical inference, statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation,[2] such as observational errors, random sampling, or random experimentation.[1] Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.

Introduction
Scope
For the most part, statistical inference makes propositions about populations, using data drawn from the population of interest via some form of random sampling. More generally, data about a random process is obtained from its observed behavior during a finite period of time. Given a parameter or hypothesis about which one wishes to make inference, statistical inference most often uses: a statistical model of the random process that is supposed to generate the data, which is known when randomization has been used, and a particular realization of the random process; i.e., a set of data. The conclusion of a statistical inference is a statistical proposition.[citation needed] Some common forms of statistical proposition are: an estimate; i.e., a particular value that best approximates some parameter of interest, a confidence interval (or set estimate); i.e., an interval constructed using a dataset drawn from a population so that, under repeated sampling of such datasets, such intervals would contain the true parameter value with the probability at the stated confidence level, a credible interval; i.e., a set of values containing, for example, 95% of posterior belief, rejection of a hypothesis[3] clustering or classification of data points into groups

Statistical inference

361

Comparison to descriptive statistics


Statistical inference is generally distinguished from descriptive statistics. In simple terms, descriptive statistics can be thought of as being just a straightforward presentation of facts, in which modeling decisions made by a data analyst have had minimal influence.

Models/Assumptions
Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference.[4] Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.[5]

Degree of models/assumptions
Statisticians distinguish between three levels of modeling assumptions; Fully parametric: The probability distributions describing the data-generation process are assumed to be fully described by a family of probability distributions involving only a finite number of unknown parameters.[4] For example, one may assume that the distribution of population values is truly Normal, with unknown mean and variance, and that datasets are generated by 'simple' random sampling. The family of generalized linear models is a widely used and flexible class of parametric models. Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal.[6] For example, every continuous probability distribution has a median, which may be estimated using the sample median or the HodgesLehmannSen estimator, which has good properties when the data arise from simple random sampling. Semi-parametric: This term typically implies assumptions 'in between' fully and non-parametric approaches. For example, one may assume that a population distribution has a finite mean. Furthermore, one may assume that the mean response level in the population depends in a truly linear manner on some covariate (a parametric assumption) but not make any parametric assumption describing the variance around that mean (i.e., about the presence or possible form of any heteroscedasticity). More generally, semi-parametric models can often be separated into 'structural' and 'random variation' components. One component is treated parametrically and the other non-parametrically. The well-known Cox model is a set of semi-parametric assumptions.

Importance of valid models/assumptions


Whatever level of assumption is made, correctly calibrated inference in general requires these assumptions to be correct; i.e., that the data-generating mechanisms really has been correctly specified. Incorrect assumptions of 'simple' random sampling can invalidate statistical inference.[7] More complex semi- and fully parametric assumptions are also cause for concern. For example, incorrectly assuming the Cox model can in some cases lead to faulty conclusions.[8] Incorrect assumptions of Normality in the population also invalidates some forms of regression-based inference.[9] The use of any parametric model is viewed skeptically by most experts in sampling human populations: "most sampling statisticians, when they deal with confidence intervals at all, limit themselves to statements about [estimators] based on very large samples, where the central limit theorem ensures that these [estimators] will have distributions that are nearly normal."[] In particular, a normal distribution "would be a totally unrealistic and catastrophically unwise assumption to make if we were dealing with any kind of economic population."[] Here, the central limit theorem states that the distribution of the sample mean "for very large samples" is approximately normally distributed, if the distribution is not heavy tailed.

Statistical inference Approximate distributions Given the difficulty in specifying exact distributions of sample statistics, many methods have been developed for approximating these. With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population distributions, by the BerryEsseen theorem.[10] Yet for many practical purposes, the normal approximation provides a good approximation to the sample-mean's distribution when there are 10 (or more) independent samples, according to simulation studies and statisticians' experience.[10] Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the KullbackLeibler distance, Bregman divergence, and the Hellinger distance.[11][12][13] With indefinitely large samples, limiting results like the central limit theorem describe the sample statistic's limiting distribution, if one exists. Limiting results are not statements about finite samples, and indeed are irrelevant to finite samples.[14][15][16] However, the asymptotic theory of limiting distributions is often invoked for work with finite samples. For example, limiting results are often invoked to justify the generalized method of moments and the use of generalized estimating equations, which are popular in econometrics and biostatistics. The magnitude of the difference between the limiting distribution and the true distribution (formally, the 'error' of the approximation) can be assessed using simulation.[17] The heuristic application of limiting results to finite samples is common practice in many applications, especially with low-dimensional models with log-concave likelihoods (such as with one-parameter exponential families).

362

Randomization-based models
For a given dataset that was produced by a randomization design, the randomization distribution of a statistic (under the null-hypothesis) is defined by evaluating the test statistic for all of the plans that could have been generated by the randomization design. In frequentist inference, randomization allows inferences to be based on the randomization distribution rather than a subjective model, and this is important especially in survey sampling and design of experiments.[18][19] Statistical inference from randomized studies is also more straightforward than many other situations.[20][21][22] In Bayesian inference, randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.[23] Objective randomization allows properly inductive procedures.[24][25][26][27] Many statisticians prefer randomization-based analysis of data that was generated by well-defined randomization procedures.[28] (However, it is true that in fields of science with developed theoretical knowledge and experimental control, randomized experiments may increase the costs of experimentation without improving the quality of inferences.[29][30]) Similarly, results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do observational studies of the same phenomena.[31] However, a good observational study may be better than a bad randomized experiment. The statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a subjective model.[32][33] However, at any time, some hypotheses cannot be tested using objective statistical models, which accurately describe randomized experiments or random samples. In some cases, such randomized studies are uneconomical or unethical.

Statistical inference Model-based analysis of randomized experiments It is standard practice to refer to a statistical model, often a linear model, when analyzing data from randomized experiments. However, the randomization scheme guides the choice of a statistical model. It is not possible to choose an appropriate model without knowing the randomization scheme.[19] Seriously misleading results can be obtained analyzing data from randomized experiments while ignoring the experimental protocol; common mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent replicates of the treatment applied to different experimental units.[34]

363

Modes of inference
Different schools of statistical inference have become established. These schools (or 'paradigms') are not mutually exclusive, and methods which work well under one paradigm often have attractive interpretations under other paradigms. The two main paradigms in use are frequentist and Bayesian inference, which are both summarized below.

Frequentist inference
This paradigm calibrates the production of propositionsWikipedia:Please clarify by considering (notional) repeated sampling of datasets similar to the one at hand. By considering its characteristics under repeated sample, the frequentist properties of any statistical inference procedure can be described although in practice this quantification may be challenging. Examples of frequentist inference P-value Confidence interval Frequentist inference, objectivity, and decision theory One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated sampling from a population. However, the approach of Neyman[35] develops these procedures in terms of pre-experiment probabilities. That is, before undertaking an experiment, one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way: such a probability need not have a frequentist or repeated sampling interpretation. In contrast, Bayesian inference works in terms of conditional probabilities (i.e. probabilities conditional on the observed data), compared to the marginal (but conditioned on unknown parameters) probabilities used in the frequentist approach. The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions.[citation needed] In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions, which play the role of (negative) utility functions. Loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality property.[36] However, loss-functions are often useful for stating optimality properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss. While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic to be used, the absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as 'objective'.[citation needed]

Statistical inference

364

Bayesian inference
The Bayesian calculus describes degrees of belief using the 'language' of probability; beliefs are positive, integrate to one, and obey probability axioms. Bayesian inference uses the available posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach. Examples of Bayesian inference Credible intervals for interval estimation Bayes factors for model comparison Bayesian inference, subjectivity and decision theory Many informal Bayesian inferences are based on "intuitively reasonable" summaries of the posterior. For example, the posterior mean, median and mode, highest posterior density intervals, and Bayes Factors can all be motivated in this way. While a user's utility function need not be stated for this sort of inference, these summaries do all depend (to some extent) on stated prior beliefs, and are generally viewed as subjective conclusions. (Methods of prior construction which do not require external input have been proposed but not yet fully developed.) Formally, Bayesian inference is calibrated with reference to an explicitly stated utility, or loss function; the 'Bayes rule' is the one which maximizes expected utility, averaged over the posterior uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic sense. Given assumptions, data and utility, Bayesian inference can be made for essentially any problem, although not every statistical inference need have a Bayesian interpretation. Analyses which are not formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use proper priors (i.e., those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs.

Other modes of inference (besides frequentist and Bayesian)


Information and computational complexity Other forms of statistical inference have been developed from ideas in information theory[37] and the theory of Kolmogorov complexity.[38] For example, the minimum description length (MDL) principle selects statistical models that maximally compress the data; inference proceeds without assuming counterfactual or non-falsifiable 'data-generating mechanisms' or probability models for the data, as might be done in frequentist or Bayesian approaches. However, if a 'data generating mechanism' does exist in reality, then according to Shannon's source coding theorem it provides the MDL description of the data, on average and asymptotically.[39] In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood estimation and maximum a posteriori estimation (using maximum-entropy Bayesian priors). However, MDL avoids assuming that the underlying probability model is known; the MDL principle can also be applied without assumptions that e.g. the data arose from independent sampling.[39][40] The MDL principle has been applied in communication-coding theory in information theory, in linear regression, and in time-series analysis (particularly for choosing the degrees of the polynomials in Autoregressive moving average (ARMA) models).[40] Information-theoretic statistical inference has been popular in data mining, which has become a common approach for very large observational and heterogeneous datasets made possible by the computer revolution and internet.[38] The evaluation of statistical inferential procedures often uses techniques or criteria from computational complexity theory or numerical analysis.[41][42]

Statistical inference Fiducial inference Fiducial inference was an approach to statistical inference based on fiducial probability, also known as a "fiducial distribution". In subsequent work, this approach has been called ill-defined, extremely limited in applicability, and even fallacious.[43][44] However this argument is the same as that which shows[45] that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence intervals, it does not necessarily invalidate conclusions drawn from fiducial arguments. Structural inference Developing ideas of Fisher and of Pitman from 1938 to 1939,[46] George A. Barnard developed "structural inference" or "pivotal inference",[47] an approach using invariant probabilities on group families. Barnard reformulated the arguments behind fiducial inference on a restricted class of models on which "fiducial" procedures would be well-defined and useful.

365

Inference topics
The topics below are usually included in the area of statistical inference. 1. Statistical assumptions 2. 3. 4. 5. 6. 7. 8. Statistical decision theory Estimation theory Statistical hypothesis testing Revising opinions in statistics Design of experiments, the analysis of variance, and regression Survey sampling Summarizing statistical data

Notes
[1] [2] [3] [4] [6] [8] Upton, G., Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4 Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9 (entry for "inferential statistics") According to Peirce, acceptance means that inquiry on this question ceases for the time being. In science, all scientific theories are revisable Cox (2006) page 2 van der Vaart, A.W. (1998) Asymptotic Statistics Cambridge University Press. ISBN 0-521-78450-6 (page 341) Freedman, D.A. (2008) "Survival analysis: An Epidemiological hazard?". The American Statistician (2008) 62: 110-119. (Reprinted as Chapter 11 (pages 169192) of: Freedman, D.A. (2010) Statistical Models and Causal Inferences: A Dialogue with the Social Sciences (Edited by David Collier, Jasjeet S. Sekhon, and Philip B. Stark.) Cambridge University Press. ISBN 978-0-521-12390-7) [9] Berk, R. (2003) Regression Analysis: A Constructive Critique (Advanced Quantitative Techniques in the Social Sciences) (v. 11) Sage Publications. ISBN 0-7619-2904-5 [10] Jrgen Hoffman-Jrgensen's Probability With a View Towards Statistics, Volume I. Page 399 [11] Le Cam (1986) [12] Erik Torgerson (1991) Comparison of Statistical Experiments, volume 36 of Encyclopedia of Mathematics. Cambridge University Press. [14] Kolmogorov (1963a) (Page 369): "The frequency concept, based on the notion of limiting frequency as the number of trials increases to infinity, does not contribute anything to substantiate the applicability of the results of probability theory to real practical problems where we have always to deal with a finite number of trials". (page 369) [15] "Indeed, limit theorems 'as UNIQ-math-0-0713b852e134e18e-QINU tends to infinity' are logically devoid of content about what happens at any particular UNIQ-math-1-0713b852e134e18e-QINU . All they can do is suggest certain approaches whose performance must then be checked on the case at hand." Le Cam (1986) (page xiv) [16] Pfanzagl (1994): "The crucial drawback of asymptotic theory: What we expect from asymptotic theory are results which hold approximately . . . . What asymptotic theory has to offer are limit theorems."(page ix) "What counts for applications are approximations, not limits." (page 188) [17] Pfanzagl (1994) : "By taking a limit theorem as being approximately true for large sample sizes, we commit an error the size of which is unknown. [. . .] Realistic information about the remaining errors may be obtained by simulations." (page ix) [18] Neyman, J.(1934) "On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection", Journal of the Royal Statistical Society, 97 (4), 557625

Statistical inference
[19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] Hinkelmann and Kempthorne(2008) ASA Guidelines for a first course in statistics for non-statisticians. (available at the ASA website) David A. Freedman et alia's Statistics. David S. Moore and George McCabe. Introduction to the Practice of Statistics. Gelman, Rubin. Bayesian Data Analysis. Peirce (1877-1878) Peirce (1883) David Freedman et alia Statistics and David A. Freedman Statistical Models. Rao, C.R. (1997) Statistics and Truth: Putting Chance to Work, World Scientific. ISBN 981-02-3111-3 Peirce, Freedman, Moore and McCabe. Box, G.E.P. and Friends (2006) Improving Almost Anything: Ideas and Essays, Revised Edition, Wiley. ISBN 978-0-471-72755-2 Cox (2006), page 196 ASA Guidelines for a first course in statistics for non-statisticians. (available at the ASA website)

366

David A. Freedman et alia's Statistics. David S. Moore and George McCabe. Introduction to the Practice of Statistics. [32] Neyman, Jerzy. 1923 [1990]. "On the Application of Probability Theory to AgriculturalExperiments. Essay on Principles. Section 9." Statistical Science 5 (4): 465472. Trans. Dorota M. Dabrowska and Terence P. Speed. [33] Hinkelmann & Kempthorne (2008) [34] Hinkelmann and Kempthorne (2008) Chapter 6. [35] Neyman, J. (1937) "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability" (http:/ / links. jstor. org/ sici?sici=0080-4614(19370830)236:767<333:OOATOS>2. 0. CO;2-6), Philosophical Transactions of the Royal Society of London A, 236, 333380. [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] Preface to Pfanzagl. Soofi (2000) Hansen & Yu (2001) Hansen and Yu (2001), page 747. Rissanen (1989), page 84 Joseph F. Traub, G. W. Wasilkowski, and H. Wozniakowski. (1988) Judin and Nemirovski. Neyman (1956) Zabell (1992)} Cox (2006) page 66 Davison, page 12. Barnard, G.A. (1995) "Pivotal Models and the Fiducial Argument", International Statistical Review, 63 (3), 309323.

References
Bickel, Peter J.; Doksum, Kjell A. (2001). Mathematical statistics: Basic and selected topics 1 (Second (updated printing 2007) ed.). Pearson Prentice-Hall. ISBN0-13-850363-X. MR 443141 (http://www.ams.org/ mathscinet-getitem?mr=443141). Cox, D. R. (2006). Principles of Statistical Inference, CUP. ISBN 0-521-68567-2. Fisher, Ronald (1955) "Statistical methods and scientific induction" Journal of the Royal Statistical Society, Series B, 17, 6978. (criticism of statistical theories of Jerzy Neyman and Abraham Wald) Freedman, David A. (2009). Statistical models: Theory and practice (http://www.cambridge.org/catalogue/ catalogue.asp?isbn=9780521743853) (revised ed.). Cambridge University Press. pp.xiv+442 pp. ISBN978-0-521-74385-3. MR 2489600 (http://www.ams.org/mathscinet-getitem?mr=2489600). Hansen, Mark H.; Yu, Bin (June 2001). "Model Selection and the Principle of Minimum Description Length: Review paper". Journal of the American Statistical Association 96 (454): 746774. doi: 10.1198/016214501753168398 (http://dx.doi.org/10.1198/016214501753168398). JSTOR 2670311 (http:// www.jstor.org/stable/2670311). MR 1939352 (http://www.ams.org/mathscinet-getitem?mr=1939352). Hinkelmann, Klaus; Kempthorne, Oscar (2008). Introduction to Experimental Design (http://books.google. com/?id=T3wWj2kVYZgC&printsec=frontcover) (Second ed.). Wiley. ISBN978-0-471-72756-9. Kolmogorov, Andrei N. (1963a). "On Tables of Random Numbers". Sankhy Ser. A. 25: 369375. MR 178484 (http://www.ams.org/mathscinet-getitem?mr=178484).

Statistical inference Kolmogorov, Andrei N. (1963b). "On Tables of Random Numbers". Theoretical Computer Science 207 (2): 387395. doi: 10.1016/S0304-3975(98)00075-9 (http://dx.doi.org/10.1016/S0304-3975(98)00075-9). MR 1643414 (http://www.ams.org/mathscinet-getitem?mr=1643414). Le Cam, Lucian. (1986) Asymptotic Methods of Statistical Decision Theory, Springer. ISBN 0-387-96307-3 Neyman, Jerzy (1956). "Note on an Article by Sir Ronald Fisher". Journal of the Royal Statistical Society. Series B (Methodological) 18 (2): 288294. JSTOR 2983716 (http://www.jstor.org/stable/2983716). (reply to Fisher 1955) Peirce, C. S. (18771878), "Illustrations of the Logic of Science" (series), Popular Science Monthly, vols. 12-13. Relevant individual papers: (1878 March), "The Doctrine of Chances", Popular Science Monthly, v. 12, March issue, pp. 604 (http:// books.google.com/books?id=ZKMVAAAAYAAJ&jtp=604)615. Internet Archive Eprint (http://www. archive.org/stream/popscimonthly12yoummiss#page/612/mode/1up). (1878 April), "The Probability of Induction", Popular Science Monthly, v. 12, pp. 705 (http://books.google. com/books?id=ZKMVAAAAYAAJ&jtp=705)718. Internet Archive Eprint (http://www.archive.org/ stream/popscimonthly12yoummiss#page/715/mode/1up). (1878 June), "The Order of Nature", Popular Science Monthly, v. 13, pp. 203 (http://books.google.com/ books?id=u8sWAQAAIAAJ&jtp=203)217.Internet Archive Eprint (http://www.archive.org/stream/ popularsciencemo13newy#page/203/mode/1up). (1878 August), "Deduction, Induction, and Hypothesis", Popular Science Monthly, v. 13, pp. 470 (http:// books.google.com/books?id=u8sWAQAAIAAJ&jtp=470)482. Internet Archive Eprint (http://www. archive.org/stream/popularsciencemo13newy#page/470/mode/1up). Peirce, C. S. (1883), "A Theory of Probable Inference", Studies in Logic, pp. 126-181 (http://books.google. com/books?id=V7oIAAAAQAAJ&pg=PA126), Little, Brown, and Company. (Reprinted 1983, John Benjamins Publishing Company, ISBN 90-272-3271-7) Pfanzagl, Johann; with the assistance of R. Hambker (1994). Parametric Statistical Theory. Berlin: Walter de Gruyter. ISBN3-11-013863-8. MR 1291393 (http://www.ams.org/mathscinet-getitem?mr=1291393). Rissanen, Jorma (1989). Stochastic Complexity in Statistical Inquiry. Series in computer science 15. Singapore: World Scientific. ISBN9971-5-0859-1. MR 1082556 (http://www.ams.org/mathscinet-getitem?mr=1082556). Soofi, Ehsan S. (December 2000). "Principal Information-Theoretic Approaches (Vignettes for the Year 2000: Theory and Methods, ed. by George Casella)". Journal of the American Statistical Association 95 (452): 13491353. JSTOR 2669786 (http://www.jstor.org/stable/2669786). MR 1825292 (http://www.ams.org/ mathscinet-getitem?mr=1825292). Traub, Joseph F.; Wasilkowski, G. W.; Wozniakowski, H. (1988). Information-Based Complexity. Academic Press. ISBN0-12-697545-0. Zabell, S. L. (Aug. 1992). "R. A. Fisher and Fiducial Argument". Statistical Science 7 (3): 369387. doi: 10.1214/ss/1177011233 (http://dx.doi.org/10.1214/ss/1177011233). JSTOR 2246073 (http://www.jstor. org/stable/2246073).

367

Statistical inference

368

Further reading
Casella, G., Berger, R.L. (2001). Statistical Inference. Duxbury Press. ISBN 0-534-24312-6 David A. Freedman. "Statistical Models and Shoe Leather" (1991). Sociological Methodology, vol. 21, pp.291313. David A. Freedman. Statistical Models and Causal Inferences: A Dialogue with the Social Sciences. 2010. Edited by David Collier, Jasjeet S. Sekhon, and Philip B. Stark. Cambridge University Press. Kruskal, William (December 1988). "Miracles and Statistics: The Casual Assumption of Independence (ASA Presidential address)". Journal of the American Statistical Association 83 (404): 929940. JSTOR 2290117 (http://www.jstor.org/stable/2290117). Lenhard, Johannes (2006). "Models and Statistical Inference: The Controversy between Fisher and NeymanPearson," British Journal for the Philosophy of Science, Vol. 57 Issue 1, pp.6991. Lindley, D. (1958). "Fiducial distribution and Bayes' theorem", Journal of the Royal Statistical Society, Series B, 20, 1027 Sudderth, William D. (1994). "Coherent Inference and Prediction in Statistics," in Dag Prawitz, Bryan Skyrms, and Westerstahl (eds.), Logic, Methodology and Philosophy of Science IX: Proceedings of the Ninth International Congress of Logic, Methodology and Philosophy of Science, Uppsala, Sweden, August 714, 1991, Amsterdam: Elsevier. Trusted, Jennifer (1979). The Logic of Scientific Inference: An Introduction, London: The Macmillan Press, Ltd. Young, G.A., Smith, R.L. (2005) Essentials of Statistical Inference, CUP. ISBN 0-521-83971-8

External links
MIT OpenCourseWare (http://dspace.mit.edu/handle/1721.1/45587): Statistical Inference

Survey methodology
A field of applied statistics, survey methodology studies the sampling of individual units from a population and the associated survey data collection techniques, such as questionnaire construction and methods for improving the number and accuracy of responses to surveys. Statistical surveys are undertaken with a view towards making statistical inferences about the population being studied, and this is depends strongly on the survey questions used. Polls about public opinion, public health surveys, market research surveys, government surveys and censuses are all examples of quantitative research that use contemporary survey methodology to answers questions about a population. Although censuses do not include a "sample", they do include other aspects of survey methodology, like questionnaires, interviewers, and nonresponse follow-up techniques. Surveys provide important information for all kinds of public information and research fields, e.g., marketing research, psychology, health professionals and sociology.[1] A single survey is made of at least a sample (or full population in the case of a census), a method of data collection (e.g., a questionnaire) and individual questions or items that become data that can be analyzed statistically. A single survey may focus on different types of topics such as preferences (e.g., for a presidential candidate), opinions (e.g., should abortion be legal?), behavior (smoking and alcohol use), or factual information (e.g., income), depending on its purpose. Since survey research is almost always based on a sample of the population, the success of the research is dependent on the representativeness of the sample with respect to a target population of interest to the researcher. That target population can range from the general population of a given country to specific groups of people within that country, to a membership list of a professional organization, or list of students enrolled in a school system (see also sampling (statistics) and survey sampling).

Survey methodology Survey methodology as a scientific field seeks to identify principles about the sample design, data collection instruments, statistical adjustment of data, and data processing, and final data analysis that can create systematic and random survey errors. Survey errors are sometimes analyzed in connection with survey cost. Cost constraints are sometimes framed as improving quality within cost constraints, or alternatively, reducing costs for a fixed level of quality. Survey methodology is both a scientific field and a profession, meaning that some professionals in the field focus on survey errors empirically and others design surveys to reduce them. For survey designers, the task involves making a large set of decisions about thousands of individual features of a survey in order to improve it.[] The most important methodological challenges of a survey methodologist include making decisions on how to:[] Identify and select potential sample members. Contact sampled individuals and collect data from those who are hard to reach (or reluctant to respond). Evaluate and test questions. Select the mode for posing questions and collecting responses. Train and supervise interviewers (if they are involved). Check data files for accuracy and internal consistency. Adjust survey estimates to correct for identified errors.

369

Selecting samples
Survey samples can be broadly divided into two types: probability samples and non-probability samples. Stratified sampling is a method of probability sampling such that sub-populations within an overall population are identified and included in the sample selected in a balanced way.

Modes of data collection


There are several ways of administering a survey. The choice between administration modes is influenced by several factors, including 1) costs, 2) coverage of the target population, 3) flexibility of asking questions, 4) respondents' willingness to participate and 5) response accuracy. Different methods create mode effects that change how respondents answer, and different methods have different advantages. The most common modes of administration can be summarized as:[2] Telephone Mail (post) Online surveys Personal in-home surveys Personal mall or street intercept survey Hybrids of the above.

Cross-sectional and longitudinal surveys


There is a distinction between one-time (cross-sectional) surveys, which involve a single questionnaire or interview administered to each sample member, and surveys which repeatedly collect information from the sample people over time. The latter are known as longitudinal surveys. Longitudinal surveys have considerable analytical advantages but they are also challenging to implement successfully. Consequently, specialist methods have been developed to select longitudinal samples, to collect data repeatedly, to keep track of sample members over time, to keep respondents motivated to participate, and to process and analyse longitudinal survey data [3]

Survey methodology

370

Response formats
Usually, a survey consists of a number of questions that the respondent has to answer in a set format. A distinction is made between open-ended and closed-ended questions. An open-ended question asks the respondent to formulate his or her own answer, whereas a closed-ended question has the respondent pick an answer from a given number of options. The response options for a closed-ended question should be exhaustive and mutually exclusive. Four types of response scales for closed-ended questions are distinguished: Dichotomous, where the respondent has two options Nominal-polytomous, where the respondent has more than two unordered options Ordinal-polytomous, where the respondent has more than two ordered options (Bounded) continuous, where the respondent is presented with a continuous scale

A respondent's answer to an open-ended question can be coded into a response scale afterwards,[2] or analysed using more qualitative methods.

Advantages and disadvantages


Surveys are good solutions for many research questions but are often not the best solution. Making a wise decision involves understanding the trade-offs of different sources of survey error, costs (and funds available), and the ultimate uses of the statistics to be calculated from the data collected. A general list of advantages and disadvantages of surveys as data collection tools is below, but it is important to consider the specifics of a given situation to determine if a survey is best.

Advantages
They are relatively easy and inexpensive to administer for the simplest of designs. Simply administering a survey does not require a lot of technical expertise, if quality of the data is not a major concern. If conducted remotely, can reduce or obviate geographical dependence. Useful in describing the characteristics of a large population assuming the sampling is valid. Can be administered remotely via the Web, mobile devices, mail, e-mail, telephone, etc. Efficient at collecting information from a large number of respondents for a fixed cost compared to other methods. Statistical techniques can be applied to the survey data to determine validity, reliability, and statistical significance even when analyzing multiple variables. Many questions can be asked about a given topic giving considerable flexibility to the analysis. Support both between and within-subjects study designs. A wide range of information can be collected (e.g., attitudes, values, beliefs, and behaviour). Compared to qualitative interviewing, standardized survey questions provide all the participants with a standardized stimulus.

Survey methodology

371

Disadvantages
The validity and reliability (i.e,. variance and bias) of survey data may depend on the following: Respondents' motivation, honesty, memory, and ability to respond. Respondents may not be fully aware of their reasons for any given action, making surveys weak methods for things that respondents cannot report consciously and accurately. Structured surveys, particularly those with closed ended questions, may have low validity when researching affective variables. Self-selection bias. Although the individuals chosen to participate in surveys are usually randomly sampled, errors due to nonresponse may exist (see also chapter 13 of Adr et al. (2008) for more information on how to deal with nonresponders bias in survey estimates). That is, people who choose to respond on the survey may be different from those who do not respond, thus biasing the estimates. For example, people who are not at home regularly will be more difficult to contact than those who are at home a lot, and thus hard to contact with a face-to-face or telephone survey that uses only landline numbers. The overall inference is limited by the sampling frame chosen. For example, polls or surveys that are conducted by calling a random sample of publicly available telephone numbers will not include the responses of people with unlisted telephone numbers, mobile (cell) phone numbers. Even random digit dial sampling frames of landlines have been shown to under-represent certain individuals (and their behaviors), specifically those who only have a cell phone. Question and questionnaire design: Survey question answer-choices could lead to vague data sets because at times they are relative only to a personal abstract notion concerning "strength of choice". For instance the choice "moderately agree" may mean different things to different subjects, and to anyone interpreting the data for correlation. Even 'yes' or 'no' answers are problematic because subjects may for instance put "no" if the choice "only once" is not available.

Nonresponse reduction
The following ways have been recommended for reducing nonresponse[4] in telephone and face-to-face surveys:[5] Advance letter. A short letter is sent in advance to inform the sampled respondents about the upcoming survey. The style of the letter should be personalized but not overdone. First, it announces that a phone call will be made/ or an interviewer wants to make an appointment to do the survey face-to-face. Second, the research topic will be described. Last, it allows both an expression of the surveyor's appreciation of cooperation and an opening to ask questions on the survey. Training. The interviewers are thoroughly trained in how to ask respondents questions, how to work with computers and making schedules for callbacks to respondents who were not reached. Short introduction. The interviewer should always start with a short instruction about him or herself. She/he should give her name, the institute she is working for, the length of the interview and goal of the interview. Also it can be useful to make clear that you are not selling anything: this has been shown to lead led to a slightly higher responding rate.[6] Respondent-friendly survey questionnaire. The questions asked must be clear, non-offensive and easy to respond to for the subjects under study. Brevity is also often cited as increasing response rate. A 1996 literature review found mixed evidence to support this claim for both written and verbal surveys, concluding that other factors may often be more important.[7] A 2010 study by SurveyMonkey looking at 100,000 of the online surveys they host found response rate dropped by about 3% at 10 questions and about 6% at 20 questions, with dropoff slowing (for example, only 10% reduction at 40 questions)[8] Other studies showed that quality of response degraded toward the end of long surveys.[9]

Survey methodology

372

Other methods to increase response rates


financial incentives paid in advance paid at completion non-monetary incentives commodity giveaways (pens, notepads) entry into a lottery, draw or contest discount coupons promise of contribution to charity preliminary notification foot-in-the-door techniques start with a small inconsequential request personalization of the request address specific individuals follow-up requests multiple requests emotional appeals bids for sympathy convince respondent that they can make a difference guarantee anonymity

legal compulsion (certain government-run surveys)

Interviewer effects
Survey methodologists have devoted much effort to determine the extent to which interviewee responses are affected by physical characteristics of the interviewer. Main interviewer traits that have been demonstrated to influence survey responses are race [10] , gender [11] and relative body weight (BMI) .[12] These interviewer effects are particularly operant when questions are related to the interviewer trait. Hence, race of interviewer has been shown to affect responses to measures regarding racial attitudes ,[13] interviewer sex responses to questions involving gender issues ,[14] and interviewer BMI answers to eating and dieting-related questions .[15] While interviewer effects have been investigated mainly for face-to-face surveys, they have also been shown to exist for interview modes with no visual contact, such as telephone surveys and in video-enhanced web surveys. The explanation typically provided for interviewer effects is that of social desirability. Survey participants may attempt to project a positive self-image in an effort to conform to the norms they attribute to the interviewer asking questions.

Notes
[1] http:/ / whatisasurvey. info/ [2] Mellenbergh, G.J. (2008). Chapter 9: Surveys. In H.J. Adr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 183209). Huizen, The Netherlands: Johannes van Kessel Publishing. [3] Lynn, P. (2009) (Ed.) Methodology of Longitudinal Surveys. Wiley. ISBN 0-470-01871-2 [4] Lynn, P. (2008) "The problem of non-response", chapter 3, 35-55, in International Handbook of Survey Methodology (ed.s E.de Leeuw, J.Hox & D.Dillman). Erlbaum. ISBN 0-8058-5753-2 [5] Dillman, D.A. (1978) Mail and telephone surveys: The total design method. Wiley. ISBN 0-471-21555-4 [6] De Leeuw, E.D. (2001). "I am not selling anything: Experiments in telephone introductions". Kwantitatieve Methoden, 22, 4148. [9] http:/ / www. research-live. com/ news/ news-headlines/ respondent-engagement-and-survey-length-the-long-and-the-short-of-it/ 4002430. article

Survey methodology

373

References
Abramson, J.J. and Abramson, Z.H. (1999).Survey Methods in Community Medicine: Epidemiological Research, Programme Evaluation, Clinical Trials (5th edition). London: Churchill Livingstone/Elsevier Health Sciences ISBN 0-443-06163-7 Groves, R.M. (1989). Survey Errors and Survey Costs Wiley. ISBN 0-471-61171-9 Ornstein, M.D. (1998). "Survey Research." Current Sociology 46(4): iii-136. Shaughnessy, J. J., Zechmeister, E. B., & Zechmeister, J. S. (2006). Research Methods in Psychology (Seventh Edition ed.). McGrawHill Higher Education. ISBN 0-07-111655-9 (pp.143192) Adr, H. J., Mellenbergh, G. J., & Hand, D. J. (2008). Advising on research methods: A consultant's companion. Huizen, The Netherlands: Johannes van Kessel Publishing. Dillman, D.A. (1978) Mail and telephone surveys: The total design method. New York: Wiley. ISBN 0-471-21555-4

Further reading
Andres, Lesley (2012). "Designing and Doing Survey Research" (http://www.uk.sagepub.com/books/ Book234957?siteId=sage-uk&prodTypes=any&q=andres). London: Sage. Leung, Wai-Ching (2001) "Conducting a Survey" (http://archive.student.bmj.com/back_issues/0601/ education/187.html), in Student BMJ, (British Medical Journal, Student Edition), May 2001

External links
Surveys (http://www.dmoz.org/Science/Social_Sciences/Methodology/Survey/) at the Open Directory Project Nonprofit Research Collection on the Use of Surveys in Nonprofit Research (http://www.issuelab.org/closeup/ Jan_2009/) Published on IssueLab

Sten scores

374

Sten scores
The results for some scales of some psychometric instruments are returned as sten scores, sten being an abbreviation for 'Standard Ten' and thus closely related to stanine scores.

Definition
A sten score indicates an individual's approximate position (as a range of values) with respect to the population of values and, therefore, to other people in that population. The individual sten scores are defined by reference to a standard normal distribution. Unlike stanine scores, which have a midpoint of five, sten scores have no midpoint (the midpoint is the value 5.5). Like stanines, individual sten scores are demarcated by half standard deviations. Thus, a sten score of 5 includes all standard scores from -.5 to zero and is centered at -0.25 and a sten score of 4 includes all standard scores from -1.0 to -0.5 and is centered at -0.75. A sten score of 1 includes all standard scores below -2.0. Sten scores of 6-10 "mirror" scores 5-1. The table below shows the standard scores that define stens and the percent of individuals drawn from a normal distribution that would receive sten score.

Standard/Z scores, percentages, and sten scores


Z-scores <-2.0 -2 to -1.5 -1.5 to -1.0 -1.0 to -.5 -.5 to 0 0 to +.5 +.5 to +1.0 +1.0 to +1.5 +1.5 to +2.0 >+2.0 Percent Sten 2.3% 1 4.4% 2 9.2% 3 15.0% 4 19.2% 5 19.2% 6 15.0% 7 9.2% 8 4.4% 9 2.3% 10

Sten scores (for the entire population of results) have a mean of 5.5 and a standard deviation of 2.[1]

Calculation of sten scores


When the score distribution is approximately normally distributed, sten scores can be calculated by a linear transformation: (1) the scores are first standardized; (2) then multiplied by the desired standard deviation of 2; and finally, (3) the desired mean of 5.5 is added. The resulting decimal value may be used as-is or rounded to an integer. For example, suppose that scale scores are found to have a mean of 23.5, a standard deviation of 4.2, and to be approximately normally distributed. Then sten scores for this scale can be calculated using the formula, . It is also usually necessary to truncate such scores, particularly if the scores are skewed. An alternative method of calculation requires that the scale developer prepare a table to convert raw scores to sten scores by apportioning percentages according to the distribution shown in the table. For example, if the scale developer observes that raw scores 0-3 comprise 2% of the population, then these raw scores will be converted to a sten score of 1 and a raw score of 4 (and possibly 5, etc.) will be converted to a sten score of 2. This procedure is a non-linear transformation that will normalize the sten scores and usually the resulting stens will only approximate the percentages shown in the table. The 16PF Questionnaire uses this scoring method.[2]

References
[1] McNab, D. et al Career Values Scale: Manual & Users' Guide, Psychometrics Publishing, 2005. [2] Russell, M.T., & Karol, D. (2002). The 16PF Fifth Edition administrator's manual. Champaign, IL: Institute for Personality and Ability Testing

Structural equation modeling

375

Structural equation modeling


Structural equation modeling (SEM) is a statistical technique for testing and estimating causal relations using a combination of statistical data and qualitative causal assumptions. This definition of SEM was articulated by the geneticist Sewall Wright (1921),[1] the economist Trygve Haavelmo (1943) and the cognitive scientist Herbert A. Simon (1953),[2] and formally defined by Judea Pearl (2000) using a calculus of counterfactuals.[] Structural equation models (SEM) allow both confirmatory and exploratory modeling, meaning they are suited to both theory testing and theory development. Confirmatory modeling usually starts out with a hypothesis that gets represented in a causal model. The concepts used in the model must then be operationalized to allow testing of the relationships between the concepts in the model. The model is tested against the obtained measurement data to determine how well the model fits the data. The causal assumptions embedded in the model often have falsifiable implications which can be tested against the data.[3] With an initial theory SEM can be used inductively by specifying a corresponding model and using data to estimate the values of free parameters. Often the initial hypothesis requires adjustment in light of model evidence. When SEM is used purely for exploration, this is usually in the context of exploratory factor analysis as in psychometric design.Wikipedia:Please clarify Among the strengths of SEM is the ability to construct latent variables: variables which are not measured directly, but are estimated in the model from several measured variables each of which is predicted to 'tap into' the latent variables. This allows the modeler to explicitly capture the unreliability of measurement in the model, which in theory allows the structural relations between latent variables to be accurately estimated. Factor analysis, path analysis and regression all represent special cases of SEM. In SEM, the qualitative causal assumptions are represented by the missing variables in each equation, as well as vanishing covariances among some error terms. These assumptions are testable in experimental studies and must be confirmed judgmentally in observational studies.

Equivalent models
In SEM, many models are equivalent in that they predict the same mean vector and covariance matrix. A "cleaned" model representation would be to model the mean and covariance matrix directly. That is, a "Clean Normal Model" (CNM) is a model with a function for every entry of the covariance matrix and mean. In terms of path diagrams, CNM is the subset of SEMs that only have squares connected by double-headed edges. CNMs are not popular for at least two reasons: CNMs are very difficult for human readers to interpret. Humans typically like to think of covariances as common sources or causations. It helps people to think of models, even though it also entails the danger of over-interpreting regressions as causations. Note that this mere re-representation had big successes: The IQ index, for example, although by definition not existent in the real world, has arguably the greatest success in Psychology, with lots of predictive power for many things.. It helps us to integrate variables that we propose do exist, but have not been measured (yet). We could for example build a model in which ion flow in certain brain cells is a latent variable and in that way make a prediction about the covariance of the ion flow to observable variables, which, if someone invents a measurement instrument that allows the measurement of this flow in the specific region, can be falsified or confirmed.

Structural equation modeling

376

Steps in performing SEM analysis


Model specification
When SEM is used as a confirmatory technique, the model must be specified correctly based on the type of analysis that the researcher is attempting to confirm. When building the correct model, the researcher uses two different kinds of variables, namely exogenous and endogenous variables. The distinction between these two types of variables is whether the variable regresses on another variable or not. As in regression, the dependent variable (DV) regresses on the independent variable (IV), meaning that the DV is being predicted by the IV. In SEM terminology, other variables regress on exogenous variables, but exogenous variables never regress on other variables. In a directed graph of the model, an exogenous variable is recognizable as any variable from which arrows only emanate, where the emanating arrows denote which variables that exogenous variable predicts. Any variable that regresses on another variable is defined to be an endogenous variable, even if other variables regress on it. In a directed graph, an endogenous variable is recognizable as any variable receiving an arrow. It is important to note that SEM is more general than regression. In particular, a variable can act as both independent and dependent variable. Two main components of models are distinguished in SEM: the structural model showing potential causal dependencies between endogenous and exogenous variables, and the measurement model showing the relations between latent variables and their indicators. Exploratory and Confirmatory factor analysis models, for example, contain only the measurement part, while path diagrams can be viewed as SEMs that contain only the structural part. In specifying pathways in a model, the modeler can posit two types of relationships: (1) free pathways, in which hypothesized causal (in fact counterfactual) relationships between variables are tested, and therefore are left 'free' to vary, and (2) relationships between variables that already have an estimated relationship, usually based on previous studies, which are 'fixed' in the model. A modeler will often specify a set of theoretically plausible models in order to assess whether the model proposed is the best of the set of possible models. Not only must the modeler account for the theoretical reasons for building the model as it is, but the modeler must also take into account the number of data points and the number of parameters that the model must estimate to identify the model. An identified model is a model where a specific parameter value uniquely identifies the model, and no other equivalent formulation can be given by a different parameter value. A data point is a variable with observed scores, like a variable containing the scores on a question or the number of times respondents buy a car. The parameter is the value of interest, which might be a regression coefficient between the exogenous and the endogenous variable or the factor loading (regression coefficient between an indicator and its factor). If there are fewer data points than the number of estimated parameters, the resulting model is "unidentified", since there are too few reference points to account for all the variance in the model. The solution is to constrain one of the paths to zero, which means that it is no longer part of the model.

Estimation of free parameters


Parameter estimation is done by comparing the actual covariance matrices representing the relationships between variables and the estimated covariance matrices of the best fitting model. This is obtained through numerical maximization of a fit criterion as provided by maximum likelihood estimation, weighted least squares or asymptotically distribution-free methods. This is often accomplished by using a specialized SEM analysis program, of which several exist.

Structural equation modeling

377

Assessment of model and model fit


Having estimated a model, analysts will want to interpret the model. Estimated paths may be tabulated and/or presented graphically as a path model. The impact of variables is assessed using path tracing rules (see path analysis). It is important to examine the "fit" of an estimated model to determine how well it models the data. This is a basic task in SEM modeling: forming the basis for accepting or rejecting models and, more usually, accepting one competing model over another. The output of SEM programs includes matrices of the estimated relationships between variables in the model. Assessment of fit essentially calculates how similar the predicted data are to matrices containing the relationships in the actual data. Formal statistical tests and fit indices have been developed for these purposes. Individual parameters of the model can also be examined within the estimated model in order to see how well the proposed model fits the driving theory. Most, though not all, estimation methods make such tests of the model possible. Of course as in all statistical hypothesis tests, SEM model tests are based on the assumption that the correct and complete relevant data have been modeled. In the SEM literature, discussion of fit has led to a variety of different recommendations on the precise application of the various fit indices and hypothesis tests. There are differing approaches to assessing fit. Traditional approaches to modeling start from a null hypothesis, rewarding more parsimonious models (i.e. those with fewer free parameters), to others such as AIC that focus on how little the fitted values deviate from a saturated model [citation needed] (i.e. how well they reproduce the measured values), taking into account the number of free parameters used. Because different measures of fit capture different elements of the fit of the model, it is appropriate to report a selection of different fit measures. Some of the more commonly used measures of fit include: Chi-Squared A fundamental measure of fit used in the calculation of many other fit measures. Conceptually it is a function of the sample size and the difference between the observed covariance matrix and the model covariance matrix. Akaike information criterion (AIC) A test of relative model fit: The preferred model is the one with the lowest AIC value. where k is the number of parameters in the statistical model, and L is the maximized value of the likelihood of the model. Root Mean Square Error of Approximation (RMSEA) Another test of model fit. RMSEA values <.05 are considered to indicated good fit. An RMSEA of .1 or more is often taken to indicate poor fit. Standardized Root Mean Residual (SRMR) The SRMR is a popular absolute fit indicator. A good model should have an SRMR smaller than .05. Comparative Fit Index (CFI) In examining baseline comparisons, the CFI depends in large part on the average size of the correlations in the data. If the average correlation between variables is not high, then the CFI will not be very high. A CFI value of .90 or higher is desirable. For each measure of fit, a decision as to what represents a good-enough fit between the model and the data must reflect other contextual factors such as sample size (for instance very large samples make the Chi-squared test overly sensitive [citation needed]), the ratio of indicators to factors, and the overall complexity of the model.

Structural equation modeling

378

Model modification
The model may need to be modified in order to improve the fit, thereby estimating the most likely relationships between variables. Many programs provide modification indices which report the improvement in fit that results from adding an additional path to the model. Modifications that improve model fit are then flagged as potential changes that can be made to the model. In addition to improvements in model fit, it is important that the modifications also make theoretical sense.

Sample size and power


Where the proposed SEM is the basis for a research hypothesis, ad hoc rules of thumb requiring the choosing of 10 observations per indicator in setting a lower bound for the adequacy of sample sizes have been widely used since their original articulation by Nunnally (1967).[4] Being linear in model constructs, these are easy to compute, but have been found to result in sample sizes that are too small. One study found that sample sizes in a particular stream of SEM literature averaged only 50% of the minimum measurements needed to draw the conclusions the studies claimed. Overall, 80% of the research articles in the study drew conclusions from insufficient samples.[] Complexities which increase information demands in structural model estimation increase with the number of potential combinations of latent variables; while the information supplied for estimation increases with the number of measured parameters times the number of observations in the sample size both are non-linear. Sample size in SEM can be computed through two methods: the first as a function of the ratio of indicator variables to latent variables, and the second as a function of minimum effect, power and significance. Software and methods for computing both have been developed by Westland (2010).[] The theory of power equivalence by von Oertzen (2010)[] formally describes equivalence classes of SEM with equal power. This allows an analytic trade-off of design parameters, including sample sizes, indicator reliability, or number of indicators, of a SEM while keeping power constant.

Interpretation and communication


The set of models are then interpreted so that claims about the constructs can be made, based on the best fitting model. Caution should always be taken when making claims of causality even when experimentation or time-ordered studies have been done. The term causal model must be understood to mean: "a model that conveys causal assumptions," not necessarily a model that produces validated causal conclusions. Collecting data at multiple time points and using an experimental or quasi-experimental design can help rule out certain rival hypotheses but even a randomized experiment cannot rule out all such threats to causal inference. Good fit by a model consistent with one causal hypothesis invariably entails equally good fit by another model consistent with an opposing causal hypothesis. No research design, no matter how clever, can help distinguish such rival hypotheses, save for interventional experiments.[] As in any science, subsequent replication and perhaps modification will proceed from the initial finding.

Structural equation modeling

379

Advanced uses
Invariance Multiple group modelling: This is a technique allowing joint estimation of multiple models, each with different sub-groups. Applications include behavior genetics, and analysis of differences between groups (e.g., gender, cultures, test forms written in different languages, etc.). Latent growth modeling Hierarchical/multilevel models; item response theory models Mixture model (latent class) SEM Alternative estimation and testing techniques Robust inference Survey sampling analyses Multi-method multi-trait models Structural Equation Model Trees

SEM-specific software
Open source software R has several contributed packages dealing with SEM Commercial packages AMOS in SPSS Stata SAS (software) procedures MPlus [5]

References
[3] Bollen, K A, and Long, S J (1993) Testing Structural Equation Models. SAGE Focus Edition, vol. 154, ISBN 0-8039-4507-8 [5] http:/ / www. statmodel. com/

Further reading
Bagozzi, R.; Yi, Y. (2012) "Specification, evaluation, and interpretation of structural equation models". Journal of the Academy of Marketing Science, 40 (1), 834. doi: 10.1007/s11747-011-0278-x (http://dx.doi.org/10.1007/ s11747-011-0278-x) Bartholomew, D J, and Knott, M (1999) Latent Variable Models and Factor Analysis Kendall's Library of Statistics, vol. 7. Arnold publishers, ISBN 0-340-69243-X Bentler, P.M. & Bonett, D.G. (1980). "Significance tests and goodness of fit in the analysis of covariance structures". Psychological Bulletin, 88, 588-606. Bollen, K A (1989). Structural Equations with Latent Variables. Wiley, ISBN 0-471-01171-1 Byrne, B. M. (2001) Structural Equation Modeling with AMOS - Basic Concepts, Applications, and Programming.LEA, ISBN 0-8058-4104-0 Goldberger, A. S. (1972). Structural equation models in the social sciences. Econometrica 40, 979- 1001. Haavelmo, T. (1943) "The statistical implications of a system of simultaneous equations," Econometrica 11:12. Reprinted in D.F. Hendry and M.S. Morgan (Eds.), The Foundations of Econometric Analysis, Cambridge University Press, 477490, 1995. Hair, Joe F., G. Tomas M. Hult, Christian M. Ringle, and Marko Sarstedt. 2013. A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM). Thousand Oaks: Sage. http://www.sagepub.com/books/ Book237345

Structural equation modeling Hoyle, R H (ed) (1995) Structural Equation Modeling: Concepts, Issues, and Applications. SAGE, ISBN 0-8039-5318-6 Kaplan, D (2000) Structural Equation Modeling: Foundations and Extensions. SAGE, Advanced Quantitative Techniques in the Social Sciences series, vol. 10, ISBN 0-7619-1407-2 Kline, R. B. (2010) Principles and Practice of Structural Equation Modeling (3rd Edition). The Guilford Press, ISBN 978-1-60623-877-6 Jreskog, K.; F. Yang (1996). "Non-linear structural equation models: The Kenny-Judd model with interaction effects". In G. Marcoulides and R. Schumacker, (eds.), Advanced structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage Publications.

380

External links
Ed Rigdon's Structural Equation Modeling Page (http://www2.gsu.edu/~mkteer/): people, software and sites Structural equation modeling page under David Garson's StatNotes, NCSU (http://www2.chass.ncsu.edu/ garson/pa765/structur.htm) Issues and Opinion on Structural Equation Modeling (http://disc-nt.cba.uh.edu/chin/ais/), SEM in IS Research The causal interpretation of structural equations (or SEM survival kit) by Judea Pearl 2000. (http://bayes.cs. ucla.edu/BOOK-2K/jw.html) Structural Equation Modeling Reference List by Jason Newsom (http://www.upa.pdx.edu/IOA/newsom/ semrefs.htm): journal articles and book chapters on structural equation models PLS-SEM book (http://www.pls-sem.com//): online resources and additional information Path Analysis in AFNI (http://afni.nimh.nih.gov/sscc/gangc/PathAna.html): The open source (GPL) AFNI (http://afni.nimh.nih.gov) package contains SEM code Handbook of Management Scales (http://en.wikibooks.org/wiki/Handbook_of_Management_Scales), a collection of previously used multi-item scales to measure constructs for SEM

Lewis Terman

381

Lewis Terman
Lewis Terman

Born

January 15, 1877 Johnson County, Indiana December 21, 1956 (aged79) Palo Alto, California American Psychology Stanford University Los Angeles Normal School

Died

Nationality Fields Institutions

Alma mater Clark University Indiana University Bloomington Central Normal College

Lewis Madison Terman (January 15, 1877 December 21, 1956) was an American psychologist, noted as a pioneer in educational psychology in the early 20th century at the Stanford University School of Education. He is best known as the inventor of the Stanford-Binet IQ test and the initiator of the longitudinal study of children with high IQs called the Genetic Studies of Genius.[1] He was a prominent eugenicist and was a member of the Human Betterment Foundation. He also served as president of the American Psychological Association.

Biography
Terman received a B.S., B.Pd. (Bachelor of Pedagogy), and B.A. from Central Normal College in 1894 and 1898, and a B.A. and M.A. from the Indiana University Bloomington in 1903. He received his Ph.D. from Clark University in 1905. He worked as a school principal in San Bernardino, California in 1905, and as a professor at Los Angeles Normal School in 1907. In 1910 he joined the faculty of Stanford University as a professor of educational psychology at the invitation of Ellwood Patterson Cubberley and remained associated with the university until his death. He served as chairman of the psychology department from 1922 to 1945. Terman published the Stanford Revision of the Binet-Simon Scale in 1916 and revisions were released in 1937 and 1960.[2] Original work on the test had been completed by Alfred Binet and Thodore Simon of France. Terman promoted his test, known colloquially as the "Stanford-Binet" test, as an aid for the classification of developmentally disabled children. Revisions of the Stanford-Binet are still used today as a general intelligence test for adults and children. The fifth revision of the test is currently in use.

Lewis Terman The first mass administration of IQ testing was done with 1.7 million soldiers during World War I, when Terman served in a psychological testing role with the United States military. Terman was able to work with other applied psychologists to categorize army recruits. The recruits were given group intelligence tests which took about an hour to administer. Testing options included Army alpha, a text-based test, and Army beta, a picture-based test for nonreaders. 25% could not complete the Alpha test.[3] The examiners scored the tests on a scale ranging from "A" through "E". Recruits who earned scores of "A" would be trained as officers while those who earned scores of "D" and "E" would never receive officer training. The work of psychologists during the war proved to Americans that intelligence tests could have broader utility. After the war Terman and his colleagues pressed for intelligence tests to be used in schools to improve the efficiency of growing American schools. He also administered English tests to Spanish-speakers and unschooled African-Americans, concluding: High-grade or border-line deficiency... is very, very common among Spanish-Indian and Mexican families of the Southwest and also among negroes. Their dullness seems to be racial, or at least inherent in the family stocks from which they come... Children of this group should be segregated into separate classes... They cannot master abstractions but they can often be made into efficient workers... from a eugenic point of view they constitute a grave problem because of their unusually prolific breeding (The Measurement of Intelligence, 1916, p. 91-92). Unlike Binet and Simon, whose goal was to identify less able school children in order to aid them with the needed care required, Terman proposed using IQ tests to classify children and put them on the appropriate job-track. He believed IQ was inherited and was the strongest predictor of one's ultimate success in life. Terman adopted William Stern's suggestion that mental age/chronological age times 100 be made the intelligence quotient or IQ. (NB: Most modern IQ tests calculate the intelligence quotient differently.) In 1921, Terman initiated the Genetic Studies of Genius, a long-term study of gifted children. He found that gifted children did not fit the existing stereotypes often associated with them: they were not weak and sickly social misfits, but in fact were generally taller, in better health, better developed physically, and better adapted socially than other children. The children included in his studies were colloquially referred to as "Termites".[4] Terman later joined the Human Betterment Foundation, a Pasadena-based eugenics group founded by E.S. Gosney in 1928 which had as part of its agenda the promotion and enforcement of compulsory sterilization laws in California. Terman Middle School in Palo Alto, California is named after himself and his son. His son Frederick Terman, as provost of Stanford University, greatly expanded the science, statistics and engineering departments that helped catapult Stanford into the ranks of the world's first class educational institutions, as well as spurring the growth of Silicon Valley.

382

Thoughts and research on gifted children


Termans study of genius and gifted children was a lifelong interest.[5] His fascination with the intelligence of children began early in his career since he was familiar with Alfred Binets research in this area.[6] Terman followed J. McKeen Cattells work which combined the ideas of Wilhelm Wundt and Francis Galton saying that those who are intellectually superior will have better sensory acuity, strength of grip, sensitivity to pain, and memory for dictated consonants.[7] At Clark University, Terman wrote his doctoral dissertation entitled Genius and stupidity: a study of some of the intellectual processes of seven bright and seven stupid boys. He administered Cattells tests on boys who were considered intelligent versus boys who were considered unintelligent.[8] In 1915, he wrote a paper called The mental hygiene of exceptional children.[9] He pointed out that though he believed the capacity for intelligence is inherited, those with exceptional intelligence also need exceptional schooling. Terman wrote that, [Bright children] are rarely given tasks which call forth their best ability, and as a result they run the risk of falling into lifelong habits of submaximum efficiency.[6] In other words, nature (heredity)

Lewis Terman plays a large role in determining intelligence, but nurture (the environment) is also important in fostering the innate intellectual ability. By his own admission there was nothing in his own ancestry that would have led anyone to predict him to have an intellectual career.[10] With Binets development of IQ tests, it became possible to quickly identify gifted children and study them from their early childhood into adulthood.[6] In his 1922 paper called A New Approach to the Study of Genius, Terman noted that this advancement in testing marked a change in research on geniuses and giftedness.[11] Previously, the research had looked at genius adults and tried to look in retrospect into their early years of childhood. Through these studies on gifted children, Terman hoped to find how to properly educate a gifted child as well as dispel the negative stereotypes that that gifted children were conceited, freakish, socially eccentric, and [insane].[12] Terman found his answers in his longitudinal study on gifted children called Genetic Studies of Genius which had five volumes.[13] The children in this study were called Termites.[7] The volumes reviewed the follow-ups that Terman conducted throughout their lives. The fifth volume was a 35 year follow-up, and looked at the gifted group during mid-life.[14] The results from this study showed that gifted and genius children were actually in good health and had normal personalities. Few of them demonstrated the previously-held negative stereotype of gifted children. Most of those in the study did well socially and academically and had lower divorce rates later in life.[7] Additionally, those in the gifted group were generally successful in their careers and had received awards recognizing their achievements. Though many of the Termites reached their potential in adulthood, some of the children did not, perhaps because of personal obstacles, insufficient education, or lack of opportunity.[6] Terman died before he completed the fifth volume of Genetic Studies of Genius, but Melita Oden, a colleague, completed the volume and published it.[14] Terman wished for the study to continue on after his death, so he selected Robert Richardson Sears, one of the many successful participants in the study as well as a colleague of his, to continue with the work.[7] The study is still supported by Stanford University and will continue until the last of the Termites withdraws from the study or dies.

383

Publications
The Measurement of Intelligence (1916) The Use of Intelligence Tests (1916) The Stanford Achievement Test (1923) Genetic Studies of Genius (1925, 1947, 1959) Autobiography of Lewis Terman (1930)

Recognition
Stanford University has an endowed professorship in his honor.

References
[1] [2] [3] [5] [6] [7] [8] Sears, R. R. (1957). L. M. Terman, pioneer in mental measurement. Science, 125, 978-979. doi:10.1126/science.125.3255.978 (http:/ / www. infoplease. com/ ce6/ people/ A0848220. html) Teigen, En psykologihistorie, page 235 (Vialle, 1994) Bernreuter, R. G., Miles, C.C., Tinker, M.A., & Young, K. (1942). Studies in personality. New York, NY: McGraw-Hill Book Company. Seagoe, M.V. (1975). Terman and the gifted. Los Altos, CA: William Kaufmann. Terman, L.M. (1906). Genius and stupidity: a study of some of the intellectual processes of seven 'bright' and seven 'stupid' boys. Pedagogical Seminary, 13, 307-373. [9] (Terman, 1915) [10] Terman, L.M. (1932). Autobiography. In C. Murchison (Ed.), A history of psychology, Vol.II (pp. 297-332). Worcester, MA; Clark University Press. [11] (Terman, 1922)

Lewis Terman
[12] Bernreuter, R. G., Miles, C.C., Tinker, M.A., & Young, K. (1942). Studies in personality. New York, NY: McGraw-Hill Book Company. p. 11 [13] Minton, 1988 [14] (Terman, 1959)

384

Bibliography
Minton, H.L. (1988). Lewis M. Terman: pioneer in psychology testing. New York, NY: New York University Press. Terman, L.M. (1915). The mental hygiene of exceptional children. Pedagogical Seminary, 22529-537. Terman, L.M. (1922). A new approach to the study of genius. Psychological Review, 29(4), 310-318. Terman, L.M. (Ed.). (1959). The gifted group at mid-life. Stanford, CA: Stanford University Press. Vialle, W. (1994). 'Termanal' science? The work of Lewis Terman revisited. Roeper Review, 17(1), 32-38. Human Intelligence: Lewis Madison Terman (http://www.indiana.edu/~intell/terman.shtml) Autobiography of Lewis M. Terman (http://psychclassics.asu.edu/Terman/murchison.htm). First published in Murchison, Carl. (Ed.) (1930). History of Psychology in Autobiography (Vol. 2, pp.297331). Republished by the permission of Clark University Press, Worcester, MA. Memorial Resolution Lewis Madison Terman (http://histsoc.stanford.edu/pdfmem/TermanL.pdf) via Stanford University Shurkin, Joel (1992). Terman's Kids: The Groundbreaking Study of How the Gifted Grow Up. Boston (MA): Little, Brown. ISBN978-0-316-78890-8. Lay summary (http://articles.latimes.com/1992-05-31/books/ bk-1247_1_lewis-terman/2) (28 June 2010).

External links
Works by Lewis Madison Terman (http://www.gutenberg.org/author/Lewis_Madison_Terman) at Project Gutenberg Lewis M. Terman, The Great Conspiracy or the Impulse Imperious of Intelligence Testers, Psychoanalyzed and Exposed by Mr. Lippmann, New Republic 33 (December 27, 1922): 116120. (http://historymatters.gmu.edu/ d/4960) "Psychological Predictors of Long Life: An 80-year study discovers traits that help people to live longer." (http:// www.psychologytoday.com/blog/looking-in-the-cultural-mirror/201206/psychological-predictors-long-life). Psychology Today. June 5, 2012.
Educational offices Precededby Knight Dunlap 32nd President of the American Psychological Association 19231924 Succeededby Granville Stanley Hall

Test (assessment)

385

Test (assessment)
A test or examination is an assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics (e.g., beliefs). A test may be administered orally, on paper, on a computer, or in a confined area that requires a test taker to physically perform a set of skills. Tests vary in style, rigor and requirements. For example, in a closed book test, a test taker is often required to rely upon memory to respond to specific items whereas in an open book test, a test taker may use one or more supplementary tools such as a reference book or calculator when responding to an item. A test may be administered formally or informally. An example of an informal test would be a reading test administered by a parent to a child. An example of a formal test would be a final examination administered by a teacher in a classroom or an I.Q. test administered by a psychologist in a clinic. Formal testing often results in a grade or a test score.[1] A test score may be interpreted with regards to a norm or criterion, or occasionally both. The norm may be established independently, or by statistical analysis of a large number of participants.

Students take exams in Mahatma Gandhi Seva Ashram, Jaura, India.

A standardized test is any test that is administered and scored in a consistent manner Cambodian students taking an exam in order to apply for the Don Bosco to ensure legal defensibility.[2] Standardized tests Technical School of Sihanoukville in 2008. are often used in education, professional certification, psychology (e.g., MMPI), the military, and many other fields. A non-standardized test is usually flexible in scope and format, variable in difficulty and significance. Since these tests are usually developed by individual instructors, the format and difficulty of these tests may not be widely adopted or used by other instructors or institutions. A non-standardized test may be used to determine the proficiency level of students, to motivate students to study, and to provide feedback to students. In some instances, a

Test (assessment)

386

teacher may develop non-standardized tests that resemble standardized tests in scope, format, and difficulty for the purpose of preparing their students for an upcoming standardized test.[] Finally, the frequency and setting by which a non-standardized tests are administered are highly variable and are usually constrained by the duration of the class period. A class instructor may for example, administer a test on a weekly basis or just twice a semester. Depending on the policy of the instructor or institution, the duration of each test itself may last for only five minutes to an entire class period.

American students in a computer fundamentals class taking a computer-based test

In contrasts to non-standardized tests, standardized tests are widely used, fixed in terms of scope, difficulty and format, and are usually significant in consequences. Standardized tests are usually held on fixed dates as determined by the test developer, educational institution, or governing body, which may or may not be administered by the instructor, held within the classroom, or constrained by the classroom period. Although there is little variability between different copies of the same type of standardized test (e.g., SAT or GRE), there is variability between different types of standardized tests. Any test with important consequences for the individual test taker is referred to as a high-stakes test. A test may be developed and administered by an instructor, a clinician, a governing body, or a test provider. In some instances, the developer of the test may not be directly responsible for its administration. For example, Educational Testing Service (ETS), a nonprofit educational testing and assessment organization, develops standardized tests such as the SAT but may not directly be involved in the administration or proctoring of these tests. As with the development and administration of educational tests, the format and level of difficulty of the tests themselves are highly variable and there is no general consensus or invariable standard for test formats and difficulty. Often, the format and difficulty of the test is dependent upon the educational philosophy of the instructor, subject matter, class size, policy of the educational institution, and requirements of accreditation or governing bodies. In general, tests developed and administered by individual instructors are non-standardized whereas tests developed by testing organizations are standardized.

History
Ancient China was the first country in the world that implemented a nationwide standardized test, which was called the imperial examination. The main purpose of this examination was to select for able candidates for specific governmental positions.[3] The imperial examination was established by the Sui Dynasty in 605 AD and was later abolished by the Qing Dynasty 1300 years later in 1905. England had adopted this examination system in 1806 to select specific candidates for positions in Her Majesty's Civil Service. This examination system was later applied to education and it started to

Students taking a scholarship examination inside a classroom in 1940

Test (assessment) influence other parts of the world as it became a prominent standard (e.g. regulations to prevent the markers from knowing the identity of candidates), of delivering standardized tests. Influence of World Wars on Testing Both World War I and World War II made many people realize the necessity of standardized testing and the benefits associated with these tests. One main reason people saw the benefits was from the Army Alpha and Army Beta tests, which were used during WWI to determine human abilities. Alongside the Army Alpha, the Stanford-Binet Intelligence Scale "added momentum to the testing movement."[4] Soon after, colleges and industry began using tests to help in accepting and hiring people based on performance of the test. Another reason more tests began to come forth was that people were realizing that the distance between secondary education and higher education was widening after WWII. In 1952, the first Advanced Placement (AP) test was administered to begin closing the gap between high schools and colleges.[5]

387

Modern day use of tests


Education
Some countries such as the United Kingdom and France require all their secondary school students to take a standardized test on individual subjects such as the General Certificate of Secondary Education (GCSE) (in England) and Baccalaurat respectively as a requirement for graduation.[6] These tests are used primarily to assess a student's proficiency in specific subjects such as mathematics, science, or literature. In contrasts, high school students in other countries such as the United States may not be required to take a standardized test to graduate. Moreover, students in these countries usually take standardized tests only to apply for a position in a university program and are typically given the option of taking different standardized tests such as the ACT or SAT, which are used primarily to measure a student's reasoning skill.[7][8] High school students in the United States may also take Advanced Placement tests on specific subjects to fulfill university-level credit. Depending on the policies of the test maker or country, administration of standardized tests may be done in a large hall, classroom, or testing center. A proctor or invigilator may also be present during the testing period to provide instructions, to answer questions, or to prevent cheating. Grades or test scores from standardized test may also be used by universities to determine if a student applicant should be admitted into one of its academic or professional programs. For example, universities in the United Kingdom admit applicants into their undergraduate programs based primarily or solely on an applicant's grades on pre-university qualifications such as the GCE A-levels or Cambridge Pre-U.[][] In contrast, universities in the United States use an applicant's test score on the SAT or ACT as just one of their many admission criteria to determine if an applicant should be admitted into one of its undergraduate programs. The other criteria in this case may include the applicant's grades from high school, extracurricular activities, personal statement, and letters of recommendations.[] Once admitted, undergraduate students in the United Kingdom or United States may be required by their respective programs to take a comprehensive examination as a requirement for passing their courses or for graduating from their respective programs. Standardized tests are sometimes used by certain countries to manage the quality of their educational institutions. For example, the No Child Left Behind Act in the United States requires individual states to develop assessments for students in certain grades. In practice, these assessments typically appear in the form of standardized tests. Test scores of students in specific grades of an educational institution are then used to determine the status of that educational institution, i.e., whether it should be allowed to continue to operate in the same way or to receive funding. Finally, standardized tests are sometimes used to compare proficiencies of students from different institutions or countries. For example, the Organisation for Economic Co-operation and Development (OECD) uses Programme for International Student Assessment (PISA) to evaluate certain skills and knowledge of students from different participating countries.[]

Test (assessment)

388

Licensing and certification


Standardized tests are sometimes used by certain governing bodies to determine if a test taker is allowed to practice a profession, to use a specific job title, or to claim competency in a specific set of skills. For example, a test taker who intends to become a lawyer is usually required by a governing body such a governmental bar licensing agency to pass a bar exam.

Immigration and naturalization


Standardized tests are also used in certain countries to regulate immigration. For example, intended immigrants to Australia are legally required to pass a citizenship test as part of that country's naturalization process.[]

Competitions
Tests are sometimes used as a tool to select for participants that have potential to succeed in a competition such as a sporting event. For example, serious skaters who wish to participate in figure skating competitions in the United States must pass official U.S. Figure Skating tests just to qualify.[]

Group memberships
Tests are sometimes used by a group to select for certain types of individuals to join the group. For example, Mensa International is a high I.Q. society that requires individuals to score at the 98th percentile or higher on a standardized, supervised IQ test.[]

Types of tests
Written tests
Written tests are tests that are administered on paper or on a computer. A test taker who takes a written test could respond to specific items by writing or typing within a given space of the test or on a separate form or document. In some tests; where knowledge of many constants or technical terms is required to effectively answer questions, like Chemistry or Biology the test developer may allow every test taker to bring with them a cheat sheet. A test developer's choice of which style or format to use when Indonesian Students taking a written test developing a written test is usually arbitrary given that there is no single invariant standard for testing. Be that as it may, certain test styles and format have become more widely used than others. Below is a list of those formats of test items that are widely used by educators and test developers to construct paper or computer-based tests. As a result, these tests may consist of only one type of test item format (e.g., multiple choice test, essay test) or may have a combination of different test item formats (e.g., a test that has multiple choice and essay items). Multiple choice In a test that has items formatted as multiple choice questions, a candidate would be given a number of set answers for each question, and the candidate must choose which answer or group of answers is correct. There are two families of multiple choice questions.[] The first family is known as the True/False question and it requires a test taker to choose all answers that are appropriate. The second family is known as One-Best-Answer question and it requires a test taker to answer only one from a list of answers.

Test (assessment) There are several reasons to using multiple choice questions in tests. In terms of administration, multiple choice questions usually requires less time for test takers to answer, are easy to score and grade, provide greater coverage of material, allows for a wide range of difficulty, and can easily diagnose a test taker's difficulty with certain concepts.[] As an educational tool, multiple choice items test many levels of learning as well as a test taker's ability to integrate information, and it provides feedback to the test taker about why distractors were wrong and why correct answers were right. Nevertheless, there are difficulties associated with the use of multiple choice questions. In administrative terms, multiple choice items that are effective usually take a great time to construct.[] As an educational tool, multiple choice items do not allow test takers to demonstrate knowledge beyond the choices provided and may even encourage guessing or approximation due to the presence of at least one correct answer. For instance a test taker might not work out explicitly that , but knowing that , they would choose an answer close to 48. Moreover, test takers may misinterpret these items and in the process, perceive these items to be tricky or picky. Finally, multiple choice items do not test a test taker's attitudes towards learning because correct responses can be easily faked. Alternative response True/False questions present candidates with a binary choice - a statement is either true or false. This method presents problems, as depending on the number of questions, a significant number of candidates could get 100% just by guesswork, and should on average get 50%. Matching type A matching item is an item that provides a defined term and requires a test taker to match identifying characteristics to the correct term.[] Completion type A fill-in-the-blank item provides a test taker with identifying characteristics and requires the test taker to recall the correct term.[] There are two types of fill-in-the-blank tests. The easier version provides a word bank of possible words that will fill in the blanks. For some exams all words in the word bank are exactly once. If a teacher wanted to create a test of medium difficulty, they would provide a test with a word bank, but some words may be used more than once and others not at all. The hardest variety of such a test is a fill-in-the-blank test in which no word bank is provided at all. This generally requires a higher level of understanding and memory than a multiple choice test. Because of this, fill-in-the-blank tests[with no word bank] are often feared by students. Essay Items such as short answer or essay typically require a test taker to write a response to fulfill the requirements of the item. In administrative terms, essay items take less time to construct.[] As an assessment tool, essay items can test complex learning objectives as well as processes used to answer the question. The items can also provide a more realistic and generalizable task for test. Finally, these items make it difficult for test takers to guess the correct answers and require test takers to demonstrate their writing skills as well as correct spelling and grammar. The difficulties with essay items is primarily administrative. For one, these items take more time for test takers to answer.[] When these questions are answered, the answers themselves are usually poorly written because test takers may not have time to organize and proofread their answers. In turn, it takes more time to score or grade these items. When these items are being scored or graded, the grading process itself becomes subjective as non-test related information may influence the process. Thus, considerable effort is required to minimize the subjectivity of the grading process. Finally, as an assessment tool, essay questions may potentially be unreliable in assessing the entire content of a subject matter.

389

Test (assessment) Mathematical questions Most mathematics questions, or calculation questions from subjects such as chemistry, physics or economics employ a style which does not fall in to any of the above categories, although some papers, notably the Maths Challenge papers in the United Kingdom employ multiple choice. Instead, most mathematics questions state a mathematical problem or exercise that requires a student to write a freehand response. Marks are given more for the steps taken than for the correct answer. If the question has multiple parts, later parts may use answers from previous sections, and marks may be granted if an earlier incorrect answer was used but the correct method was followed, and an answer which is correct (given the incorrect input) is returned. Higher level mathematical papers may include variations on true/false, where the candidate is given a statement and asked to verify its validity by direct proof or stating a counterexample.

390

Physical fitness tests


A physical fitness test is a test designed to measure physical strength, agility, and endurance. They are commonly employed in educational institutions as part of the physical education curriculum, in medicine as part of diagnostic testing, and as eligibility requirements in fields that focus on physical ability such as military or police. Throughout the 20th century, scientific evidence emerged demonstrating the usefulness of strength training and aerobic exercise in maintaining overall health, and more agencies began to incorporate standardized fitness testing. In the United States, the President's Council on Youth Fitness was established in 1956 as a way to encourage and monitor fitness in schoolchildren. Common tests[9][10][11] include timed running or the multi-stage fitness test, and numbers of push-ups, sit-ups/abdominal crunches and pull-ups that the individual can perform. More specialised tests may be used to test ability to perform a particular job or role.

A Minnesota National Guardsman performs pushups during a physical fitness test.

Performance tests
A performance test is an assessment that requires an examinee to actually perform a task or activity, rather than simply answering questions referring to specific parts. The purpose is to ensure greater fidelity to what is being tested. An example is a behind-the-wheel driving test to obtain a driver's license. Rather than only answering simple multiple-choice items regarding the driving of an automobile, a student is required to actually drive one while being evaluated. Performance tests are commonly used in workplace and professional applications, such as professional certification and licensure. When used for personnel selection, the tests might be referred to as a work sample. A licensure example would be cosmetologists being required to demonstrate a haircut or manicure on a live person. The Group-Bourdon test is one of a number of psychometric tests which trainee train drivers in the UK are required to pass.[12] Some performance tests are simulations. For instance, the assessment to become certified as an ophthalmic technician includes two components, a multiple-choice examination and a computerized skill simulation. The

Test (assessment) examinee must demonstrate the ability to complete seven tasks commonly performed on the job, such as retinoscopy, that are simulated on a computer.

391

Test preparations
From the perspective of a test developer, there is great variability with respect to time and effort needed to prepare a test. Likewise, from the perspective of a test taker, there is also great variability with respect to the time and needed to obtain a desired grade or score on any given test. When a test developer constructs a test, the amount of time and effort is dependent upon the significance of the test itself, the proficiency of the test taker, the format of the test, class size, deadline of test, and experience of the test developer. The process of test construction has been greatly aided in several ways. For one, many test developers were themselves students at one time, and therefore are able to modify or outright adopt test questions from their previous tests. In some countries such as the United States, book publishers often provide teaching packages that include test banks to university instructors who adopt their published books for their courses. These test banks may contain up to four thousand sample test questions that have been peer-reviewed and time tested.[13] The instructor who chooses to use this testbank would only have to select a fixed number of test questions from this test bank to construct a test. As with test constructions, the time needed for a test taker to prepare for a test is dependent upon the frequency of the test, the test developer, and the significance of the test. In general, nonstandardized tests that are short, frequent, and do not constitute a major portion of the test taker's overall course grade or score require do not require the test taker to spend great amounts preparing for the test.[] Conversely, nonstandardized tests that are long, infrequent, and do constitute a major portion of the test taker's overall course grade or score usually require the test taker to spend great amounts preparing for the test. To prepare for a nonstandardized test, test takers may rely upon their reference books, class or lecture notes, Internet, and past experience to prepare for the test. Test takers may also use various learning aids to study for tests such as flash cards and mnemonics.[14] Test takers may even hire tutors to coach them through the process so that they may increase the probability of obtaining a desired test grade or score. Finally, test takers may rely upon past copies of a test from previous years or semesters to study for a future test. These past tests may be provided by a friend or a group that has copies of previous tests or from instructors and their institutions.[15] Unlike nonstandardized test, the time needed by test takers to prepare for standardized tests are less variable and usually considerable. This is because standardized tests are usually uniformed in scope, format, and difficulty and often have important consequences with respect to a test taker's future such as a test taker's eligibility to attend a specific university program or to enter a desired profession. It is not unusual for test takers to prepare for standardized tests by relying upon commercially available books that provide in-depth coverage of the standardized test or compilations of previous tests (e.g., 10 year series in Singapore). In many countries, test takers even enroll in test preparation centers or cram schools that provide extensive or supplementary instructions to test takers to help them better prepare for a standardized test. Finally, in some countries, instructors and their institutions have also played a significant role in preparing test takers for a standardized test.

Cheating on tests
Cheating on a test is the process of using unauthorized means or methods for the purpose of obtaining a desired test score or grade. This may range from bringing and using notes during a closed book examination, copying another test taker's answer or choice of answers during an individual test, or even sending a paid proxy to take the test. Several common methods have been employed to combat cheating. They include the use of multiple proctors or invigilators during a testing period to monitor test takers. Test developers may construct multiple variants of the same test to be administered to different test takers at the same time. In some cases, instructors themselves may not administer their own tests but will leave the task to other instructors or invigilators, which may mean that the invigilators do not know the candidates, and thus some form of identification may be required. Finally, instructors or

Test (assessment) test providers may compare the answers of suspected cheaters on the test themselves to determine if cheating did occur.

392

Support and criticisms of tests


Despite their widespread use, the validity, quality, or use of tests, particularly standardized tests in education have continued to be widely supported or criticized. Like the tests themselves, supports and criticisms of tests are often varied and may come from a variety of sources such as parents, test takers, instructors, business groups, universities, or governmental watchdogs. Supporters of standardized tests in education often provide the following reasons for promoting testing in education: Feedback or diagnosis of test taker's performance[] Fair and efficient[] Promotes accountability[][] Prediction and selection[] Improves performance[]

Critics of standardized tests in education often provide the following reasons for revising or removing standardized tests in education: Narrows curricular format and encourages teaching to the test.[] Poor predictive quality.[16] Grade inflation of test scores or grades.[17][18][19] Culturally or socioeconomically biased.[20]

References
[1] [2] [3] [4] [5] [8] Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum. Page 1, sentence 1. North Central Regional Educational Laboratory, NCREL.org (http:/ / www. ncrel. org/ sdrs/ areas/ issues/ students/ earlycld/ ea5lk3. htm) Advanced Level Examination, Chinese Language and Culture, Paper 1A Kaplan, R.M., & Saccuzzo, D.P. (2009) Psychological Testing. Belmont, CA: Wadsworth http:/ / www. collegeboard. com/ prod_downloads/ about/ news_info/ ap/ ap_history_english. pdf Name changed in 1996.

Further reading
Airasian, P. (1994) "Classroom Assessment," Second Edition, NY" McGraw-Hill. Cangelosi, J. (1990) "Designing Tests for Evaluating Student Achievement." NY: Addison-Wesley. Gronlund, N. (1993) "How to make achievement tests and assessments," 5th edition, NY: Allyn and Bacon. Haladyna, T.M. & Downing, S.M. (1989) Validity of a Taxonomy of Multiple-Choice Item-Writing Rules. "Applied Measurement in Education," 2(1), 51-78. Monahan, T. (1998) The Rise of Standardized Educational Testing in the U.S. A Bibliographic Overview (http:/ /torinmonahan.com/papers/testing.pdf). Ravitch, Diane, The Uses and Misuses of Tests (http://www.dianeravitch.com/uses_and_misuses.pdf), in The Schools We Deserve (New York: Basic Books, 1985), pp.172181. Wilson, N. (1997) Educational standards and the problem of error. Education Policy Analysis Archives, Vol 6 No 10 (http://epaa.asu.edu/epaa/v6n10/)

Test (assessment)

393

External links
"About the Joint Committee on Testing Practices" (http://www.apa.org/science/programs/testing/committee. aspx). http://www.apa.org: American Psychological Association. Retrieved 2 Aug 2011. "The Joint Committee on Testing Practices (JCTP) was established in 1985 by the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). In 2007 the JCTP disbanded, but JCTP publications are still available and may be obtained by contacting any of the groups listed in the product descriptions shown below."

Test score
A test score is a piece of information, usually a number, that conveys the performance of an examinee on a test. One formal definition is that it is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured."[1] Test scores are interpreted with a norm-referenced or criterion-referenced interpretation, or occasionally both. A norm-referenced interpretation means that the score conveys meaning about the examinee with regards to their standing among other examinees. A criterion-referenced interpretation means that the score conveys information about the examinee with regards a specific subject matter, regardless of other examinees' scores.[2]

Types of test scores


There are two types of test scores: raw scores and scaled scores. A raw score is a score without any sort of adjustment or transformation, such as the simple number of questions answered correctly. A scaled score is the results of some transformation applied to the raw score. The purpose of scaled scores is to report scores for all examinees on a consistent scale. Suppose that a test has two forms, and one is more difficult than the other. It has been determined by equating that a score of 65% on form 1 is equivalent to a score of 68% on form 2. Scores on both forms can be converted to a scale so that these two equivalent scores have the same reported scores. For example, they could both be a score of 350 on a scale of 100 to 500. Two well-known tests in the United States that have scaled scores are the ACT and the SAT. The ACT's scale ranges from 0 to 36 and the SAT's from 200 to 800 (per section). Ostensibly, these two scales were selected to represent a mean and standard deviation of 18 and 6 (ACT), and 500 and 100. The upper and lower bounds were selected because an interval of plus or minus three standard deviations contains more than 99% of a population. Scores outside that range are difficult to measure, and return little practical value. Note that scaling does not affect the psychometric properties of a test, it is something that occurs after the assessment process (and equating, if present) is completed. Therefore, it is not a psychometric issue, but a public relations issue.

References
[1] Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum. Page 1, sentence 1. [2] Iowa Testing Programs guide for interpreting test scores (http:/ / www. education. uiowa. edu/ itp/ itbs/ itbs_interp_score. htm)

Theory of conjoint measurement

394

Theory of conjoint measurement


The theory of conjoint measurement (also known as conjoint measurement or additive conjoint measurement) is a general, formal theory of continuous quantity. It was independently discovered by the French economist Grard Debreu (1960) and by the American mathematical psychologist R. Duncan Luce and statistician John Tukey (Luce & Tukey 1964). The theory concerns the situation where at least two natural attributes, A and X, non-interactively relate to a third attribute, P. It is not required that A, X or P are known to be quantities. Via specific relations between the levels of P, it can be established that P, A and X are continuous quantities. Hence the theory of conjoint measurement can be used to quantify attributes in empirical circumstances where it is not possible to combine the levels of the attributes using a side-by-side operation or concatenation. The quantification of psychological attributes such as attitudes, cognitive abilities and utility is therefore logically plausible. This means that the scientific measurement of psychological attributes is possible. That is, like physical quantities, a magnitude of a psychological quantity may possibly be expressed as the product of a real number and a unit magnitude. Application of the theory of conjoint measurement in psychology, however, has been limited. It has been argued that this is due to the high level of formal mathematics involved (e.g., Cliff 1992) and that the theory cannot account for the "noisy" data typically discovered in psychological research (e.g., Perline, Wright & Wainer 1979). It has been argued that the Rasch model is a stochastic variant of the theory of conjoint measurement (e.g., Brogden 1977; Embretson & Reise 2000; Fischer 1995; Keats 1967; Kline 1998; Scheiblechner 1999), however, this has been disputed (e.g., Karabatsos, 2001; Kyngdon, 2008). Order restricted methods for conducting probabilistic tests of the cancellation axioms of conjoint measurement have been developed in the past decade (e.g., Karabatsos, 2001; Davis-Stober, 2009). The theory of conjoint measurement is (different but) related to conjoint analysis, which is a statistical-experiments methodology employed in marketing to estimate the parameters of additive utility functions. Different multi-attribute stimuli are presented to respondents, and different methods are used to measure their preferences about the presented stimuli. The coefficients of the utility function are estimated using alternative regression-based tools.

Historical overview
In the 1930s, the British Association for the Advancement of Science established the Ferguson Committee to investigate the possibility of psychological attributes being measured scientifically. The British physicist and measurement theorist Norman Robert Campbell was an influential member of the committee. In its Final Report (Ferguson, et al., 1940), Campbell and the Committee concluded that because psychological attributes were not capable of sustaining concatenation operations, such attributes could not be continuous quantities. Therefore, they could not be measured scientifically. This had important ramifications for psychology, the most significant of these being the creation in 1946 of the operational theory of measurement by Harvard psychologist Stanley Smith Stevens. Stevens' non-scientific theory of measurement is widely held as definitive in psychology and the behavioural sciences generally (Michell 1999). Whilst the German mathematician Otto Hlder (1901) anticipated features of the theory of conjoint measurement, it was not until the publication of Luce & Tukey's seminal 1964 paper that the theory received its first complete exposition. Luce & Tukey's presentation was algebraic and is therefore considered more general than Debreu's (1960) topological work, the latter being a special case of the former (Luce & Suppes 2002). In the first article of the inaugural issue of the Journal of Mathematical Psychology, Luce & Tukey 1964 proved that via the theory of conjoint measurement, attributes not capable of concatenation could be quantified. N.R. Campbell and the Ferguson Committee were thus proven wrong. That a given psychological attribute is a continuous quantity is a logically coherent and empirically testable hypothesis.

Theory of conjoint measurement Appearing in the next issue of the same journal were important papers by Dana Scott (1964), who proposed a hierarchy of cancellation conditions for the indirect testing of the solvability and Archimedean axioms, and David Krantz (1964) who connected the Luce & Tukey work to that of Hlder (1901). Work soon focused on extending the theory of conjoint measurement to involve more than just two attributes. Krantz 1968 and Amos Tversky (1967) developed what became known as polynomial conjoint measurement, with Krantz 1968 providing a schema with which to construct conjoint measurement structures of three or more attributes. Later, the theory of conjoint measurement (in its two variable, polynomial and n-component forms) received a thorough and highly technical treatment with the publication of the first volume of Foundations of Measurement, which Krantz, Luce, Tversky and philosopher Patrick Suppes cowrote (Krantz et al. 1971). Shortly after the publication of Krantz, et al., (1971), work focused upon developing an "error theory" for the theory of conjoint measurement. Studies were conducted into the number of conjoint arrays that supported only single cancellation and both single and double cancellation (Arbuckle & Larimer 1976; McClelland 1977). Later enumeration studies focused on polynomial conjoint measurement (Karabatsos & Ullrich 2002; Ullrich & Wilson 1993). These studies found that it is highly unlikely that the axioms of the theory of conjoint measurement are satisfied at random, provided that more than three levels of at least one of the component attributes has been identified. Joel Michell (1988) later identified that the "no test" class of tests of the double cancellation axiom was empty. Any instance of double cancellation is thus either an acceptance or a rejection of the axiom. Michell also wrote at this time a non-technical introduction to the theory of conjoint measurement (Michell 1990) which also contained a schema for deriving higher order cancellation conditions based upon Scott's (1964) work. Using Michell's schema, Ben Richards (Kyngdon & Richards, 2007) discovered that some instances of the triple cancellation axiom are "incoherent" as they contradict the single cancellation axiom. Moreover, he identified many instances of the triple cancellation which are trivially true if double cancellation is supported. The axioms of the theory of conjoint measurement are not stochastic; and given the ordinal constraints placed on data by the cancellation axioms, order restricted inference methodology must be used (Iverson & Falmagne 1985). George Karabatsos and his associates (Karabatsos, 2001; Karabatsos & Sheu 2004) developed a Bayesian Markov chain Monte Carlo methodology for psychometric applications. Karabatsos & Ullrich 2002 demonstrated how this framework could be extended to polynomial conjoint structures. Karabatsos (2005) generalised this work with his multinomial Dirichlet framework, which enabled the probabilistic testing of many non-stochastic theories of mathematical psychology. More recently, Clintin Davis-Stober (2009) developed a frequentist framework for order restricted inference that can also be used to test the cancellation axioms. Perhaps the most notable (Kyngdon, 2011) use of the theory of conjoint measurement was in the prospect theory proposed by the Israeli - American psychologists Daniel Kahneman and Amos Tversky (Kahneman & Tversky, 1979). Prospect theory was a theory of decision making under risk and uncertainty which accounted for choice behaviour such as the Allais Paradox. David Krantz wrote the formal proof to prospect theory using the theory of conjoint measurement. In 2002, Kahneman received the Nobel Memorial Prize in Economics for prospect theory (Birnbaum, 2008).

395

Measurement and quantification


The classical / standard definition of measurement
In physics and metrology, the standard definition of measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind (de Boer, 1994/95; Emerson, 2008). For example, the statement "Peter's hallway is 4m long" expresses a measurement of an hitherto unknown length magnitude (the hallway's length) as the ratio of the unit (the metre in this case) to the length of the hallway. The real number "4" is a real number in the strict mathematical sense of this term.

Theory of conjoint measurement For some other quantities, it is easier or has been convention to estimate ratios between attribute differences. Consider temperature, for example. In the familiar everyday instances, temperature is measured using instruments calibrated in either the Fahrenheit or Celsius scales. What are really being measured with such instruments are the magnitudes of temperature differences. For example, Anders Celsius defined the unit of the Celsius scale to be 1/100th of the difference in temperature between the freezing and boiling points of water at sea level. A midday temperature measurement of 20 degrees Celsius is simply the ratio of the Celsius unit to the midday temperature. Formally expressed, a scientific measurement is:

396

where Q is the magnitude of the quantity, r is a real number and [Q] is a unit magnitude of the same kind. This classical/standard definition of measurement does not take into account that measurement in one physical realm is affected by other physical realms as demonstrated by the Heisenberg uncertainty principle, and Einstein's theories of Special and General Relativity. For instance, we know from Boyle's law that measurement of volume is affected by temperature, pressure, etc. A gallon of gasoline measured in winter, will expand in volume by summer and vice versa. The definition of temperature in degrees Celsius itself is based upon the boiling temperature of water AT SEA LEVEL, but do we usually account for this in our measurement of temperature? We also know from Einstein's theories that length is not constant for any object in motion, and all objects in the universe are under varying motion. Similarly for time. Therefore it is not possible for any measurement (physical or psychological) to be the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind.

Extensive and intensive quantity


Length is a quantity for which natural concatenation operations exist. That is, we can combine in a side by side fashion lengths of rigid steel rods, for example, such that the additive relations between lengths is readily observed. If we have four 1m lengths of such rods, we can place them end to end to produce a length of 4m. Quantities capable of concatenation are known as extensive quantities and include mass, time, electrical resistance and plane angle. These are known as base quantities in physics and metrology. Temperature is a quantity for which there is an absence of concatenation operations. We cannot pour a volume of water of temperature 40 degrees Celsius into another bucket of water at 20 degrees Celsius and expect to have a volume of water with a temperature of 60 degrees Celsius. Temperature is therefore an intensive quantity. Psychological attributes, like temperature, are considered to be intensive as no way of concatenating such attributes has been found. But this is not to say that such attributes are not quantifiable. The theory of conjoint measurement provides a theoretical means of doing this.

Theory
Consider two natural attributes A, and X. It is not known that either A or X is a continuous quantity, or that both of them are. Let a, b, and c represent three independent, identifiable levels of A; and let x, y and z represent three independent, identifiable levels of X. A third attribute, P, consists of the nine ordered pairs of levels of A and X. That is, (a, x), (b, y),..., (c, z) (see Figure 1). The quantification of A, X and P depends upon the behaviour of the relation holding upon the levels of P. These are relations are presented as axioms in the theory of conjoint measurement.

Theory of conjoint measurement Single cancellation or independence axiom The single cancellation axiom is as follows. The relation upon P satisfies single cancellation if and only if for all a and b in A, and x in X, (a, x) > (b, x) is implied for every w in X such that (a, w) > (b, w). Similarly, for all x and y in X and a in A, (a, x) > (a, y) is implied for every d in A such that (d, x) > (d, y). What this means is that if any two levels, a, b, are ordered, then this Figure One: Graphical representation of the single cancellation axiom. It can be seen that order holds irrespective of each and a > b because (a, x) > (b, x), (a, y) > (b, y) and (a, z) > (b, z). every level of X. The same holds for any two levels, x and y of X with respect to each and every level of A. Single cancellation is so-called because a single common factor of two levels of P cancel out to leave the same ordinal relationship holding on the remaining elements. For example, a cancels out of the inequality (a, x) > (a, y) as it is common to both sides, leaving x > y. Krantz, et al., (1971) originally called this axiom independence, as the ordinal relation between two levels of an attribute is independent of any and all levels of the other attribute. However, given that the term independence causes confusion with statistical concepts of independence, single cancellation is the preferable term. Figure One is a graphical representation of one instance of single cancellation. Satisfaction of the single cancellation axiom is necessary, but not sufficient, for the quantification of attributes A and X. It only demonstrates that the levels of A, X and P are ordered. Informally, single cancellation does not sufficiently constrain the order upon the levels of P to quantify A and X. For example, consider the ordered pairs (a, x), (b, x) and (b, y). If single cancellation holds then (a, x) > (b, x) and (b, x) > (b, y). Hence via transitivity (a, x) > (b, y). The relation between these latter two ordered pairs, informally a left-leaning diagonal, is determined by the satisfaction of the single cancellation axiom, as are all the "left leaning diagonal" relations upon P. Double cancellation axiom Single cancellation does not determine the order of the "right-leaning diagonal" relations upon P. Even though by transitivity and single cancellation it was established that (a, x) > (b, y), the relationship between (a, y) and (b, x) remains undetermined. It could be that either (b, x) > (a, y) or (a, y) > (b, x) and such ambiguity cannot remain unresolved. The double cancellation axiom inequality (broken line arrow) does not contradict the direction of both antecedent concerns a class of such relations upon inequalities (solid line arrows), so supporting the axiom. P in which the common terms of two antecedent inequalities cancel out to produce a third inequality. Consider the instance of double cancellation graphically represented by Figure Two. The antecedent inequalities of this particular instance of double cancellation are:
Figure Two: A Luce - Tukey instance of double cancellation, in which the consequent

397

Theory of conjoint measurement and . Given that:

398

is true if and only if

; and

is true if and only if

, it follows that: .

Cancelling the common terms results in: . Hence double cancellation can only obtain when A and X are quantities. Double cancellation is satisfied if and only if the consequent inequality does not contradict the antecedent inequalities. For example, if the consequent inequality above was: , or alternatively, , then double cancellation would be violated (Michell 1988) and it could not be concluded that A and X are quantities. Double cancellation concerns the behaviour of the "right leaning diagonal" relations on P as these are not logically entailed by single cancellation. (Michell 2009) discovered that when the levels of A and X approach infinity, then the number of right leaning diagonal relations is half of the number of total relations upon P. Hence if A and X are quantities, half of the number of relations upon P are due to ordinal relations upon A and X and half are due to additive relations upon A and X (Michell 2009). The number of instances of double cancellation is contingent upon the number of levels identified for both A and X. If there are n levels of A and m of X, then the number of instances of double cancellation is n! m!. Therefore, if n = m = 3, then 3! 3! = 6 6 = 36 instances in total of double cancellation. However, all but 6 of these instances are trivially true if single cancellation is true, and if anyone of these 6 instances is true, then all of them are true. One such instance is that shown in Figure Two. (Michell 1988) calls this a Luce Tukey instance of double cancellation. If single cancellation has been tested upon a set of data first and is established, then only the Luce Tukey instances of double cancellation need to be tested. For n levels of A and m of X, the number of Luce Tukey double cancellation instances is . For example, if n = m = 4, then there are 16 such instances. If n = m = 5 then there are 100. The greater the number of levels in both A and X, the less probable it is that the cancellation axioms are satisfied at random (Arbuckle & Larimer 1976; McClelland 1977) and the more stringent test of quantity the application of conjoint measurement becomes.

Theory of conjoint measurement Solvability and Archimedean axioms The single and double cancellation axioms by themselves are not sufficient to establish continuous quantity. Other conditions must also be introduced to ensure continuity. These are the solvability and Archimedean conditions. Solvability means that for any three elements of a, b, x and y, the fourth exists such that the equation a x = b y is solved, hence the name of the Figure Three: An instance of triple cancellation. condition. Solvability essentially is the requirement that each level P has an element in A and an element in X. Solvability reveals something about the levels of A and X they are either dense like the real numbers or equally spaced like the integers (Krantz et al. 1971). The Archimedean condition is as follows. Let I be a set of consecutive integers, either finite or infinite, positive or negative. The levels of A form a standard sequence if and only if there exists x and y in X where x y and for all integers i and i + 1 in I: . What this basically means is that if x is greater than y, for example, there are levels of A which can be found which makes two relevant ordered pairs, the levels of P, equal. The Archimedean condition argues that there is no infinitely greatest level of P and so hence there is no greatest level of either A or X. This condition is a definition of continuity given by the ancient Greek mathematician Archimedes whom wrote that "Further, of unequal lines, unequal surfaces, and unequal solids, the greater exceeds the less by such a magnitude as, when added to itself, can be made to exceed any assigned magnitude among those which are comparable with one another " (On the Sphere and Cylinder, Book I, Assumption 5). Archimedes recognised that for any two magnitudes of a continuous quantity, one being lesser than the other, the lesser could be multiplied by a whole number such that it equalled the greater magnitude. Euclid stated the Archimedean condition as an axiom in Book V of the Elements, in which Euclid presented his theory of continuous quantity and measurement. As they involve infinitistic concepts, the solvability and Archimedean axioms are not amenable to direct testing in any finite empirical situation. But this does not entail that these axioms cannot be empirically tested at all. Scott's (1964) finite set of cancellation conditions can be used to indirectly test these axioms; the extent of such testing being empirically determined. For example, if both A and X possess three levels, the highest order cancellation axiom within Scott's (1964) hierarchy that indirectly tests solvability and Archimedeaness is double cancellation. With four levels it is triple cancellation (Figure 3). If such tests are satisfied, the construction of standard sequences in differences upon A and X are possible. Hence these attributes may be dense as per the real numbers or equally spaced as per the integers (Krantz et al. 1971). In other words, A and X are continuous quantities.

399

Theory of conjoint measurement

400

Relation to the scientific definition of measurement


Satisfaction of the conditions of conjoint measurement means that measurements of the levels of A and X can be expressed as either ratios between magnitudes or ratios between magnitude differences. It is most commonly interpreted as the latter, given that most behavioural scientists consider that their tests and surveys "measure" attributes on so-called "interval scales" (Kline 1998). That is, they believe tests do not identify absolute zero levels of psychological attributes. Formally, if P, A and X form an additive conjoint structure, then there exist functions from A and X into the real numbers such that for a and b in A and x and y in X: . If and are two other real valued functions satisfying the above expression, there exist and That is, and . are measurements of A and X unique up to affine transformation (i.e. each is an and

real valued constants satisfying:

interval scale in Stevens (1946) parlance). The mathematical proof of this result is given in (Krantz et al. 1971, pp.2616). This means that the levels of A and X are magnitude differences measured relative to some kind of unit difference. Each level of P is a difference between the levels of A and X. However, it is not clear from the literature as to how a unit could be defined within an additive conjoint context. van der Ven 1980 proposed a scaling method for conjoint structures but he also did not discuss the unit. The theory of conjoint measurement, however, is not restricted to the quantification of differences. If each level of P is a product of a level of A and a level of X, then P is another different quantity whose measurement is expressed as a magnitude of A per unit magnitude of X. For example, A consists of masses and X consists of volumes, then P consists of densities measured as mass per unit of volume. In such cases, it would appear that one level of A and one level of X must be identified as a tentative unit prior to the application of conjoint measurement. If each level of P is the sum of a level of A and a level of X, then P is the same quantity as A andX. For example, A and X are lengths so hence must be P. All three must therefore be expressed in the same unit. In such cases, it would appear that a level of either A or X must be tentatively identified as the unit. Hence it would seem that application of conjoint measurement requires some prior descriptive theory of the relevant natural system.

Applications of Conjoint Measurement


Empirical applications of the theory of conjoint measurement have been sparse (Cliff 1992; Michell 2009). Levelt, Riemersma & Bunt 1972 applied the theory to the psychophysics of binaural loudness. They found the double cancellation axiom was rejected. Gigerenzer & Strube 1983 conducted a similar investigation and replicated Levelt, et al.' (1972) findings. Michell 1990 applied the theory to L.L. Thurstone's (1927) theory of paired comparisons, multidimensional scaling and Coombs' (1964) theory of unidimensional unfolding. He found support of the cancellation axioms only with Coombs' (1964) theory. However, the statistical techniques employed by Michell (1990) in testing Thurstone's theory and multidimensional scaling did not take into consideration the ordinal constraints imposed by the cancellation axioms (van der Linden 1994). (Johnson 2001), Kyngdon (2006), Michell (1994) and (Sherman 1993) tested the cancellation axioms of upon the interstimulus midpoint orders obtained by the use of Coombs' (1964) theory of unidimensional unfolding. Coombs' theory in all three studies was applied to a set of six statements. These authors found that the axioms were satisfied, however, these were applications biased towards a positive result. With six stimuli, the probability of an interstimulus midpoint order satisfying the double cancellation axioms at random is .5874 (Michell, 1994). This is

Theory of conjoint measurement not an unlikely event. Kyngdon & Richards (2007) employed eight statements and found the interstimulus midpoint orders rejected the double cancellation condition. Perline, Wright & Wainer 1979 applied conjoint measurement to item response data to a convict parole questionnaire and to intelligence test data gathered from Danish troops. They found considerable violation of the cancellation axioms in the parole questionnaire data, but not in the intelligence test data. Moreover, they recorded the supposed "no - test" instances of double cancellation. Interpreting these correctly as instances in support of double cancellation (Michell, 1988), the results of Perline, Wright & Wainer 1979 are better than what they believed. Stankov & Cregan 1993 applied conjoint measurement to performance on sequence completion tasks. The columns of their conjoint arrays (X) were defined by the demand placed upon working memory capacity through increasing numbers of working memory place keepers in letter series completion tasks. The rows were defined by levels of motivation (A), which consisted in different amount of times available for compelting the test. Their data (P) consisted of completion times and average number of series correct. They found support for the cancellation axioms, however, their study was biased by the small size of the conjoint arrays (3 3 is size) and by statistical techniques that did not take into consideration the ordinal restrictions imposed by the cancellation axioms. Kyngdon (2011) used Karabatsos' (2001) order restricted inference framework to test a conjoint matrix of reading item response proportions (P) where the examinee reading ability comprised the rows of the conjoint array (A) and the difficulty of the reading items formed the columns of the array (X). The levels of reading ability were identified via raw total test score and the levels of reading item difficulty were identified by the Lexile Framework for Reading (Stenner et al. 2006). Kyngdon found that satisfaction of the cancellation axioms was obtained only through permutation of the matrix in a manner inconsistent with the putative Lexile measures of item difficulty. Kyngdon also tested simulated ability test response data using polynomial conjoint measurement. The data were generated using Humphry's extended frame of reference Rasch model (Humphry & Andrich 2008). He found support of distributive, single and double cancellation consistent with a distributive polynomial conjoint structure in three variables (Krantz & Tversky 1971).

401

References
Arbuckle, J.; Larimer, J. (1976). "The number of two-way tables satisfying certain additivity axioms". Journal of Mathematical Psychology 12: 89100. doi:10.1016/0022-2496(76)90036-5 [1]. Birnbaum, M.H. (2008). "New paradoxes of risky decision making". Psychological Review 115 (2): 463501. doi:10.1037/0033-295X.115.2.463 [2]. PMID18426300 [3]. Brogden, H.E. (December 1977). "The Rasch model, the law of comparative judgement and additive conjoint measurement" [4]. Psychometrika 42 (4): 6314. doi:10.1007/BF02295985 [5]. Cliff, N. (1992). "Abstract measurement theory and the revolution that never happened". Psychological Science 3 (3): 186190. doi:10.1111/j.1467-9280.1992.tb00024.x [6]. Coombs, C.H. (1964). A Theory of Data. New York: Wiley.Wikipedia:Citing sources Davis-Stober, C.P. (February 2009). "Analysis of multinomial models under inequality constraints: applications to measurement theory" [7]. Journal of Mathematical Psychology 53 (1): 113. doi:10.1016/j.jmp.2008.08.003 [8]. Debreu, G. (1960). "Topological methods in cardinal utility theory". In Arrow, K.J.; Karlin, S.; Suppes, P. Mathematical Methods in the Social Sciences. Stanford University Press. pp.1626. Embretson, S.E.; Reise, S.P. (2000). Item response theory for psychologists. Erlbaum.Wikipedia:Citing sources Emerson, W.H. (2008). "On quantity calculus and units of measurement". Metrologia 45 (2): 134138. Bibcode:2008Metro..45..134E [9]. doi:10.1088/0026-1394/45/2/002 [10]. Fischer, G. (1995). "Derivations of the Rasch model". In Fischer, G.; Molenaar, I.W. Rasch models: Foundations, recent developments, and applications. New York: Springer. pp.1538. Gigerenzer, G.; Strube, G. (1983). "Are there limits to binaural additivity of loudness?". Journal of Experimental Psychology: Human Perception and Performance 9: 126136. doi:10.1037/0096-1523.9.1.126 [11].

Theory of conjoint measurement Grayson, D.A. (September 1988). "Two-group classification and latent trait theory: scores with monotone likelihood ratio" [12]. Psychometrika 53 (3): 383392. doi:10.1007/BF02294219 [13]. Hlder, O. (1901). "Die Axiome der Quantitt und die Lehre vom Mass". Berichte uber die Verhandlungen der Koeniglich Sachsischen Gesellschaft der Wissenschaften zu Leipzig, Mathematisch-Physikaliche Klasse 53: 146. (Part 1 translated by Michell, J.; Ernst, C. (September 1996). "The axioms of quantity and the theory of measurement" [14]. Journal of Mathematical Psychology 40 (3): 235252. doi:10.1006/jmps.1996.0023 [15]. PMID8979975 [16]. Humphry, S.M.; Andrich, D. (2008). "Understanding the unit in the Rasch model". Journal of Applied Measurement 9 (3): 249264. PMID18753694 [17]. Iverson, G.; Falmagne, J.C. (1985). "Statistical issues in measurement". Mathematical Social Sciences 10 (2): 131153. doi:10.1016/0165-4896(85)90031-9 [18]. Johnson, T. (2001). "Controlling the effect of stimulus context change on attitude statements using Michell's binary tree procedure". Australian Journal of Psychology 53: 2328. doi:10.1080/00049530108255118 [19]. Kahneman, D.; Tversky, A. (1979). "Prospect theory: an analysis of decision under risk". Econometrica 47 (2): 263291. doi:10.2307/1914185 [20]. Karabatsos, G. (2001). "The Rasch model, additive conjoint measurement, and new models of probabilistic measurement theory". Journal of Applied Measurement 2 (4): 389423. PMID12011506 [21]. Karabatsos, G. (February 2005). "The exchangeable multinomial model as an approach for testing axioms of choice and measurement" [22]. Journal of Mathematical Psychology 49 (1): 5169. doi:10.1016/j.jmp.2004.11.001 [23]. Karabatsos, G.; Sheu, C.F. (2004). "Bayesian order constrained inference for dichotomous models of unidimensional non-parametric item response theory". Applied Psychological Measurement 28 (2): 110125. doi:10.1177/0146621603260678 [24]. Karabatsos, G.; Ullrich, J.R. (2002). "Enumerating and testing conjoint measurement models". Mathematical Social Sciences 43 (3): 485504. doi:10.1016/S0165-4896(02)00024-0 [25]. Krantz, D.H. (July 1964). "Conjoint measurement: the Luce Tukey axiomatisation and some extensions" [26]. Journal of Mathematical Psychology 1 (2): 248277. doi:10.1016/0022-2496(64)90003-3 [27]. Krantz, D.H. (1968). "A survey of measurement theory". In Danzig, G.B.; Veinott, A.F. Mathematics of the Decision Sciences: Part 2. Providence, Rhode Island: American Mathematical Society. pp.314350. Keats, J.A. (1967). "Test theory". Annual Review of Psychology 18: 217238. doi:10.1146/annurev.ps.18.020167.001245 [28]. PMID5333423 [29]. Kline, P. (1998). The New Psychometrics: Science, psychology and measurement. London: Routledge.Wikipedia:Citing sources Krantz, D.H.; Luce, R.D; Suppes, P.; Tversky, A. (1971). Foundations of Measurement, Vol. I: Additive and polynomial representations. New York: Academic Press. Krantz, D.H.; Tversky, A. (1971). "Conjoint measurement analysis of composition rules in psychology". Psychological Review 78 (2): 151169. doi:10.1037/h0030637 [30]. Kyngdon, A. (2006). "An empirical study into the theory of unidimensional unfolding". Journal of Applied Measurement 7 (4): 369393. PMID17068378 [31]. Kyngdon, A. (2008). "The Rasch model from the perspective of the representational theory of measurement". Theory & Psychology 18: 89109. doi:10.1177/0959354307086924 [32]. Kyngdon, A. (2011). "Plausible measurement analogies to some psychometric models of test performance". British Journal of Mathematical and Statistical Psychology 64 (3): 478497. doi:10.1348/2044-8317.002004 [33]. PMID21973097 [34]. Kyngdon, A.; Richards, B. (2007). "Attitudes, order and quantity: deterministic and direct probabilistic tests of unidimensional unfolding". Journal of Applied Measurement 8 (1): 134. PMID17215563 [35].

402

Theory of conjoint measurement Levelt, W.J.M.; Riemersma, J.B.; Bunt, A.A. (May 1972). "Binaural additivity of loudness" [36]. British Journal of Mathematical and Statistical Psychology 25 (1): 5168. doi:10.1111/j.2044-8317.1972.tb00477.x [37]. PMID5031649 [38]. Luce, R.D.; Suppes, P. (2002). "Representational measurement theory". In Pashler, H.; Wixted, J. Stevens handbook of experimental psychology: Vol. 4. Methodology in experimental psychology (3rd ed.). New York: Wiley. pp.141. Luce, R.D.; Tukey, J.W. (January 1964). "Simultaneous conjoint measurement: a new scale type of fundamental measurement" [39]. Journal of Mathematical Psychology 1 (1): 127. doi:10.1016/0022-2496(64)90015-X [40]. McClelland, G. (June 1977). ""A note on Arbuckle and Larimer: the number of two way tables satisfying certain additivity axioms"" [41]. Journal of Mathematical Psychology 15 (3): 2925. doi:10.1016/0022-2496(77)90035-9 [42] . Michell, J. (June 1994). "Measuring dimensions of belief by unidimensional unfolding" [43]. Journal of Mathematical Psychology 38 (2): 224273. doi:10.1006/jmps.1994.1016 [44]. Michell, J. (December 1988). "Some problems in testing the double cancellation condition in conjoint measurement" [45]. Journal of Mathematical Psychology 32 (4): 466473. doi:10.1016/0022-2496(88)90024-7 [46] . Michell, J. (1990). An Introduction to the Logic of Psychological Measurement. Hillsdale NJ: Erlbaum.Wikipedia:Citing sources Michell, J. (February 2009). "The psychometricians' fallacy: Too clever by half?" [47]. British Journal of Mathematical and Statistical Psychology 62 (1): 4155. doi:10.1348/000711007X243582 [48]. Perline, R.; Wright, B.D; Wainer, H. (1979). "The Rasch model as additive conjoint measurement". Applied Psychological Measurement 3 (2): 237255. doi:10.1177/014662167900300213 [49]. Scheiblechner, H. (September 1999). "Additive conjoint isotonic probabilistic models (ADISOP)" [50]. Psychometrika 64 (3): 295316. doi:10.1007/BF02294297 [51]. Scott, D. (July 1964). "Measurement models and linear inequalities" [52]. Journal of Mathematical Psychology 1 (2): 233247. doi:10.1016/0022-2496(64)90002-1 [53]. Sherman, K. (April 1994). "The effect of change in context in Coombs's unfolding theory" [54]. Australian Journal of Psychology 46 (1): 4147. doi:10.1080/00049539408259468 [55]. Stankov, L.; Cregan, A. (1993). "Quantitative and qualitative properties of an intelligence test: series completion". Learning and Individual Differences 5 (2): 137169. doi:10.1016/1041-6080(93)90009-H [56]. Stenner, A.J.; Burdick, H.; Sanford, E.E.; Burdick, D.S. (2006). "How accurate are Lexile text measures?". Journal of Applied Measurement 7 (3): 307322. PMID16807496 [57]. Stevens, S.S. (1946). "On the theory of scales of measurement". Science 103 (2684): 667680. Bibcode:1946Sci...103..677S [58]. doi:10.1126/science.103.2684.677 [59]. PMID17750512 [60]. Stober, C.P. (2009). Luce's challenge: Quantitative models and statistical methodology.Wikipedia:Citing sources#What information to include Thurstone, L.L. (1927). "A law of comparative judgement". Psychological Review 34 (4): 278286. doi:10.1037/h0070288 [61]. Tversky, A. (1967). "A general theory of polynomial conjoint measurement" [62] (PDF). Journal of Mathematical Psychology 4: 120. doi:10.1016/0022-2496(67)90039-9 [63]. Ullrich, J.R.; Wilson, R.E. (December 1993). "A note on the exact number of two and three way tables satisfying conjoint measurement and additivity axioms". Journal of Mathematical Psychology 37 (4): 6248. doi:10.1006/jmps.1993.1037 [64]. van der Linden, W. (March 1994). "Review of Michell (1990)" [65]. Psychometrika 59 (1): 139142. doi:10.1007/BF02294273 [66]. van der Ven, A.H.G.S. (1980). Introduction to Scaling. New York: Wiley.Wikipedia:Citing sources

403

Theory of conjoint measurement

404

External links
Karabatsos' S-Plus programs for testing conjoint axioms [67] Birnbaum's FORTRAN MONANOVA program for testing addivity [68] Kyngdon's R programs for enumerating cancellation tests, testing axioms and prospect theory [69] R statistical computing software [46]

References
[1] http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2876%2990036-5 [2] http:/ / dx. doi. org/ 10. 1037%2F0033-295X. 115. 2. 463 [3] http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 18426300 [4] http:/ / link. springer. com/ article/ 10. 1007/ BF02295985 [5] http:/ / dx. doi. org/ 10. 1007%2FBF02295985 [6] http:/ / dx. doi. org/ 10. 1111%2Fj. 1467-9280. 1992. tb00024. x [7] http:/ / www. sciencedirect. com/ science/ article/ pii/ S0022249608000758 [8] http:/ / dx. doi. org/ 10. 1016%2Fj. jmp. 2008. 08. 003 [9] http:/ / adsabs. harvard. edu/ abs/ 2008Metro. . 45. . 134E [10] http:/ / dx. doi. org/ 10. 1088%2F0026-1394%2F45%2F2%2F002 [11] http:/ / dx. doi. org/ 10. 1037%2F0096-1523. 9. 1. 126 [12] http:/ / link. springer. com/ article/ 10. 1007/ BF02294219 [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] http:/ / dx. doi. org/ 10. 1007%2FBF02294219 http:/ / www. sciencedirect. com/ science/ article/ pii/ S0022249696900231 http:/ / dx. doi. org/ 10. 1006%2Fjmps. 1996. 0023 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 8979975 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 18753694 http:/ / dx. doi. org/ 10. 1016%2F0165-4896%2885%2990031-9 http:/ / dx. doi. org/ 10. 1080%2F00049530108255118 http:/ / dx. doi. org/ 10. 2307%2F1914185 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 12011506 http:/ / tigger. uic. edu/ ~georgek/ HomePage/ KarabatsosJMP. pdf http:/ / dx. doi. org/ 10. 1016%2Fj. jmp. 2004. 11. 001 http:/ / dx. doi. org/ 10. 1177%2F0146621603260678 http:/ / dx. doi. org/ 10. 1016%2FS0165-4896%2802%2900024-0 http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249664900033 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2864%2990003-3 http:/ / dx. doi. org/ 10. 1146%2Fannurev. ps. 18. 020167. 001245 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 5333423 http:/ / dx. doi. org/ 10. 1037%2Fh0030637 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 17068378 http:/ / dx. doi. org/ 10. 1177%2F0959354307086924 http:/ / dx. doi. org/ 10. 1348%2F2044-8317. 002004 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 21973097 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 17215563 http:/ / onlinelibrary. wiley. com/ doi/ 10. 1111/ j. 2044-8317. 1972. tb00477. x/ abstract http:/ / dx. doi. org/ 10. 1111%2Fj. 2044-8317. 1972. tb00477. x http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 5031649 http:/ / www. sciencedirect. com/ science/ article/ pii/ 002224966490015X http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2864%2990015-X http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249677900359 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2877%2990035-9 http:/ / www. sciencedirect. com/ science/ article/ pii/ S0022249684710169 http:/ / dx. doi. org/ 10. 1006%2Fjmps. 1994. 1016 http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249688900247 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2888%2990024-7

[47] http:/ / onlinelibrary. wiley. com/ doi/ 10. 1348/ 000711007X243582/ abstract [48] http:/ / dx. doi. org/ 10. 1348%2F000711007X243582 [49] http:/ / dx. doi. org/ 10. 1177%2F014662167900300213

Theory of conjoint measurement


[50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] http:/ / link. springer. com/ article/ 10. 1007/ BF02294297 http:/ / dx. doi. org/ 10. 1007%2FBF02294297 http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249664900021 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2864%2990002-1 http:/ / onlinelibrary. wiley. com/ doi/ 10. 1080/ 00049539408259468/ abstract http:/ / dx. doi. org/ 10. 1080%2F00049539408259468 http:/ / dx. doi. org/ 10. 1016%2F1041-6080%2893%2990009-H http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 16807496 http:/ / adsabs. harvard. edu/ abs/ 1946Sci. . . 103. . 677S http:/ / dx. doi. org/ 10. 1126%2Fscience. 103. 2684. 677 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 17750512 http:/ / dx. doi. org/ 10. 1037%2Fh0070288 http:/ / deepblue. lib. umich. edu/ bitstream/ 2027. 42/ 33362/ 1/ 0000760. pdf http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2867%2990039-9 http:/ / dx. doi. org/ 10. 1006%2Fjmps. 1993. 1037 http:/ / link. springer. com/ article/ 10. 1007/ BF02294273 http:/ / dx. doi. org/ 10. 1007%2FBF02294273 http:/ / tigger. uic. edu/ ~georgek/ HomePage/ publications. htm http:/ / psych. fullerton. edu/ mbirnbaum/ programs. htm https:/ / sites. google. com/ site/ drandrewkyngdon/ home

405

Thurstone scale
In psychology, the Thurstone scale was the first formal technique for measuring an attitude. It was developed by Louis Leon Thurstone in 1928, as a means of measuring attitudes towards religion. It is made up of statements about a particular issue, and each statement has a numerical value indicating how favorable or unfavorable it is judged to be. People check each of the statements to which they agree, and a mean score is computed, indicating their attitude.

Thurstone scale
Thurstone's method of pair comparisons can be considered a prototype of a normal distribution-based method for scaling-dominance matrices. Even though the theory behind this method is quite complex (Thurstone, 1927a), the algorithm itself is straightforward. For the basic Case V, the frequency dominance matrix is translated into proportions and interfaced with the standard scores. The scale is then obtained as a left-adjusted column marginal average of this standard score matrix (Thurstone, 1927b). The underlying rationale for the method and basis for the measurement of the "psychological scale separation between any two stimuli" derives from Thurstone's Law of comparative judgment (Thurstone, 1928). The principal difficulty with this algorithm is its indeterminacy with respect to one-zero proportions, which return z values as plus or minus infinity, respectively. The inability of the pair comparisons algorithm to handle these cases imposes considerable limits on the applicability of the method. The most frequent recourse when the 1.00-0.00 frequencies are encountered is their omission. Thus, e.g., Guilford (1954, p.163) has recommended not using proportions more extreme than .977 or .023, and Edwards (1957, pp.4142) has suggested that if the number of judges is large, say 200 or more, then we might use pij values of .99 and .01, but with less than 200 judges, it is probably better to disregard all comparative judgments for which pij is greater than .98 or less than .02." Since the omission of such extreme values leaves empty cells in the Z matrix, the averaging procedure for arriving at the scale values cannot be applied, and an elaborate procedure for the estimation of unknown parameters is usually employed (Edwards, 1957, pp.4246). An alternative solution of this problem was suggested by Krus and Kennedy (1977). With later developments in psychometric theory, it has become possible to employ direct methods of scaling such as application of the Rasch model or unfolding models such as the Hyperbolic Cosine Model (HCM) (Andrich & Luo,

Thurstone scale 1993). The Rasch model has a close conceptual relationship to Thurstone's law of comparative judgment (Andrich, 1978), the principal difference being that it directly incorporates a person parameter. Also, the Rasch model takes the form of a logistic function rather than a cumulative normal function.

406

References
Andrich, D. (1978b) Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 449-460. Andrich, D. & Luo, G. (1993) A hyperbolic cosine model for unfolding dichotomous single-stimulus responses. Applied Psychological Measurement, 17, 253-276. Babbie, E., 'The Practice of Social Research', 10th edition, Wadsworth, Thomson Learning Inc., ISBN 0-534-62029-9 Edwards, A. L. Techniques of attitude scale construction. New York: Appleton-Century- Crofts, 1957. Guilford, J. P. Psychometric methods. New York: McGraw-Hill, 1954. Krus, D.J., & Kennedy, P.H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189-193 (Request reprint). [3] Krus, D.J., Sherman, J.L., & Kennedy, P.H. (1977) Changing values over the last half-century: the story of Thurstone's crime scales. Psychological Reports, 40, 207-211 (Request reprint). [1] Thurstone, L. L. (1927a) A Law of comparative judgment. Psychological Review, 34, 273-286. Thurstone, L. L. (1927b) The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21, 384-400. Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-54.

References
[1] http:/ / www. visualstatistics. net/ Readings/ Thurstone%20Crimes%20Scale/ Thurstone%20Crimes%20Scale. htm

Thurstonian model

407

Thurstonian model
A Thurstonian model is a latent variable model for describing the mapping of some continuous scale onto discrete, possibly ordered categories of response. In the model, each of these categories of response corresponds to a latent variable whose value is drawn from a normal distribution, independently of the other response variables and with constant variance. Thurstonian models have been used as an alternative to generalized linear models in analysis of sensory discrimination tasks.[] They have also been used to model long-term memory in ranking tasks of ordered alternatives, such as the order of the amendments to the US Constitution.[1] Their main advantage over other models ranking tasks is that they account for non-independence of alternatives.[]

Definition
Consider a set of m options to be ranked by n independent judges. Such a ranking can be represented by the ordering vector rn = (rn1, rn2,...,rnm). Rankings are assumed to be derived from real-valued latent variables zij, representing the evaluation of option j by judge i. Rankings ri are derived deterministically from zi such that zi(ri1) < zi(ri2) < ... < zi(rim). The zij are assumed to be derived from an underlying ground truth value for each option. In the most general case, they are multivariate-normally distributed:

where j is multivariate-normally distributed around 0 with covariance matrix . In a simpler case, there is a single standard deviation parameter i for each judge:

Inference
The Gibbs-sampler based approach to estimating model parameters is due to Yao and Bockenholt (1999).[] Step 1: Given , , and r_i, sample z_i. The zij must be sampled from a truncated multivariate normal distribution to preserve their rank ordering. Hajivassiliou's Truncated Multivariate Normal Gibbs sampler can be used to sample efficiently.[2][3] Step 2: Given , z_i, sample . is sampled from a normal distribution: where * and * are the current estimates for the means and covariance matrices. Step 3: Given , z_i, sample . 1 is sampled from a Wishart posterior, combining a Wishart prior with the data likelihood from the samples i =zi .

Thurstonian model

408

History
Thurstonian models were introduced by Louis Leon Thurstone to describe the law of comparative judgment.[4] Prior to 1999, Thurstonian models were rarely used for modeling tasks involving more than 4 options because of the high-dimensional integration required to estimate parameters of the model. In 1999, Yao and Bockenholt introduced their Gibbs-sampler based approach to estimating model parameters.[]

Applications to sensory discrimination


Thurstonian models have been applied to a range of sensory discrimination tasks, including auditory, taste, and olfactory discrimination, to estimate sensory distance between stimuli that range along some sensory continuum.[5][][6] The Thurstonian approach motivated Frijter (1979)'s explanation of Gridgeman's Paradox, also known as the paradox of discriminatory nondiscriminators:[7][8][][] People perform better in a three-alternative forced choice task when told in advance which dimension of the stimulus to attend to. (For example, people are better at identifying which of one three drinks is different from the other two when told in advance that the difference will be in degree of sweetness.) This result is accounted for by differing cognitive strategies: when the relevant dimension is known in advance, people can estimate values along that particular dimension. When the relevant dimension is not known in advance, they must rely on a more general, multi-dimensional measure of sensory distance.

References
[4] Reprinted:

Torrance Tests of Creative Thinking


The Torrance Tests of Creative Thinking is a test of creativity.

Description
Building on J.P. Guilford's work and created by Ellis Paul Torrance, the Torrance Tests of Creative Thinking (TTCT), a test of creativity, originally involved simple tests of divergent thinking and other problem-solving skills, which were scored on four scales: Fluency. The total number of interpretable, meaningful, and relevant ideas generated in response to the stimulus. Flexibility. The number of different categories of relevant responses. Originality. The statistical rarity of the responses. Elaboration. The amount of detail in the responses.

The third edition of the TTCT in 1984 eliminated the Flexibility scale from the figural test, but added Resistance to Premature Closure (based on Gestalt Psychology) and Abstractness of Titles as two new criterion-referenced scores on the figural. Torrance called the new scoring procedure Streamlined Scoring. With the five norm-referenced measures that he now had (fluency, originality, abstractness of titles, elaboration and resistance to premature closure), he added 13 criterion-referenced measures which include: emotional expressiveness, story-telling articulateness, movement or actions, expressiveness of titles, syntheses of incomplete figures, synthesis of lines, of circles, unusual visualization, extending or breaking boundaries, humor, richness of imagery, colourfulness of imagery, and fantasy.[1] According to Arasteh and Arasteh (1976) the most systematic assessment of creativity in elementary school children has been conducted by Torrance and his associates (1960a,1960b, 1960c, 1961,1962,1962a,1963a 1964), who have developed and administered the Minnesota Tests of Creative Thinking (MTCT), which was later renamed as the TTCT, to several thousands of school children. Although they have used

Torrance Tests of Creative Thinking many of Guilfords concepts in their test construction, the Minnesota group, in contrast to Guilford, has devised tasks which can be scored for several factors, involving both verbal and non-verbal aspects and relying on senses other than vision. These tests represent a fairly sharp departure from the factor type tests developed by Guilford and his associates (Guilford, Merrifield and Cox, 1961; Merrifield, Guilford and Gershan,1963), and they also differ from the battery developed by Wallach and Kogan (1965), which contains measures representing creative tendencies that are similar in nature (Torrance, 1968). To date, several longitudinal studies have been conducted to follow up the elementary school-aged students who were first administered the Torrance Tests in 1958 in Minnesota. There was a 22-year follow-up,[2][3][4] a 40-year follow-up,[5] and a 50 year follow-up [6] Torrance (1962) grouped the different subtests of the Minnesota Tests of Creative Thinking (MTCT) into three categories. 1. Verbal tasks using verbal stimuli 2. Verbal tasks using non-verbal stimuli 3. Non-verbal tasks

409

Tasks
A brief description of the tasks used by Torrance is given below:

Unusual Uses
The unusual uses tasks using verbal stimuli are direct modifications of Guilfords Brick uses test. After preliminary tryouts, Torrance (1962) decided to substitute tin cans and books for bricks. It was believed the children would be able to handle tin cans and books more easily since both are more available to children than bricks. Impossibilities task It was used originally by Guilford and his associates (1951) as a measure of fluency involving complex restrictions and large potential. In a course in personality development and mental hygiene, Torrance has experimented with a number of modifications of the basic task, making the restrictions more specific. In this task the subjects are asked to list as many impossibilities as they can. Consequences task The consequences task was also used originally by Guilford and his associates (1951). Torrance has made several modifications in adapting it. He chose three improbable situations and the children were required to list out their consequences. Just suppose task It is an adaptation of the consequences type of test designed to elicit a higher degree of spontaneity and to be more effective with children. As in the consequence task, the subject is confronted with an improbable situation and asked to predict the possible outcomes from the introduction of a new or unknown variable. Situations task The situation task was modeled after Guilfords (1951) test designed to assess the ability to see what needs to be done. Subjects were given three common problems and asked to think of as many solutions to these problems as they can. For example, if all schools were abolished, what would you do to try to become educated? Common problems task This task is an adoption of Guilfords (1951) Test designed to assess the ability to see defects, needs and deficiencies and found to be one of the test of the factors termed sensitivity to problems. Subjects are instructed that they will be given common situations and that they will be asked to think of as many problems

Torrance Tests of Creative Thinking as they can that may arise in connection with these situations. For example, doing homework while going to school in the morning. Improvement task This test was adopted from Guilfords (1952) apparatus test which was designed to assess ability to see defects and all aspects of sensitivity to problems. In this task the subjects are given a list of common objects and are asked to suggest as many ways as they can to improve each object. They are asked not to bother about whether or not it is possible to implement the change thought of. Mother- Hubbard problem This task was conceived as an adoption of the situations task for oral administration in the primary grades and also useful for older groups. This test has stimulated a number of ideas concerning factors which inhibit the development of ideas. Imaginative stories task In this task the child is told to write the most interesting and exciting story he can think of. Topics are suggested (e.g., the dog that did not bark); or the child may use his own ideas. Cow jumping problems The Cow jumping problem is a companion task for the Mother- Hubbard problem and has been administered to the same groups under the same conditions and scored according to the similar procedures. The task is to think of all possible things which might have happened when the cow jumped over the moon.

410

Verbal tasks using nonverbal stimuli


Ask and guess task The ask and guess task requires the individual first to ask questions about a picture questions which cannot be answered by just looking at the picture. Next he is asked to make guesses or formulate hypotheses about the possible causes of the event depicted, and then their consequences both immediate and remote. Product improvement task In this task common toys are used and children are asked to think of as many improvements as they can which would make the toy more fun to play with. Subjects are then asked to think of unusual uses of these toys other than 'something to play with. Unusual uses task In this task, along with the product improvement task another task (unusual uses) is used. The child is asked to think of the cleverest, most interesting and most unusual uses of the given toy, other than as a plaything. These uses could be for the toy as it is, or for the toy as changed.

Non-verbal tasks
Incomplete figures task It is an adaptation of the Drawing completion test developed by Kate Franck and used by Barron (1958). On an ordinary white paper an area of fifty four square inches is divided into six squares each containing a different stimulus figure. The subjects are asked to sketch some novel objects or design by adding as many lines as they can to the six figures. Picture construction task or shapes task In this task the children are given shape of a triangle or a jelly bean and a sheet of white paper. The children are asked to think of a picture in which the given shape is an integral part. They should paste it wherever they want on the white sheet and add lines with pencil to make any novel picture. They have to think of a name for the picture and write it at the bottom.

Torrance Tests of Creative Thinking Circles and squares task It was originally designed as a nonverbal test of ideational fluency and flexibility, then modified in such a way as to stress originality and elaboration. Two printed forms are used in the test. In one form, the subject is confronted with a page of forty two circles and asked to sketch objects or pictures which have circles as a major part. In the alternate form, squares are used instead of circles. Creative design task Hendrickson has designed it which seems to be promising, but scoring procedures are being tested but have not been perfected yet. The materials consist of circles and strips of various sizes and colours, a four page booklet, scissors and glue. Subjects are instructed to construct pictures or designs, making use of all of the coloured circles and strips with a thirty minute time limit. Subjects may use one, two, three, or four pages; alter the circles and strips or use them as they are; add other symbols with pencil or crayon.

411

References
[2] (Torrance, E. P. (1980). Growing Up Creatively Gifted: The 22-Year Longitudinal Study. The Creative Child and Adult Quarterly, 3, 148-158. [3] Torrance, E. P. (1981a). Predicting the creativity of elementary school children (1958 80)and the teacher who "made a difference." Gifted Child Quarterly, 25, 55-62. [4] Torrance, E. P. (1981b). Empirical validation of criterionreferenced indicators of creative ability through a longitudinal study. Creative Child and Adult Quarterly, 6, 136-140. [5] Cramond, B., MatthewsMorgan, J., Bandalos, D., & Zuo, L. (2005). A report on the 40 year followup of the Torrance Tests of Creative Thinking: Alive and Well in the New Millennium. Gifted Child Quarterly, 49, 283-291. [6] Runco, M. A., Millar, G., Acar, S., Cramond, B. (2011) Torrance Tests of Creative Thinking as Predictors of Personal and Public Achievement: A Fifty Year Follow-Up. Creativity Research Journal, 22 (4). in press.

William H. Tucker
William H. Tucker is a professor of psychology at Rutgers University and the author of several books critical of race science. Tucker received his bachelor's degree from Bates College in 1967, and his master's and doctorate from Princeton University. He joined the faculty at Rutgers University in 1970 and has been there since. Tucker was a Psychometric Fellow for three years at Princeton, a position subsidized by Educational Testing Service. The majority of Tucker's scholarship has been about psychometrics, not in it. He currently sits on the advisory board of the Institute for the Study of Academic Racism.[1] He has written critical commentaries on several hereditarian psychologists known for their controversial work on race and intelligence. He has received awards for his research on Cyril Burt and the Pioneer Fund.[2] According to his website, "My research interests concern the useor more properly the misuseof social science to support oppressive social policies, especially in the area of race. I seek to explore how scientists in general, and psychologists in particular, have become involved with such issues and what effect their participation has produced."

William H. Tucker

412

Publications
Tucker WH (1994a). Fact and Fiction in the Discovery of Sir Cyril Burt's Flaws. Journal of the History of the Behavioral Sciences, 30, 335-347. Tucker WH (1994b). The Science and Politics of Racial Research [3]. University of Illinois Press. ISBN 0-252-02099-5 Tucker WH (1997). Re-reconsidering Burt: Beyond a reasonable doubt. Journal of the History of the Behavioral Sciences, 33, 145-162. Tucker WH (2002). The Funding of Scientific Racism: Wycliff Draper and the Pioneer Fund [4]. University of Illinois Press. ISBN 0-252-02762-0 Tucker WH (2005). The Intelligence Controversy: A Guide to the Debates. ABC-Clio, Inc. ISBN 1-85109-409-1 Tucker WH (2009). The Cattell Controversy: Race, Science, and Ideology, University of Illinois Press

References
[1] ISAR Advisor Council (http:/ / web. archive. org/ web/ 20060207194059/ http:/ / www. ferris. edu/ isar/ avc. htm), Retrieved Frebruary 7, 2006 [2] University of Illinois Press: "Winner of the Anisfield-Wolf Award, 1995. Winner of the Ralph J. Bunche Award, American Political Science Association, 1995. Outstanding Book from the Gustavus Myers Center for the Study of Human Rights in North America." [3] http:/ / www. press. uillinois. edu/ s96/ tucker. html [4] http:/ / www. press. uillinois. edu/ epub/ books/ tucker/ toc. html

External links
Bill Tucker homepage (http://crab.rutgers.edu/~btucker/home.html) via Rutgers University. Does Science Offer Support for Racial Separation? (http://www.ferris.edu/isar/bios/cattell/HPPB/science. htm) via Institute for the Study of Academic Racism.

Validity (statistics)

413

Validity (statistics)
In science and statistics, validity is the extent to which a concept, conclusion or measurement is well-founded and corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool (for example, a test in education) is considered to be the degree to which the tool measures what it claims to measure. In psychometrics, validity has a particular application known as test validity: "the degree to which evidence and theory support the interpretations of test scores" ("as entailed by proposed uses of tests").[1] In the area of scientific research design and experimentation, validity refers to whether a study is able to scientifically answer the questions it is intended to answer. In clinical fields, the assessment of validity of a diagnosis and various diagnostic tests are extremely important. As diagnosis augments treatments, medications, and the patient's life, it is extremely important to know that when running diagnostic tests that clinicians are truly testing what they intend to test. It is generally accepted that the concept of scientific validity addresses the nature of reality and as such is an epistemological and philosophical issue as well as a question of measurement. The use of the term in logic is narrower, relating to the truth of inferences made from premises. Validity is important because it can help determine what types of tests to use, and help to make sure researchers are using methods that are not only ethical, and cost-effective, but also a method that truly measures the idea or construct in question.

Test validity
Reliability (consistency) and validity (accuracy)
Validity of an assessment is the degree to which it measures what it is supposed to measure. This is not the same as reliability, which is the extent to which a measurement gives results that are consistent. Within validity, the measurement does not always have to be similar, as it does in reliability. When a measure is both valid and reliable, the results will appear as in the image to the right. Though, just because a measure is reliable, it is not necessarily valid (and vice-versa). Validity is also dependent on the measurement measuring what it was designed to measure, and not something else instead.[2] Validity (similar to reliability) is based on matters of degrees; validity is not an all or nothing idea. There are many different types of validity. An early definition of test validity identified it with the degree of Validity & Reliability correlation between the test and a criterion. Under this definition, one can show that reliability of the test and the criterion places an upper limit on the possible correlation between them (the so-called validity coefficient). Intuitively, this reflects the fact that reliability involves freedom from random error and random errors do not correlate with one another. Thus, the less random error in the variables, the higher the possible correlation between them. Under these definitions, a test cannot have high validity unless it also has high reliability. However, the concept of validity has expanded substantially beyond this early definition and the classical relationship between reliability and validity need not hold for alternative conceptions of reliability and validity.

Validity (statistics) Within classical test theory, predictive or concurrent validity (correlation between the predictor and the predicted) cannot exceed the square root of the correlation between two versions of the same measure that is, reliability limits validity.

414

Construct validity
Construct validity refers to the extent to which operationalizations of a construct (i.e., practical tests developed from a theory) do actually measure what the theory says they do. For example, to what extent is an IQ questionnaire actually measuring "intelligence"? Construct validity evidence involves the empirical and theoretical support for the interpretation of the construct. Such lines of evidence include statistical analyses of the internal structure of the test including the relationships between responses to different test items. They also include relationships between the test and measures of other constructs. As currently understood, construct validity is not distinct from the support for the substantive theory of the construct that the test is designed to measure. As such, experiments designed to reveal aspects of the causal role of the construct also contribute to construct validity evidence. Convergent validity Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with.

Content validity
Content validity is a non-statistical type of validity that involves "the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured" (Anastasi & Urbina, 1997 p.114). For example, does an IQ questionnaire have items covering all areas of intelligence discussed in the scientific literature? Content validity evidence involves the degree to which the content of the test matches a content domain associated with the construct. For example, a test of the ability to add two numbers should include a range of combinations of digits. A test with only one-digit numbers, or only even numbers, would not have good coverage of the content domain. Content related evidence typically involves subject matter experts (SME's) evaluating test items against the test specifications. A test has content validity built into it by careful selection of which items to include (Anastasi & Urbina, 1997). Items are chosen so that they comply with the test specification which is drawn up through a thorough examination of the subject domain. Foxcroft, Paterson, le Roux & Herbst (2004, p.49)[3] note that by using a panel of experts to review the test specifications and the selection of items the content validity of a test can be improved. The experts will be able to review the items and comment on whether the items cover a representative sample of the behaviour domain. Representation validity Representation validity, also known as translation validity, is about the extent to which an abstract theoretical construct can be turned into a specific practical test Face validity Face validity is an estimate of whether a test appears to measure a certain criterion; it does not guarantee that the test actually measures phenomena in that domain. Measures may have high validity, but when the test does not appear to be measuring what it is, it has low face validity. Indeed, when a test is subject to faking (malingering), low face validity might make the test more valid. Considering one may get more honest answers with lower face validity, it is sometimes important to make it appear as though there is low face validity whilst administering the measures.

Validity (statistics) Face validity is very closely related to content validity. While content validity depends on a theoretical basis for assuming if a test is assessing all domains of a certain criterion (e.g. does assessing addition skills yield in a good measure for mathematical skills? - To answer this you have to know, what different kinds of arithmetic skills mathematical skills include) face validity relates to whether a test appears to be a good measure or not. This judgment is made on the "face" of the test, thus it can also be judged by the amateur. Face validity is a starting point, but should NEVER be assumed to be provably valid for any given purpose, as the "experts" have been wrong beforethe Malleus Malificarum (Hammer of Witches) had no support for its conclusions other than the self-imagined competence of two "experts" in "witchcraft detection," yet it was used as a "test" to condemn and burn at the stake tens of thousands women as "witches."[4]

415

Criterion validity
Criterion validity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct. In other words, it compares the test with other measures or outcomes (the criteria) already held to be valid. For example, employee selection tests are often validated against measures of job performance (the criterion), and IQ tests are often validated against measures of academic performance (the criterion). If the test data and criterion data are collected at the same time, this is referred to as concurrent validity evidence. If the test data are collected first in order to predict criterion data collected at a later point in time, then this is referred to as predictive validity evidence. Concurrent validity Concurrent validity refers to the degree to which the operationalization correlates with other measures of the same construct that are measured at the same time. When the measure is compared to another measure of the same type, they will be related (or correlated). Returning to the selection test example, this would mean that the tests are administered to current employees and then correlated with their scores on performance reviews. Predictive validity Predictive validity refers to the degree to which the operationalization can predict (or correlate with) other measures of the same construct that are measured at some time in the future. Again, with the selection test example, this would mean that the tests are administered to applicants, all applicants are hired, their performance is reviewed at a later time, and then their scores on the two measures are correlated. This is also when your measurement predicts a relationship between what you are measuring and something else; predicting whether or not the other thing will happen in the future. This type of validity is important from a public view standpoint; is this going to look acceptable to the public or not?

Experimental validity
The validity of the design of experimental research studies is a fundamental part of the scientific method, and a concern of research ethics. Without a valid design, valid scientific conclusions cannot be drawn.

Statistical conclusion validity


Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or reasonable. This began as being solely about whether the statistical conclusion about the relationship of the variables was correct, but now there is a movement towards moving to reasonable conclusions that use: quantitative, statistical, and qualitative data.[5]

Validity (statistics) Statistical conclusion validity involves ensuring the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures. As this type of validity is concerned solely with the relationship that is found among variables, the relationship may be solely a correlation.

416

Internal validity
Internal validity is an inductive estimate of the degree to which conclusions about causal relationships can be made (e.g. cause and effect), based on the measures used, the research setting, and the whole research design. Good experimental techniques, in which the effect of an independent variable on a dependent variable is studied under highly controlled conditions, usually allow for higher degrees of internal validity than, for example, single-case designs. Eight kinds of confounding variable can interfere with internal validity (i.e. with the attempt to isolate causal relationships): 1. History, the specific events occurring between the first and second measurements in addition to the experimental variables 2. Maturation, processes within the participants as a function of the passage of time (not specific to particular events), e.g., growing older, hungrier, more tired, and so on. 3. Testing, the effects of taking a test upon the scores of a second testing. 4. Instrumentation, changes in calibration of a measurement tool or changes in the observers or scorers may produce changes in the obtained measurements. 5. Statistical regression, operating where groups have been selected on the basis of their extreme scores. 6. Selection, biases resulting from differential selection of respondents for the comparison groups. 7. Experimental mortality, or differential loss of respondents from the comparison groups. 8. Selection-maturation interaction, etc. e.g., in multiple-group quasi-experimental designs

External validity
External validity concerns the extent to which the (internally valid) results of a study can be held to be true for other cases, for example to different people, places or times. In other words, it is about whether findings can be validly generalized. If the same research study was conducted in those other cases, would it get the same results? A major factor in this is whether the study sample (e.g. the research participants) are representative of the general population along relevant dimensions. Other factors jeopardizing external validity are: 1. Reactive or interaction effect of testing, a pretest might increase the scores on a posttest 2. Interaction effects of selection biases and the experimental variable. 3. Reactive effects of experimental arrangements, which would preclude generalization about the effect of the experimental variable upon persons being exposed to it in non-experimental settings 4. Multiple-treatment interference, where effects of earlier treatments are not erasable. Ecological validity Ecological validity is the extent to which research results can be applied to real life situations outside of research settings. This issue is closely related to external validity but covers the question of to what degree experimental findings mirror what can be observed in the real world (ecology = the science of interaction between organism and its environment). To be ecologically valid, the methods, materials and setting of a study must approximate the real-life situation that is under investigation. Ecological validity is partly related to the issue of experiment versus observation. Typically in science, there are two domains of research: observational (passive) and experimental (active). The purpose of experimental designs is to test causality, so that you can infer A causes B or B causes A. But sometimes, ethical and/or methological restrictions prevent you from conducting an experiment (e.g. how does isolation influence a child's cognitive

Validity (statistics) functioning?). Then you can still do research, but it's not causal, it's correlational. You can only conclude that A occurs together with B. Both techniques have their strengths and weaknesses. Relationship to internal validity On first glance, internal and external validity seem to contradict each other - to get an experimental design you have to control for all interfering variables. That's why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological or external validity because you establish an artificial laboratory setting. On the other hand with observational research you can't control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs. However, in doing so, you sacrifice internal validity. The apparent contradiction of internal validity and external validity is, however, only superficial. The question of whether results from a particular study generalize to other people, places or times arises only when one follows an inductivist research strategy. If the goal of a study is to deductively test a theory, one is only concerned with factors which might undermine the rigor of the study, i.e. threats to internal validity.

417

Diagnostic validity
In clinical fields such as medicine, the validity of a diagnosis, and associated diagnostic tests or screening tests, may be assessed. In regard to tests, the validity issues may be examined in the same way as for psychometric tests as outlined above, but there are often particular applications and priorities. In laboratory work, the medical validity of a scientific finding has been defined as the 'degree of achieving the objective' - namely of answering the question which the physician asks.[6] An important requirement in clinical diagnosis and testing is sensitivity and specificity - a test needs to be sensitive enough to detect the relevant problem if it is present (and therefore avoid too many false negative results), but specific enough not to respond to other things (and therefore avoid too many false positive results).[7] In psychiatry there is a particular issue with assessing the validity of the diagnostic categories themselves. In this context:[] content validity may refer to symptoms and diagnostic criteria; concurrent validity may be defined by various correlates or markers, and perhaps also treatment response; predictive validity may refer mainly to diagnostic stability over time; discriminant validity may involve delimitation from other disorders.

Robins and Guze proposed in 1970 what were to become influential formal criteria for establishing the validity of psychiatric diagnoses. They listed five criteria:[] distinct clinical description (including symptom profiles, demographic characteristics, and typical precipitants) laboratory studies (including psychological tests, radiology and postmortem findings) delimitation from other disorders (by means of exclusion criteria) follow-up studies showing a characteristic course (including evidence of diagnostic stability) family studies showing familial clustering

These were incorporated into the Feighner Criteria and Research Diagnostic Criteria that have since formed the basis of the DSM and ICD classification systems. Kendler in 1980 distinguished between:[] antecedent validators (familial aggregation, premorbid personality, and precipitating factors) concurrent validators (including psychological tests) predictive validators (diagnostic consistency over time, rates of relapse and recovery, and response to treatment)

Validity (statistics) Nancy Andreasen (1995) listed several additional validators molecular genetics and molecular biology, neurochemistry, neuroanatomy, neurophysiology, and cognitive neuroscience - that are all potentially capable of linking symptoms and diagnoses to their neural substrates.[] Kendell and Jablinsky (2003) emphasized the importance of distinguishing between validity and utility, and argued that diagnostic categories defined by their syndromes should be regarded as valid only if they have been shown to be discrete entities with natural boundaries that separate them from other disorders.[] Kendler (2006) emphasized that to be useful, a validating criterion must be sensitive enough to validate most syndromes that are true disorders, while also being specific enough to invalidate most syndromes that are not true disorders. On this basis, he argues that a Robins and Guze criterion of "runs in the family" is inadequately specific because most human psychological and physical traits would qualify - for example, an arbitrary syndrome comprising a mixture of "height over 6 ft, red hair, and a large nose" will be found to "run in families" and be "hereditary", but this should not be considered evidence that it is a disorder. Kendler has further suggested that "essentialist" gene models of psychiatric disorders, and the hope that we will be able to validate categorical psychiatric diagnoses by "carving nature at its joints" solely as a result of gene discovery, are implausible.[8] In the United States Federal Court System validity and reliability of evidence is evaluated using the Daubert Standard: see Daubert v. Merrell Dow Pharmaceuticals. Perri and Lichtenwald (2010) provide a starting point for a discussion about a wide range of reliability and validity topics in their analysis of a wrongful murder conviction.[9]

418

References
[1] American Educational Research Association, Psychological Association, & National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. [2] Kramer, Geoffrey P., Douglas A. Bernstein, and Vicky Phares. Introduction to clinical psychology. 7th ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2009. Print. [3] Foxcroft, C., Paterson, H., le Roux, N., & Herbst, D. Human Sciences Research Council, (2004). 'Psychological assessment in South Africa: A needs analysis: The test use patterns and needs of psychological assessment practitioners: Final Report: July. Retrieved from website: http:/ / www. hsrc. ac. za/ research/ output/ outputDocuments/ 1716_Foxcroft_Psychologicalassessmentin%20SA. pdf [4] The most common estimates are between 40,000 and 60,000 deaths. Brian Levack (The Witch Hunt in Early Modern Europe) multiplied the number of known European witch trials by the average rate of conviction and execution, to arrive at a figure of around 60,000 deaths. Anne Lewellyn Barstow (Witchcraze) adjusted Levack's estimate to account for lost records, estimating 100,000 deaths. Ronald Hutton (Triumph of the Moon) argues that Levack's estimate had already been adjusted for these, and revises the figure to approximately 40,000. [5] Cozby, Paul C.. Methods in behavioral research. 10th ed. Boston: McGraw-Hill Higher Education, 2009. Print.

External links
Cronbach, L. J.; Meehl, P. E. (1955). "Construct validity in psychological tests" (http://psychclassics.yorku.ca/ Cronbach/construct.htm). Psychological Bulletin 52 (4): 281302. doi: 10.1037/h0040957 (http://dx.doi.org/ 10.1037/h0040957). PMID 13245896 (http://www.ncbi.nlm.nih.gov/pubmed/13245896).

Values scales

419

Values scales
Values scales are psychological inventories used to determine the values that people endorse in their lives. They facilitate the understanding of both work and general values that individuals uphold. In addition, they assess the importance of each value in peoples lives and how the individual strives toward fulfillment through work and other life roles, such as parenting.[1] Most scales have been normalized and can therefore be used cross-culturally for vocational, marketing, and counseling purposes, yielding unbiased results.[2] Values scales are used by psychologists, political scientists, economists, and others interested in defining values, determining what people value, and evaluating the ultimate function or purpose of values.[3]

Development
Values scales were first developed by an international group of psychologists whose goal was to create a unique self-report instrument that measured intrinsic and extrinsic values for use in the lab and in the clinic. The psychologists called their project the Work Importance Study (WIS). The original values scale measured the following values, listed in alphabetical order: ability utilization, achievement, advancement, aesthetics, altruism, authority, autonomy, creativity, cultural identity, economic rewards, economic security, life style, personal development, physical activity, physical prowess, prestige, risk, social interaction, social relations, variety, and working conditions. Some of the listed values were intended to be inter-related, but conceptually differentiable.[1] Since the original Work Importance Study, several scientists have supplemented the study by creating their own scale or by deriving and improving the original format. Theorists and psychologists often study values, values scales, and the field surrounding values, otherwise known as axiology.[4] New studies have even been published recently, updating the work in the field. Dr. Eda Gurel-Atay published an article in the Journal of Advertising Research in March 2010, providing a glimpse into how social values have changed between 1976 and 2007. The paper explained how self-respect has been on the upswing, while a sense of belonging has become less important to individuals.[5]

Contributing Scientists
Rokeach
According to Milton Rokeach, a prominent social psychologist, human values are defined as core conceptions of the desirable within every individual and society. They serve as standards or criteria to guide not only action but also judgment, choice, attitude, evaluation, argument, exhortation, rationalization, andattribution of causality.[6] In his 1979 publication, Rokeach also stated that the consequences of human values would be manifested in all phenomena that social scientists might consider worth investigating. In order for any type of research to be successful, regardless of the field of study, peoples underlying values needed to be understood. To allow for this, Rokeach created the Rokeach Value Survey (RVS), which has been in use for more than 30 years. It provides a theoretical perspective on the nature of values in a cognitive framework and consists of two sets of values 18 instrumental and 18 terminal.[7] Instrumental values are beliefs or conceptions about desirable modes of behavior that are instrumental to the attainment of desirable end points, such as honesty, responsibility, and capability. Terminal values are beliefs or conceptions about ultimate goals of existence that are worth surviving for, such as happiness, self-respect, and freedom.[8] The value survey asks subjects to rank the values in order of importance to them.[7] The actual directions are as follows: Rank each value in its order of importance to you. Study the list and think of how much each value may act as a guiding principle in your life.[9] The Rokeach Value Survey has been criticized because people are often not able to rank each value clearly. Some values may be equally important, while some values may be equally unimportant, and so on. Presumably, people are more certain of their most extreme values (i.e. what they love and what they hate) and are not so certain of the ones in between. Further, C.J. Clawson and Donald E. Vinson (1977) proved that the Rokeach Value Survey omitted a number of values that a large portion of the population holds.[7]

Values scales

420

Schwartz
Shalom H. Schwartz, social psychologist and author of The Structure of Human Values: Origins and Implications and Theory of Basic Human Values, has done research on universal values and how they exist in a wide variety of contexts.[10] Most of his work addressed broad questions about values, such as: how are individuals priorities affected by social experiences? How do individuals priorities influence their behavior and choices? And, how do value priorities influence ideologies, attitudes, and actions in political, religious, environmental, and other domains? Through his studies, Schwartz concluded that ten types of universal values exist: achievement, benevolence, conformity, hedonism, power, security, self-direction, stimulation, tradition, and universalism. Schwartz also tested the possibility of spirituality as an eleventh universal value, but found that it did not exist in all cultures.[11] Schwartz's value theory and instruments are part of the biannual European Social Survey.

Shalom H. Schwartz

Allport-Vernon-Lindzey
Gordon Allport, a student of German philosopher and psychologist Eduard Spranger,[12] believed that an individuals philosophy is founded upon the values or basic convictions that he holds about what is and is not important in life.[13] Based on Sprangers (1928) view that understanding the individuals value philosophy best captures the essence of a person, Allport and his colleagues, Vernon and Lindzey, created the Allport-Vernon-Lindzey Study of Values. The values scale outlined six major value types: theoretical (discovery of truth), economic (what is most useful), aesthetic (form, beauty, and harmony), social (seeking love of people), political (power), and religious (unity). Forty years after the studys publishing in 1931, it was the third most-cited non-projective personality measure.[4] By 1980, the values scale had fallen into disuse due to its archaic content, lack of religious inclusiveness, and dated language. Richard E. Kopelman, et al., recently updated the Allport-Vernon-Lindzey Study of Values. The motivation behind their update was to make the value scale more relevant to today; they believed that the writing was too dated. The updated, copyrighted version was published in Elsevier Science in 2003. Today, permission is required for use.[4] (volume 62)

Hartman
Philosopher Robert S. Hartman, creator of the Science of Value, introduced and identified the concept of systematic values, which he believed were an important addition to the previously studied intrinsic and extrinsic values. He also made an illuminating distinction between what people value and how people value. How people value parallels very closely with systematic values, which Hartman operationally defined as conceptual constructs or cognitive scripts that exist in peoples minds. Ideals, norms, standards, rules, doctrines, and logic systems are all examples of systematic values. If someones cognitive script is repetitively about violent actions, for instance, then that person is more likely act vengefully and less likely to value peace. With that additional idea in mind, Harman combined intrinsic, extrinsic, and systematic concepts to create the Hartman Value Profile, also known as the Hartman Value Inventory. The profile consists of two parts. Each part contains 18 paired value-combination items, where nine of these items are positive and nine are negative. The three different types of values, intrinsic, extrinsic, and systematic, can be combined positively or negatively with one another in 18 logically possible ways. Depending on the combination, a certain value is either enhanced or diminished. Once the rankings are completed, the outcome is then compared to the theoretical norm, generating scores for psychological interpretation.[13]

Values scales

421

Applications to Psychology
Research surrounding understanding values serves as a framework for ideas in many other situations, such as counseling. Psychotherapists, behavioral scientists, and social scientists often deal with intrinsic, extrinsic, and systematic values of their patients.[14] A primary way to learn about patients is to know what they value, as values are essential keys to personality structures. This knowledge can pinpoint serious problems in living, aide immensely in planning therapeutic regimens, and measure therapeutic progress with applications of values scales over time, especially as social environments and social norms change.[13]

Applications to Business and Marketing


Values are important in the construction of personal morality and as a basis for living life.[2] Recent literature suggests that social values are reflected in a large variety of advertisements and can influence audience reactions to advertising appeals.[15] When a choice is tied to a value, that choice then becomes more attractive to people who share that value. Means-end chain analyses often find that consumers select products with attributes that deliver consequences, which in turn contribute to value fulfillment. In short, peoples values resonate in and are observable throughout their daily lives.[7] A perfect example, presented in the Journal of Advertising Research by Dr. Eda Gurel-Atay, is coffee. People who endorse fun and enjoyment in life may want a cup of coffee for its rich, pleasant taste. Meanwhile, people who value a sense of accomplishment may rather use coffee as a mild stimulant. People who value warm, loving relationships with others may want a cup of coffee to share in a social manner. Perspective and personal beliefs greatly influence behavior.[5] Clawson and Vinson (1978) further elaborated on this idea by explaining how values are one of the most powerful explanations of, and influences on, consumer behavior.[7] Values scales are helpful in understanding several aspects of consumption areas and consumer behavior, including leisure, media, and gift giving. People who endorse certain values more highly than others engage in certain activities, prefer certain programs or magazines, or give gifts differently than others. Values scales and the study of values could also be of interest to companies who are looking to build or strengthen their customer relationship management.

References
[1] Super, Donald and Dorothy D. Nevill. Brief Description of Purpose and Nature of Test. Consulting Psychologists Press. 1989: 3-10. Print. [2] Beatty, Sharon E., et al. Alternative Measurement Approaches to Consumer Values: The List of Values and the Rokeach Value Survey. Psychology and Marketing. 1985: 181-200. Web. [3] Johnston, Charles S. The Rokeach Value Survey: Underlying Structure and Multidimensional Scaling. The Journal of Psychology. 1995: 583-597. Print. [4] Kopelman, Richard E., et al. The Study of Values: Construction of the fourth edition. Journal of Vocational Behavior. 2003: 203-220. Print. [5] Gurel-Atay, Eda. Changes in Social Values in the United States: 1976-2007, Self-Respect is on the Upswing as A Sense of Belonging Becomes Less Important. Journal of Advertising Research. 2010: 57-67. Print. [6] Rokeach, M. The Nature of Human Values. The Free Press. NY: Free Press. 1979. [7] Beatty, Sharon E., et al. Alternative Measurement Approaches to Consumer Values: The List of Values and the Rokeach Value Survey. Psychology and Marketing. 1985: 181-200. Web. [8] Piirto, Jane. I Live in My Own Bubble: The Values of Talented Adolescents. The Journal of Secondary Gifted Education. 2005: 106-118. Web. [9] Rokeach, M. The Nature of Human Values. The Free Press. NY: Free Press. 1979. [10] Schwartz, S.H. "Are There Universal Aspects in the Content and Structure of Values?" Journal of Social Issues. 1994: 19-45. Print. [11] Schwartz, Shalom H. Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries. Advances in Experimental Psychology. 1992: 1-65. Print. [12] Allport, G.W. Becoming: Basic Considerations for a Psychology of Personality. Yale University Press. 1955. Web. [13] Pomeroy, Leon and Rem B. Edwards. The New Science of Axiological Psychology. New York, NY: 2005. Print. [14] Hills, M.D. Kluckhohn and Strodtbecks Values Orientation Theory. Online Readings in Psychology and Culture. 2002. Web. [15] Piirto, Jane. I Live in My Own Bubble: The Values of Talented Adolescents. The Journal of Secondary Gifted Education. 2005: 106-118. Web.

Vestibulo emotional reflex

422

Vestibulo emotional reflex


Vestibulo-emotional reflex (VER) is a reflex 3d head movement that stabilized vertical equilibrium of head by producing head-neck muscle movement with the frequency depending on the emotional and psychophysiological state of a person. Vestibulo-emotional reflex[1] is one of the vestibular reflexes, linking human physiology and emotions. Physiology and pathology of vestibular system and vestibular apparatus were researched by Robert Brny receiving the 1914 Nobel Prize in Physiology. The vestibular system, which contributes to human balance and our sense of spatial orientation, is the sensory system that provides the dominant input about movement and equilibrioception. Vertical human head position is controlled by vestibular system by the means of head-neck anatomy.

Biomechanics
Two month old child begins to poise the head in vertical position on reflex level, firstly performs visible movements for it. An adult person also performs micromovements for poise vertical head position, because it is impossible to coordinate vertical mechanical balance of heavy object without movements. The trajectory of 3D head movement is enough complicated[2][3] and used for different vestibular reflexes researches and human health diagnostics, because vestibular system links with sensory system, nervous system and every part of human body. Sensory systems code for four aspects of a stimulus; type, intensity, location, and duration. Certain receptors are sensitive to certain types of stimuli (for example, different mechanoreceptors respond best to different kinds of touch stimuli, like sharp or blunt objects). Receptors send impulses in certain patterns to send information about the intensity of a stimulus (for example, how loud a sound is). Russian neurophysiologist Nikolai Bernstein spent most part of his life to physiology of movement. He also coined the term biomechanics, the study of movement through the application of mechanical principles. The principles of biological feedback and discrete movement discovered by Bernstein, forms one of the VER bases and his calculation of human movement time discrete about 0.1 sec was confirmed by video image analysis.

VER model. Human head moves slowly when person is calm and still (white head image). Human head moves fast and frequently when person is active, aggressive, anxiety and nervous (red head image)

Vestibular system as typical sensory system reacts to stimulus. But gravitation is constantly working stimulus, so vertical head coordination becomes constantly working and reflex process. This is the main physiology difference between vertical head coordination and any other sensory process that works sometimes. This difference transfers vertical head coordination into typical physiological process as heart rate (HR) measured by ECG and blood pressure, brain activity measured by electroencephalography (EEG), or thermoregulation measured by galvanic-skin respond (GSR). Biological evolution used head vertical coordination for energy regulation,[4] because natural head movement is ideal vibration movement with high energy range. The other sample of nature vibration process for energy regulation is dog tail wagging, but humans have not tail and head movement is better for it. It is understandable, that more high frequency head movement requests more energy, than low frequency movement. On

Vestibulo emotional reflex sensor level it means, that signals send from vestibular receptors to autonomic nervous system, brain and muscles are going with different time delay, depends on biochemical human state. That means dependence between emotional state and vestibular head coordination or vestibulo-emotional reflex.

423

Head movement vestibulogram signal captured by low noise web camera with resolution 640x480 pixels and 30 f/s frequency

VER application
VER gives functional information about person and could be apply for medical, eHealth, psychology and behavior testing, lie detection, emotion control, self-regulation, fitness, animals research are also providing by the different types of vibraimage system.[5] Vibraimage system transforms biomechanics movement into emotional and physiological data of person by video image processing. This process could be remote and hidden for user that is important for security applications, as aviation security.

References
Facial vibraimage with frequency scale

Visual analogue scale

424

Visual analogue scale


A visual analogue scale (VAS) is a psychometric response scale which can be used in questionnaires. It is a measurement instrument for subjective characteristics or attitudes that cannot be directly measured. When responding to a VAS item, respondents specify their level of agreement to a statement by indicating a position along a continuous line between two end-points. This continuous (or "analogue") aspect of the scale differentiates it from discrete scales such as the Likert scale. There is evidence showing that visual analogue scales have superior metrical characteristics than discrete scales, thus a wider range of statistical methods can be applied to the measurements. [1] In practice, computer-analysed VAS responses may be measured using discrete values due to the discrete nature of computer displays. The VAS can be compared to other linear scales such as the Likert scale or Borg scale. The sensitivity and reproducibility of the results are broadly very similar, although the VAS may outperform the other scales in some cases. [2][3] The recent advances in methodologies for Internet-based research (e.g. of visual analogue scales for use in Internet-based questionnaires. [5]
[4]

) include the development and evaluation

External links
A description and example of a VAS scale [6] VAS Generator - a free Web service to create VASs for computerized questionnaires [7]

References
[1] U.-D. Reips and F. Funke (2008) Interval level measurement with visual analogue scales in Internet-based research: VAS Generator. [2] S. Grant, T. Aitchison, E. Henderson, J. Christie, S. Zare, J. McMurray, and H. Dargie (1999) A comparison of the reproducibility and the sensitivity to change of visual analogue scales, borg scales, and likert scales in normal subjects during submaximal exercise. [3] U.-D. Reips and F. Funke (2008) Interval level measurement with visual analogue scales in Internet-based research: VAS Generator. [4] U.-D. Reips (2006) Web-based methods. In M. Eid & E. Diener (Eds.), Handbook of multimethod measurement in psychology (pp. 73-85). Washington, DC: American Psychological Association. [5] U.-D. Reips and F. Funke (2008) Interval level measurement with visual analogue scales in Internet-based research: VAS Generator. [6] http:/ / www. cebp. nl/ vault_public/ filesystem/ ?ID=1478 [7] http:/ / vasgenerator. net

Youth Outcome Questionnaire

425

Youth Outcome Questionnaire


The Youth Outcome Questionnaire is a collection of questions designed to collect data regarding the effectiveness of youth therapies.[1] The Y-OQ is a parent report measure of treatment progress for children and adolescents (ages 4 17) receiving mental health interventions. The Y-OQSR is an adolescent self report measure appropriate for ages 12 18.[2] The psychometric properties of the Youth Outcome Questionnaire Self-Report version [Y-OQ-SR]) were examined by a group of researchers at Brigham Young University. They reported a favorable analysis in terms of internal consistency, test-retest reliability, and concurrent validity. They report it to be a valid and reliable self-report measure of psychosocial distress in youth psychotherapy research.[3] The Y-OQ-SR is backed by data based on large samples of youth who were carefully chosen to be representative of both clinical and normal populations.[4] Higher scores indicate greater dysfunction, patients in psychiatric hospitals score about 100. Those in outpatient treatment average about 78 and the normal population is less than 47.[5] The Y-OQ measures six subscales: Interpersonal Distress (ID) - Anxiety, depression, fearfulness, etc. Somatic (S) - Headache, stomach, bowel, dizziness, etc. Interpersonal Relationships (IR) - Attitude, communication and interaction with parents, adults, and peers. Critical Items (CI) - Paranoid ideation, suicide, hallucinatory, delusions, etc. Social Problems (SP) - Delinquent or aggressive behaviors, breaking social mores. Behavioral Dysfunction (BD) - Organize and complete tasks, handle frustration, impulsivity, inattention.

The subscale scores can be used to identify and target particularly problematic areas as a focus of treatment and help with treatment planning. [] These questionnaires have been used in outcome studies for individual teen programs and groups of therapeutic boarding schools and adventure therapy or wilderness therapy programs. One such study, involving 993 students from 9 schools was presented at the 114th Annual Convention of the American Psychological Association.[4] Another study from 2001, involving 858 kids and their families enrolled in a group of seven wilderness therapy programs for a full year, has been published by the University of Idaho.[6][5]

References
[1] OBHIC Research: "The Youth Outcome Questionnaire," http:/ / www. obhic. com/ research/ does-wilderness-treatment-work. html. Outdoor Behavioral Healthcare Research Cooperative (OBHRC) at the University of Idaho [2] http:/ / www. masspartnership. com/ provider/ outcomesmanagement/ Outcomesfiles/ Tools/ YOQ. pdf [3] Ridge NW, Warren JS, Burlingame GM, Wells MG, Tumblin KM, Reliability and validity of the youth outcome questionnaire self-report, Brigham Young University (http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 19693961) [4] Ellen Behrens and Kristin Satterfield, of Findings from a Multi-Center Study of Youth Outcomes in Private Residential Treatment (http:/ / www. strugglingteens. com/ news/ APAReport81206. pdfReport), Presented at the 114th Annual Convention of the American Psychological Association, New Orleans, Louisiana, August 2006 [5] http:/ / www. obhic. com/ research/ does-wilderness-treatment-work. htm Does Wilderness Treatment Work? [6] Keith C. Russell, Ph.D., Assessment of Treatment Outcomes in Outdoor Behavioral Healthcare (http:/ / www. cnr. uidaho. edu/ wrc/ Pdf/ Tech_Report_27Final. pdf), University of Idaho-Wilderness Research Center

Attribute Hierarchy Method

426

Attribute Hierarchy Method


The Attribute Hierarchy Method (AHM), is a cognitively-based psychometric procedure developed by Jacqueline Leighton, Mark Gierl, and Steve Hunka at the Centre for Research in Applied Measurement and Evaluation [1] (CRAME) at the University of Alberta. The AHM is one form of cognitive diagnostic assessment that aims to integrate cognitive psychology with educational measurement for the purposes of enhancing instruction and student learning.[2] A cognitive diagnostic assessment (CDA), is designed to measure specific knowledge states and cognitive processing skills in a given domain. The results of a CDA yield a profile of scores with detailed information about a students cognitive strengths and weaknesses. This cognitive diagnostic feedback has the potential to guide instructors, parents and students in their teaching and learning processes. To generate a diagnostic skill profile, examinees test item responses are classified into a set of structured attribute patterns that are derived from components of a cognitive model of task performance. The cognitive model contains attributes, which are defined as a description of the procedural or declarative knowledge needed by an examinee to answer a given test item correctly.[2] The inter-relationships among the attributes are represented using a hierarchical structure so the ordering of the cognitive skills is specified. This model provides a framework for designing diagnostic items based on attributes, which links examinees test performance to specific inferences about examinees knowledge and skills.

Differences between the AHM and the Rule Space Method


The AHM differs from Tatsuokas Rule Space Method (RSM) [3] with the assumption of dependencies among the attributes within the cognitive model. In other words, the AHM was derived from RSM by assuming that some or all skills may be represented in hierarchical order. Modeling cognitive attributes using the AHM necessitates the specification of a hierarchy outlining the dependencies among the attributes. As such, the attribute hierarchy serves as a cognitive model of task performance designed to represent the inter-related cognitive processes required by examinees to solve test items. This assumption better reflects the characteristics of human cognition because cognitive processes usually do not work in isolation but function within a network of interrelated competencies and skills.[4] In contrast, the RSM makes no assumptions regarding the dependencies among the attributes. This difference has led to the development of both IRT and non-IRT based psychometric procedures for analyzing test item responses using the AHM. The AHM also differs from the RSM with respect to the identification of the cognitive attributes and the logic underlying the diagnostic inferences made from the statistical analysis.[5]

Identification of the cognitive attributes


The RSM uses a post-hoc approach to the identification of the attributes required to successfully solve each item on an existing test. In contrast, the AHM uses an a priori approach to identifying the attributes and specifying their interrelationships in a cognitive model.

Diagnostic inferences from statistical analysis


The RSM using statistical pattern classification where examinees observed response patterns are matched to pre-determined response patterns that each correspond to a particular cognitive or knowledge state. Each state represents a set of correct and incorrect rules used to answer test items. The focus with the RSM is identification of erroneous rules or misconceptions. The AHM, on the other hand, uses statistical pattern recognition where examinees observed response patterns are compared to response patterns that are consistent with the attribute hierarchy. The purpose of statistical pattern recognition is to identify the attribute combinations that the examinee is likely to possess. Hence, the AHM does not identify incorrect rules or misconceptions as in the RSM.

Attribute Hierarchy Method

427

Principled Test Design


The AHM uses a construct-centered approach to test development and analysis. Construct-centered emphasizes the central role of the construct in directing test development activities and analysis. The advantage of this approach is that the inferences made about student performance are firmly grounded in the construct specified. Principled test design [6] encompasses 3 broad stages: 1. cognitive model development 2. test development 3. psychometric analysis. Cognitive model development comprises the first stage in the test design process. During this stage, the cognitive knowledge, processes, and skills are identified and organized into an attribute hierarchy or cognitive model. This stage also encompasses validation of the cognitive model prior to the test development stage. Test development comprises the second stage in the test design process. During this stage, items are created to measure each attribute within the cognitive model while also maintaining any dependencies modeled among the attributes. Psychometric analysis comprises the third stage in the test design process. During this stage, the fit of the cognitive model relative to observed examinee responses is evaluated to ascertain the appropriateness of the model to explain test performance. Examinee test item responses are then analyzed and diagnostic skill profiles created highlighting examinee cognitive strengths and weaknesses.

Cognitive Model Development


What is a cognitive model?
An AHM analysis must begin with the specification of a cognitive model of task performance. A cognitive model in educational measurement refers to a simplified description of human problem solving on standardized educational tasks, which helps to characterize the knowledge and skills students at different levels of learning have acquired and to facilitate the explanation and prediction of students performance.[7] These cognitive skills, conceptualized as an attribute in the AHM Visual Representation of the four general forms of hierarchical structures in a cognitive model framework, are specified at a small grain size in order to generate specific diagnostic inferences underlying test performance. Attributes include different procedures, skills, and/or processes that an examinee must possess to solve a test item. Then, these attributes are structured using a hierarchy so the ordering of the cognitive skills is specified. The cognitive model can be represented by various hierarchical structures. Generally, there are four general forms of hierarchical structures that can easily be expanded and combined to form increasingly complex networks of hierarchies where the cognitive complexity corresponds to the nature of the problem solving task. The four hierarchical forms include: a) linear, b) convergent, c) divergent, and d) unstructured.

Attribute Hierarchy Method

428

How are cognitive models created and validated?


Theories of task performance can be used to derive cognitive models of task performance in a subject domain. However, the availability of these theories of task performance and cognitive models in education are limited. Therefore, other means are used to generate cognitive models. One method is the use of a task analysis of representative test items from a subject domain. A task analysis represents a hypothesized cognitive model of task performance, where the likely knowledge and processes used to solve the test item are specified. A second method involves having examinees think aloud as they solve test items to identify the actual knowledge, processes, and strategies elicited by the task,.[8][9] The verbal report collected as examinees talk aloud can contain the relevant knowledge, skills, and procedures used to solve the test item. These knowledge, skills, and procedures become the attributes in the cognitive model, and their temporal sequencing documented in the verbal report provides the hierarchical ordering. A cognitive model derived using a task analysis can be validated and, if required, modified using examinee verbal reports collected from think aloud studies.

Why is the accuracy of the cognitive model important?


An accurate cognitive model is crucial for two reasons. First, a cognitive model provides the interpretative framework for linking test score interpretations to cognitive skills. That is, the test developer is in a better position to make defensible claims about student knowledge, skills, and processes that account for test performance. Second, a cognitive model provides a link between cognitive and learning psychology with instruction. Based on an examinees observed response pattern, detailed feedback about an examinees cognitive strengths and weaknesses can be provided through a score report. This diagnostic information can then be used to inform instruction tailored to the examinee, with the goals of improving or remediating specific cognitive skills.

An example of a cognitive model


The following hierarchy is an example of a cognitive model task performance for the knowledge and skills in the areas of ratio, factoring, function, and substitution (called the Ratios and Algebra hierarchy).[10] This hierarchy is divergent and composed of nine attributes which are described below. If the cognitive model is assumed to be true, then an examinee who has mastered attribute A3 is assumed to have mastered the attributes below it, namely attributes A1 and A2. Conversely, if an examinee has mastered attribute A2, then it is expected that the examinee has mastered attribute A1 but not A3.
A Demonstration of Attributes Required to Solve Items in the Ratios and Algebra Hierarchy Attribute A1 A2 A3 A4 A5 Summary of the Attribute Represents the most basic arithmetic operation skills Includes knowledge about the properties of factors Involves the skills of applying the rules of factoring Includes the skills required for substituting values into algebraic expressions Represents the skills of mapping a graph of a familiar function with its corresponding function Deals with the abstract properties of functions, such as recognizing the graphical representation of the relationship between independent and dependent variables Requires the skills to substitute numbers into algebraic expressions Represents the skills of advanced substitution algebraic expressions, rather than numbers, need to be substituted into another algebraic expression Relates to skills associated with rule understanding and application

A6

A7 A8

A9

Attribute Hierarchy Method The hierarchy contains two independent branches which share a common prerequisite attribute A1. Aside from attribute A1, the first branch includes two additional attributes, A2 and A3, and the second branch includes a self-contained sub-hierarchy which includes attributes A4 through A9. Three independent branches compose the sub-hierarchy: attributes A4, A5, A6; attributes A4, A7, A8; and attributes A4, A9. As a prerequisite attribute, attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction, multiplication, and division of numbers. Attributes A2 and A3 both deal with factors. In attribute A2, the examinee needs to have knowledge about the property of factors. In attribute A3, the examinee not only requires knowledge of factoring (i.e., attribute A2), but also the skills of applying the rules of factoring. Therefore, attribute A3 is considered a more advanced attribute than A2. The self-contained sub-hierarchy contains six attributes. Among these attributes, attribute A4 is the prerequisite for all other attributes in the sub-hierarchy. Attribute A4 has attribute A1 as a prerequisite because A4 not only represents basic skills in arithmetic operations (i.e., attribute A1), but it also involves the substitution of values into algebraic expressions which is more abstract and, therefore, more difficult than attribute A1. The first branch in the sub-hierarchy deals, mainly, with functional graph reading. For attribute A5, the examinee must be able to map the graph of a familiar function with its corresponding function. In an item that requires attribute A5 (e.g., item 4), attribute A4 is typically required because the examinee must find random points in the graph and substitute the points into the equation of the function to find a match between the graph and the function. Attribute A6, on the other hand, deals with the abstract properties of functions, such as recognizing the graphical representation of the relationship between independent and dependent variables. The graphs for less familiar functions, such as a function of higher-power polynomials, may be involved. Therefore, attribute A6 is considered to be more difficult than attribute A5 and placed below attribute A5 in the sub-hierarchy. The second branch in the sub-hierarchy considers the skills associated with advanced substitution. Attribute A7 requires the examinee to substitute numbers into algebraic expressions. The complexity of attribute A7 relative to attribute A4 lies in the concurrent management of multiple pairs of numbers and multiple equations. Attribute A8 also represents the skills of advanced substitution. However, what makes attribute A8 more difficult than attribute A7 is that algebraic expressions, rather than numbers, need to be substituted into another algebraic expression. The last branch in the sub-hierarchy contains only one additional attribute, A9, related to skills associated with rule understanding and application. It is the rule, rather than the numeric value or the algebraic expression that needs to be substituted in the item to reach a solution.

429

Cognitive model representation


The Ratio and Algebra attribute hierarchy can also be expressed in matrix form. To begin, the direct relationship among the attributes is specified by a binary adjacency matrix (A) of order (k,k), where k is the number of attributes, such that each element in the A matrix represents the absence (i.e., 0) or presence (i.e., 1) of a direct connection between two attributes. The A matrix for the Ratio and Algebra hierarchy presented is shown below.

Each row and column the A matrix represents one attribute; the first row and column represents attribute A1 and the last row and column represents attribute A9. The presence of a 1 in a particular row denotes a direct connection

Attribute Hierarchy Method between that attribute and the attribute corresponding to the column position. For example, attribute A1 is directly connected to attribute A2 because of the presence of a 1 in the first row (i.e. attribute A1) and the second column (i.e., attribute A2). The positions of 0 in row 1 indicate that A1 is neither directly connected to itself nor to attributes A3 and A5 to A9. The direct and indirect relationships among attributes are specified by the binary reachability matrix (R) of order (k,k), where k is the number of attributes. To obtain the R matrix from the A matrix, Boolean addition and multiplication operations are performed on the adjacency matrix, meaning where n is the integer required to reach invariance, Algebra hierarchy is shown next. , and I is the identity matrix. The R matrix for the Ratio and

430

Similar to the A matrix, each row and column in the matrix represents one attribute; the first row and column represents attribute A1 and the last row and column represents attribute A9. The first attribute is either directly or indirectly connected to all attributes A1 to A9. This is represented by the presence of 1s in all columns of row 1 (i.e., representing attribute A1). In the R matrix, an attribute is considered related to itself resulting in 1s along the main diagonal. Referring back to the hierarchy, it is shown that attribute A1 is directly connected to attribute A2 and indirectly to A3 through its connection with A2. Attribute A1 is indirectly connected to attributes A5 to A9 through its connection with A4. The potential pool of items is represented by the incidence matrix (Q) matrix of order (k, p), where k is the number of attributes and p is number of potential items. This pool of items represents all combinations of the attributes when the attributes are independent of each other. However, this pool of items can be reduced to form the reduced incidence matrix (Qr), by imposing the constraints of the attribute hierarchy as defined by the R matrix. The Qr matrix represents items that capture the dependencies among the attributes defined in the attribute hierarchy. The Qr matrix is formed using Boolean inclusion by determining which columns of the R matrix are logically included in each column of the Q matrix. The Qr matrix is of order (k,) where k is the number of attributes and i is the reduced number of items resulting from the constraints in the hierarchy. For the Ratio and Algebra hierarchy, the Qr matrix is shown next.

The Qr matrix serves an important test item development blueprint where items can be created to measure each specific combination of attributes. In this way, each component of the cognitive model can be evaluated systematically. In this example, a minimum of 9 items are required to measure all the attribute combinations specified in the Qr matrix.

Attribute Hierarchy Method The expected examinee response patterns can now be generated using the Qr matrix. An expected examinee is conceptualized as a hypothetical examinee who correctly answers items that require cognitive attributes that the examinee has mastered. The expected response matrix (E) is created, using Boolean inclusion, by comparing each row of the attribute pattern matrix (which is the transpose of the Qr matrix) to the columns of the Qr matrix. The expected response matrix is of order (j,i), where j is the number of examinees and i is the reduced number of items resulting from the constraints imposed by the hierarchy. The E matrix for the Ratio and Algebra hierarchy is shown below. If the cognitive model is true, then 58 unique item response patterns should be produced by examinees who write these cognitively-based items. A row of 0s is usually added to the E matrix which represents an examinee who has not mastered any attributes. To summarize, if the attribute pattern of the examinee contains the attributes required by the item, then the examinee is expected to answer the item correctly. However, if the examinees attribute pattern is missing one or more of the cognitive attributes required by the item, the examinee is not expected to answer the item correctly.

431

Role of the cognitive model in item development


The cognitive model in the form of an attribute hierarchy has direct implications for item development. Items that measure each attribute must maintain the hierarchical ordering of the attributes as specified by the cognitive model while also measuring increasingly complex cognitive processes. These item types may be in either multiple choice or constructed response format. To date, the AHM has been used with items that are scored dichotomously where 1 corresponds to a correct answer and 0 corresponds to an incorrect answer. Therefore, a students test performance can be summarized by a vector of correct and incorrect responses in the form of 1s and 0s. This vector then serves as the input for the psychometric analysis where the examinees attribute mastery is estimated.

Approach to item development


The attributes in the cognitive model are specified at a fine grain size in order to yield a detailed cognitive skill profile about the examinees test performance. This necessitates many items that must be created to measure each attribute in the hierarchy. For computer-based tests, automated item generation (AIG) is a promising method for generating multiple items on the fly that have similar form and psychometric properties using a common template.[11]

Example of items aligned to the attributes in a hierarchy[10]


Referring back to the pictorial representation of Ratio and Algebra hierarchy, an item can be constructed to measure the skills described in each of the attributes. For example, attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction, multiplication, and division of numbers. An item that measures this skill could be the following: examinees are presented with the algebraic expression , and asked to solve for (t + u). For this item, examinees need to subtract 3 from 19 and then divide 16 by 4. Attribute A2 represents knowledge about the property of factors. An example of an item that measures this attribute is "If (p + 1)(t 3) = 0 and p is positive, what is the value of t? The examinee must know the property that the value of at least one factor must be zero if the product of multiple factors is zero. Once this property is recognized, the examinee would be able to recognize that because p is positive, (t 3) must be zero to make the value of the whole expression zero, which would finally yield the value of 3 for t. To answer this item correctly, the examinee should have mastered both attributes A1 and A2. Attribute A3 represents not only knowledge of factoring (i.e., attribute A2), but also the skills of applying the rules of factoring. An example of an item that measures this attribute is .

Only after the examinee factors the second expression into the product of the first expression would the calculation

Attribute Hierarchy Method of the value of the second expression be apparent. To answer this item correctly, the examinee should have mastered attributes A1, A2, and A3.

432

Psychometric Analysis
During this stage, statistical pattern recognition is used to identify the attribute combinations that the examinee is likely to possess based on the observed examinee response relative to the expected response patterns derived from the cognitive model.

Evaluating model-data fit


Prior to any further analysis, the cognitive model specified must accurately reflect the cognitive attributes used by the examinees. It is expected that there will be discrepancies, or slips, between observed response patterns generated by a large group of examinees and the expected response patterns. The fit of the cognitive model relative to the observed response patterns obtained from examinees can be evaluated using the Hierarchical Consistency Index.[10] The HCI evaluates the degree to which the observed response patterns are consistent with the attribute hierarchy. The HCI for examinee i is given by:

where J is the total number of items, Xij is examinee i s score (i.e., 1 or 0) to item j, Sj includes items that require the subset of attributes of item j, and Nci is the total number of comparisons for correctly answered items by examinee i. The values of the HCI range from -1 to +1. Values closer to 1 indicate a good fit between the observed response pattern and the expected examinee response patterns generated from the hierarchy. Conversely, low HCI values indicate a large discrepancy between the observed examinee response patterns and the expected examinee response patterns generated from the hierarchy. HCI values above 0.70 indicate good model-data fit.

Why is model-data fit important?


Obtaining good model-data fit provides additional evidence to validate the specified attribute hierarchy, which is required before proceeding with determination of an examinees attribute mastery. If the data is not shown to fit the model, then various reasons may account for the large number of discrepancies including: a misspecification of the attributes, incorrect ordering of attributes within the hierarchy, items not measuring the specified attributes, and/or the model is not reflective of the cognitive processes used by a given sample of examinees. Therefore, the cognitive model should be correctly defined and closely aligned with the observed response patterns in order to provide a substantive framework for making inferences about a specific group of examinees knowledge and skills.[12]

Estimating attribute probabilities


Once we establish that the model fits the data, the attribute probabilities can be calculated. The use of attribute probabilities is important in the psychometric analyses of the AHM because these probabilities provide examinees with specific information about their attribute-level performance as part of the diagnostic reporting process. To estimate the probability that examinees possess specific attributes, given their observed item response pattern, an artificial neural network approach is used.

Attribute Hierarchy Method

433

Brief description of a neural network


The neural network is a type of parallel-processing architecture that transforms any stimulus received by the input unit (i.e., stimulus units) to a signal for the output unit (i.e., response units) through a series of mid-level hidden units. Each unit in the input layer is connected to each unit in the hidden layer and, in turn, to each unit in the output layer. Generally speaking, a neural network requires the following steps. To begin, each cell of the input layer receives a value (0 or 1) corresponding to the response values in the exemplar vector. Each input cell then passes the value it receives to every hidden cell. Each hidden cell forms a linearly weighted sum of its input and transforms the sum using the logistic function and passes the result to every output cell. Each output cell, in turn, forms a linearly weighted sum of its inputs from the hidden cells and transforms it using the logistic function, and outputs the result. Because the result is scaled using the logistic transformation, the output values range from 0 to 1. The result can be interpreted as the probability the correct or target value for each output will have a value of 1. The output targets in the response units (i.e., the examinee attributes) are compared to the pattern associated with each stimulus input or exemplar (i.e., the expected response patterns). The solution produced initially with the stimulus and association connection weights is likely to be discrepant resulting in a relatively large error. However, this discrepant result can be used to modify the connection weights thereby leading to a more accurate solution and a smaller error term. One popular approach for approximating the weights so the error term is minimized is with a learning algorithm called the generalized delta rule that is incorporated in a training procedure called back propagation of error.[13]

Specification of the neural network


Calculation of attribute probabilities begins by presenting the neural network with both the generated expected examinee response patterns from Stage 1, with their associated attribute patterns which is derived from the cognitive model (i.e., the transpose of the Qr matrix), until the network learns each association. The result is a set of weight matrices that will be used to calculate the probability that an examinee has mastered a particular cognitive attribute based on their observed response pattern.[10] An attribute probability close to 1 would indicate that the examinee has likely mastered the cognitive attribute, whereas a probability close to 0 would indicate that the examinee has likely not mastered the cognitive attribute.

Attribute Hierarchy Method

434

Reporting The Results


The importance of the reporting process
Score reporting serves a critical function as the interface between the test developer and a diverse audience of test users. A score report must include detailed information, which is often technical in nature, about the meanings and possible interpretations of results that users can make. The Standards for Educational and Psychological [14] Testing clearly define the role of test developers in the reporting process. Standard 5.10 states: When test score information is released to students, parents, legal representatives, teachers, clients, or the media, those responsible for testing programs should provide appropriate interpretations. The interpretations should describe in simple language what the test covers, what the scores mean, and how the scores will be used.

A Sample Diagnostic Score Report for an Examinee Who Mastered Attributes A1, A4, A5, and A6

Reporting cognitive diagnostic results using the AHM


A key advantage of the AHM is that it supports individualized diagnostic score reporting using the attribute probability results. The score reports produced by the AHM have not only a total score but also detailed information about what cognitive attributes were measured by the test and the degree to which the examinees have mastered these cognitive attributes. This diagnostic information is directly linked to the attribute descriptions, individualized for each student, and easily presented. Hence, these reports provide specific diagnostic feedback which may direct instructional decisions. To demonstrate how the AHM can be used to report test scores and provide diagnostic feedback, a sample report is presented next.[10] In the example to the right, the examinee mastered attributes A1 and A4 to A6. Three performance levels were selected for reporting attribute mastery: non-mastery (attribute probability value between 0.00 and 0.35), partial mastery (attribute probability value between 0.36 and 0.70), and mastery (attribute probability value between 0.71 and 1.00). The results in the score report reveal that the examinee has clearly mastered four attributes, A1 (basic arithmetic operations), A4 (skills required for substituting values into algebraic expressions), A5 (the skills of mapping a graph of a familiar function with its corresponding function), and A6 (abstract properties of functions). The examinee has not mastered the skills associated with the remaining five attributes.

Attribute Hierarchy Method

435

Implications of AHM for Cognitive Diagnostic Assessment


Integration of assessment, instruction, and learning
The rise in popularity of cognitive diagnostic assessments can be traced to two sources: assessment developers and assessment users.[15] Assessment developers see great potential for cognitive diagnostic assessments to inform teaching and learning by changing the way current assessments are designed. Assessment developers also argue that to maximize the educational benefits from assessments, curriculum, instruction, and assessment design should be aligned and integrated. Assessment users, including teachers and other educational stakeholders, are increasingly demanding relevant results from educational assessments. This requires assessments to be aligned with classroom practice to be of maximum instructional value. The AHM to date, as a form of cognitive diagnostic assessment, addresses the path between curriculum and assessment design by identifying the knowledge, skills, and processes actually used by examinees to solve problems in a given domain. These cognitive attributes organized into a cognitive model becomes not only representation of the construct of interest, but also the cognitive test blueprint. Items can then be constructed to systematically measure each attribute combination within the cognitive model. The path between assessment design and instruction is also addressed by providing specific, detailed feedback about an examinees performance in terms of the cognitive attributes mastered. This cognitive diagnostic feedback is provided to students and teachers in the form of a score report. The skills mastery profile, along with adjunct information such as exemplar test items, can be used by the teacher to focus instructional efforts in areas where the student is requiring additional assistance. Assessment results can also provide feedback to the teacher on the effectiveness of instruction for promoting the learning objectives. The AHM is a promising method for cognitive diagnostic assessment. Using a principled test design approach, integrating cognition into test development, can promote stronger inferences about how students actually think and solve problems. With this knowledge, students can be provided with additional information that can guide their learning, leading to improved performance on future educational assessments and problem solving tasks.
Effects of the results of a Cognitive Diagnostic Assessment

Attribute Hierarchy Method

436

Suggested Reading
Leighton, J. P., & Gierl, M. J. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge, UK: Cambridge University Press.

External links
Center for Research in Applied Measurement and Evaluation [1]

References
[1] http:/ / www. education. ualberta. ca/ educ/ psych/ crame/ [2] Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka's rule-space approach. Journal of Educational Measurement, 41, 205-237. [3] Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345-354. [4] Kuhn, D. (2001). Why development does (and does not) occur: Evidence from the domain of inductive reasoning. In J. L. McClelland & R. Siegler (Eds.), Mechanisms of cognitive development: Behavioral and neural perspectives (pp. 221-249). Hillsdale, NJ: Erlbaum. [5] Gierl, M. J. (2007). Making diagnostic inferences about cognitive attributes using the rule-space model and attribute hierarchy method. Journal of Educational Measurement, 44, 325-340. [6] Gierl, M. J., & Zhou, J. (2008). Computer adaptive-attribute testing: A new approach to cognitive diagnostic assessment. To appear in the Special Issue of Zeitschift fur Psychologie-Journal of Psychology, (Spring, 2008), Adaptive Models of Psychological Testing, Wim J. van der Linden (Guest Editor). [7] Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees thinking processes. Educational Measurement: Issues and Practice, 26, 3-16. [8] Ericsson, K. A. & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. [9] Leighton, J.P. (2004). Avoiding misconceptions, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23, 1-10. [10] Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees cognitive skills in algebra on the SAT. Journal of Technology, Learning, and Assessment, 6 (6). Retrieved October 24, 2008, from http:/ / www. jtla. org. [11] Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment, 2 (3). Retrieved October 24, 2008, from http:/ / www. jtla. org. [12] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986b). Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press [13] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986a). Learning representations by back-propagating errors. Nature, 323, 533536 [14] American Educational Research Association (AERA), American Psychological Association, National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: AERA. [15] Huff, K., & Goodman, D. P. (2007). The demand for cognitive diagnostic assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 19-60). Cambridge, UK: Cambridge University Press.

Differential item functioning

437

Differential item functioning


Differential item functioning (DIF), also referred to as measurement bias, occurs when people from different groups (commonly gender or ethnicity) with the same latent trait (ability/skill) have a different probability of giving a certain response on a questionnaire or test.[1] DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are Mantel-Haenszel, item response theory (IRT) based methods, and logistic regression.[2]

Description
DIF refers to differences in the functioning of items across groups, oftentimes demographic, which are matched on the latent trait or more generally the attribute being measure by the items or test.[3][4] It is important to note that when examining items for DIF, the groups must be matched on the measured attribute, otherwise this may result in inaccurate detection of DIF. In order to create a general understanding of DIF or measurement bias, consider the following example offered by Osterlind and Everson (2009).[5] In this case, Y refers to a response to a particular test item which is determined by the latent construct[6] being measured. The latent construct of interest is referred to as theta () where Y is an indicator of which can be arranged in terms of the probability distribution of Y on by the expression f(Y)|. Therefore, response Y is conditional on the latent trait (). Because DIF examines differences in the conditional probabilities of Y between groups, let us label the groups as the reference and focal groups. Although the designation does not matter, a typical practice in the literature is to designate the reference group as the group who is suspected to have an advantage while the focal group refers to the group anticipated to be disadvantaged by the test.[3] Therefore, given the functional relationship f(Y)| and under the assumption that there are identical measurement error distributions for the reference and focal groups it can be concluded that under the null hypothesis: f (Y = 1) | , G = r) = f (Y = 1) | , G = f) with G corresponding to the grouping variable, "r" the reference group, and "f" the focal group. This equation represents an instance where DIF is not present. In this case, the absence of DIF is determined by the fact that the conditional probability distribution of Y is not dependent on group membership. To illustrate, consider an item with response options 0 and 1, where Y = 0 indicates an incorrect response, and Y = 1 indicates a correct response. The probability of correctly responding to an item is the same for members of either group. This indicates that there is no DIF or item bias because members of the reference and focal group with the same underlying ability or attribute have the same probability of responding correctly. Therefore, there is no bias or disadvantage for one group over the other. Consider the instance where the conditional probability of Y is not the same for the reference and focal groups. In other words, members of different groups with the same trait or ability level have unequal probability distributions on Y. Once controlling for , there is a clear dependency between group membership and performance on an item. For dichotomous items, this suggests that when the focal and reference groups are at the same location on , there is a different probability of getting a correct response or endorsing an item. Therefore, the group with the higher conditional probability of correctly responding to an item is the group advantaged by the test item. This suggests that the test item is biased and functions differently for the groups, therefore exhibits DIF. It is important to draw the distinction between DIF or measurement bias and ordinary group differences. Whereas group differences indicate differing score distributions on Y, DIF explicitly involves conditioning on . For instance, consider the following equation: p (Y = 1 | G = g) p(Y = 1)

Differential item functioning This indicates that an examinee's score is conditional on grouping such that having information about group membership changes the probability of a correct response. Therefore, if the groups differ on , and performance depends on , then the above equation would suggest item bias even in the absence of DIF. For this reason, it is generally agreed upon in the measurement literature that differences on Y conditional on group membership alone is inadequate for establishing bias.[7][8][9] In fact, differences on or ability are common between groups and establish the basis for much research. Remember to establish bias or DIF, groups must be matched on and then demonstrate differential probabilities on Y as a function of group membership.

438

Forms of DIF
Uniform DIF is the simplest type of DIF where the magnitude of conditional dependency is relatively invariant across the latent trait continuum (). The item of interest consistently gives one group an advantage across all levels of ability .[10] Within an item response theory (IRT) framework this would be evidenced when both item characteristic curves (ICC) are equally discriminating yet exhibit differences in the difficulty parameters (i.e., ar = af and br < bf) as depicted in Figure 1.[11] However, nonuniform DIF presents an interesting case. Rather than a consistent advantage being given to the reference group across the ability continuum, the conditional dependency moves and changes direction at different locations on the continuum.[12] For instance, an item may give the reference group a minor advantage at the lower end of the continuum while a major advantage at the higher end. Also, unlike uniform DIF, an item can simultaneously vary in discrimination for the two groups while also varying in difficulty (i.e., ar af and br < bf). Even more complex is crossing nonuniform DIF. As demonstrated in Figure 2, this occurs when an item gives an advantage to a reference group at one end of the continuum while favors the focal group at the other end. Differences in ICCs indicate that examinees from the two groups with identical ability levels have unequal probabilities of correctly responding to an item. When the curves are different but do not intersect, this is evidence of uniform DIF. However, if the ICCs cross at any point along the scale, there is evidence of nonuniform DIF.

Differential item functioning

439

Procedures for Detecting DIF


Mantel-Haenszel
A common procedure for detecting DIF is the Mantel-Haenszel (MH) approach.[13] The MH procedure is a chi-squared contingency table based approach which examines differences between the reference and focal groups on all items of the test, one by one.[14] The ability continuum, defined by total test scores, is divided into k intervals which then serves as the basis for matching members of both groups.[15] A 2 x 2 contingency table is used at each interval of k comparing both groups on an individual item. The rows of the contingency table correspond to group membership (reference or focal) while the columns correspond to correct or incorrect responses. The following table presents the general form for a single item at the kth ability interval.

Odds Ratio The next step in the calculation of the MH statistic is to use data from the contingency table to obtain an odds ratio for the two groups on the item of interest at a particular k interval. This is expressed in terms of p and q where p represents the proportion correct and q the proportion incorrect for both the reference (R) and focal (F) groups. For the MH procedure, the obtained odds ratio is represented by with possible value ranging from 0 to . A value of 1.0 indicates an absence of DIF and thus similar performance by both groups. Values greater than 1.0 suggest that the reference group outperformed or found the item less difficult than the focal group. On the other hand, if the obtained value is less than 1.0, this is an indication that the item was less difficult for the focal group.[8] Using variables from the contingency table above, the calculation is as follows: = (pRk / qRk)(pFk / qFk) = (Ak / (Ak + Bk)) / (Bk / (Ak + Bk)) (Ck / (Ck + Dk)) / (Dk / (Ck + Dk)) = (Ak / Bk)(Ck / Dk) = AkDkBkCk The above computation pertains to an individual item at a single ability interval. The population estimate can be extended to reflect a common odds ratio across all ability intervals k for a specific item. The common odds ratio estimator is denoted MH and can be computed by the following equation: MH = (AkDk / Nk) (BkCk / Nk) for all values of k and where Nk represents the total sample size at the kth interval.

The obtained MH is often standardized through log transformation, centering the value around 0.[16] The new transformed estimator MHD-DIF is computed as follows:

MHD-DIF = -2.35ln(MH)

Differential item functioning Thus an obtained value of 0 would indicate no DIF. In examining the equation, it is important to note that the minus sign changes the interpretation of values less than or greater than 0. Values less than 0 indicate a reference group advantage whereas values greater than 0 indicate an advantage for the focal group.

440

Item Response Theory


Item response theory (IRT) is another widely used method for assessing DIF. IRT allows for a critical examination of responses to particular items from a test or measure. As noted earlier, DIF examines the probability of correctly responding to or endorsing an item conditioned on the latent trait or ability. Because IRT examines the monotonic relationship between responses and the latent trait or ability, it is a fitting approach for examining DIF.[17] Three major advantages of using IRT in DIF detection are:[18] Compared to classical test theory, IRT parameter estimates are not as confounded by sample characteristics. Statistical properties of items can be expressed with greater precision which increases the interpretation accuracy of DIF between two groups. These statistical properties of items can be expressed graphically, improving interpretability and understanding of how items function differently between groups. In relation to DIF, item parameter estimates are computed and graphically examined via item characteristic curves (ICCs) also referred to as trace lines or item response functions (IRF). After examination of ICCs and subsequent suspicion of DIF, statistical procedures are implemented to test differences between parameter estimates. ICCs represent mathematical functions of the relationship between positioning on the latent trait continuum and the probability of giving a particular response.[19] Figure 3 illustrates this relationship as a logistic function. Individuals lower on the latent trait or with less ability have a lower probability of getting a correct response or endorsing an item, especially as difficulty increases. Thus, those higher on the latent trait or in ability have a greater chance of a correct response or endorsing an item. For instance, on a depression inventory, highly depressed individuals would have a greater probability of endorsing an item than individuals with lower depression. Similarly, individuals with higher math ability have a greater probability of getting a math item correct than those with lesser ability. Another critical aspect of ICCs pertains to the inflection point. This is the point on the curve where the probability of a particular response is .5 and also represents the maximum value for the slope.[20] This inflection point indicates where the probability of a correct response or endorsing an item becomes greater than 50%, except when a c parameter is greater than 0 which then places the inflection point at 1 + c/2 (a description will follow below). The inflection point is determined by the difficulty of the item which corresponds to values on the ability or latent trait continuum.[21] Therefore, for an easy item, this inflection point may be lower on the ability continuum while for a difficult item it may be higher on the same scale.

Differential item functioning

441

Before presenting statistical procedures for testing differences of item parameters, it is important to first provide a general understanding of the different parameter estimation models and their associated parameters. These include the one-, two-, and three-parameter logistic (PL) models. All these models assume a single underling latent trait or ability. All three of these models have an item difficulty parameter denoted b. For the 1PL and 2PL models, the b parameter corresponds to the inflection point on the ability scale, as mentioned above. In the case of the 3PL model, the inflection corresponds to 1 + c/2 where c is a lower asymptote (discussed below). Difficultly values, in theory, can range from - to +; however in practice they rarely exceed 3. Higher values are indicative of harder test items. Items exhibiting low b parameters are easy test items.[22] Another parameter that is estimated is a discrimination parameter designated a . This parameter pertains to an item's ability to discriminate among individuals. The a parameter is estimated in the 2PL and 3PL models. In the case of the 1PL model, this parameter is constrained to be equal between groups. In relation to ICCs, the a parameter is the slope of the inflection point. As mentioned earlier, the slope is maximal at the inflection point. The a parameter, similar to the b parameter, can range from - to +; however typical values are less than 2. In this case, higher value indicate greater discrimination between individuals.[23] The 3PL model has an additional parameter referred to as a guessing or pseudochance parameter and is denoted by c. This corresponds to a lower asymptote which essentially allows for the possibility of an individual to get a moderate or difficult item correct even if they are low in ability. Values for c range between 0 and 1, however typically fall below .3.[24] When applying statistical procedures to assess for DIF, the a and b parameters (discrimination and difficulty) are of particular interest. However, assume a 1PL model was used, where the a parameters are constrained to be equal for both groups leaving only the estimation of the b parameters. After examining the ICCs, there is an apparent difference in b parameters for both groups. Using a similar method to a Student's t-test, the next step is to determine if the difference in difficulty is statistically significant. Under the null hypothesis H0: br = bf Lord (1980) provides an easily computed and normally distributed test statistic. d = (br - bf) / SE(br - bf)

Differential item functioning The standard error of the difference between b parameters is calculated by 2 2 [SE(b )] + [SE(b )] r f Wald Statistic However, more common than not, a 2PL or 3PL model is more appropriate than fitting a 1PL model to the data and thus both the a and b parameters should be tested for DIF. Lord (1980) proposed another method for testing differences in both the a and b parameters, where c parameters are constrained to be equal across groups. This test yields a Wald statistic which follows a chi-square distribution. In this case the null hypothesis being tested is H0: ar = af and b

442

= bf. r First, a 2 x 2 covariance matrix of the parameter estimates is calculated for each group which are represented by S r and S for the reference and focal groups. These covariance matrices are computed by inverting the obtained f information matrices.
Next, the differences between estimated parameters are put into a 2 x 1 vector and is denoted by V' = (ar - af, br - bf) Next, covariance matrix S is estimated by summing S and S . r f Using this information, the Wald statistic is computed as follows: 2 = V'S-1V which is evaluated at 2 degrees of freedom. Likelihood-Ratio test The Likelihood-ratio test is another IRT based method for assessing DIF. This procedure involves comparing the ratio of two models. Under model (Mc) item parameters are constrained to be equal or invariant between the reference and focal groups. Under model (Mv) item parameters are free to vary.[25] The likelihood function under Mc is denoted (Lc) while the likelihood function under Mv is designated (Lv). The items constrained to be equal serve as anchor items for this procedure while items suspected of DIF are allowed to freely vary. By using anchor items and allowing remaining item parameters to vary, multiple items can be simultaneously assessed for DIF.[26] However, if the likelihood ratio indicates potential DIF, an item-by-item analysis would be appropriate to determine which items, if not all, contain DIF. The likelihood ratio of the two models is computed by G2 = 2ln[Lv / Lc] Alternatively, the ratio can be expressed by G2 = -2ln[Lc / Lv] where Lv and Lc are inverted and then multiplied by -2ln. G2 approximately follows a chi square distribution, especially with larger samples. Therefore, it is evaluated by the degrees of freedom that correspond to the number of constraints necessary to derive the constrained model from the freely varying model.[27] For instance, if a 2PL model is used and both a and b parameters are free to vary under Mv and these same two parameters are constrained in under Mc, then the ratio is evaluated at 2 degrees of freedom.

Logistic Regression
Logistic regression approaches to DIF detection involve running a separate analysis for each item. The independent variables included in the analysis are group membership, an ability matching variable typically a total score, and an interaction term between the two. The dependent variable of interest is the probability or likelihood of getting a correct response or endorsing an item. Because the outcome of interest is expressed in terms of probabilities, maximum likelihood estimation is the appropriate procedure.[28] This set of variables can then be expressed by the following regression equation:

Differential item functioning Y = 0 + 1M + 2G + 3MG where corresponds to the intercept or the probability of a response when M and G are equal to 0 with remaining 0 s corresponding to weight coefficients for each independent variable. The first independent variable, M, is the matching variable used to link individuals on ability, in this case a total test score, similar to that employed by the Mantel-Haenszel procedure. The group membership variable is denoted G and in the case of regression is represented through dummy coded variables. The final term MG corresponds to the interaction between the two above mentioned variables. For this procedure, variables are entered hierarchically. Following the structure of the regression equation provided above, variables are entered by the following sequence: matching variable M, grouping variable G, and the interaction variable MG. Determination of DIF is made by evaluating the obtained chi-square statistic with 2 degrees of freedom. Additionally, parameter estimate significance is tested. From the results of the logistic regression, DIF would be indicated if individuals matched on ability have significantly different probabilities of responding to an item and thus differing logistic regression curves. Conversely, if the curves for both groups are the same, then the item is unbiased and therefore DIF is not present. In terms of uniform and nonuniform DIF, if the intercepts and matching variable parameters for both groups are not equal, then there is evidence of uniform DIF. However, if there is a nonzero interaction parameter, this is an indication of nonuniform DIF.[29]

443

DIF Considerations
Sample Size
The first consideration pertains to issues of sample size, specifically with regard to the reference and focal groups. Prior to any analyses, information about the amount of people in each group is typically known such as the number of males/females or members of ethnic/racial groups. However, the issue more closely revolves around whether the amount of people per group is sufficient for there to be enough statistical power to identify DIF. In some instances such as ethnicity there may be evidence of unequal group sizes such that Whites represent a far larger group sample than each individual ethnic group being represented. Therefore, in such instances, it may be appropriate to modify or adjust data so that the groups being compared for DIF are in fact equal or closer in size. Dummy coding or recoding is a common practice employed to adjust for disparities in the size of the reference and focal group. In this case, all Non-White ethnic groups can be grouped together in order to have a relatively equal sample size for the reference and focal groups. This would allow for a "majority/minority" comparison of item functioning. If modifications are not made and DIF procedures are carried out, there may not be enough statistical power to identify DIF even if DIF exists between groups. Another issue that pertains to sample size directly relates to the statistical procedure being used to detect DIF. Aside from sample size considerations of the reference and focal groups, certain characteristics of the sample itself must be met to comply with assumptions of each statistical test utilized in DIF detection. For instance, using IRT approaches may require larger samples than required for the Mantel-Haenszel procedure. This is important, as investigation of group size may direct one toward using one procedure over another. Within the logistic regression approach, leveraged values and outliers are of particular concern and must be examined prior to DIF detection. Additionally, as with all analyses, statistical test assumptions must be met. Some procedures are more robust to minor violations while others less so. Thus, the distributional nature of sample responses should be investigated prior to implementing any DIF procedures.

Differential item functioning

444

Items
Determining the number of items being used for DIF detection must be considered. No standard exists as to how many items should be used for DIF detection as this changes from study-to-study. In some cases it may be appropriate to test all items for DIF, whereas in others it may not be necessary. If only certain items are suspected of DIF with adequate reasoning, then it may be more appropriate to test those items and not the entire set. However, oftentimes it is difficult to simply assume which items may be problematic. For this reason, it is often recommended to simultaneously examine all test items for DIF. This will provide information about all items, shedding light on problematic items as well as those that function similarly for both the reference and focal groups. With regard to statistical tests, some procedures such as IRT-Likelihood Ratio testing require the use of anchor items. Some items are constrained to be equal across groups while items suspected of DIF are allowed to freely vary. In this instance, only a subset would be identified as DIF items while the rest would serve as a comparison group for DIF detection. Once DIF items are identified, the anchor items can also be analyzed by then constraining the original DIF items and allowing the original anchor items to freely vary. Thus it seems that testing all items simultaneously may be a more efficient procedure. However, as noted, depending on the procedure implemented different methods for selecting DIF items are used. Aside from identifying the number of items being used in DIF detection, of additional importance is determining the number of items on the entire test or measure itself. The typical recommendation as noted by Zumbo (1999) is to have a minimum of 20 items. The reasoning for a minimum of 20 items directly relates to the formation of matching criteria. As noted in earlier sections, a total test score is typically used as a method for matching individuals on ability. The total test score is divided up into normally 3-5 ability levels (k) which is then used to match individuals on ability prior to DIF analysis procedures. Using a minimum of 20 items allows for greater variance in the score distribution which results in more meaningful ability level groups. Although the psychometric properties of the instrument should have been assessed prior to being utilized, it is important that the validity and reliability of an instrument be adequate. Test items need to accurately tap into the construct of interest in order to derive meaningful ability level groups. Of course, one does not want to inflate reliability coefficients by simply adding redundant items. The key is to have a valid and reliable measure with sufficient items to develop meaningful matching groups. Gadermann et al. (2012),[30] Revelle and Zinbarg (2009),[31] and John and Soto (2007)[32] offer more information on modern approaches to structural validation and more precise and appropriate methods for assessing reliability.

Statistics versus Reasoning


As with all psychological research and psychometric evaluation, statistics play a vital role but should by no means be the sole basis for decisions and conclusions reached. Reasoned judgment is of critical importance when evaluating items for DIF. For instance, depending on the statistical procedure used for DIF detection, differing results may be yielded. Some procedures are more precise while others less so. For instance, the Mantel-Haenszel procedure requires the researcher to construct ability levels based on total test scores whereas IRT more effectively places individuals along the latent trait or ability continuum. Thus, one procedure may indicate DIF for certain items while others do not. Another issue is that sometimes DIF may be indicated but there is no clear reason why DIF exists. This is where reasoned judgment comes into play. The researcher must use common sense to derive meaning from DIF analyses. It is not enough to report that items function differently for groups, there needs to be a theoretical reason for why it occurs. Furthermore, evidence of DIF does not directly translate into unfairness in the test. It is common in DIF studies to identify some items that suggest DIF. This may be an indication of problematic items that need to revised or omitted and not necessarily an indication of an unfair test. Therefore, DIF analysis can be considered a useful tool for item analysis but is more effective when combined with theoretical reasoning.

Differential item functioning

445

Statistical Software
Below are common statistical programs capable of performing the procedures discussed herein. By clicking on list of statistical packages, you will be directed to a comprehensive list of open source, public domain, freeware, and proprietary statistical software. Mantel-Haenszel Procedure SPSS SAS Stata R Systat IRT-based procedures BILOG-MG MULTILOG PARSCALE TESTFACT EQSIRT R (e.g., 'mirt' package) IRTPRO Logistic Regression SPSS SAS Stata R Systat

References
[1] Embretson,S. E., Reise,S. P. (2000).Item Response Theory for Psychologists. New Jersey: Lawrence Erlbaum. [2] Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223233. [3] Camilli, G. (2006). Test fairness: In R. L. (Ed.), Educational measurement (4th ed., pp. 220-256). Westport, CT: American Council on Education. [4] Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum. [5] Osterlind, S. J. & Everson, H. T. (2009). Differential item functioning. Thousand Oaks, CA: Sage Publishing. [6] http:/ / toolserver. org/ %7Edispenser/ cgi-bin/ dab_solver. py?page=Differential_item_functioning& editintro=Template:Disambiguation_needed/ editintro& client=Template:Dn [7] Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 674-691. [8] Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. [9] Millsap, R. E., & Everson, H. T. (1993). Methodological review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297-334. [10] Walker, C. (2011). What's the DIF? Why differential item functioning analyses are an important part of instrument development and validation. Journal of Psychoeducational Assessment, 29, 364-376 [11] Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-118. [12] Walker, C. M., Beretvas, S. N., Ackerman, T. A. (2001). An examination of conditioning variables used in computer adaptive testing for DIF. Applied Measurement in Education, 14, 3-16. [13] Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. [14] Marasculio, L. A., & Slaughter, R. E. (1981). Statistical procedures for identifying possible sources of item bias based on 2 x 2 statistics. Journal of Educational Measurement, 18, 229-248. [15] Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Erlbaum.

Differential item functioning


[16] Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Erlbaum. [17] Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11(4), 402-415. [18] Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. [19] Reise, S. P., & Ainsworth, A. T., & Haviland, M. G. (2005). Item response theory: Fundamentals, applications, and promise in psychological research. Current Directions in Psychological Science, 14, 95-101. [20] Edelen, M. O., Reeve, B. B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research, 16, 5-18. [21] DeMars, C. (2010). Item response theory. New York: Oxford Press. [22] Harris, D. (1989). Comparison of 1-, 2-, 3-parameter IRT models. Educational Measurement: Issues and Practice, 8, 35-41. [23] Baker, F. B. (2001). The basics of item response theory. ERIC Clearinghouse on Assessment and Evaluation. [24] Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinees ability. Part 5 in F. M. Lord and M. R. Novick. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley [25] Thissen, D., Steinberg, L., Gerrard, M. (1986). Beyond group differences: The concept of bias. Psychological Bulletin, 99, 118-128. [26] IRTPRO: User Guide. (2011). Lincolnwood, IL: Scentific Software International, Inc. [27] Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland and & H. Wainer (Eds.), Differential item functioning(pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum. [28] Bock, R. D. (1975). Multivariate statistical methods. New York: McGraw-Hill. [29] Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. [30] Gadermann, A., M., Guhn, M., & Zumbo, B. D. (2012). Estimating ordinal reliability for Likert-type and ordinal item response data: A conceptual, empirical, and practical guide. Practical Assessment, Research, & Evaluation, 17(3), 1-13. [31] Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the GLB: Comments on Sijtsma. Psychometrika, 74(1), 145-154. [32] John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 461-494). New York, NY: Cambridge University Press.

446

Psychometrics
Psychology

Outline History Subfields

Basic types

Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive

Psychometrics

447
Quantitative Social

Applied psychology

Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport

Lists

Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal

Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement. The field is primarily concerned with the construction and validation of measurement instruments such as questionnaires, tests, and personality assessments. It involves two major research tasks, namely: (i) the construction of instruments and procedures for measurement; and (ii) the development and refinement of theoretical approaches to measurement. Those who practice psychometrics are known as psychometricians. All psychometricians possess a specific psychometric qualification, and while many are clinical psychologists, others work as human resources or learning and development professionals.

Psychometrics

448

19th century foundation


Psychological testing has come from two streams of thought: one, from Darwin, Galton, and Cattell on the measurement of individual differences, and the second, from Herbart, Weber, Fechner, and Wundt and their psychophysical measurements of a similar construct. The second set of individuals and their research is what has led to the development of experimental psychology, and standardized testing.[1] Victorian stream Charles Darwin was the inspiration behind Sir Francis Galton who led to the creation of psychometrics. In 1859, Charles Darwin published his book "The Origin of Species", which pertained to individual differences in animals. This book discussed how individual members in a species differ and how they possess characteristics that are more adaptive and successful or less adaptive and less successful. Those who are adaptive and successful are the ones that survive and give way to the next generation, who would be just as or more adaptive and successful. This idea, studied previously in animals, led to Galton's interest and study of human beings and how they differ one from another, and more importantly, how to measure those differences. Galton wrote a book entitled "Hereditary Genius" about different characteristics that people possess and how those characteristics make them more "fit" than others. Today these differences, such as sensory and motor functioning (reaction time, visual acuity, and physical strength) are important domains of scientific psychology. Much of the early theoretical and applied work in psychometrics was undertaken in an attempt to measure intelligence. Francis Galton, often referred to as "the father of psychometrics," devised and included mental tests among his anthropometric measures. James McKeen Cattell, who is considered a pioneer of psychometrics went on to extend Galton's work. Cattell also coined the term mental test, and is responsible for the research and knowledge which ultimately led to the development of modern tests. (Kaplan & Saccuzzo, 2010) German stream The origin of psychometrics also has connections to the related field of psychophysics. Around the same time that Darwin, Galton, and Cattell were making their discoveries, J.E. Herbart was also interested in "unlocking the mysteries of human consciousness" through the scientific method. (Kaplan & Saccuzzo, 2010) Herbart was responsible for creating mathematical models of the mind, which were influential in educational practices in years to come. Following Herbart, E.H. Weber built upon Herbart's work and tried to prove the existence of a psychological threshold saying that a minimum stimulus was necessary to activate a sensory system. After Weber, G.T. Fechner expanded upon the knowledge he gleaned from Herbart and Weber, to devise the law that the strength of a sensation grows as the logarithm of the stimulus intensity. A follower of Weber and Fechner, Wilhelm Wundt is credited with founding the science of psychology. It is Wundt's influence that paved the way for others to develop psychological testing.[1]

20th century
The psychometrician L. L. Thurstone, founder and first president of the Psychometric Society in 1936, developed and applied a theoretical approach to measurement referred to as the law of comparative judgment, an approach that has close connections to the psychophysical theory of Ernst Heinrich Weber and Gustav Fechner. In addition, Spearman and Thurstone both made important contributions to the theory and application of factor analysis, a statistical method developed and used extensively in psychometrics.[citation needed] In the late 1950s, Leopold Szondi made an historical and epistemological assessment of the impact of statistical thinking onto psychology during previous few decades: "in the last decades, the specifically psychological thinking has been almost completely suppressed and removed, and replaced by a statistical thinking. Precisely here we see the cancer of testology and testomania of today."[2]

Psychometrics More recently, psychometric theory has been applied in the measurement of personality, attitudes, and beliefs, and academic achievement. Measurement of these unobservable phenomena is difficult, and much of the research and accumulated science in this discipline has been developed in an attempt to properly define and quantify such phenomena. Critics, including practitioners in the physical sciences and social activists, have argued that such definition and quantification is impossibly difficult, and that such measurements are often misused, such as with psychometric personality tests used in employment procedures: "For example, an employer wanting someone for a role requiring consistent attention to repetitive detail will probably not want to give that job to someone who is very creative and gets bored easily."[3] Figures who made significant contributions to psychometrics include Karl Pearson, Henry F. Kaiser, Carl Brigham, L. L. Thurstone, Georg Rasch, Eugene Galanter, Johnson O'Connor, Frederic M. Lord, Ledyard R Tucker, Arthur Jensen, and David Andrich. Psychometric, psychometrician and psychometrist appreciation week is the first week in November.

449

Definition of measurement in the social sciences


The definition of measurement in the social sciences has a long history. A currently widespread definition, proposed by Stanley Smith Stevens (1946), is that measurement is "the assignment of numerals to objects or events according to some rule." This definition was introduced in the paper in which Stevens proposed four levels of measurement. Although widely adopted, this definition differs in important respects from the more classical definition of measurement adopted in the physical sciences, which is that measurement is the numerical estimation and expression of the magnitude of one quantity relative to another (Michell, 1997). Indeed, Stevens's definition of measurement was put forward in response to the British Ferguson Committee, whose chair, A. Ferguson, was a physicist. The committee was appointed in 1932 by the British Association for the Advancement of Science to investigate the possibility of quantitatively estimating sensory events. Although its chair and other members were physicists, the committee also included several psychologists. The committee's report highlighted the importance of the definition of measurement. While Stevens's response was to propose a new definition, which has had considerable influence in the field, this was by no means the only response to the report. Another, notably different, response was to accept the classical definition, as reflected in the following statement: Measurement in psychology and physics are in no sense different. Physicists can measure when they can find the operations by which they may meet the necessary criteria; psychologists have but to do the same. They need not worry about the mysterious differences between the meaning of measurement in the two sciences. (Reese, 1943, p. 49) These divergent responses are reflected in alternative approaches to measurement. For example, methods based on covariance matrices are typically employed on the premise that numbers, such as raw scores derived from assessments, are measurements. Such approaches implicitly entail Stevens's definition of measurement, which requires only that numbers are assigned according to some rule. The main research task, then, is generally considered to be the discovery of associations between scores, and of factors posited to underlie such associations. On the other hand, when measurement models such as the Rasch model are employed, numbers are not assigned based on a rule. Instead, in keeping with Reese's statement above, specific criteria for measurement are stated, and the goal is to construct procedures or operations that provide data that meet the relevant criteria. Measurements are estimated based on the models, and tests are conducted to ascertain whether the relevant criteria have been met.

Psychometrics

450

Instruments and procedures


The first psychometric instruments were designed to measure the concept of intelligence. The best known historical approach involved the Stanford-Binet IQ test, developed originally by the French psychologist Alfred Binet. Intelligence tests are useful tools for various purposes. An alternative conception of intelligence is that cognitive capacities within individuals are a manifestation of a general component, or general intelligence factor, as well as cognitive capacity specific to a given domain. Psychometrics is applied widely in educational assessment to measure abilities in domains such as reading, writing, and mathematics. The main approaches in applying tests in these domains have been Classical Test Theory and the more recent Item Response Theory and Rasch measurement models. These latter approaches permit joint scaling of persons and assessment items, which provides a basis for mapping of developmental continua by allowing descriptions of the skills displayed at various points along a continuum. Such approaches provide powerful information regarding the nature of developmental growth within various domains. Another major focus in psychometrics has been on personality testing. There have been a range of theoretical approaches to conceptualizing and measuring personality. Some of the better known instruments include the Minnesota Multiphasic Personality Inventory, the Five-Factor Model (or "Big 5") and tools such as Personality and Preference Inventory and the Myers-Briggs Type Indicator. Attitudes have also been studied extensively using psychometric approaches. A common method in the measurement of attitudes is the use of the Likert scale. An alternative method involves the application of unfolding measurement models, the most general being the Hyperbolic Cosine Model (Andrich & Luo, 1993).

Theoretical approaches
Psychometricians have developed a number of different measurement theories. These include classical test theory (CTT) and item response theory (IRT).[4][5] An approach which seems mathematically to be similar to IRT but also quite distinctive, in terms of its origins and features, is represented by the Rasch model for measurement. The development of the Rasch model, and the broader class of models to which it belongs, was explicitly founded on requirements of measurement in the physical sciences.[6] Psychometricians have also developed methods for working with large matrices of correlations and covariances. Techniques in this general tradition include: factor analysis,[7] a method of determining the underlying dimensions of data; multidimensional scaling,[8] a method for finding a simple representation for data with a large number of latent dimensions; and data clustering, an approach to finding objects that are like each other. All these multivariate descriptive methods try to distill large amounts of data into simpler structures. More recently, structural equation modeling[9] and path analysis represent more sophisticated approaches to working with large covariance matrices. These methods allow statistically sophisticated models to be fitted to data and tested to determine if they are adequate fits. One of the main deficiencies in various factor analyses is a lack of consensus in cutting points for determining the number of latent factors. A usual procedure is to stop factoring when eigenvalues drop below one because the original sphere shrinks. The lack of the cutting points concerns other multivariate methods, also.[citation needed]

Psychometrics

451

Key concepts
Key concepts in classical test theory are reliability and validity. A reliable measure is one that measures a construct consistently across time, individuals, and situations. A valid measure is one that measures what it is intended to measure. Reliability is necessary, but not sufficient, for validity. Both reliability and validity can be assessed statistically. Consistency over repeated measures of the same test can be assessed with the Pearson correlation coefficient, and is often called test-retest reliability.[10] Similarly, the equivalence of different versions of the same measure can be indexed by a Pearson correlation, and is called equivalent forms reliability or a similar term.[10] Internal consistency, which addresses the homogeneity of a single test form, may be assessed by correlating performance on two halves of a test, which is termed split-half reliability; the value of this Pearson product-moment correlation coefficient for two half-tests is adjusted with the SpearmanBrown prediction formula to correspond to the correlation between two full-length tests.[10] Perhaps the most commonly used index of reliability is Cronbach's , which is equivalent to the mean of all possible split-half coefficients. Other approaches include the intra-class correlation, which is the ratio of variance of measurements of a given target to the variance of all targets. There are a number of different forms of validity. Criterion-related validity can be assessed by correlating a measure with a criterion measure known to be valid. When the criterion measure is collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion is collected later the goal is to establish predictive validity. A measure has construct validity if it is related to measures of other constructs as required by theory. Content validity is a demonstration that the items of a test are drawn from the domain being measured. In a personnel selection example, test content is based on a defined statement or set of statements of knowledge, skill, ability, or other characteristics obtained from a job analysis. Item response theory models the relationship between latent traits and responses to test items. Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-taker on a given latent trait as well as the standard error of measurement of that location. For example, a university student's knowledge of history can be deduced from his or her score on a university test and then be compared reliably with a high school student's knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of a "norm group" randomly selected from the population. In fact, all measures derived from classical test theory are dependent on the sample tested, while, in principle, those derived from item response theory are not.

Standards of quality
The considerations of validity and reliability typically are viewed as essential elements for determining the quality of any test. However, professional and practitioner associations frequently have placed these concerns within broader contexts when developing standards and making overall judgments about the quality of any test as a whole within a given context. A consideration of concern in many applied research settings is whether or not the metric of a given psychological inventory is meaningful or arbitrary.[11]

Testing standards
In this field, the Standards for Educational and Psychological Testing[12] place standards about validity and reliability, along with errors of measurement and related considerations under the general topic of test construction, evaluation and documentation. The second major topic covers standards related to fairness in testing, including fairness in testing and test use, the rights and responsibilities of test takers, testing individuals of diverse linguistic backgrounds, and testing individuals with disabilities. The third and final major topic covers standards related to testing applications, including the responsibilities of test users, psychological testing and assessment, educational testing and assessment, testing in employment and credentialing, plus testing in program evaluation and public

Psychometrics policy.

452

Evaluation standards
In the field of evaluation, and in particular educational evaluation, the Joint Committee on Standards for Educational Evaluation[13] has published three sets of standards for evaluations. The Personnel Evaluation Standards[14] was published in 1988, The Program Evaluation Standards (2nd edition)[15] was published in 1994, and The Student Evaluation Standards[16] was published in 2003. Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing and improving the identified form of evaluation. Each of the standards has been placed in one of four fundamental categories to promote educational evaluations that are proper, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic. For example, the student accuracy standards help ensure that student evaluations will provide sound, accurate, and credible information about student learning and performance.

References
Bibliography
Andrich, D. & Luo, G. (1993). "A hyperbolic cosine model for unfolding dichotomous single-stimulus responses". Applied Psychological Measurement 17 (3): 253276. doi:10.1177/014662169301700307 [17]. Michell, J. B (1997). "Quantitative science and the definition of measurement in psychology". British Journal of Psychology 88 (3): 355383. doi:10.1111/j.2044-8295.1997.tb02641.x [18]. Michell, J. (1999). Measurement in Psychology. Cambridge: Cambridge University Press. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. Reese, T.W. (1943). "The application of the theory of physical measurement to the measurement of psychological magnitudes, with three experimental examples". Psychological Monographs 55: 189. Stevens, S. S. (1946). "On the theory of scales of measurement". Science 103 (2684): 67780. doi:10.1126/science.103.2684.677 [59]. PMID17750512 [60]. Thurstone, L.L. (1927). "A law of comparative judgement". Psychological Review 34 (4): 278286. doi:10.1037/h0070288 [61]. Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court. Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press. http://www.services.unimelb.edu.au/careers/student/interviews/test.html .Psychometric Assessments University of Melbourne. S.F. Blinkhorn (1997). "Past imperfect, future conditional: fifty years of test theory". Br. J. Math. Statist. Psychol 50 (2): 175185. doi:10.1111/j.2044-8317.1997.tb01139.x [19].

Psychometrics

453

Notes
[1] Kaplan, R.M., & Saccuzzo, D.P. (2010). Psychological Testing: Principles, Applications, and Issues. (8th ed.). Belmont, CA: Wadsworth, Cengage Learning. [2] Leopold Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396. Quotation: [3] Psychometric Assessments. Psychometric Assessments . (http:/ / www. services. unimelb. edu. au/ careers/ student/ interviews/ test. html) University of Melbourne. [4] Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Erlbaum. [5] Hambleton, R.K., & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston: Kluwer-Nijhoff. [6] Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research, expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. [7] Thompson, B.R. (2004). Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association. [8] Davison, M.L. (1992). Multidimensional Scaling. Krieger. [9] Kaplan, D. (2008). Structural Equation Modeling: Foundations and Extensions, 2nd ed. Sage. [10] Reliability definitions at the University of Connecticut (http:/ / www. gifted. uconn. edu/ Siegle/ research/ Instrument Reliability and Validity/ Reliability. htm) [11] Blanton, H., & Jaccard, J. (2006). Arbitrary metrics in psychology. (http:/ / psychology. tamu. edu/ Faculty/ blanton/ bj. 2006. arbitrary. pdf) American Psychologist, 61(1), 27-41. [12] The Standards for Educational and Psychological Testing (http:/ / www. apa. org/ science/ standards. html#overview) [13] Joint Committee on Standards for Educational Evaluation (http:/ / www. wmich. edu/ evalctr/ jc/ ) [14] Joint Committee on Standards for Educational Evaluation. (1988). The Personnel Evaluation Standards: How to Assess Systems for Evaluating Educators. (http:/ / www. wmich. edu/ evalctr/ jc/ PERSTNDS-SUM. htm) Newbury Park, CA: Sage Publications. [15] Joint Committee on Standards for Educational Evaluation. (1994). The Program Evaluation Standards, 2nd Edition. (http:/ / www. wmich. edu/ evalctr/ jc/ PGMSTNDS-SUM. htm) Newbury Park, CA: Sage Publications. [16] Committee on Standards for Educational Evaluation. (2003). The Student Evaluation Standards: How to Improve Evaluations of Students. (http:/ / www. wmich. edu/ evalctr/ jc/ briefing/ ses/ ) Newbury Park, CA: Corwin Press. [17] http:/ / dx. doi. org/ 10. 1177%2F014662169301700307 [18] http:/ / dx. doi. org/ 10. 1111%2Fj. 2044-8295. 1997. tb02641. x [19] http:/ / dx. doi. org/ 10. 1111%2Fj. 2044-8317. 1997. tb01139. x

Further reading
Borsboom, Denny (2005). Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge: Cambridge University Press. ISBN978-0-521-84463-5. Lay summary (http://www.cambridge.org/uk/ catalogue/catalogue.asp?isbn=978-0-521-84463-5) (28 June 2010). DeVellis, Robert F (2003). Scale Development: Theory and Applications (http://books.google.com/ ?id=BYGxL6xLokUC&printsec=frontcover&dq=scale+development#v=onepage&q&f=false) (2nd ed.). London: Sage Publications. ISBN0-7619-2604-6 (cloth). Retrieved 11 August 2010 Paperback ISBN 0-7619-2605-4

External links
APA Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html) Joint Committee on Standards for Educational Evaluation (http://www.wmich.edu/evalctr/jc/) The Psychometrics Centre, University of Cambridge (http://www.psychometrics.cam.ac.uk) Psychometric Society and Psychometrika homepage (http://www.psychometrika.org/) London Psychometric Laboratory (http://www.psychometriclab.com) Rasch analysis in psychometrics (http://www.rasch-analysis.com/)

As Test-Taking Grows, Test-Makers Grow Rarer (http://www.nytimes.com/2006/05/05/education/05testers. html?ex=1304481600&en=bec6ba0fec0c3772&ei=5090&partner=rssuserland&emc=rss), May 5, 2006, NY Times. "Psychometrics, one of the most obscure, esoteric and cerebral professions in America, is now also one of the hottest."

Vineland Adaptive Behavior Scale

454

Vineland Adaptive Behavior Scale


The Vineland Adaptive Behavior Scale is a psychometric instrument used in child and adolescent psychiatry. It is used especially in the assessment of individuals with mental retardation, pervasive developmental disorder and other types of developmental delay.[1]

References

Article Sources and Contributors

455

Article Sources and Contributors


Accuracy and precision Source: http://en.wikipedia.org/w/index.php?oldid=550315036 Contributors: 16@r, 2clap, A bit iffy, A8UDI, ABF, AKA MBG, AbsolutDan, Afluegel, Agogin, Agyeyaankur, Ahoerstemeier, Alansohn, AliveFreeHappy, Altenmann, Ancheta Wis, Animator18, Archimerged, Armando, AwamerT, BD2412, Barbara Shack, Bartledan, Becritical, Beheaderchaseman, Beland, Ben Ben, BenFrantzDale, Benbest, BigFatBuddha, BoH, Bob K31416, Bobblewik, Bobzchemist, Boris Barowski, Boulbass, Bovineone, Brews ohare, Brick Thrower, BrokenSegue, CDN99, Calair, Can't sleep, clown will eat me, Capitalist, Capricorn42, Catalin Bogdan, Cmelsheimer, Corvus cornix, Cutler, Cybercobra, Cyclopia, DMacks, Danarmak, DarkRyder, Davidruben, Dcljr, DeanBrettle, Deltahedron, Delusion23, Den fjttrade ankan, Dharnan, Diego Moya, Diogobruno, Discospinster, Dmwpowers, Dodger67, Dododerek, Dogbyter, Dthomsen8, ELApro, El Krem, El0i, Electriccatfish2, Elizabeth Gordon, Erik9, Excirial, Faridkandil, Father Chaos, Fernzy3, Fhsjrm, Flyer22, Fr6, Galahaad, Garrepi, Giftlite, Gilliam, Glenn L, Gregbard, Gregcaletta, Grunny, GuySh, Hallows AG, Harryboyles, Headbomb, Herr blaschke, Hoplon, Hraefen, Iain99, Ian.thomson, Iida-yosiaki, Ike9898, IndulgentReader, InverseHypercube, Iulus Ascanius, J.delanoy, JSpung, Jackzhp, Jacob J. Walker, Jamamala, Jamelan, James086, Jeff3000, Jeskec, Jncraton, Johnmainstreamamerican, Joseph Solis in Australia, Julian Mehnle, JustinTime55, Jzana, Kangohenry, Kevinkor2, Khazar2, Khvalamde, KnowItAll'sKnowItAll, KrakatoaKatie, Lagaffe, Lambiam, Leuko, Linas, Lochlan1, Lotje, MATThematical, Macrakis, Magioladitis, Margin1522, Mceike, Mean as custard, Melcombe, Memming, Mgiganteus1, Michael Hardy, Mikael Hggstrm, Mild Bill Hiccup, Mindmatrix, Mlonguin, Moe Epsilon, Musiphil, Mygerardromance, Nbanic, Neutrality, Nicolasbock, Nigholith, Oashi, Oleg Alexandrov, Ooga Booga, Panek, Paolo.dL, Patrick, Pekaje, Peter Horn, Pgr94, Pianoplonkers, Pinethicket, Qqzzccdd, Qwfp, Qxz, Rb88guy, RedHillian, RedSpruce, Reddi, Rentzepopoulos, Rethunk, Rettetast, Rholton, RichardF, Riddley, Rlsheehan, Rosalien, Rui Silva, Ryan Reich, SJK, Sanguinity, Saros136, Schoolproject2, Selvasivagurunathan m, Simetrical, Slo-mo, Smyth, Softtest123, SpaceFlight89, Specs112, Stephan Leeds, Sv1xv, Taifunbrowser, TastyPoutine, Tex, TheMidnighters, Theethamizh26, Toccata quarta, Tom harrison, Tommy2010, Tsemii, Twas Now, Uncle G, Unyoyega, Urhixidur, Velella, Vsmith, Wavelength, Wdflake, Welsh, Whoosit, Whym, WikHead, Wikidenizen, Wikieditor06, Wikipelli, Wimt, Wingedsubmariner, WissensDrster, XJaM, Xact, Zetawoof, Zntrip, Zvika, 372 anonymous edits Activity vector analysis Source: http://en.wikipedia.org/w/index.php?oldid=480237907 Contributors: Ekphraster, SwisterTwister Adaptive comparative judgement Source: http://en.wikipedia.org/w/index.php?oldid=532007392 Contributors: Addshore, Alicycle, CaroleHenson, Karimderrick, Mr Sheep Measham, Phearson, 8 anonymous edits Anchor test Source: http://en.wikipedia.org/w/index.php?oldid=510032846 Contributors: Blanchardb, ChrisGualtieri, Fabrictramp, Geniac, GregorB, Iulus Ascanius, Melcombe, Paalappoo, Zhangyt Assessment centre Source: http://en.wikipedia.org/w/index.php?oldid=546037489 Contributors: Andrew Davidson, Celebrater, Ground Zero, JaGa, MicroMike78, Top Jim, Woohookitty Assessment day Source: http://en.wikipedia.org/w/index.php?oldid=529751903 Contributors: Celebrater, Jesse V. Base rate Source: http://en.wikipedia.org/w/index.php?oldid=545593776 Contributors: CBM, Ccacsmss, Christinebenson58, David Eppstein, Dfass, Giftiger wunsch, Hroulf, I dream of horses, Laurinavicius, Littlejj02, Manuelt15, Mattisse, Mean as custard, Melcombe, Michael Hardy, Namenotek, Nsk92, Oleg Alexandrov, PamD, SE16, Simon123, Tesseract2, Tolly4bolly, 12 anonymous edits Bias in Mental Testing Source: http://en.wikipedia.org/w/index.php?oldid=548217920 Contributors: AndyTheGrump, DGG, Eeekster, Katharineamy, Star767, U3964057, Victor Chmara, Zeromus1 Bipolar spectrum diagnostic scale Source: http://en.wikipedia.org/w/index.php?oldid=549232768 Contributors: Arcadian, Download, FiachraByrne, Garrondo, Spydoo, Squids and Chips Borderline intellectual functioning Source: http://en.wikipedia.org/w/index.php?oldid=538792530 Contributors: Access Denied, Blufftexasfiji, D6, Daro Alafita Prez, Davisdraco, Graymornings, Harryharper, Jason Quinn, LOL, Legitimus, Lova Falk, Lugia2453, Luvjoy1, Marpher, Nentrex, Plantforestsoil, Ponyo, Remingtonhill1, Rnno, Smartse, Stephenb, Ulysses elias, WhatamIdoing, 32 anonymous edits Choice set Source: http://en.wikipedia.org/w/index.php?oldid=482358862 Contributors: Gary King, Lawrence.larry.huong, Malcolma, Michael Hardy, Otolemur crassicaudatus, Robert P. O'Shea, Steve Smith, Xiare, 1 anonymous edits Citizen survey Source: http://en.wikipedia.org/w/index.php?oldid=550480085 Contributors: 300530NRC, Ashleyleia, Discospinster, Dr Gangrene, Levineps, Propaniac, Rmky87, SortaQuiet, 8 anonymous edits Classical test theory Source: http://en.wikipedia.org/w/index.php?oldid=548355499 Contributors: Aboctok, Adel1827, Amead, Andy Fugard, Chris Roy, Chris53516, Cruise, Enoch the red, Epbr123, Fastfission, Interiot, Ioannes Pragensis, Iss246, Iulus Ascanius, Justin545, Kate, Life of Riley, Lindseyandersen, Mattisse, Melcombe, Michael Hardy, Nbarth, Nick Connolly, Oleg Alexandrov, RichardF, Rjwilmsi, STeamTraen, Taak, TinJack, WeijiBaikeBianji, Winsteps, 43 anonymous edits Cluster analysis (in marketing) Source: http://en.wikipedia.org/w/index.php?oldid=546296277 Contributors: 3mta3, Abhishek191288, Angela, Antonrojo, Arrenbas, Bhny, Boxplot, ClothJP, D6, Dcoetzee, Derek farn, Fjrohlf, Fortdj33, FrankTobia, Fyrael, Jncraton, Kku, Klonimus, Maurreen, Melcombe, Metcalm, Mydogategodshat, Off2Explore, Sonett72, 20 anonymous edits Cognitive Process Profile Source: http://en.wikipedia.org/w/index.php?oldid=491854339 Contributors: Bearcat, CECkate, Cwmhiraeth, Enric Naval, Rjwilmsi, Tony1 Common-method variance Source: http://en.wikipedia.org/w/index.php?oldid=546281944 Contributors: Chiswick Chap, Grey Geezer, Ktr101, Melcombe, Michael Hardy, RDBrown, Victor Chmara, 13 anonymous edits Computer-Adaptive Sequential Testing Source: http://en.wikipedia.org/w/index.php?oldid=442019954 Contributors: GregorB, I'll teach you who rocks, Iulus Ascanius Computerized adaptive testing Source: http://en.wikipedia.org/w/index.php?oldid=550517892 Contributors: Aervanath, Agentcoba, Alan Pascoe, Amead, Axda0002, Bausland, Btyner, Cbuckley, Chris53516, Dvd627, ElTchanggo, Elisag, Erianna, Flamurai, Frehley, Gwil, Hailey C. Shannon, Hmains, Iulus Ascanius, JSpung, JaGa, Jhawk1024, JoshAnish, LarryQ, LibrarianAnnie, MBisanz, Martarius, Mattisse, Mcld, Mkosinski, Nigua, Phone38, Preeminence21, Psnae, RJHall, Randomalb, Rufus843, Sn0wflake, Squids and Chips, Wikiklrsc, Zeyes, 37 anonymous edits Computerized classification test Source: http://en.wikipedia.org/w/index.php?oldid=517481291 Contributors: Elonka, Gioto, Iulus Ascanius, J04n, JaGa, John Vandenberg, 4 anonymous edits Congruence coefficient Source: http://en.wikipedia.org/w/index.php?oldid=497308214 Contributors: Melcombe, Victor Chmara Conjoint analysis Source: http://en.wikipedia.org/w/index.php?oldid=542486587 Contributors: Aranel, Axiomguy, Blokit, Bultro, DavidLee, Doczilla, Gadfium, Heuristo, Jde123, Johnbibby, KathrynLybarger, Melcombe, Michael Hardy, Nickpunt, Nutcracker, RichardF, Seanbradyw99, Sephiroth BCR, Skifwolf, Skittleys, SmallDeskBigIdeas, Statistical.model, Szyslak, Wfeidt, Zigger, 22 anonymous edits Correction for attenuation Source: http://en.wikipedia.org/w/index.php?oldid=543909547 Contributors: Btyner, Drummond, Jason Quinn, Mairi, Melcombe, RJFJR, SwordSmurf, 8 anonymous edits Counternull Source: http://en.wikipedia.org/w/index.php?oldid=529552023 Contributors: AManWithNoPlan, DavidCBryant, Dhaluza, Doczilla, John of Reading, MZMcBride, Melcombe, Oleg Alexandrov, Qwfp, Rjwilmsi, Tayste, The Anome, 2 anonymous edits Criterion-referenced test Source: http://en.wikipedia.org/w/index.php?oldid=544294501 Contributors: Andycjp, Appraiser, Bondaruk85, Chris53516, Darklilac, Edwy, Finnancier, IronGargoyle, Iss246, Iulus Ascanius, Mandarax, Michael Hardy, Nesbit, Northernhenge, R'n'B, RichardF, Rjwilmsi, SchuminWeb, Senjuto, Srudes2, Sugarcaddy, Sungl, TheTechieGeek63, WhatamIdoing, Wiarthurhu, 12 anonymous edits Cronbach's alpha Source: http://en.wikipedia.org/w/index.php?oldid=549985492 Contributors: Aeternus, Afelton, Alansohn, Amead, Antur, Art LaPella, Ashleyleia, Bloodshedder, BobbyAsh, Cacycle, ClamDip, CowboyBear, Cronbach2010, Csingrey, Davohuang, Dcljr, Dmbaguley, Dontdoit, Drbb01, Eequor, Frank1470, Giftlite, Herve661, Humba, J. J. F. Nau, JHunterJ, JSRTRP, JaGa, Jeremymiles, Jncraton, JorgeIvanVelez, JulesEllis, Kathleenscalise, Kingturtle, Kristensson, Lambiam, Leafman, Leon3289, Life of Riley, LilHelpa, MFSchar, Mack2, Magioladitis, Malkin, Materialscientist, Mathstat, Melcombe, Michael Hardy, MitterNacht, Mudd1, Nesbit, Officiallyover, Olathe, Peacedance, Peco, Perelly, Pete.Hurd, Podgorec, Qwfp, Rez960, Rjwilmsi, Sampi Europa, Seanrife, Seglea, Sjlver, Statesman 88, Stevenfruitsmaak, Taak, Talgalili, Tayste, Tdferro, Thuvan Dihn, Tim bates, Timn, Twirligig, Victor Chmara, VsevolodKrolikov, Vt007ken, Wellspring, Welsh, Winsteps, 118 anonymous edits

Article Sources and Contributors


Cutscore Source: http://en.wikipedia.org/w/index.php?oldid=486618714 Contributors: Iulus Ascanius, Wikitorrens Descriptive statistics Source: http://en.wikipedia.org/w/index.php?oldid=549310003 Contributors: 24.183.212.xxx, AbsolutDan, Alexius08, Allens, Arcadian, Ashleyleia, Auss00, Aviados, AxelBoldt, Bakerccm, Barticus88, BenJChadwick, Benzband, Blueyoshi321, Bo Jacoby, Ch88, ChangChienFu, ChrisGualtieri, Clahoonya, Conversion script, Dan Polansky, Dashed, David Eppstein, Dcljr, DeadEyeArrow, Dick Beldin, Drea23839, Eddyspeeder, Elium2, Escape Orbit, Forsakendaemon, Friginator, G716, Giftlite, GoingBatty, Graham87, Graymornings, HCA, Henrygb, Hmansourian, Irbobo, ItsClaudiaC, Jitse Niesen, Jlj1173, John Baez, Johnmperry, Jsw6408, Kurykh, L.tak, Lambiam, Larry Sanger, Latka, Lfstevens, Livius3, MER-C, Magioladitis, Materialscientist, Mattisse, Mav, Melcombe, Michael Hardy, Mickey, Modify, Morning Sunshine, MrOllie, Niceguyedc, Palfrey, Pelicans in the lake, Pinethicket, Qwfp, RPHv, RainbowOfLight, Rich Farmbrough, Ronz, S2000magician, S2Jackie, Samantha kellett, SchreiberBike, Skbkekas, Syalowitz, TCrossland, TPK, Tomi, TonySt, Vegaswikian, Wimt, Wmahan, Yllhyseni, Yukyuk11, Zarboki, 111 anonymous edits Dot cancellation test Source: http://en.wikipedia.org/w/index.php?oldid=420857842 Contributors: Alvin Seville, Bearcat, Cirt, Jfdwolff, Mikael Hggstrm, Thincat Elementary cognitive task Source: http://en.wikipedia.org/w/index.php?oldid=545183708 Contributors: Dhartung, Legalleft, Nick Connolly, Sadads, TenPoundHammer, WissensDrster, 5 anonymous edits Equating Source: http://en.wikipedia.org/w/index.php?oldid=404284266 Contributors: Cassivs, Colithium, Holon, Iulus Ascanius, Khan.edu, The Anome, WhatamIdoing, 2 anonymous edits Factor analysis Source: http://en.wikipedia.org/w/index.php?oldid=549378069 Contributors: 2over0, Afelton, Ahsanur-wiki, Alexf, Alousybum, AnakngAraw, Andre Engels, AndrewHowse, Andy Fugard, Andyjsmith, Angela, Anthony Appleyard, Aprock, Arrenbas, Ashleyleia, Autarch, Babakathy, BenFrantzDale, Berlascalp, Bikestats, Bobblewik, Boxplot, Bruce rennes, Calvin 1998, Charybdisz, Cherkash, Cholling, Chuck Sirloin, Colbert Sesanker, Crasshopper, Cruise, Cydmab, Dalpha, DanielCD, DaveWF, Daytona2, Decoy, Den fjttrade ankan, DennyColt, Diego Moya, Dmd, Drummond, Element16, Emayv, Fell.inchoate, Fnielsen, FortuneGod, Gaius Cornelius, Gauss, Ged.R, Gerhi, Giftlite, Gihanuk, Gscshoyru, Guillaume777, Hesperian, Hess88, Hirak 99, Illia Connell, Imarkovs, Imersion, Indeterminate, Ioannes Pragensis, IsaacAA, JaGa, Jackzhp, JamesBWatson, Jd8837, Jeffmcneill, Jerkmonster, Jia.meng, Jncraton, John Bessa, Jokestress, JorisvS, Josfritz, Jtkiefer, Jtneill, Justin Mauger, Katharineamy, Kku, Klonimus, Kmarkus, La-pays, Ljosa, Mace street what, Maksim-e, Mangojuice, Marion.cuny, Marsmiao, Materialscientist, Mattisse, Maurreen, Maximus Rex, Mediran, Melcombe, Michael Hardy, Michael Slone, Michael Snow, Mistercupcake, Mrayess, Mudd1, Mydogategodshat, Nesbit, Oleg Alexandrov, OliverHargreaves, Omgoleus, Pankajmathur1902, Paulgaham, Pcb21, Peacedance, Perohanych, Pgan002, Ph.eyes, Quickwik, Qwfp, RDBrown, Raaj11jan, Radford Neal, Rdledesma, Redrose64, Rjwilmsi, Roybgardner, Sam Hocevar, Schewek, Semihkekul, SimonP, Sjlweech, Snoyes, Spangineer, Srich32977, Starflixx, Strasburger, Swaprava, Taak, Talgalili, Tedtoal, Tekhnofiend, Thiseye, Tim bates, Tomi, Tpepler, Tucoxn, Ultramarine, Victor Chmara, Ward3001, WereSpielChequers, WikiMSL, Wotnow, X7q, 230 anonymous edits Figure rating scale Source: http://en.wikipedia.org/w/index.php?oldid=398930309 Contributors: Belovedfreak, Coyets, Gary King, Pdcook, RenamedUser01302013, Sadads, Sparklesforbreakfast Fuzzy concept Source: http://en.wikipedia.org/w/index.php?oldid=551150549 Contributors: Byelf2007, Cnilep, DeadEyeArrow, Eaefremov, Feezo, Grafen, Jbruck, John of Reading, JorisvS, Jtneill, Jujutacular, Jurriaan, Loodog, Malcolma, Megamix, Nezzadar, PhnomPencil, SchreiberBike, Snorlax Monster, Squids and Chips, Star767, Steel1943, Tom Morris, True Pagan Warrior, Uli.ch, Wikiloop, Woohookitty, 57 anonymous edits G factor (psychometrics) Source: http://en.wikipedia.org/w/index.php?oldid=549838605 Contributors: 2602:306:CCF3:10D0:B1F6:5ACE:468:60E7, Akriasas, Allmenbynaturedesireknowledge, AndreasPDemetriou, Andrew Gray, Andy Fugard, Aprock, Argumzio, Aristotle1990, Ashmoo, BMF81, Beyond My Ken, BrainDoc, CalendarWatcher, Captain Occam, Charles Matthews, Charlesspearman, Chris the speller, Clemmy, Closedmouth, Cmichael, Coatchecker, Contrablue, Dak, Dennymeta, Deselliers, Diego, Dmcq, DonSiano, Drummond, Ed Poor, Elembis, Elsweyn, Epbr123, Eric Kvaalen, Eumolpo, Eyu100, Falcorian, Falk Lieder, Gavia immer, Geschichte, Gfsong, Godfrey Daniel, Gregor Strasser, Harkenbane, Hitssquad, InverseHypercube, Itinerant1, Iulus Ascanius, Jarble, Jcbutler, John of Reading, Johnkarp, Jokestress, Julesd, Julia Neumann, Jwilliams43, Khullah, KlausOberauer, Ksyrie, Kwertii, L33tminion, Laug, Liamkf, Lilac Soul, Lokitecho, Lotje, Marianocecowski, Mark Nez, Mattisse, Michael Hardy, Miradre, Mpondopondo, MrOllie, Muntuwandi, NOrbeck, Nentrex, Nesbit, Noosphere, Nvj, ONoNotThisGuyAgain, Olathe, Ott, Outriggr, Peoplesunionpro, Ponyo, Preeminence21, Pyroclastic, Quizkajer, Ragesoss, ReferenceMan, Right honorable dr. zombie, Rjwilmsi, Rob Hurt, Sasquatch, Saucepan, Schaefer, Schekn, Sdedeo, Simonster, SiobhanHansa, Slump, Ste1n, Stevertigo, TakuyaMurata, Tckma, Thehotelambush, Think outside the box, Tom X. Tobin, Troust, Tstrobaugh, Ultramarine, Victor Chmara, VsevolodKrolikov, Waldir, Ward3001, Wechselstrom, WeijiBaikeBianji, Woodsrock, WordyGirl90, Xyzzy n, 138 anonymous edits Francis Galton Source: http://en.wikipedia.org/w/index.php?oldid=550498681 Contributors: 10metreh, 1tephania, 68taileddragon, A930913, AI, Aa77zz, Ace of Sevens, Adam sk, Adh30, Adoniscik, Adrian.benko, AgRince, Ahoerstemeier, Aitias, Aldousdj, Aletheia, All Hallow's Wraith, Alterego, Amyposey, Andrew c, Androstachys, Andycjp, Angela, Anujdogra1, Anwar saadat, Aranel, Arch8887, Arctic Fox, Arunsingh16, Ask123, Astrochemist, Autodidactyl, Avraham, AxelBoldt, Axeman89, Back ache, Beaver3t, Berasategui, Bluap, BobQQ, BobSilverman, Bobo192, Bryan Derksen, Bunzil, Burlywood, Bwyche, Carola.Catenacci, Celuici, Chemical Engineer, Chetvorno, Chicheley, Chienlit, Classicalfan2, Connormah, Coubure, Creationlaw, Crimsone, CruftEater, Crumley, Crusoe8181, Cutler, D6, DCDuring, Dapsv, Dave Earl, DavidWBrooks, Dcljr, Derekbd, Dimovsk3, Dirac1933, Discospinster, Dmcq, Dolly1313, Dominus Vobisdu, DonSiano, Dr. Faustroll, Drtywmn, Dsh34, Dsp13, Duncharris, DutchmanInDisguise, EJF, EdBever, EdH, Emeraldcityserendipity, Emerson7, Evolauxia, Exeunt, Falk Lieder, Fastfission, Favonian, Finn-Zoltan, Flaming Ferrari, Fraggle81, Freek Verkerk, Frtillman, Gaia Octavia Agrippa, Gaius Cornelius, Gavantredoux, Gbr3, Giftlite, Gnurkel, Godheval, Graham87, Grant M, Gregor Strasser, Grumpyyoungman01, Gupmmk, Hairy poker monster, HamburgerRadio, Harry of the Yellow Banana, Hede2000, Heron, HexaChord, Hlnodovic, IRP, Imersion, Ironholds, Isotype, JGaeta, JLaTondre, JSpung, Jahnavisatyan, Jair Moreno, Jaraalbe, Jaredwf, Jeepday, Jfdwolff, Jhall1, Jive Dadson, John Smythe, John of Reading, John.Conway, Johnfos, Johnuniq, Joseph Solis in Australia, Jpgordon, Jrtayloriv, Junes, K1Bond007, Kateshortforbob, Keith D, Keshidragon, Kevin Ryde, Kirby Gookin, Krelnik, Kumioko (renamed), Lang rabbie, Lemonander, Leondumontfollower, Leslie.anderson, LestatdeLioncourt, Lestrade, Liftarn, Lightmouse, LimoWreck, LinguisticDemographer, Listener vetter, Lockley, Lord.lucan, Loving day, Lunarsoc, MONGO, Macdonald-ross, Maclean25, Magnus Manske, Mais oui!, Margaschoch, Mark83, Markhebner, Matthew Fennell, Matthew.murdoch, Mav, Maximaximax, Mayumashu, Media anthro, Memestream, Merchbow, Mervyn, Michael C Price, Michael Hardy, Mikepanhu, Miquonranger03, Moebiusstrip, Molerat, Mr impossible, Mr pand, Naddy, Nakon, Narssarssuaq, Nectarflowed, Neelix, Neurowiki, Niceguyedc, Nicklse, Nk, Oliver Pereira, Omnipaedista, Orangemarlin, Oska, Paul W, Pcpcpc, Pedant17, Pedrosaucedo, Peruvianllama, Physicistjedi, Pigsonthewing, Poor Yorick, Postdlf, Prashanthns, Princesspaperieca, Quark7, Qwfp, R'n'B, Ramdrake, RashersTierney, Rcb1, Reikat705, Reinoutr, Resurgent insurgent, Rich Farmbrough, Richard Arthur Norton (1958- ), Richard001, RisingSunWiki, Rjwilmsi, Robert1947, Rodhullandemu, Rotational, Ryguasu, SGBailey, Sancassania, Schaxel, SewerCat, Sharkface217, Shmuel, Shtove, Smallweed, Somewhatdazed, Spicemix, Spookpadda, Spratacus2002, Steinbach, Studerby, SuperGirl, T. Anthony, Taagane19, Template namespace initialisation script, TexasAndroid, The Thing That Should Not Be, ThePeer, Thelle, Tide rolls, Tim bates, Tomi, Topbanana, Tothebarricades.tk, Tstrobaugh, Ucypanp, User2004, Utcursch, Ute in DC, Vice regent, Victor Chmara, Waldir, Wanstep, WereSpielChequers, Wereon, Wiki alf, Will Beback, Wodrow, Woohookitty, Wsiegmund, Wtmitchell, Xanchester, Xavexgoem, Yersinia, YishayMor, Yurivict, Zenohockey, Zero g, Zigger, Zoso Jade, Zzuuzz, , , 351 anonymous edits Group size measures Source: http://en.wikipedia.org/w/index.php?oldid=545286511 Contributors: CommonsDelinker, Epipelagic, Ezmindegy, Jesse V., Lajos.rozsa, Levineps, Michael Hardy, Oleg Alexandrov, Pring, RHaworth, 50 anonymous edits Guttman scale Source: http://en.wikipedia.org/w/index.php?oldid=540534961 Contributors: Abhishektotawar, Amead, Bluemoose, CDN99, Carbonite, Chris the speller, Cmdrjameson, Cruise, Douglas R. White, Eequor, FirstPrinciples, Giftlite, Holon, Jeff3000, JustAGal, Klonimus, Lakinekaki, Lambiam, Longshot14, Lsandhus, Melcombe, Michael Hardy, Mrholybrain, Mydogategodshat, Omnipaedista, Piotrus, Wilcoemons, 28 anonymous edits High-stakes testing Source: http://en.wikipedia.org/w/index.php?oldid=550393056 Contributors: Alexey Feldgendler, Allens, Bookbrad, Collect, Cosfly, Donreed, Drbb01, Dusti, Eastlaw, EdJohnston, Erianna, Georg Hurtig, Gogue2, Irbisgreif, Iulus Ascanius, Kaleidscope-eyes, KrebMarkt, Legihatp, Lotje, Magioladitis, Metropolitan90, Nemonoman, Nolookingca, Ps07swt, Rjwilmsi, Sae Harshberger, Sendhil, Trovatore, WhatamIdoing, Wikiklrsc, 30 anonymous edits Historiometry Source: http://en.wikipedia.org/w/index.php?oldid=543513993 Contributors: Aprock, Aranel, Doczilla, DonSiano, FreplySpang, GregorB, Hqb, Ignoranteconomist, J04n, L Trezise, Mattisse, McMelvin, Melcombe, Miradre, Oatmeal batman, Ohconfucius, Palaeovia, Poccil, RandomMonitor, Reedy, Reinoutr, Rich Farmbrough, Rjwilmsi, SE7, Saddhiyama, Satori, Smasongarrison, Svajcr, Tstrobaugh, Twirligig, Xnuala, 9 anonymous edits House-Tree-Person test Source: http://en.wikipedia.org/w/index.php?oldid=546289627 Contributors: BD2412, DGtal, K kisses, Mmmaick, Mo-Al, SchreiberBike, 2 anonymous edits Idiographic image Source: http://en.wikipedia.org/w/index.php?oldid=545447600 Contributors: Aaron Kauppi, FiachraByrne, Lova Falk, MBisanz, Trevinci Intelligence quotient Source: http://en.wikipedia.org/w/index.php?oldid=550967023 Contributors: "Primitive Revolutionaries of China, "alyosha", *Kat*, 100110100, 123spamfighter123, 159rabbit, 1StrangerSC, 2004-12-29T22:45Z, 213.253.39.xxx, 21655, 24fan24, 2601:C:380:20:95AF:4B73:8822:BE11, 2pem, 2ravens, 62.253.64.xxx, A Musing, A.Ou, A.a.p.cool, AAAAA, ACSE, AS, ASA-IRULE, AaRH, AaronSw, AbigailAbernathy, AbsolutDan, Acadmica Orientlis, Acather96, Acornlord, Acroterion, Adambro, Addshore, Adpolitis, Aelffin, Aetheling, Afterwriting, AgRince, Agrecascino, Ahoerstemeier, Ahseaton, Akamad, Akhilan, Al-Andalus, AlanM1, Alansohn, Alejo2083, Aleph42, Alethiophile, Alliedgreatness123, Allison Stillwell, Alphachimp, Altetendekrabbe, Alxeedo, Amirbaraka, Andonio, Andres, Andrew19801980, AndrewJ123, Andries, AndyCapp, Anetode, AngelOfSadness, Angelic Wraith, Angelito7, Anna Lincoln, Anonymous Dissident, Antandrus, Antonio Lopez, Anwar saadat, Apophenian Alchemy, Aprock, Arbor, Arcadian, ArchonMagnus, ArtOfLife, Arthena, ArtifexMayhem, Arundhati lejeune, Arx Fortis, Asgard851, Asparagus, Aua, Audriusa, Aunukia, Australopitecus, Av99, Avaya1, AxelBoldt, Axeman89, Aydcery, AzaToth, B800, BD2412, BMF81, BOOHOOIWANNASING, BShiplet, Backin72, BadSeed, Baddox, Bakkster Man, Barek, Barfooz, BarnabyJoans, BaronLarf, Barronkrjr, Barrylb, Bassbonerocks, Bb3cxv, Bbatsell, Bdoserror,

456

Article Sources and Contributors


Beaumains, Beefyt, Beek man, Beetstra, Before My Ken, Beland, Beltz, Ben Ben, Bender235, Bento00, Benuski, Benw01, Benwildeboer, Beyond My Ken, Big Brother 1984, BigNate37, Biki2, BillWatson, BioPupil, Bjrnfrdnnd, Bkonrad, Blackworm, Blainster, Blamblamblam, Blanchardb, Blaxthos, Bob Andolusorn, Bobblewik, Bobby2e34rxfdf, Bobebla, Bobet, Bobmack89x, Bobo192, Bodnotbod, Bongwarrior, Boombaard, Boothello, Boothy443, BrainDoc, Brainyiscool, BrianGV, British cigarette, Brutaldeluxe, Bsneaky123, Btouburg, Buck Mulligan, Bueller 007, Bulldog123, Butter Bandit, Bwe1862, CAVincent, CIreland, CPT Spaz, Cacycle, Cahayden, Caledones, Calorus, CambridgeBayWeather, Can't sleep, clown will eat me, CanadianLinuxUser, Candidulus, Canon, Cantus, Caper13, CapitalR, Capricorn42, Captain-tucker, CardinalDan, Cathryn J., Cbuckley, Ceeded, Celestianpower, Cema, Chameleon, Chaosdruid, Charles Gaudette, Chasingsol, CheesemonkeyFrenchperson, Cheesy Yeast, Cherkash, Chesterg, Chicocvenancio, Chmod007, Cholling, Chris 73, Chris Capoccia, Chris the speller, Chrishatch1973, Chrislk02, Christian List, Chukpeev, Chunky Rice, Cinnaplum, CivilCeleb, Ck lostsword, Cmano13, Coatchecker, Cogmed, Cognitiveelite, Collinimhof, Comrade42, ConcernedScientist, Connelly, ConnorD1, Constructive editor, Conversion script, Coolhawks88, Copydeskcat, Corgy.x, Corpx, Coubure, Cpsimoes, Cquan, Crazyvas, Credeswish, Crosar, Crownjewel82, CryptoDerk, Csernica, Csrempert, Cyberevil, Czar Dragon, D, DARTH SIDIOUS 2, DBaba, DJ Clayworth, DVD R W, DaBler, Dabliujr, Dalassa, Dale Arnett, Damicatz, Damieng, Damirgraffiti, Dan Fuhry, DanMS, Dane Sorensen, Daniel, Daniel5127, DanielBor, Danijelretard, Danny Rathjens, Dannyvocal, DarkBlueOreos, Darkmeerkat, Darkounlimited8, Darth Panda, Darwin CC, Davejohnsan, Davewild, David Fuchs, David*dahlia, David.Mestel, DavidBrowne, DavidLeighEllis, DavidOaks, DavidWBrooks, Davidprior, Dawn Bard, Dcamp314, Ddxc, Deadalus821, Deathroad333, Debresser, Decumanus, Deglr6328, Delirium, Delldot, Delta G, Deltabeignet, Demi, Denelson83, Deon Steyn, DerHexer, Derek Ross, Derekleungtszhei, Derrekito, Deselliers, DeusIrae22, DevastatorIIC, Deville, Dewgong, Dezidor, Dfeldmann, Dgw, Dicklyon, DidiDigomi, Diego Grez, Dina, DirkOliverTheis, DirkvdM, Discospinster, Dmcq, Doc glasgow, Document, Don't look back in anger, DonSiano, Donfbreed2, DoriSmith, DoubleBlue, DougHill, Doulos Christos, Download, Dr. Blofeld, Dr. Volkmar Weiss, Dravecky, Dreaded Walrus, DreamGuy, Dreamafter, Dreslough, Drmies, Drn8, Drummond, Dsantesteban, Dtony102, Duffman, Duoduoduo, E. Fokker, EM1227819, ERK, ERcheck, ESkog, Ed Poor, EdGl, Edison, Edward Z. Yang, Eeekster, Eequor, Egnarorm, Ehrenkater, Elohimgenius, Emiao, Emilfarb, Emperormikey, Emurphy42, Enelson, Enigmaman, EnterpriseCrew, Epbr123, Ephery, Equinexus, Eric Kvaalen, Erik9, Ernham, Errabee, Esn, Esrogs, Etexzan, Ethant87, Eubulides, Euchiasmus, Eve Teschlemacher, EvelinaB, Everyking, Evil g1, Excirial, Executivewalrus, Exert, Exteray, Extropian314, FF2010, Falcon8765, Fallenfromthesky, Farquaadhnchmn, Favonian, Fegect4, Felipito1.966, Fert360, FieryMatrix, Figureskatingfan, Flapdragon, Flavonoid, Florian Blaschke, Folyman, Fosnez, Foxj, Fraggle81, Fram, Franamax, Frecklefoot, Freedo, FrenchIsAwesome, Frencheigh, FreplySpang, Friedchicken1979, Frigglinn, Froth, Frumphammer, Fsubones, Full Shunyata, Furrykef, Futurebird, Fvasconcellos, Gabbe, Gail, Galoubet, Geneb1955, Geneffects, Gfsong, Giftlite, Gilgamesh he, Gilo1969, Gioto, Glennwells, Gnangarra, Gnaussor, Goergen2, Gogo Dodo, Gopher292, Gracefool, Graham87, Graibeard, Grandpafootsoldier, Gregor Strasser, GregorB, Gregzeng, Grey Goshawk, Grick, Grifter tm, Grim23, Grow60, Grumpyyoungman01, Guido del Confuso, Guillaume2303, Guillaume777, Gxe, H9i9j9, HARRY POTTER, HGBaley, HJ Mitchell, HOUDAESTCONNE, Hadal, Hadj, HairyPerry, Handsomehomerplus3, Hans33333, Hatua4, Haukurth, Hazard-SJ, Headbomb, Helixweb, Henrik, Herald Alberich, Heron, HexaChord, Hgmichna, Hhaiksjdeno, Hiddekel, Hitssquad, Hlucho, Hmains, Hmrox, Hobartimus, Holme053, Horselover Frost, Hotel Caliphate, Hu12, Huangdi, HungryWhale, Hurtstotouchfire, Husond, Hut 8.5, Hyenaste, I42, IRP, Ian Pitchford, Igni, Igoldste, Imagination dbride, Imjustlazy03, Imnotminkus, Imperator Honorius, Inkington, Ino5hiro, Integralolrivative, Intelati, Intelligency7, Intractable, Ironman26, Iselilja, Ishikawa Minoru, Islander, Isnow, Isolani, Itinerant1, Ivansanchez, IvyIQTest100, Ixfd64, J.delanoy, JASpencer, JForget, JQF, JYOuyang, JaGa, Jacj, Jadon, Jae, Jagz, Jahiegel, Jaksap, James5200, JamesBWatson, Jamesjones5, Jamessugrono, Jamison Lofthouse, JanCeuleers, Jarble, Jauerback, Jay11028, Jaybuffington, Jaydiem, Jayjay317, Jayron32, Jeff G., Jeffq, Jeltz, Jengod, Jennavecia, Jenova20, Jeremy68, Jeremybrice, Jfdwolff, JimD, Jimlyttle, Jj137, Jkjkjk5435435, Jknwilson, Jmh649, Jncraton, Jnivekk, Jodi.a.schneider, John254, JohnCD, Johnfromtheprarie, Johnkarp, Johnmarkh, Jojhutton, Jokestress, Jon t44, Jonathan321, Jonesy1289, Jonpro, JordanSamuels, Jorgenev, JorisvS, Jose Ramos, Joseewiki, Joshannon, Joyous!, Jpgordon, Jrathage, Jrockley, Jrtayloriv, Jtkiefer, Jtneill, Julesd, Julia Neumann, Juliancolton, Jusdafax, Just James, Jwilliams43, KC., KGasso, KSchutte, Kaisershatner, Kaldari, Kanonkas, Karada, Karekare0, Karnpatel18, Katieh5584, Kbdank71, Keeper76, Keilana, Kemet, Keraunos, Kerrygrink, Kier07, King Nintendoid, Kingfish, Kingpin13, Kingturtle, KitchM, Knutux, Koavf, Koozedine, Kostisl, Krawi, Ktrenchard, Kweeket, Kwertii, Kyle Barbour, L337 kybldmstr, LAX, LFaraone, LPinCalPA, Langhorner, Latitudinarian, LazyDog86, Lcawte, Le sacre, Leafyplant, Lecajoler, Ledinlaind, Lee Daniel Crocker, Lee M, Leeisawinner, Leszek Jaczuk, Leuko, Levin56, Lezi18767, Life of Riley, Ligulem, Liljonluvsblonde, LinkTiger, Linnell, LittleHow, Livebrick, LjL, Lkepler, Lolatya, Lord antares, Loremaster, Lova Falk, Love-Eva, Lovro, Lslongmore, LuK3, Luckyherb, Luke poa, Luria69, MBisanz, MD87, MER-C, MJ94, MSTCrow, Mackensen, Macphisto12, Magickallwiz, Magiczees, Magioladitis, Magnus Bakken, Mailer diablo, Majora4, Majorclanger, MalleusManus, Man-et-arms, Mandarax, Mangajunglist, MarXidad, Marcokrt, Markb, Markhurd, Markus Kuhn, Marshalldunn, Martarius, Martin451, Martinor, Marysunshine, Mat-C, Materialscientist, Math Champion, Mathonius, Matt Deres, Matt Gies, Matthew Fennell, Matthew238, Matthewmeisel, Mattjs, Mauls, Maunus, Mav, MaveenOlam, Max1231, Max234, Maximus Rex, Mbc362, Mboverload, McSly, Mdgruen, Mdhowe, Mdrine, Mdwh, Mean as custard, MegX, MeijinNoKori, Meiskam, Mel21clc, MelissaDonovan Afrikak, Mendaliv, Mentifisto, Mentisock, Metamagician3000, MetsFan76, Michael Belisle, Michael Hardy, Michael Keenan, MichaelACookII, MichaelJanich, Michaelas10, Micler, Mifter, Mihai, Mike R, Mike Rosoft, Mike V, Mike2000, Mimiian, Mindless Imbecile, Miradre, Misarxist, MissionNPOVible, Misza13, Mitchan, Mm1972, Mmjbhhal, Mo2718, Moe Epsilon, Monkthatgotfunk, Monomoi, Monosthatos, Montgomery '39, Moozer91, Morganfitzp, Morken, Moulder, Mr Jack Boot, Mr.guy6, MrASingh, MrArt, MrZhuKeeper, Mrgreenluv, Msingos, Muchie11791, Muchness, Muchosucko, Mufka, Munci, Murraypaul, Mvasaly, Mxipp, Mxn, N0w8st8s, Nabeth, Nabla, Nakon, Narssarssuaq, Naryathegreat, Nasz, Nate1481, Natgoo, Natl1, NawlinWiki, Ncdoyle, Nectarflowed, Neilc, Netalarm, Neverquick, NewEnglandYankee, Niceguyedc, Nick, NightMonkey, Nihiltres, Nilfanion, Nivix, Nixeagle, Nixn, Nkayesmith, Nmyung728, Nnp, No-Bullet, NoPetrol, Noe, Noles1984, Noname6562, Noobsaucepwnage, Normalityrelief, Nouse4aname, Novel Zephyr, Novistar, Nslsmith, Nth barfields, Nuno Tavares, O1ive, ONEder Boy, OccamzRazor, Octaazacubane, OddMNilsen, Oddity-, Ohnoitsjamie, Okmanyobro333, Olathe, Olav Bringedal, Old Moonraker, Olivier, Olorin28, OneGyT, OneWeirdDude, Oracleoftruth, Oreo Priest, Oroso, Otherone, Oxymoron83, Pacaro, Padddy5, Pallab1234, Parthan, Pascal.Tesson, Patman, Patrick, Pattenam, Paul Magnussen, Paxsimius, Pejman47, Peoplesunionpro, Per84, Peter, Petruspennanen, Pgan002, Pgk, Pgr94, Phil Boswell, Philip Trueman, Phils, Pinethicket, Plavalagunanbanshee, Player49er, Pmj, Pne, Poggio, Pontificake, Ponydepression, Possum, Postdlf, Postmortemjapan, Potatoswatter, Pratyya Ghosh, Preeminence21, President Elect, Preslethe, PrestonH, Promking, Psb777, Pseudo-Richard, Pstanton, PurrfectPeach, Pyg, Q0, Qmwne235, Qtakhisis, Quilliam2010, Quizkajer, R'n'B, R3m0t, RDBrown, RJN, RK, Radon210, RafaelG, Railfence, Rajah9, Ramdrake, Ramuman, Randomlala, Raoulharris, RaulMiller, Razorflame, Rdfuerle, Rdsmith4, Reallybored999, Recognizance, Reinsarn, Reitking, Rettetast, Rev.bayes, Rexmorgan, Rhobite, Rholton, Rich Farmbrough, Richard001, Richfife, Rjwilmsi, Rmhermen, Rmkreeg, Robbe, Robert Brockway, Robertvan1, RobinHood70, Robina Fox, RoddyYoung, Romanm, Rotem Dan, Roux-HG, RoyBoy, Royboycrashfan, Rror, Rsrikanth05, Ruby.red.roses, Rudyfink, Rurik the Varangian, Rursus, Rusesji, Ryan4, RyanIsNotAnExpert, S h i v a (Visnu), ST47, Sabik, Sadi Carnot, Saftyline, Saladpope, Sam4590, SamERules1995, Sammesk, Samwilson, Sanfranman59, Sango123, Sannse, Santas back3, Scarian, Sceptre, Schaefer, Scheinwerfermann, Schmiteye, SchreiberBike, SciberDoc, Scoobydoo99, Scott A Herbert, Scott0485, Scwlong, Sdbaker, Seaphoto, Search4Lancer, Seb az86556, Seglea, Senordingdong, Seraphim, Seren-dipper, Serendipityrose1, Seth Ilys, Sewdonim, Shadowhillway, Shantavira, ShaunOfTheLive, Shieldsgeordie, Shinwachi, Shinyam, Shriram, SilentC, Siliconov, Silver Edge, Simetrical, Simoes, SimonP, Sin-man, Sinn, SiobhanHansa, Sisyphus2006, Skidude9950, Skimnc, Skinny87, Skinnyweed, Skizzik, Slashme, Sloth monkey, Slowking Man, SmartyPants3576, Smenge32, Smurrayinchester, Snoyes, So1omon, SoWhy, Socrates lives, Solarra, Solitude, Soul phire, SpLoT, Spadeprince, Sparkleyone, SpecB, SpikeTorontoRCP, Splitpeasoup, SpuriousQ, Squandermania, Sss3d, Stan Shebs, Startstop123, Stephen Compall, Stephenchou0722, Steve Farrell, Stevemcmahon1, Stevertigo, Stifle, StikEmanon, Stubblyhead, Stwalkerster, Subarctica, Suffusion of Yellow, Suila, SummerPhD, Sunderland06, Sunlit Forest, Sunray, Suruena, Svburke@hotmail.com, Svdmolen, Sven Manguard, Swimboy1, TOttenville8, TakuyaMurata, Taotriad, Tassedethe, Tbhotch, TeaDrinker, TechBear, Technolust, Tejanosoccer13, Tesseract2, Testing.all, The Anarchist Beggar, The Anome, The Bipolar Anon-IP Gnome, The Jeff Killer, The Rhymesmith, The Rogue Penguin, The Thing That Should Not Be, The Wolfgang, TheCommunistCow, TheJJJunk, ThePointblank, Thincat, Thingg, Thiseye, ThomasPusch, ThoseWeLost, Tigerhawkvok, Tim bates, Timberframe, Timothy Cooper, Titanium Dragon, Titoxd, Toi, Tom harrison, Tomeasy, Tomsega, Tone, TonyJunak, Tonyfaull, Toon05, Toughpigs, Tower, Toytoy, TracerBuIIet, Travis Evans, Trekkiegirl1113, TrentonLipscomb, Trilokeshwar Shonku, Tristan Schmelcher, Trusilver, Tstrobaugh, Tualha, Tukes, Twalter2, Twilbert, Tyler Stransky, Tzarnaz, UB65, UP, Ucanlookitup, Udey, Ultramarine, Unicycle77, Unschool, Us.werdna, Usafunnygirl, Useight, User2004, User27091, Vary, Vathsana Sengsouliddeth, Vaughan Pratt, Vecrumba, Verbum Veritas, Versageek, Versus, Versus22, Vicki Rosenzweig, Vicse1784, Victor Chmara, Vietnamiq, Viz, Vladsinger, Vlmastra, VodkaJazz, Voyaging, Vrenator, WD RIK NEW, WODUP, WSFDude, Waabu, WadeSimMiser, Wafulz, Walkerma, Wallach2008, Wang ty87916, Wapondaponda, Ward3001, Warfvinge, Wasbeer, Watcher, Wavelength, Wayne Slam, Wdhamilton, Weatherguy1033, Webmgr, WeijiBaikeBianji, Werdan7, Wesmantooth90, Whosyourjudas, Wifiguru1982, Wiki alf, WikiParker, WikiSlasher, Wikieditor06, Wikimate, Wikimsturner, WildlifeAnalysis, WillDarlock, Willking1979, Wilson44691, Wknight94, Wobble, Wolfnix, Wombatman, WonderYrednow, Woodsrock, Woohookitty, WordyGirl90, Wpoeop, Wsiegmund, Wtmitchell, Wuhwuzdat, X-factor, X42bn6, Xavierjouve, Xbxg32000, Xenoranger, Xerol, XmaX, Xompanthy, Xp54321, Xyzzyplugh, YUL89YYZ, Yamamoto Ichiro, Yasya, Yayay, Yidisheryid, Yonosnada, YorkBW, Yotimbob, Yousou, Yug, Z10x, ZBrannigan, Zalgo, Zazaban, Zbxgscqf, Zen Apprentice, Zenohockey, Zensufi, Zr40, Zsinj, Zuky79, Zzuuzz, , 93, , 2779 anonymous edits Internal consistency Source: http://en.wikipedia.org/w/index.php?oldid=542591087 Contributors: ADeveria, Amalas, Arthur Rubin, Enochlau, EricEnfermero, Fang Aili, Gregbard, Hi878, Inhumandecency, JRSpriggs, Kate, Lambiam, Melcombe, Michael Hardy, RJFJR, Rez960, Richtea2007, RoyBoy, Statesman 88, Stephanlangdon, Tayste, Xavierjouve, 24 anonymous edits Intra-rater reliability Source: http://en.wikipedia.org/w/index.php?oldid=349278548 Contributors: Michael Hardy, Tayste IPPQ Source: http://en.wikipedia.org/w/index.php?oldid=527550950 Contributors: 9Nak, Bdw86, DaveJamesMiller, Katharineamy, LilHelpa, Mazca, RHaworth, Resilience2009, Shadowjams, Uncle G, 1 anonymous edits Item bank Source: http://en.wikipedia.org/w/index.php?oldid=527670124 Contributors: Iulus Ascanius, R'n'B, Winderdr, 1 anonymous edits Item response theory Source: http://en.wikipedia.org/w/index.php?oldid=550013586 Contributors: 1000Faces, ATBS, AerobicFox, Alterego, Amead, AndresCorrada, Angela, Appraiser, Bd134, Bgwhite, Bhabing, Chowbok, Cmdrjameson, D4g0thur, Dirtyboy1968, Dongseock, EagleFan, Edward, Erianna, Fsiler, Fuhghettaboutit, GabeIglesia, Holon, I smits, Iridescent, Iss246, Iulus Ascanius, Jakobwandall, John Vandenberg, John of Reading, Kiefer.Wolfowitz, Kirsus, Klonimus, LeonardoPisano, Lotze, Lwdiener, Martin Rizzo, Mattisse, Maximus Rex, Mebden, Melcombe, Michael Hardy, Naddy, Nbarth, Nesbit, Open2universe, Pak21, Paolo.dL, Peachypoh, Philchalmers, Practical321, Quester67, RDBrown, RichardF, Rjwilmsi, Rlew234, Sam Hocevar, Seglea, Sfan00 IMG, Stewartadcock, Taak, Thorseth, Tim bates, Trontonian, Victor Chmara, WikHead, Wikiklrsc, Wilcoemons, Wirthrj, Xpelanek, 114 anonymous edits Jenkins activity survey Source: http://en.wikipedia.org/w/index.php?oldid=532054140 Contributors: Addie777, Bearcat, Beeblebrox, Delusion23, Doczilla, GregorB, Iulus Ascanius, Jlechem, Jyril, Lesath, Nono64, Objectivesea, Robofish, Spalding, Tikiwont, 9 anonymous edits Jensen box Source: http://en.wikipedia.org/w/index.php?oldid=545418620 Contributors: Airbag190, Chris the speller, Dagko, Excirial, Fnielsen, Freek Verkerk, Mitch Ames, Tim bates, Victor Chmara, Yfever, 9 anonymous edits

457

Article Sources and Contributors


KuderRichardson Formula 20 Source: http://en.wikipedia.org/w/index.php?oldid=547225359 Contributors: Aranel, Broheimer, Charles Matthews, Cparkins111, Dmbaguley, DrMicro, Franois Pichette, Iulus Ascanius, Jeffmcneill, KerryB, Kuder2011, KuderAssessment, Melcombe, Michael Hardy, Nesbit, None the Wiser, Oleg Alexandrov, Sampi Europa, Silverfish, Thine Antique Pen, Tony1, 14 anonymous edits Latent variable Source: http://en.wikipedia.org/w/index.php?oldid=540576717 Contributors: AdamSmithee, Aleenf1, AndrewHowse, Antonielly, CBM, Charles Matthews, Chire, Chrisjohnson, Claraevallensis, Coffee2theorems, Enisrat, Gaius Cornelius, Giftlite, Hariva, Holon, Jasoegaard, John Quiggin, Karada, Laurens-af, Luke Maurits, Melcombe, Michael Hardy, MisterSheik, Mkill, Mtanana, NOrbeck, Nick Connolly, Nova77, Olaf, Pgan002, Qwfp, RDBrown, Radagast83, Rjwilmsi, Royboycrashfan, Samw, Spalding, Splintercellguy, Ssurendra, Timothyjlayton, Topbanana, 25 anonymous edits Law of comparative judgment Source: http://en.wikipedia.org/w/index.php?oldid=544065316 Contributors: Arh22, Bloodshedder, EagleFan, Holon, J04n, Lou Sander, Mattisse, Michael Hardy, Ms2ger, NeonMerlin, RichardF, SchreiberBike, Tevildo, The Rambling Man, Trebor, 26 anonymous edits Likert scale Source: http://en.wikipedia.org/w/index.php?oldid=550356118 Contributors: Aboluay, Aboriginal Noise, Allens, Amead, Arjun01, Avraham, Benhocking, Bronayur, Brother Francis, CLW, Can't sleep, clown will eat me, Chafe66, Chorleypie, Chris53516, Cognita, DARTH SIDIOUS 2, Deflective, Dk0der, Drbb01, Dryke, El Alternativo, Elitre, Excirial, Finchlove, Gaius Cornelius, GavinTing, Giftlite, GoodDamon, Gwil, Hapli, Harkenbane, Hede2000, Holon, Hu12, Hullbr3ach, Imersion, Ioannes Pragensis, Iulus Ascanius, IvanLanin, JForget, Jamelan, Joebeone, John Pons, Jordi Roqu, JorisvS, Jtneill, Justin Mauger, Kastchei, Keith D, Kidu, Kirby1024, Kjtobo, Klonimus, Kwamikagami, Kwertii, Lakinekaki, M.boli, Mandarax, Mari370, Mcld, Melcombe, Michael Hardy, Mjoyce, Mr.Olas, Mudd1, Mydogategodshat, NeonMerlin, Nesbit, NewEnglandYankee, Newagebookworm, Northingalls, OneWeirdDude, Paul Magnussen, Peterl, Phette23, Piotrus, Practical321, QuiteUnusual, Qwfp, R'n'B, Rasmus Faber, Razorbliss, Reazon, RoyBoy, Royjones41, Sadi Carnot, Schmettow, Sciurin, Sewadj, Shanes, Smyth, Star767, SwissAndrew, Taak, Talgalili, Tayste, The Anome, TheInsideTrack, Thelle, Three-quarter-ten, Tim bates, Twocs, VTPG, Violetriga, Wavelength, Yaris678, YorkBW, Zzuuzz, 187 anonymous edits Linear-on-the-fly testing Source: http://en.wikipedia.org/w/index.php?oldid=550481386 Contributors: Ashleyleia, ChrisGualtieri, Iulus Ascanius, Rjwilmsi Frederic M. Lord Source: http://en.wikipedia.org/w/index.php?oldid=547994531 Contributors: G716, Iulus Ascanius, Qwfp, Waacstats, 1 anonymous edits Measurement invariance Source: http://en.wikipedia.org/w/index.php?oldid=531653568 Contributors: Illia Connell, Lockley, Victor Chmara Mediation (statistics) Source: http://en.wikipedia.org/w/index.php?oldid=540331093 Contributors: 400Calder, 802Anastasia, Andrwstw, Asamind, Ata065, Btyner, ChristineLambert12, Creemorian, Cthulhu88, DanSoper, Darklilac, Dsp13, Khazar2, Kvihill, Leondb, LilHelpa, Lilac Soul, Mack2, Majoge, Mediationtime, Melcombe, Michael Hardy, MrOllie, Oleg Alexandrov, Phil K, Plrk, Qwfp, Radagast83, Rjwilmsi, Robina Fox, Sonicyouth86, Sslevine, Talgalili, Thesoxlost, Timmy12, Udansk, Ylem, Yuzhong, 73 anonymous edits Mental age Source: http://en.wikipedia.org/w/index.php?oldid=550243746 Contributors: 87v7t76fc4iguwevf7657436253yd4fug754ws67dtfugiy67t8576, Aaron Kauppi, AuthorityTam, BobRay, Doczilla, Freek Verkerk, Fun123f, GenevieveDH, Ghibellina, Howedrew, InspectorTiger, Jokestress, Jutherine, Laaio, LilHelpa, Lova Falk, Lycurgus, MakiMakiMachine, Mattisse, Methodoloog, Nentrex, PhiLiP, Qwerty Binary, RockMFR, Sanya3, Wjfox2005, 27 anonymous edits Mental chronometry Source: http://en.wikipedia.org/w/index.php?oldid=541675578 Contributors: Achowat, Allens, Argumzio, Calabe1992, Charcollins, ChrisGualtieri, Cooper24, DGG, Damirgraffiti, Dcflyer, DragonflySixtyseven, Eubulides, Eumolpo, Evanh2008, Ferahgo the Assassin, Florian Blaschke, Hqb, Jaam1314, Julia Neumann, Khukri, Kndiaye, Ksyrie, Kwiki, L Kensington, Lollicide, Mikehober, Miradre, Outriggr, PatheticCopyEditor, Pavesina, Pichpich, Piotrus, RG2, Rjwilmsi, Robin S, Sam Derbyshire, SchreiberBike, Shanghainese.ua, Srleffler, Tevildo, Tgeairn, Tobby72, Velella, Victor Chmara, Wikipedian Penguin, Wknight94, Yamara, Zazim, 73 anonymous edits Missing completely at random Source: http://en.wikipedia.org/w/index.php?oldid=549012727 Contributors: 4twenty42o, Aimz o, Algebraist, Ashleyleia, BenFrantzDale, Fabrictramp, Iridescent, Mattisse, Melcombe, Michael Hardy, Qwfp, 5 anonymous edits Moderated mediation Source: http://en.wikipedia.org/w/index.php?oldid=533244316 Contributors: Ariel.silver11, Michael Hardy, Scopecreep, Snow Blizzard, Tayste Moderation (statistics) Source: http://en.wikipedia.org/w/index.php?oldid=545489154 Contributors: Btyner, Duoduoduo, G716, Khazar2, Melcombe, Michael Hardy, MrOllie, Nikopoloopkin, Ning2011, Qetuth, Sjlweech, Thesoxlost, Yworo, 5 anonymous edits Multidimensional scaling Source: http://en.wikipedia.org/w/index.php?oldid=549513143 Contributors: Adiputra daniel, Alousybum, Andrew from NC, Andrewman327, BenFrantzDale, Brmccune, Carnaubo, Cmouse, Coastside, DDerby, Daytona2, Demi, Den fjttrade ankan, DoriSmith, DrJunge, Drphilharmonic, Duoduoduo, Floria.tosca, Fnielsen, Giftlite, Hike395, Illia Connell, JHunterJ, JamesXinzhiLi, Jeroenemans, Jtneill, Khazar2, Kku, KnightRider, Lawrennd, Lb.at.wiki, Lgallindo, Lingeek, MartinSpacek, Maryan.Vdovin, MaxSem, Mdd, Melchoir, Melcombe, Michael Hardy, Michael.bronstein, Mitar, Niemeyerstein en, Nutcracker, Olaf, Oleg Alexandrov, Patrick Zanon, Pgan002, Pikachu sensei, R'n'B, RDBrown, Rama, Santaclaus, SchreiberBike, Sunjigang1965, Talgalili, Tekhnofiend, Teryx, Tgdwyer, The Anome, Tomi, TreasuryTag, Trumpsternator, Wiggin15, Wireless friend, X7q, 81 anonymous edits Multiple mini interview Source: http://en.wikipedia.org/w/index.php?oldid=381732418 Contributors: 1ForTheMoney, Airplaneman, Celique, Gilula, II Andr3w II, Industryliaison, JHunterJ, Jungcam, Katharineamy, Ll Andr3w ll, RHaworth, Valentinejoesmith Multistage testing Source: http://en.wikipedia.org/w/index.php?oldid=376513064 Contributors: Aua, Emeraude, Erianna, Iulus Ascanius, 1 anonymous edits Multitrait-multimethod matrix Source: http://en.wikipedia.org/w/index.php?oldid=545174642 Contributors: Arno Matthias, Chris the speller, Dendodge, Drhays, Edstat, Mdd, Melcombe, Michael Hardy, Niceguyedc, Pamdfitz, Sammcq, Sardanaphalus, Victor Chmara, West.andrew.g.norb, Ynhockey, 5 anonymous edits Neo-Piagetian theories of cognitive development Source: http://en.wikipedia.org/w/index.php?oldid=531310876 Contributors: Aaron Kauppi, AndreasPDemetriou, Beland, Bunnyhop11, Cdw1952, Cmeiqnj, Fyrael, Gibbja, Gregbard, Lova Falk, Moe Epsilon, Nikurasu, R'n'B, Rjwilmsi, Robofish, SchreiberBike, Vajrapoppy, Woohookitty, 60 anonymous edits NOMINATE (scaling method) Source: http://en.wikipedia.org/w/index.php?oldid=549233415 Contributors: Bentogoa, Campbellgraham, Chevymontecarlo, Chris hare84, Joe Decker, Jp489x, Kongr43gpen, MatthewVanitas, Michael Hardy, Twalls, Uga politics, Voteview, 5 anonymous edits Non-response bias Source: http://en.wikipedia.org/w/index.php?oldid=548845843 Contributors: Arjayay, Justin W Smith, Magioladitis, Melcombe, Michael Hardy, Msalganik, PRL42, 14 anonymous edits Norm-referenced test Source: http://en.wikipedia.org/w/index.php?oldid=544294549 Contributors: Achristoffersen, Arthena, Ashertg, Atticus12345, Beland, Benson85, Blade21cn, Chris53516, Davsch65, Dolsson5, Edward, Edwy, Finnancier, Iss246, MBisanz, Martijnvanderwoud, Merope, Michael Hardy, Mlaffs, Mrevan, Nesbit, R'n'B, Rich Farmbrough, RichardF, Rjwilmsi, Ryulong, SchreiberBike, Slp1, Squotis, Sugarcaddy, Tammykps, Welsh, WhatamIdoing, 14 anonymous edits Normal curve equivalent Source: http://en.wikipedia.org/w/index.php?oldid=550310499 Contributors: 1ForTheMoney, CBM, Captain Video, Carlowenby, Chris53516, Closedmouth, Extraordinary, Melcombe, Michael Hardy, Mxhunter, Qwfp, Tim bates, 16 anonymous edits Objective test Source: http://en.wikipedia.org/w/index.php?oldid=544063514 Contributors: Andrea Parton, Asparagus, Charles Matthews, Dima1, Doczilla, Extraordinary, Gary King, ImperialismGo, Iulus Ascanius, J. Ash Bowie, Joel7687, Lmsilva, Mattisse, Parkerjackson, Rich Farmbrough, Tevildo, Trevinci, 16 anonymous edits Online assessment Source: http://en.wikipedia.org/w/index.php?oldid=545046641 Contributors: Chromaticity, Dullard2, Ehccheehcche, Fluffernutter, Gary King, Iulus Ascanius, Jbeck206, Jessy726, Jorunn, Lovincitrus, Mandarax, Mattg82, R'n'B, Renfroj, SlackerMom, Testingwhatmatters, Woohookitty, 8 anonymous edits Operational definition Source: http://en.wikipedia.org/w/index.php?oldid=542614573 Contributors: Alienus, Amakuru, AndrewHowse, Andries, BMF81, BenFrantzDale, Bgwhite, Bioneer1, Bobblewik, Brews ohare, Byelf2007, CSTAR, Charles Matthews, Clicketyclack, Cutler, D, Daniel Brockman, DanielCD, Dekimasu, Dicklyon, Download, Dpv, Dr Oldekop, Epbr123, Fredrik, Fuhghettaboutit, Gregbard, Gsullsc, Gurch, HMSSolent, Hadal, Hyacinth, J.delanoy, JMSwtlk, Jackollie, JimWae, Jon Awbrey, Jon Roland, Kgw2, Ksyrie, Lacatosias, Lotje, Lperez2029, Luna Santin, MEsfahani, Mattisse, Maurice Carbonaro, Mesoderm, Mets501, Mgiganteus1, Mkoval, NawlinWiki, Nesbit, Neutrality, Normxxx, Nurg, Obankston, Olathe, Paul August, Pgk, PhilHibbs, Piotrus, Poeloq, Quantumor, R&Bcowboy, R'n'B, Radagast83, Scott5114, Shirahadasha, SimonP, Sligocki, Suffusion of Yellow, SummerWithMorons, SummonerMarc, Tarotcards, Tesseract2, Tijfo098, Troy 07, Tstrobaugh, WMCEREBELLUM, WikipedianMarlith, ZCash1104, ZacParkplatz, Zntrip, 112 anonymous edits Operationalization Source: http://en.wikipedia.org/w/index.php?oldid=540649836 Contributors: 2004-12-29T22:45Z, Ancheta Wis, AndrewHowse, Andries, AnonMoos, Bhny, Brother Dysk, Chimin 07, Dr Oldekop, Edward, Eleuther, Ghormax, Gregbard, Hedgehog41, Helgig1, INic, Iner22, JMSwtlk, Jeff Muscato, Jon Roland, Jtneill, Jtweiss, Maurice Carbonaro, Nezzadar, Obankston, Omnipaedista, PAR, Palfrey, Piotrus, Ps07swt, Radagast83, Riley Huntley, S M Woodall, SecondSight, Shivsagardharam, SummerWithMorons, Sysy, T.Whetsell, Tesseract2, Thomasmeeks, Thosjleep, Timreven, Trusilver, Tsemii, Undsoweiter, Vapour, Vasi, WMdeMuynck, Wtfobia, 31 anonymous edits

458

Article Sources and Contributors


Opinion poll Source: http://en.wikipedia.org/w/index.php?oldid=550839378 Contributors: ,,n, 172, 24Research, Accounting4Taste, AdeMiami, Ajraddatz, Akrabbim, Al Lemos, Alanbrowne, Alansohn, Americasroof, Andrew Gray, Appraiser, Aude, Avenue, BD2412, Baloghmatt, Bbzzme, Bcnsubmit, Bhentze, Bkell, BlackTerror, Blargh29, Bloodshedder, Bob98b3, Bobo192, Boffob, Boud, Br1z, Bruceanthro, Cab88, CallieRyan, Capguru, Cgingold, Charles Matthews, Chealer, Chris53516, Chrylis, Cmonday, Cydelin, DMacks, Daob, Darth Panda, Dasani, David1217, Delldot, Destynova, DocWatson42, Drcrnc, Dumebi1986, Duncan Keith, Dvorsky, Dylan Lake, Emostafa 2008, English Linesman, Epbr123, Erianna, Estrellador*, Etp01, FakeTango, Filll, Fish and karate, Flatterworld, Fred Bradstadt, Frsky, Ft93110, Futurist110, Fys, G.R.Walden, Gadfium, Gbinal, Gidonb, Gogo Dodo, Ground Zero, Grumpyyoungman01, Grutness, Halavais, Hannah Commodore, Harryfdoherty, Hlavac, Howard the Duck, Ikip, ImperfectlyInformed, Ip82166, Isaac Dupree, Isomorphic, JForget, JHP, Jackol, JamesBWatson, Janice Vian, January, Jean-Francois Landry, Jeandr du Toit, Jeff Silvers, Jiffles1, Jlittlet, John, John FitzGerald, John of Reading, JohnFromPinckney, Jonathan26, Jord, Joriki, Joseph Solis in Australia, Jpo, Jprw, Junes, Jusdafax, Jxg, Katpatuka, Kawanazii, Kendrick7, Kiefer.Wolfowitz, Klemen Kocjancic, Korky Day, Kpwa gok, Kurdo777, Kuru, LanguageMan, Lastorset, Lazulilasher, LedgendGamer, Levineps, LilHelpa, Lincolninst, Lincolnite, Lingamer8, Lotje, MER-C, MMS2013, Maccabi72, Machaven, Mack2, MaesterTonberry, Mais oui!, Maniesansdelire, Maradif, Materialscientist, Math321, Max Naylor, Mblumber, Mcorazao, Meelar, Melcombe, Member456345, Member529, Member564, Michael Hardy, Michael Snow, Mjh110101, Mmqa, Mrb08876, NRCenter, NameIsRon, Natanv, Nbarth, Neil Chadda, Neilc, Nelro, Nick, Okstacie123, Oleg Alexandrov, OneLewis101, Optimel4, Optimistic5, Patriotis672, Pax:Vobiscum, Picaroon, Piledhigheranddeeper, Pinethicket, Pipacopa, Politicas, Pollpub, Pollster09, Polly Ticker, Prashanthns, Pratyya Ghosh, Quae legit, Quinten, Quota, Qwfp, RadioActive, RadioFan, Raprat0, RazorICE, Rebskin, Recurring dreams, Redroar75, Rethas, Rich Farmbrough, RichardF, Rjensen, Roving Wordslinger, Rplal120, Sandossu, Schmackity, Shell Kinney, Shuki, SimonP, Sk wiki, Sophus Bie, Sporti, Srtxg, Stafaxis, Stefanomione, Stephenb, Stephensuleeman, Steven X, Streetpav, Student7, Sungzungkim, Surachit, TAnthony, TFMcQ, Tachyon01, Tartuffoboy, Tequendamia, The Rambling Man, Thekohser, Therearenospoons, Timrollpickering, Tobby72, Tom harrison, Trevj, Trift, Ukr-Trident, Una Smith, Ve2jgs, Vicki Rosenzweig, Vladimir.qq, Votemania, WOT, Wafry, WatariSan, Wavelength, Whooym, Whouk, Wikipersonae, Will Beback, Woohookitty, Xezbeth, Xinoph, XxAvalanchexX, Yidisheryid, YorkshireKat, Zandperl, Zigorath, Zzuuzz, 390 anonymous edits Optimal discriminant analysis Source: http://en.wikipedia.org/w/index.php?oldid=450332745 Contributors: Langstephen, Michael Hardy, Sbelknap, 1 anonymous edits Pairwise comparison Source: http://en.wikipedia.org/w/index.php?oldid=541320922 Contributors: Abductive, ApolloCreed, Badpazzword, ChemGardener, Curb Chain, Dankonikolic, Fylbecatulous, Gfis, Hazard-SJ, Holon, Ironwolf, J04n, Jeh25, Jojalozzo, Kuru, Llorenzi, Lou Sander, Luna Santin, MBisanz, Manop, MarkusSchulze, Mcld, Melcombe, Michael Hardy, Mlittman, Mrprasad, Pak21, Patrick, Paulwizard, Psychonomics, RDBury, Rjwilmsi, Skbkekas, The Anome, Thehotelambush, Wavelength, 26 anonymous edits Pathfinder network Source: http://en.wikipedia.org/w/index.php?oldid=538362902 Contributors: Amead, Apyule, D6, Klemen Kocjancic, Michael Hardy, Oliver, Rjwilmsi, Schvan, 4 anonymous edits Perceptual mapping Source: http://en.wikipedia.org/w/index.php?oldid=546476569 Contributors: Angela, Angela.hausman, Axr15, Chris Roy, Gec118, Joelm, Klonimus, Mabdul, Markiewp, Mattisse, Maurreen, MrOllie, Mydogategodshat, Oxymoron83, Pinballjim, S j bloom, Srfiorini, Steve30000, Tagishsimon, ThePI, Tkgd2007, V81, Veetro001, Wizgha, 23 anonymous edits Person-fit analysis Source: http://en.wikipedia.org/w/index.php?oldid=514518021 Contributors: 35609178F, Ebyabe, Herostratus, Iulus Ascanius, KPH2293, Seaphoto, Wilcoemons, Yutsi, 7 anonymous edits Phrase completions Source: http://en.wikipedia.org/w/index.php?oldid=490443118 Contributors: Aboriginal Noise, Bobby4orr, DavidFGillespie, Gioto, Iulus Ascanius, JustAGal, Michael Hardy, R'n'B, 6 anonymous edits Point-biserial correlation coefficient Source: http://en.wikipedia.org/w/index.php?oldid=547223436 Contributors: Atgvg, Bondegezou, Cmdrjameson, Cparkins111, Elizabeth Liddle, Emersoni, Fmansoor, Fnielsen, Gustavb, Iulus Ascanius, Jeremymiles, Jncraton, Jtneill, Kmir78, Michael Hardy, MrArt, Nbrouard, PigFlu Oink, Robert K S, Romanm, Schwnj, Ted7815, Theda, Winsteps, 6 anonymous edits Polychoric correlation Source: http://en.wikipedia.org/w/index.php?oldid=545000885 Contributors: Melcombe, Michael Hardy, Mycatharsis, Pinethicket, Practical321, Qwfp, Shabbychef, SlackerMom, Talgalili, 8 anonymous edits Polynomial conjoint measurement Source: http://en.wikipedia.org/w/index.php?oldid=365240415 Contributors: Axiomguy, Lova Falk, Mentifisto, RayAYang Polytomous Rasch model Source: http://en.wikipedia.org/w/index.php?oldid=526581135 Contributors: Den fjttrade ankan, Download, Elfguy, GreatWhiteNortherner, Holon, John Vandenberg, Johndburger, MER-C, Michael Hardy, Mike3550, R'n'B, Salsb, Sam Hocevar, SebastianHelm, Wcherowi, Winsteps, 12 anonymous edits Progress testing Source: http://en.wikipedia.org/w/index.php?oldid=550616621 Contributors: Harry the Dirty Dog, Moswento, Parevest, Pinethicket, Tassedethe, Tfl, Tgeairn, 7 anonymous edits Projective test Source: http://en.wikipedia.org/w/index.php?oldid=548404492 Contributors: 2oviraptor2, Andrea Parton, Angela.a.kraemer, Anil1956, Arcadian, Bobnorwal, Cassmus, Chillum, Columbia613, Cpeterson77, Cswrye, Doczilla, Edward, Eequor, Gary King, Gr8dawg247, GregorB, Iulus Ascanius, J. Ash Bowie, Jayjay317, Johnkarp, Johnny Money, Klonimus, LjL, Lova Falk, Martinevans123, Mattisse, Mifter, Minvaren, Mlevis, Mrshapiro, Norouzi1, Pinky sl, Rjwilmsi, Sampi Europa, Shoeofdeath, SummerWithMorons, Tgeairn, Tide rolls, Tothebarricades.tk, Ungoliant MMDCCLXIV, WLU, Yoggysot, 45 anonymous edits Prometric Source: http://en.wikipedia.org/w/index.php?oldid=549951547 Contributors: Aptasi, Bill william compton, Cachimbu, Cander0000, Chirags, ChrisGualtieri, Coolcaesar, Cwspeicher, Cybercobra, Ddxc, Eham.a, Engineerchange, Exit2DOS2000, Fintor, Firassalim, Franz D'Amato, Great timer, Henriok, Henrymrx, Iulus Ascanius, J.delanoy, Jason.gross, JodiAnnK, Joerom5883, Johnpacklambert, Kdau, Kevlar67, Khalid hassani, Matt Crypto, Meters, Milnivlek, Nathan256, Ocee, Octaazacubane, Pducat, Psnae, Rathishg, Rocketrod1960, Scray, Shortride, Takdir ecat, Tim!, Toddst1, Vis-a-visconti, Xnatedawgx, Zip123, 59 anonymous edits Psychological statistics Source: http://en.wikipedia.org/w/index.php?oldid=540205855 Contributors: 2over0, Acdixon, Ajpoggio, Ceradon, Chris53516, Conversion script, Den fjttrade ankan, Dick Beldin, Doczilla, EPM, Graham87, Hike395, Jeremymiles, Jfitzg, Kgwet, Mattisse, Melcombe, Michael Hardy, Nakon, Oleg Alexandrov, Precanalytics, PrestonH, R'n'B, Ranger2006, RexNL, RichardF, RobertM525, Sardanaphalus, Shakesomeaction, SimonP, Smasongarrison, Talgalili, TheParanoidOne, 15 anonymous edits Psychometric function Source: http://en.wikipedia.org/w/index.php?oldid=541365516 Contributors: Benjaminx, ChrisGualtieri, Dontaskme, Marybelr, Mexaguil, RichardF, Robert P. O'Shea, Tevildo, 11 anonymous edits Psychometrics of racism Source: http://en.wikipedia.org/w/index.php?oldid=542588851 Contributors: Doczilla, Futurebird, Jmorgan, Mattisse, NoriMori, Pigman, Rickmer Quantitative marketing research Source: http://en.wikipedia.org/w/index.php?oldid=546476434 Contributors: A.Kurtz, Aardila, Abeg92, AndrewHZ, Bennose, Bonadea, Bruceanthro, Buffer34, CardinalDan, ComputerGeezer, DeadEyeArrow, Delldot, Dnavarro, Dr Runt, Edward, Elwood64151, Flippy45, Gary D, Holon, Ixfd64, JForget, Jackboogie, JenLouise, JeremyA, Jj137, Kku, Kth, Lee Elms, LithiumFlower, MER-C, Mark, Maurreen, Melcombe, Michael Hardy, Michael Snow, Mydogategodshat, Ocaasi, Ogranut, Oleg Alexandrov, Oscarthecat, Psychonomics, RJASE1, Rjensen, Rjwilmsi, Rplal120, RuudVisser, S kundus, Sceptre, Seglea, Sitethief, Superzohar, Svetovid, Taak, TheAMmollusc, Timrollpickering, Unitedopinions, Wafulz, Wavelength, Zzuuzz, 46 anonymous edits Quantitative psychology Source: http://en.wikipedia.org/w/index.php?oldid=546459188 Contributors: Amalas, Arjayay, Avicennasis, Doczilla, Dorftrottel, Ices2Csharp, Jschedle, Kiefer.Wolfowitz, Mattisse, Merovingian, Mzabduk, Oleg Alexandrov, Peshrout, Pithwit, Pomte, Rmneubauer, Sardanaphalus, Schwnj, Stephentueller, Theopolisme, Y0u, 8 anonymous edits Questionnaire construction Source: http://en.wikipedia.org/w/index.php?oldid=539915111 Contributors: ABDDBA, Abaca2, Altenmann, AvicAWB, Bernd in Japan, Bobo192, Bowlhover, Bruceanthro, CalumH93, Dan Polansky, DanQuirk, Danny, Ecustomersurvey, Flammifer, Flopsy Mopsy and Cottonmouth, Foxandpotatoes, Futureobservatory, GraemeL, Gsociology, JV Smithy, JenLouise, Jtneill, Jusque, Kevin, Kuru, Linkspamremover, MER-C, Maurreen, Melcombe, Michael Snow, MrOllie, Mydogategodshat, Natanv, NawlinWiki, Niceguyedc, Oseland, Piano non troppo, Piotrus, RichardF, Rjwilmsi, Ronz, Sephiroth BCR, Shomroni, SimonP, StefanB, Superzohar, Svetovid, Taak, TheGrimReaper NS, Thomasmeeks, Tommy2010, Torres056, Wiki alf, Wvdberg, Zzuuzz, 68 anonymous edits Rasch model Source: http://en.wikipedia.org/w/index.php?oldid=550741868 Contributors: A bit iffy, Amead, Btyner, CBM, Cycologist, D4g0thur, DavidMorgan1950, Den fjttrade ankan, Dti21, Egil, Gjhernandezp, H2otto, Holon, Hongooi, Hubbardaie, JaGa, Jeremykemp, Johndburger, Laminado, Mack2, Mangotree, MathewTownsend, Melcombe, Michael Hardy, Mjb, Nesbit, Niteowlneils, OrenBochman, Physchim62, Pmetric, RichardF, Ricky81682, Robroot, RoyBoy, Salsb, Sanguinity, SchreiberBike, Shadowjams, The Rambling Man, TinJack, Tonyfaull, WPFisherJr, Wholehearted, Winsteps, Wissons, Zapp645, 84 anonymous edits Rasch model estimation Source: http://en.wikipedia.org/w/index.php?oldid=525685150 Contributors: Holon, Kiefer.Wolfowitz, Mack2, Melcombe, Michael Hardy, Schomerus, Winsteps, Zvika, Zybler, 8 anonymous edits Rating scale Source: http://en.wikipedia.org/w/index.php?oldid=551077497 Contributors: Allens, Ameliorate!, Bennose, Berek, Caroline Jarrett, D.M. from Ukraine, DillonLarson, Eastlaw, Eurodog, Holon, Hordaland, J04n, Jack Greenmaven, Jarvik, Jll, Jnestorius, Joe Gazz84, Joshua Bruce Alan Noone, Joshuaali, Jtneill, Manscher, Mattisse, Mboverload, Moisture, Mtevfrog, Noddy1000, Proberts2003, PseudoSudo, Psychonomics, Richard Keatinge, Robbster98, Robk6364, Svenjick, Three-quarter-ten, Turadg, Wesker018, Woohookitty, Xtfer, 46 anonymous edits

459

Article Sources and Contributors


Rating scales for depression Source: http://en.wikipedia.org/w/index.php?oldid=545510996 Contributors: Apokrif, Arcadian, DARTH SIDIOUS 2, Eastlaw, FiachraByrne, Garrondo, Headbomb, Krash42424, Michael Hardy, MichaelExe, Panda11, Petersam, Pigman, Rjwilmsi, Silent Method, Suntag, Wetman123, Xasodfuih, 3 anonymous edits Reliability (psychometrics) Source: http://en.wikipedia.org/w/index.php?oldid=550697847 Contributors: ASQ-Reliability-Div, Amead, Arkuse, Borgx, Bradley326, Bryan Derksen, Calvin 1998, Correogsk, Cpl Syx, Cruise, DJ Nietzsche, DM78, Dawnseeker2000, Dcljr, Discospinster, Ekabhishek, Fgnievinski, Fhimas, Flewis, Fmacgregor, GL, Gregbard, Hgamboa, Holon, Iulus Ascanius, J.delanoy, Jeff3000, Jfitzg, Jon Awbrey, Jovianeye, Jtneill, Kami.inu, L Kensington, Lambiam, Leges sine moribus vanae, Limegreen, Melcombe, Michael Anon, Michael Hardy, Mikael Hggstrm, Mild Bill Hiccup, Msalganik, Mydogategodshat, Nesbit, Nevit, Newton2, Patstuart, Pgk, Pkgx, Pownuk, Qwfp, Rfurlano, Rich Farmbrough, Rlsheehan, Ronz, Sam Hocevar, SimonP, Skagedal, Smack, SocratesJedi, Squids and Chips, Taak, Tayste, Texteditor, Versageek, Wigie, Wiki091005!!, Xasodfuih, Zvika, 101 anonymous edits Repeatability Source: http://en.wikipedia.org/w/index.php?oldid=540070130 Contributors: ArnaudContet, Byron Vickers, Canderson1494, Colar snap, Dcoetzee, EricEnfermero, Griccioppo, Hydrogen Iodide, Itub, Jesse V., Kbh3rd, Mattisse, Melcombe, Mikael Hggstrm, Rfurlano, Rich Farmbrough, Rlsheehan, SchreiberBike, Spalding, Spiel496, Splintercellguy, Squids and Chips, Stephentarr, Stifle, Uncle G, Weirdoalisa, XJaM, 19 anonymous edits Reproducibility Source: http://en.wikipedia.org/w/index.php?oldid=548872709 Contributors: 16@r, 20040302, 2over0, Arashbm, Arc de Ciel, BigDwiki, BrianOfRugby, Bryan Derksen, Byron Vickers, CDN99, CRKingston, Capitalist, Carrionluggage, Charles Matthews, Cookiedog, David Eppstein, David Gerard, Dept of Alchemy, Duoduoduo, Ed Poor, Ee Azer, Enric Naval, Ephery, Fomels, Fwappler, Gerhi, Hephaestos, Hu12, Ino5hiro, Jabowery, Jon Awbrey, Juesch, K, Kotepho, Lee Daniel Crocker, Lightmouse, Lowellian, Magioladitis, Marie Poise, MathMartin, Mattisse, Mayumashu, Melcombe, Michael Hardy, Mthomps, N4nojohn, NeoJustin, Nsda, Olathe, Pcarbonn, PeregrineAY, Pgr94, Previously ScienceApologist, Remember the dot, Richard001, Rjwilmsi, Rlsheehan, Robert Merkel, Shii, Spalding, Squids and Chips, Stevertigo, Tedickey, Tnxman307, Verne Equinox, Vsmith, Wolframite74, XJaM, Xtofe, 48 anonymous edits Riddle scale Source: http://en.wikipedia.org/w/index.php?oldid=548031794 Contributors: Atlantima, Bearcat, CTF83!, Sigma0 1, Smartse, StAnselm, William Avery, 2 anonymous edits Risk Inclination Formula Source: http://en.wikipedia.org/w/index.php?oldid=543529197 Contributors: Bradyjack Risk Inclination Model Source: http://en.wikipedia.org/w/index.php?oldid=550037117 Contributors: Bradyjack, Jackson Peebles, Tentinator, 7 anonymous edits Role-based assessment Source: http://en.wikipedia.org/w/index.php?oldid=467040357 Contributors: DragonflySixtyseven, Eustress, HRman01, JoeBrennan, Jperiquito, Rich Farmbrough, WeijiBaikeBianji, Weirdguy0509, 1 anonymous edits Scale (social sciences) Source: http://en.wikipedia.org/w/index.php?oldid=538911626 Contributors: AndrewHowse, BartlebytheScrivener, Cokoli, Cruise, Destynova, EagleFan, Estel, Falcon Kirtaran, FirstPrinciples, Glenn, Grick, Hanshot1st, Holon, Imersion, Inwind, J. Spencer, Janks, Jose Ramos, Karl-Henner, Klemen Kocjancic, Klonimus, Lakinekaki, Lupo, Masih ghaziasgar, Maurreen, Meclee, Michael Hardy, Mydogategodshat, Paddles, Skoban, StAnselm, Stwalkerster, Taak, Tsemii, Wik, Wotnow, Yaris678, 34 anonymous edits Self-report inventory Source: http://en.wikipedia.org/w/index.php?oldid=541755171 Contributors: AdRock, Andycjp, Atlant, Casey Abell, Chealer, Cswrye, Diego Moya, Dlmccaslin, Ewulp, Iulus Ascanius, Jason Quinn, Mattisse, PaulWicks, R'n'B, RLM2007, Rhurik, Sardanaphalus, SlackerMom, Slp1, SteinbDJ, 8 anonymous edits Semantic differential Source: http://en.wikipedia.org/w/index.php?oldid=540830569 Contributors: Arcadian, Bbx, ChrisGualtieri, Clicketyclack, Cruise, David Cruise, Declutter, Diego Moya, Dimitrees, Dreftymac, Drjheise, Dsp13, Feour, Flammifer, Freakofnurture, Futureobservatory, Fvw, Jfdwolff, Kaihsu, Mdd, Mfumerkin, Michael Hardy, Mild Bill Hiccup, Quiddity, Qwfp, Rasputinous, Rich Farmbrough, SimonP, 29 anonymous edits Sequential probability ratio test Source: http://en.wikipedia.org/w/index.php?oldid=544592284 Contributors: Adoniscik, Agricola44, Chaintzean, Et764, Inks.LWC, Iulus Ascanius, Melcombe, Moverly, Rjwilmsi, Salvio giuliano, Silly rabbit, SkeletorUK, WhatamIdoing, Widr, 18 anonymous edits SESAMO Source: http://en.wikipedia.org/w/index.php?oldid=545387516 Contributors: ADM, Ary29, Beland, DGG, FoCuSandLeArN, Galoubet, Lova Falk, M2Ys4U, Mifter, R'n'B, The Anome, Trevinci, Wheelchair Epidemic, WikiPuppies, 6 anonymous edits Situational judgement test Source: http://en.wikipedia.org/w/index.php?oldid=550956330 Contributors: Amgardner13, Bobrayner, Bossrat, Elijah Earl, Eustress, Extraordinary, Gadget850, Gmburn, Hans Adler, Iulus Ascanius, John Bessa, Kawalker14, Laurenjharris, MathewTownsend, Mattisse, Mlmorey, Mogism, Quinn2jm, Rjwilmsi, Sardanaphalus, Smallman12q, Sponto, Tommy.may3ta, YUL89YYZ, 18 anonymous edits Psychometric software Source: http://en.wikipedia.org/w/index.php?oldid=533428774 Contributors: BartlebytheScrivener, Colonies Chris, Cycologist, Edward, Iulus Ascanius, Jpm4qs, Khan.edu, Ktr101, Meyerjp, Nelsonlarryr, Philchalmers, Plastikspork, Practical321, Qwfp, Res2216firestar, Rwwww, Victor Chmara, Winsteps, Wirthrj, 41 anonymous edits SpearmanBrown prediction formula Source: http://en.wikipedia.org/w/index.php?oldid=551045397 Contributors: Amead, Chris Roy, EldKatt, Irbatic, JakeVortex, Melcombe, Michael Hardy, Phil Boswell, Shadowjams, Singingdaisies, Taak, The wub, 13 anonymous edits Standard-setting study Source: http://en.wikipedia.org/w/index.php?oldid=517266284 Contributors: Gary King, Ghaly, Iulus Ascanius, Kmasters0, Laszlo5000, Rufus843, Shevek57, Ynhockey, 9 anonymous edits Standards for Educational and Psychological Testing Source: http://en.wikipedia.org/w/index.php?oldid=436491818 Contributors: Blathnaid, Bobblehead, Iulus Ascanius, Jtneill, RichardF, Wavelength, Zginder, 3 anonymous edits StanfordBinet Intelligence Scales Source: http://en.wikipedia.org/w/index.php?oldid=551040657 Contributors: Adavis444, AgentSmith03, Andy Farrell, Angelic Wraith, Anusan.rasalingam, Arcadian, Asparagus, BadLeprechaun, Barrylb, Bobo192, BozMo, Bueller 007, CapitalR, Chowbok, Chuq, Colin MacLaurin, Coubure, Cresix, Curps, D3, DR04, Danhash, DanielCD, Dasartis, Daveswagon, Deltabeignet, Dominus, Dupz, E. Ripley, Eequor, Eliz2707, Evilasiangenius, FSharpMajor, FT2, Feefa, Floaterfluss, Freek Verkerk, FreelanceWizard, FrenchIsAwesome, Gaff, Gazisamah, Gilliam, GraemeL, Heroeswithmetaphors, HiB2Bornot2B, I dream of horses, Iulus Ascanius, IvyIQTest100, Jaane123, Jahiegel, Jammoe, Janviermichelle, Jengod, JerryFriedman, JoeSmack, John, John Nevard, Johnkarp, Joshafina, Jujutacular, Jwwdnbts, Kazvorpal, Kevincof, Kingturtle, Klemen Kocjancic, Ksyrie, Lpgeffen, Luan2012, M4gnum0n, Magioladitis, Maqsarian, Mattisse, Metsamies, Michael Hardy, Mike R, Mmortal03, Mshonle, Neewhom, Omnipaedista, Ospalh, Patman, PhilKnight, Pickle swan, Pmezard, Prospero, Qrc, Quae legit, Redjazz96, RichardF, Ringbang, Ritapruzansky, Rmosler2100, Roastytoast, Ronbo75, Rrburke, SQGibbon, Sadi Carnot, Samwaltz, Stern, Subsolar, T(T), Tdowling, The Thing That Should Not Be, Thecolemanation, Thshdw, Tim bates, Tstrobaugh, Ward3001, WeijiBaikeBianji, WhisperToMe, WikHead, Wikid77, Ynhockey, Zanimum, 140 anonymous edits Stanine Source: http://en.wikipedia.org/w/index.php?oldid=541160376 Contributors: Amead, Bunnyhop11, Caerwine, Errarel, Feinstein, Gaius Cornelius, Gknor, Gpvos, GregorB, Holon, Iulus Ascanius, Lee M, Melcombe, Mfriedma, Michael Hardy, Nbarth, Neutrality, Nunogloop, RichardF, Rwalkerusa, Salix alba, Skagedal, TutterMouse, Walter Grlitz, 15 anonymous edits Statistical hypothesis testing Source: http://en.wikipedia.org/w/index.php?oldid=551173814 Contributors: AbsolutDan, Acroterion, Adamjslund, Adjespers, Adoniscik, Agricola44, Alansohn, Albmont, AlexKepler, Andonic, Andreim27, Andresswift, Andrew73, Andycjp, Arcadian, Arthena, Aua, Aurorion, B, BWoodrum, Badjoby, Bazonka, Bbarkley2, Bdolicki, Benlisquare, Bentogoa, Birge, BlaiseFEgan, Boxplot, Bradford rob, Brougham96, Btyner, Catraeus, Cburnett, Cherkash, Chris the speller, Citruscoconut, Conversion script, Coppertwig, Crasshopper, Crazy george, Cretog8, Cybercobra, Cyc, Cydmab, Czap42, Dailyknowledge, Daniel11, Darkwind, DavidCBryant, DavidSJ, Davidruben, Dcljr, Dcoetzee, Ddxc, DeFaultRyan, Den fjttrade ankan, Dhaluza, Digfarenough, Drakedirect, Edward, Emanuele.olivetti, Epbr123, Eukaryote89, Feinstein, Fgnievinski, FreeT, G716, Gabbe, GargantuanDan, Gary King, Ggchen, Giftlite, GoingBatty, Graham87, Hatshepsut1988, Henrygb, Hu, Hu12, Illia Connell, Ivancho.was.here, J heisenberg, J04n, J36miles, Jackol, Jackzhp, Jake Wartenberg, JamesAM, Jason.grossman, Jcchurch, JimsMaher, Jmnbatista, John Quiggin, John of Reading, Johnkarp, Jollyroger131, JonDePlume, Jprg1966, Juliancolton, Jyeee, K, Kastchei, Kateshortforbob, Kjtobo, Kkddkkdd, Knappsych, Krawi, Lambiam, Larjohn, Larry_Sanger, Leapsword, LeilaniLad, Levineps, Lordmetroid, Magioladitis, MasterMind5991, Materialscientist, Mattisse, Mbhiii, Mcld, Mechnikov, Melcombe, Meritus, Michael C Price, Michael Hardy, Mikelove1, Mortense, MrOllie, Mudd1, MystRivenExile, NYC2TLV, Nbarth, Nijdam, NormDor, Nsaa, Nullhypothesistester, Oleg Alexandrov, Ott2, Patrick, Pdbogen, Pejman47, Penitence, Pete.Hurd, Pewwer42, Pgan002, Philippe, Philippe (WMF), Pixie, Policron, Poor Yorick, Protonk, Pstevens, Psy1235, Qwfp, Radagast83, Reedy, Requestion, Rich Farmbrough, Rjwilmsi, Robbyjo, Robma, Ronz, Rory O'Kane, Ryan.morton, Ryanblak, Salamurai, Sam Blacketer, Schwnj, Shabbychef, Silas S. Brown, SiobhanHansa, Sir Paul, Skbkekas, Slakr, Some jerk on the Internet, Someguy1221, Spalding, Speciate, Srich32977, Statlearn, Storm Rider, Strategist333, Sullivan.t.j, Talgalili, Tanath, Tayste, TedDunning, Tedunning, Terry0051, The Anome, The-tenth-zdog, Thomasmeeks, Thosjleep, Thoytpt, Tibbyshep, Tim bates, Tom Lougheed, Trainspotter, TrickyTank, Trift, Tsujimasen, Ultramarine, Ulyssesmsu, Urdutext, Utcursch, Valcust, Varuag doos, Verlainer, Viraltux, Waynechew87, West.andrew.g, Wikid77, Wile E. Heresiarch, Wittygrittydude, Wolverineski, Woohookitty, Xdenizen, Xiphosurus, Yg12, Zheric, Zvika, 559 , anonymous edits Statistical inference Source: http://en.wikipedia.org/w/index.php?oldid=541823581 Contributors: Aagtbdfoua, Ancheta Wis, Arcadian, Bashirra1, Bayes Puppy, Benwing, Bo Jacoby, CRGreathouse, Cherkash, Chris the speller, Christian List, Conversion script, Den fjttrade ankan, Dick Beldin, Douglas Whitaker, Erianna, Eric Kvaalen, G716, GDibyendu, Giftlite, Graham87, Greenleafjacob, Headbomb, Henrygb, Hoary, Hyacinth, Illia Connell, JA(000)Davidson, Jkominek, John of Reading, Jtneill, Jvstone, Kenneth M Burke, Kiefer.Wolfowitz, KlaudiuMihaila, Koavf, Ksyrie, L.tak, Larry Sanger, Larry_Sanger, Lfkrebs, Maher27777, MarkSweep, Mattisse, McPastry, Melcombe, Michael Hardy, Mo ainm, Modeha, Nbarth, Odoncaoa, Oleg Alexandrov, Ph.eyes, Piotrus, PleaseStand, Pollinosisss, Reindra, Rich Farmbrough, Rjwilmsi, RockfanRecords, Rongou, Run54, S2000magician, Scwlong, Shalom Yechiel, Strife911, Tassedethe, The

460

Article Sources and Contributors


Tetrast, Tomi, Vahid232323, Wikipelli, , , 34 anonymous edits Survey methodology Source: http://en.wikipedia.org/w/index.php?oldid=551016241 Contributors: Abrech, AbsolutDan, Adrian 1001, Aesopos, Alangardner1001, Alfie66, Altenmann, Archiegordon, Athaenara, Avenue, Aweet, Banano818, Bdqweb dcwydyu, Beland, Bluestorm310, Boffob, Bomac, Bruceanthro, Bryan Derksen, Buffer34, Calltech, CambridgeBayWeather, Carmen56, Cehoving, Cevalsi, Cherkash, Chris53516, Citing, Cpl Syx, CryptoDerk, Daob, DavidWBrooks, Dekisugi, Demjanich, Den fjttrade ankan, Destynova, Download, Dr Runt, Dr.Bastedo, DrMicro, Dspradau, ECEstats, Ebelknap, Efgn, ElKevbo, Elcobbola, Erckvlp, Evanh2008, Ewc21, Falcon8765, Farmanesh, Felagund, Fieldday-sunday, Finchlove, Flopsy Mopsy and Cottonmouth, Free Software Knight, Frsky, G716, Gaius Cornelius, Gallant.Cassious, Giftlite, Glenn, Gmaxwell, Goh wz, GraemeL, Grafen, Greychris, Gsociology, Gurch, HIED ADHE536, Hamsamich, Henrygb, Hu12, IW.HG, Ioannes Pragensis, IssuesRUs, JLaTondre, Jaitchay, Jaxl, Jcc1, Jeffrey Henning, JenLouise, Jlittlet, John FitzGerald, Johnjohnsonver, Joseph Solis in Australia, Js.oconnor, Jtneill, Jusque, Just plain Bill, Khazar, Kiefer.Wolfowitz, Kodwo, Kuru, Kuteni, Lawrence.larry.huong, LeCire, Lendorien, Life of Riley, LutzPrechelt, Mark, Markhurd, MathewBerry, Mattjans, Maurreen, Meclee, Melchoir, Melcombe, Michael Hardy, Michael Snow, MrOllie, Mundokiir, Mv276, Mydogategodshat, Natanv, NeutralBosnian, Nolanus, Nosplashback, Optigan13, Oseland, Philip Trueman, Piano non troppo, Pinethicket, Piotrus, Pretty Green, Psy MA, Qwfp, R'n'B, R-dogg122, RJASE1, Reedy, Regancy42, Rhondact1, Rich Farmbrough, Richard W.M. Jones, RichardF, Rjensen, Rplal120, Samsara, Schwnj, Seglea, Selge44, Serenity id, Shellym04, Showtime2009, Snigbrook, Socialresearch, SpikeToronto, Streltzer, Stwheel1, Suweller, Taak, Tabbycas, TacoBelly, Thekohser, Therearewaytoomanybooksinhere, Thosjleep, Thumperward, Tomasr, Tomsega, Toytoy, Trippmarxx, Triwbe, Tyler, Ulticrow, Waggers, Webeffect77, Werieth, WikiDan61, Wikiklrsc, Wikilibrarian, Winterstein, Zzuuzz, 306 anonymous edits Sten scores Source: http://en.wikipedia.org/w/index.php?oldid=504219138 Contributors: Amead, Asabjf, Biscuittin, Khazar2, Mycatharsis, SewerCat, XLerate, 4 anonymous edits Structural equation modeling Source: http://en.wikipedia.org/w/index.php?oldid=550045040 Contributors: Acpcmc, Adzinok, AndreasWittenstein, Andrewwilson2, Auto469680, Ayayla, BD2412, Barfooz, Billymac00, BrandSmith1960, Brandmaier, Brocagh, Cdagnino, Chris.westland, Countdrac, Cryptic, Ctacmo, Dasim, EdJohnston, Edhubbard, Ellenmc, Eternityqueen, Eykanal, Fadulj, Flavonoid, Fnielsen, G716, Geced, Gesang75, Glane23, Houdini5000, Ictlogist, Ilikeed, Ioannes Pragensis, JPG-GR, Jacksawyer, JanYv, JorisvS, Jpritikin, Jugander, Kastchei, Kbdank71, Kenneth M Burke, Kiefer.Wolfowitz, Kmarkus, Kvihill, Lgallindo, Maarten Hermans, Melchoir, Melcombe, Michael Hardy, MrOllie, Mycatharsis, Mzabduk, Naq, Nwstephens, Oleg Alexandrov, Oli Filth, Oren0, Perezoso, Pgan002, Ph.eyes, PresN, Qwfp, Rcnatarajan, Rich Farmbrough, Rickdeckard, Ronz, S9901470, Schwnj, Silly rabbit, Some standardized rigour, Stephenh, SwiftAmhe, TastyPoutine, Tayste, Tdslk, Teake, Thringer, Tim bates, Tpb, Triona, Victor Chmara, Vince Wiggins, WikHead, 136 anonymous edits Lewis Terman Source: http://en.wikipedia.org/w/index.php?oldid=544911409 Contributors: 36invisible, AaronSw, Agateller, Allen3, BD2412, Bender235, Brighamhb, Cassmus, Chaiken, Churn and change, CountOlaf, DNewhall, Dandv, Decumanus, Deville, Doczilla, Ekabhishek, EricEnfermero, Erinwolfe, Erp, Etacar11, Freek Verkerk, Gregor Strasser, HarikumarR, Ilikeliljon, Japo, JephthahsDaughter, Jokestress, Joseph Solis in Australia, Jrtayloriv, JustAGal, Kiphinton, Koozedine, Kubigula, Littlecheri, Melaen, Missnancydrew, Namiba, Naraht, Nentrex, Nesbit, Paul Magnussen, Ppoulin, Rjwilmsi, Rogerd, Roleren, Satori Son, ScottyBerg, Sheynhertz-Unbayg, Sjcann123, Skagedal, Stern, Tagishsimon, Tassedethe, Temporaluser, The Anome, Tide rolls, Tigr56, Tstrobaugh, Vectorjohn, Veneziano, Victor Chmara, Viriditas, Waacstats, WeijiBaikeBianji, WordyGirl90, Zoicon5, 44 anonymous edits Test (assessment) Source: http://en.wikipedia.org/w/index.php?oldid=550492833 Contributors: 100110100, 4001001A, 888mlee, Addihockey10, Addshore, Adorie, Ahoerstemeier, Ajayexpert, Allens, Amead, Amillar, Andres, Andy, Angela, Apeloverage, Arnestranden, AttishOculus, AzaToth, BD2412, Bobblehead, Bobo192, Bongwarrior, Braydonanderson, Brett, Brick Thrower, C12H22O11, COMPFUNK2, Caesura, Can't sleep, clown will eat me, CanadianLinuxUser, Captain-n00dle, Cerebus123, Chowbok, Chris53516, Cimex, Closedmouth, Cmdrjameson, CommonsDelinker, Crazymonkey1123, Dah31, Danielkueh, Darkwind, David Jay Walker, Dawnseeker2000, Dcoetzee, Dduttaroy, Deathphoenix, Deli nk, Dennis Kwaria, Depressionman, Der Spion, Deville, Doct.proloy, Don421, Dreadstar, Dysprosia, Dzt, EHLUK, EamonnPKeane, Edgar181, Ellisonch, Emilysepencer, Emote, EricWesBrown, Etexzan, Euinmotion, ExplicitImplicity, F.u.c.k all tests and exams!, FanCon, Fangz, Fatema09044(2), Favonian, Fayenatic london, Finalnight, Fiskpinne, Fito, Flamurai, FrancoGG, Gaderffii, Galoubet, Galzigler, Gary13579, Georg Hurtig, Germio, Ginsengbomb, Gmv0419, Goktr001, Graham87, GreatWhiteNortherner, Green caterpillar, Gz33, Haadk, Hellcat fighter, HelloAnnyong, Holon, IncidentalPoint, Interiot, Irfzam, IrishStephen, Isnow, Iss246, Iulus Ascanius, JQF, JaGa, Jamee999, Jellyfish dave, Jeremykemp, Jfitzg, Jiang, JivaGroup, JohnOwens, Jongrover, Jgre, KF, Keilana, Khin007, Kthapelo, Latitude0116, LeaW, LeisureContributer, Leszek Jaczuk, Ligar, LilHelpa, Little1scotty3, Logan, Logical Cowboy, Longhair, Lotje, Luckybuccaneer, LuigiManiac, Lunakeet, MZMcBride, Maheshkumaryadav, Mais oui!, Mark387533, MassimoAr, Mattbuck, Merovingian, Merphant, Michaelbusch, Mikaey, Mikez, Miles, Milton Stanley, Minecraftboy, MrJerry1987, Musical Linguist, Mzyxptlk, Nabla, Naryathegreat, Neonumbers, NerdyScienceDude, Nescafe11, Nikkimaria, Obersachse, Ohnoitsjamie, Paboe, Parvmanish, Patrick, Paul Klenk, Pauld, Pelivani, Peregrine981, Persian Poet Gal, PhiRho, Phil Sandifer, PhilipO, Pine, Pm master, Porusgift, Pramette, Qaddosh, Quarl, R'n'B, RJHall, Ravikumar001, Ready, Redzuny, Rex the first, Rgdboer, RichardF, Robinson0120, Ronz, Rufus843, Ruud Koot, Sandgem Addict, Sandstein, Satori Son, SchreiberBike, SchuminWeb, Seinfreak37, Senbon98, Sendhil, Sfan00 IMG, Shadowjams, Smack, Squids and Chips, Ssscienccce, Supposed, Surachit, Surfdue, Swimboy1, THNswimmer11, TestBanks SolutionManuals, Tests and exams are bullsh1t, Texture, The Thing That Should Not Be, TheSpaceRace, Theo10011, Thincat, ThorRune, Thoreaulylazy, Tide rolls, Tvlwiki, Ujomin, Utcursch, Vapier, Vary, Viriditas, Waldir, Ward3001, Wavelength, Weetoddid, WereSpielChequers, Wernergerman, WhatamIdoing, Whtrbbt93, Wiikipedian, Wikiklrsc, Wikilibrarian, Wikipelli, Willcrys 84, Woohookitty, Xxglennxx, Yegg13, Yintan, ZackMK, , 337 anonymous edits Test score Source: http://en.wikipedia.org/w/index.php?oldid=484068171 Contributors: Andy M. Wang, Iulus Ascanius, WereSpielChequers, 3 anonymous edits Theory of conjoint measurement Source: http://en.wikipedia.org/w/index.php?oldid=545965007 Contributors: 3mta3, Axiomguy, Colonies Chris, David Eppstein, Derek farn, Dozonoff, EPM, Giftlite, GregorB, J04n, KConWiki, Kevinmon, LilHelpa, Melcombe, Michael Hardy, Qwerasdfzxcv1234, RDBrown, Redheylin, Rich Farmbrough, Rjwilmsi, SHIMONSHA, Saiwing, Salix alba, Spidey104, Sun Creator, Tabletop, Tedtoal, Tkuvho, 16 anonymous edits Thurstone scale Source: http://en.wikipedia.org/w/index.php?oldid=541190626 Contributors: Alma Pater, Bloodshedder, Cruise, FirstPrinciples, Heida Maria, Holon, Joy, Lakinekaki, Marni, Melcombe, Michael Hardy, Peterl, Piotrus, Taak, Whicky1978, WikiSlasher, 18 anonymous edits Thurstonian model Source: http://en.wikipedia.org/w/index.php?oldid=517524550 Contributors: Bearcat, Challenlur, Chzz, Eeekster, Jab7842, Malcolma, Mathemagicalpsycholinguist, Melcombe, RDBrown, TM1096 Torrance Tests of Creative Thinking Source: http://en.wikipedia.org/w/index.php?oldid=538414038 Contributors: Acadmica Orientlis, Bearcat, Dick Kimball, Doctormatt, Isaacdealey, 1 anonymous edits William H. Tucker Source: http://en.wikipedia.org/w/index.php?oldid=523275100 Contributors: Bearcat, Hitssquad, Jokestress, Mathsci, Mertozoro, Michael Hardy, Nectarflowed, Nobunaga24, Swampyank, T@nn, Waacstats, 2 anonymous edits Validity (statistics) Source: http://en.wikipedia.org/w/index.php?oldid=550772639 Contributors: 1000Faces, Avenue X at Cicero, Bayle Shanks, Ben 2082, Bhny, Bill, Bkwillwm, Black Falcon, Bmistler, Bobo192, Caclark21, Ched, Cmh, Colin, Completely Neuronic, Correogsk, Da flow, Dlohcierekim, Drummond, F0CUS, Falk Lieder, Farmanesh, Finn krogstad, G716, Gap9551, Gareth Griffith-Jones, Giftlite, Gmac3339, Gomm, Grumpyyoungman01, Hgamboa, Holon, Irbatic, Iridescent, JackSchmidt, Jeffkross, Jfitzg, Jmbrowne, John of Reading, Jon Awbrey, Joriki, Jtneill, KatelynJohann, Kookaburra17, Lova Falk, Maartenremijn, Melcombe, MiNombreDeGuerra, Michael Hardy, MitchMcM, Mwolkove, Mydogategodshat, Neitherday, Nesbit, Nevit, Nick Number, Oleg Alexandrov, PGWG, Pile-Up, Piotrus, Plastikspork, Psychlologist, Qwfp, R'n'B, RJFJR, Rennydapooh78, Rjwilmsi, Rogmann, Skagedal, Smack, Taak, Tjwallace87, Tobias Bergemann, Tomgc, Tstrobaugh, Twirligig, Viriditas, Winsteps, Zeiden, 131 , anonymous edits Values scales Source: http://en.wikipedia.org/w/index.php?oldid=532100244 Contributors: AndrewHowse, Aripa, Drmies, GoingBatty, Grafen, Gregbard, Ironholds, Meclee, Michelferrari, Omills1010, Redheylin, TheRiverStyx, 45 anonymous edits Vestibulo emotional reflex Source: http://en.wikipedia.org/w/index.php?oldid=545892066 Contributors: Braintest, John of Reading, Khazar2, Mild Bill Hiccup, Vibrabrain Visual analogue scale Source: http://en.wikipedia.org/w/index.php?oldid=541429303 Contributors: Chris goulet, Makawity, Mcld, Millstream3, Robodoc.at, 4 anonymous edits Youth Outcome Questionnaire Source: http://en.wikipedia.org/w/index.php?oldid=509684159 Contributors: Cdw1952, Derek R Bullamore, Orlady, The Anome, Wavelength, Your Lord and Master Attribute Hierarchy Method Source: http://en.wikipedia.org/w/index.php?oldid=540558612 Contributors: Auntof6, D6, EagleFan, Hollislai, Joel7687, PigFlu Oink, R'n'B, RJFJR, Rjwilmsi, 11 anonymous edits Differential item functioning Source: http://en.wikipedia.org/w/index.php?oldid=544671881 Contributors: Anthonyr723, Bgwhite, Crystallina, Doczilla, I smits, Iulus Ascanius, Maryfbrowne, Michael Hardy, Philchalmers, R'n'B, Samw, Tikiwont, 5 anonymous edits Psychometrics Source: http://en.wikipedia.org/w/index.php?oldid=551141576 Contributors: 16@r, 2over0, A1octopus, Abasraz, Afterwriting, Ahoerstemeier, Alexander VII, Alluwanted, Alphabeat, Amead, Amit.amin1984, Anonymi, Antandrus, Barticus88, BartlebytheScrivener, Basawala, Before My Ken, Bhabing, Blassen, Bloomvlad, Bogey4, Borgx, BrotherGeorge, BullRangifer, Calliopejen1, Cassmus, Ceyockey, Chris-gore, Chris53516, Cmarieleahy, Commenzky, Conversion script, Coubure, Cswrye, Dadaist6174, David Shay, Dduttaroy, DerBorg, Dhkaiser, Dmerrill, Dmitri Lytov, Doczilla, Dovid, EPM, Ellywa, Euinmotion, Everyking, Exigentsky, Fluffernutter, Fnielsen, Fryed-peach, Funnyfarmofdoom, Gadfium, Geneffects, Giftlite, Gioto, Gothika11, Hede2000, Hijiri88, Holon, Hunt.topher, Hynas, Innapoy, Iss246, Itschris, Iulus Ascanius, IvyIQTest100, JSRTRP, Jaxl, Jcbutler, Jeffmcneill, Jfitzg, Jim1138, JivaGroup,

461

Article Sources and Contributors


Jkingcastle, John of Reading, Johnkarp, Johnrust, JoshAnish, Jtneill, Jusdafax, Karol Langner, Kennita728, Khan.edu, Kiefer.Wolfowitz, Kingfish, LHOON, Lgallindo, Lindseyandersen, LookingGlass, MER-C, Mangotree, Mark Foskey, Mattisse, Mayadafarouq, Mean as custard, Meekywiki, MegaSloth, Miami33139, Michael Hardy, Michele123, Michellefox, MinerVI, MissionNPOVible, Mmjbhhal, Mr pand, Mrdungx, Mzabduk, Nectarflowed, Nesbit, Nick Wilson, Novangelis, Nparfitt, Oleg Alexandrov, Philip Trueman, PhilipO, Phillip Scavulli, Ppsis, Psychdataguy, Psychpsych, Quintote, Qwfp, R'n'B, RDF-SAS, Rdsmith4, Reinoutr, RichardF, Rjwilmsi, Ronz, Sam1450, Sandstein, Sardanaphalus, Satori Son, Saucepan, Scrapbook, Skagedal, Smasongarrison, Sporti, SpuriousQ, Steve3849, Stevedavies2712, Stevertigo, SummerWithMorons, Taak, Tasuna, Theopolisme, ThreeOfCups, Tim Q. Wells, Tomi, Tomo, Toytoy, Transmissionelement, Trontonian, Twinkle2003, Vaughan, Versageek, Vicarious, Victor Chmara, Waveguy, Wavelength, WeijiBaikeBianji, Whicky1978, Wiki13, Wotnow, Xanzzibar, Yardcock, Yms, Zandperl, Zigger, Ziggurat, , 168 anonymous edits Vineland Adaptive Behavior Scale Source: http://en.wikipedia.org/w/index.php?oldid=497554458 Contributors: Bearcat, Carel.jonkhout, Hopping, Whoisjohngalt

462

Image Sources, Licenses and Contributors

463

Image Sources, Licenses and Contributors


Image:Accuracy and precision.svg Source: http://en.wikipedia.org/w/index.php?title=File:Accuracy_and_precision.svg License: GNU Free Documentation License Contributors: Original uploader was Pekaje at en.wikipedia Image:High accuracy Low precision.svg Source: http://en.wikipedia.org/w/index.php?title=File:High_accuracy_Low_precision.svg License: Public Domain Contributors: DarkEvil Image:High precision Low accuracy.svg Source: http://en.wikipedia.org/w/index.php?title=File:High_precision_Low_accuracy.svg License: Public Domain Contributors: DarkEvil File:Accuracy (trueness and precision).svg Source: http://en.wikipedia.org/w/index.php?title=File:Accuracy_(trueness_and_precision).svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: SV1XV File:High accuracy Low precision.svg Source: http://en.wikipedia.org/w/index.php?title=File:High_accuracy_Low_precision.svg License: Public Domain Contributors: DarkEvil File:High precision Low accuracy.svg Source: http://en.wikipedia.org/w/index.php?title=File:High_precision_Low_accuracy.svg License: Public Domain Contributors: DarkEvil File:ERB Practical Skills Training and Assessment Centre.JPG Source: http://en.wikipedia.org/w/index.php?title=File:ERB_Practical_Skills_Training_and_Assessment_Centre.JPG License: GNU Free Documentation License Contributors: me ( cychk) File:Psi2.svg Source: http://en.wikipedia.org/w/index.php?title=File:Psi2.svg License: Public Domain Contributors: Arjen Dijksman, Badseed, Beao, Bouncey2k, Gdh, Herbythyme, Imz, Jack Phoenix, KillOrDie, Nagy, Proteins, Remember the dot, Wutsje, 25 anonymous edits Image:example choice set.png Source: http://en.wikipedia.org/w/index.php?title=File:Example_choice_set.png License: Public Domain Contributors: Lawrence L Huong Image:PerceptualMap2.png Source: http://en.wikipedia.org/w/index.php?title=File:PerceptualMap2.png License: GNU Free Documentation License Contributors: Original uploader was Mydogategodshat at en.wikipedia image:Eq tests.PNG Source: http://en.wikipedia.org/w/index.php?title=File:Eq_tests.PNG License: GNU Free Documentation License Contributors: Holon File:Operationalization of Free and Fair Judiciary.png Source: http://en.wikipedia.org/w/index.php?title=File:Operationalization_of_Free_and_Fair_Judiciary.png License: Creative Commons Attribution 3.0 Contributors: Nezzadar[SPEAK] Image:SpearmanFactors.svg Source: http://en.wikipedia.org/w/index.php?title=File:SpearmanFactors.svg License: GNU Free Documentation License Contributors: Elembis Image:Carroll_three_stratum_model_of_human_Intelligence.png Source: http://en.wikipedia.org/w/index.php?title=File:Carroll_three_stratum_model_of_human_Intelligence.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Tim bates Image:Jensen box.gif Source: http://en.wikipedia.org/w/index.php?title=File:Jensen_box.gif License: Creative Commons Attribution-Sharealike 3.0 Contributors: Tim bates File:Francis_Galton_1850s.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Francis_Galton_1850s.jpg License: Public Domain Contributors: Fastfission, Frank C. Mller File:Francis Galton00.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Francis_Galton00.jpg License: Public Domain Contributors: Octavius Oakley (1800-1867) File:Louisa Jane Galton00.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Louisa_Jane_Galton00.jpg License: Public Domain Contributors: Androstachys File:Francis Galton2.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Francis_Galton2.jpg License: unknown Contributors: Fastfission, Katharineamy, Monkeybait, Nectarflowed, Sfan00 IMG, Train2104, UtherSRG, 2 anonymous edits Image:Common Cranes (Grus grus) at Sultanpur I Picture 076.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Common_Cranes_(Grus_grus)_at_Sultanpur_I_Picture_076.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: J.M.Garg Image:Rook colonies smoothed 2.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Rook_colonies_smoothed_2.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Lajos.Rozsa Image:Sa aphid colony highres.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Sa_aphid_colony_highres.jpg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Sanjay ach Image:Yellow Paper Wasp.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Yellow_Paper_Wasp.jpg License: GNU Free Documentation License Contributors: JAW Image:Lutjanus kasmira school.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Lutjanus_kasmira_school.jpg License: Creative Commons Attribution 2.0 Contributors: Jim and Becca Wicks Image:Flamingos flying.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Flamingos_flying.jpg License: Creative Commons Attribution-Sharealike 2.0 Contributors: Paul Mannix Image:Tlpelperce.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Tlpelperce.jpg License: GNU Free Documentation License Contributors: bodoklecksel Image:Common Coots I IMG 9270.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Common_Coots_I_IMG_9270.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: J.M.Garg Image:Great Woodswallow group.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Great_Woodswallow_group.jpg License: Creative Commons Attribution-Sharealike 2.0 Contributors: markaharper1 Image:Red-billed quelea flocking at waterhole.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Red-billed_quelea_flocking_at_waterhole.jpg License: Creative Commons Attribution-Sharealike 2.0 Contributors: Atamari, Bdk, FlickreviewR, Kersti Nebelsiek, Kilom691, Sabine's Sunbird Image:Canis lupus pack surrounding Bison.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Canis_lupus_pack_surrounding_Bison.jpg License: Public Domain Contributors: Doug Smith Image:Wild Dog Kruger National Park South Africa.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Wild_Dog_Kruger_National_Park_South_Africa.jpg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Bart Swanson(Bkswanson) Image:Elephant seal colony edit.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Elephant_seal_colony_edit.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: user:CillanXC Image:Vicugna vicugna.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Vicugna_vicugna.JPG License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Haplochromis Image:Dolphins gesture language.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Dolphins_gesture_language.jpg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Serguei S. Dukachev Image:CapeBuffalo-Mara.JPG Source: http://en.wikipedia.org/w/index.php?title=File:CapeBuffalo-Mara.JPG License: GNU Free Documentation License Contributors: myself Image:ARS sheep herding.jpg Source: http://en.wikipedia.org/w/index.php?title=File:ARS_sheep_herding.jpg License: Public Domain Contributors: Stephen Ausmus File:2009-03-11 Student driver SB on N Gregson St in Durham.jpg Source: http://en.wikipedia.org/w/index.php?title=File:2009-03-11_Student_driver_SB_on_N_Gregson_St_in_Durham.jpg License: Creative Commons Attribution-Share Alike Contributors: Ildar Sagdejev (Specious) Image:Francis Galton 1850s.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Francis_Galton_1850s.jpg License: Public Domain Contributors: Fastfission, Frank C. Mller File:Child Art Aged 4.5 Person 2.png Source: http://en.wikipedia.org/w/index.php?title=File:Child_Art_Aged_4.5_Person_2.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: 1Veertje, Barefootguru, Morning Sunshine, Santosga, Warburg file:RavenMatrix.svg Source: http://en.wikipedia.org/w/index.php?title=File:RavenMatrix.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Life of Riley File:IQ curve.svg Source: http://en.wikipedia.org/w/index.php?title=File:IQ_curve.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Alessio Damato, Mikhail Ryazanov File:3PL IRF.png Source: http://en.wikipedia.org/w/index.php?title=File:3PL_IRF.png License: Public Domain Contributors: A3 nm, Iulus Ascanius, MithrandirMage, TheDJ File:Example Likert Scale.svg Source: http://en.wikipedia.org/w/index.php?title=File:Example_Likert_Scale.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Nicholas Smith vectorization: Image:Mediation.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mediation.jpg License: Public Domain Contributors: DanSoper, Jean-Louis Lascoux File:Moderation.png Source: http://en.wikipedia.org/w/index.php?title=File:Moderation.png License: Public domain Contributors: DanSoper File:Mediated moderation model 1.png Source: http://en.wikipedia.org/w/index.php?title=File:Mediated_moderation_model_1.png License: Public domain Contributors: DanSoper File:Mediated moderation model 2.png Source: http://en.wikipedia.org/w/index.php?title=File:Mediated_moderation_model_2.png License: Public domain Contributors: DanSoper File:Mediated moderation model 3.png Source: http://en.wikipedia.org/w/index.php?title=File:Mediated_moderation_model_3.png License: Public domain Contributors: DanSoper

Image Sources, Licenses and Contributors


File:Mediated moderation model 4.png Source: http://en.wikipedia.org/w/index.php?title=File:Mediated_moderation_model_4.png License: Public domain Contributors: DanSoper File:Mediated moderation model 5.png Source: http://en.wikipedia.org/w/index.php?title=File:Mediated_moderation_model_5.png License: Public domain Contributors: DanSoper File:Mediation.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mediation.jpg License: Public Domain Contributors: DanSoper, Jean-Louis Lascoux File:Comparative chart of cognitive methods (after Donders).jpg Source: http://en.wikipedia.org/w/index.php?title=File:Comparative_chart_of_cognitive_methods_(after_Donders).jpg License: Public domain Contributors: DragonflySixtyseven, Feedmecereal, MithrandirMage, Sfan00 IMG, Zazim File:EEG fMRI.jpg Source: http://en.wikipedia.org/w/index.php?title=File:EEG_fMRI.jpg License: Creative Commons Attribution-Sharealike 2.5 Contributors: Feedmecereal, MithrandirMage, Zazim Image:Figure 1 demetriou.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Figure_1_demetriou.jpg License: GNU Free Documentation License Contributors: dimdimit Image:Table 1 demetriou.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Table_1_demetriou.jpg License: GNU Free Documentation License Contributors: dimdimit Image:Table 2 demetriou.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Table_2_demetriou.jpg License: GNU Free Documentation License Contributors: dimdimit File:House 111 X plot.jpg Source: http://en.wikipedia.org/w/index.php?title=File:House_111_X_plot.jpg License: Creative Commons Attribution 3.0 Contributors: Chris hare84 File:Me and howard 1984 4.gif Source: http://en.wikipedia.org/w/index.php?title=File:Me_and_howard_1984_4.gif License: Creative Commons Zero Contributors: Chris hare84 File:NOMINATE 2.jpg Source: http://en.wikipedia.org/w/index.php?title=File:NOMINATE_2.jpg License: Creative Commons Attribution 3.0 Contributors: Chris hare84 File:NOMINATE 3.jpg Source: http://en.wikipedia.org/w/index.php?title=File:NOMINATE_3.jpg License: Creative Commons Attribution 3.0 Contributors: Chris hare84 File:NOMINATE 4.jpg Source: http://en.wikipedia.org/w/index.php?title=File:NOMINATE_4.jpg License: Creative Commons Attribution 3.0 Contributors: Chris hare84 File:NOMINATE polarization.jpg Source: http://en.wikipedia.org/w/index.php?title=File:NOMINATE_polarization.jpg License: Creative Commons Attribution 3.0 Contributors: Chris hare84 File:Pbnjmpegman.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Pbnjmpegman.jpg License: GNU Free Documentation License Contributors: MpegMan File:Personal Space.svg Source: http://en.wikipedia.org/w/index.php?title=File:Personal_Space.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: WebHamster Image:Voter poll.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Voter_poll.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: RadioFan (talk) Image:Bio q2.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Bio_q2.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Roger Schvaneveldt Image:PerceptualMap1.png Source: http://en.wikipedia.org/w/index.php?title=File:PerceptualMap1.png License: GNU Free Documentation License Contributors: Original uploader was Mydogategodshat at en.wikipedia Image:PerceptualMap3-new.png Source: http://en.wikipedia.org/w/index.php?title=File:PerceptualMap3-new.png License: GNU Free Documentation License Contributors: Andrew Gray, BenFrantzDale, Keegan, Pinballjim Image:CPCs.png Source: http://en.wikipedia.org/w/index.php?title=File:CPCs.png License: GNU Free Documentation License Contributors: Maksim Image:TCC.PNG Source: http://en.wikipedia.org/w/index.php?title=File:TCC.PNG License: GNU Free Documentation License Contributors: Holon Image:PersItm.PNG Source: http://en.wikipedia.org/w/index.php?title=File:PersItm.PNG License: GNU Free Documentation License Contributors: Holon Image:ICCs prog.png Source: http://en.wikipedia.org/w/index.php?title=File:ICCs_prog.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: Holon Image:RaschICC.gif Source: http://en.wikipedia.org/w/index.php?title=File:RaschICC.gif License: GNU Free Documentation License Contributors: A3 nm, Ma-Lik, Maksim, Mdd File:Rating scale example 001---5-stars model.png Source: http://en.wikipedia.org/w/index.php?title=File:Rating_scale_example_001---5-stars_model.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Three-quarter-ten File:Reliability and validity.svg Source: http://en.wikipedia.org/w/index.php?title=File:Reliability_and_validity.svg License: unknown Contributors: Nevit Dilmen (talk) File:Risk Inclination Model (RIM).jpg Source: http://en.wikipedia.org/w/index.php?title=File:Risk_Inclination_Model_(RIM).jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Bradyjack file:Semantic Differntial 2.png Source: http://en.wikipedia.org/w/index.php?title=File:Semantic_Differntial_2.png License: GNU Free Documentation License Contributors: User:Bkell, User:Cohesion, User:Cruise, User:Cydebot Image:Graf1 Sesamo test.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Graf1_Sesamo_test.jpg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Trevinci File:Origins Of Hybrid Hypothesis Testing.png Source: http://en.wikipedia.org/w/index.php?title=File:Origins_Of_Hybrid_Hypothesis_Testing.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Nullhypothesistester File:Terman.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Terman.jpg License: Public Domain Contributors: unknown File:Exams in Jaura, India.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Exams_in_Jaura,_India.jpg License: Creative Commons Attribution-Share Alike Contributors: Yann (talk) File:Donboscocambodia0001.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Donboscocambodia0001.JPG License: Public Domain Contributors: Albeiro Rodas albeiror24@gmail.com File:Students taking computerized exam.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Students_taking_computerized_exam.jpg License: Creative Commons Attribution-Sharealike 2.0 Contributors: Michael Surran File:StateLibQld 1 100348.jpg Source: http://en.wikipedia.org/w/index.php?title=File:StateLibQld_1_100348.jpg License: Public Domain Contributors: Slick File:SMATrinitasUlum.JPG Source: http://en.wikipedia.org/w/index.php?title=File:SMATrinitasUlum.JPG License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dennis Kwaria File:APFT-JH-12-19.jpg Source: http://en.wikipedia.org/w/index.php?title=File:APFT-JH-12-19.jpg License: Public Domain Contributors: Rubin16, Image:Wikipage single cancellation.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Wikipage_single_cancellation.JPG License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Axiomguy Image:Wikipage double cancellation.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Wikipage_double_cancellation.JPG License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Axiomguy Image:Wikipage triple cancellation.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Wikipage_triple_cancellation.JPG License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Axiomguy Image:Shalom H Schwartz.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Shalom_H_Schwartz.jpg License: GNU Free Documentation License Contributors: User:Ravit Image:VER model.gif Source: http://en.wikipedia.org/w/index.php?title=File:VER_model.gif License: Public Domain Contributors: Braintest Image:Head movement vestibulogram signal.png Source: http://en.wikipedia.org/w/index.php?title=File:Head_movement_vestibulogram_signal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: Braintest File:Facial vibraimage with frequency scale.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Facial_vibraimage_with_frequency_scale.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Braintest File:Pic1.ahm.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Pic1.ahm.JPG License: Creative Commons Attribution 3.0 Contributors: Hollislai Image:Pic7.ahm.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Pic7.ahm.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Hollislai, 1 anonymous edits File:Pic5.ahm.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Pic5.ahm.JPG License: Creative Commons Attribution 3.0 Contributors: Hollislai File:Pic6.ahm.JPG Source: http://en.wikipedia.org/w/index.php?title=File:Pic6.ahm.JPG License: Creative Commons Attribution 3.0 Contributors: Hollislai File:Uniform DIF curve.png Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_DIF_curve.png License: GNU Free Documentation License Contributors: Anthonyr723 File:Nonuni DIF ICC.png Source: http://en.wikipedia.org/w/index.php?title=File:Nonuni_DIF_ICC.png License: GNU Free Documentation License Contributors: Anthonyr723 File:MHDIFTable.png Source: http://en.wikipedia.org/w/index.php?title=File:MHDIFTable.png License: GNU Free Documentation License Contributors: Anthonyr723 File:ICC slope ip.png Source: http://en.wikipedia.org/w/index.php?title=File:ICC_slope_ip.png License: GNU Free Documentation License Contributors: Anthonyr723

464

License

465

License
Creative Commons Attribution-Share Alike 3.0 Unported //creativecommons.org/licenses/by-sa/3.0/

Das könnte Ihnen auch gefallen