A Compendium of Neuropsychological Tests: Fundamentals of Neuropsychological Assessment and Test Reviews for Clinical Practice (2022)
A Compendium of Neuropsychological Tests: Fundamentals of Neuropsychological Assessment and Test Reviews for Clinical Practice (2022) helps you master complex topics with simplified explanations.
Loading page 4...
A COMPENDIUM OF
NEUROPSYCHOLOGICAL
TESTS
Fundamentals of Neuropsychological Assessment
and Test Reviews for Clinical Practice
F O U R T H E D I T I O N
Elisabeth M. S. Sherman, Jing Ee Tan,
and Marianne Hrabok
Loading page 5...
Oxford University Press is a department of the University of Oxford. It furthers
the University’s objective of excellence in research, scholarship, and education
by publishing worldwide. Oxford is a registered trade mark of Oxford University
Press in the UK and certain other countries.
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America.
© Oxford University Press 2022
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by license, or under terms agreed with the appropriate reproduction
rights organization. Inquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above.
You must not circulate this work in any other form
and you must impose this same condition on any acquirer.
CIP data is on file at the Library of Congress
ISBN 978–0–19–985618–3
This material is not intended to be, and should not be considered, a substitute for medical or other
professional advice. Treatment for the conditions described in this material is highly dependent on
the individual circumstances. And, while this material is designed to offer accurate information with
respect to the subject matter covered and to be current as of the time it was written, research and
knowledge about medical and health issues is constantly evolving and dose schedules for medications
are being revised continually, with new side effects recognized and accounted for regularly. Readers
must therefore always check the product information and clinical procedures with the most up-to-date
published product information and data sheets provided by the manufacturers and the most recent
codes of conduct and safety regulation. The publisher and the authors make no representations or
warranties to readers, express or implied, as to the accuracy or completeness of this material. Without
limiting the foregoing, the publisher and the authors make no representations or warranties as to the
accuracy or efficacy of the drug dosages mentioned in the material. The authors and the publisher do
not accept, and expressly disclaim, any responsibility for any liability, loss, or risk that may be claimed
or incurred as a consequence of the use and/or application of any of the contents of this material.
Printed by Integrated Books International, United States of America
Loading page 6...
This book is dedicated to the memory of Dr. Esther Strauss, mentor, role model, and friend. Esther was one of the first female
neuropsychologists whom we saw gracefully mix science, scholarship, and family. She was humble and hard-working; she
taught us that the most daunting tasks of scholarship don’t require innate stores of superlative brilliance or rarified knowledge;
they simply require putting one’s head down and getting to work. Over the years, we saw her navigate life with warmth,
humor, and intelligence, and witnessed her dedication to and love of neuropsychology. She died too soon, in 2009, three years
after the last edition of this book was published; her imprint is still there in the words of this book. She is deeply missed.
We also want to acknowledge and remember Dr. Otfried Spreen. Otfried was a pioneer in neuropsychology who helped shape
neuropsychology as we know it today through successive generations of students, academics, and clinicians who relied on his
writings and scholarly work as roadmaps on how to understand and best practice neuropsychology. The very first edition of this
book was a compilation of tests used at the University of Victoria Neuropsychology Laboratory at a time where few commercial
tests existed and neuropsychologists relied on researchers for normative data. We hope that the current edition lives up to
Otfried’s initial vision of a useful compilation of tests for practicing clinicians.
Loading page 7...
Loading page 8...
CONTENTS
Preface ix
1. PSYCHOMETRICS IN NEUROPSYCHOLOGICAL
ASSESSMENT 1
2. VALIDITY AND RELIABILITY IN
NEUROPSYCHOLOGICAL ASSESSMENT:
NEW PERSPECTIVES 24
3. PERFORMANCE VALIDITY, SYMPTOM VALIDITY,
AND MALINGERING CRITERIA 37
4. PREMORBID ESTIMATION 48
National Adult Reading Test (NART) 48
Oklahoma Premorbid Intelligence Estimate-IV (OPIE-IV) 58
Test of Premorbid Functioning (TOPF) 64
5. INTELLIGENCE 73
Kaufman Brief Intelligence Test, Second Edition
(KBIT-2) 73
Raven’s Progressive Matrices 78
Reynolds Intellectual Assessment Scales, Second
Edition (RIAS-2) and Reynolds Intellectual
Screening Test, Second Edition (RIST-2) 87
Test of Nonverbal Intelligence, Fourth Edition
(TONI-4) 92
Wechsler Abbreviated Scale of Intelligence,
Second Edition (WASI-II) 96
Wechsler Adult Intelligence Scale—Fourth
Edition (WAIS-IV) 100
Woodcock-Johnson IV Tests of Cognitive
Abilities (WJ IV COG) 119
6. NEUROPSYCHOLOGICAL BATTERIES AND
RELATED SCALES 129
CNS Vital Signs (CNS VS) 129
Kaplan Baycrest Neurocognitive Assessment (KBNA) 142
Neuropsychological Assessment Battery (NAB) 148
Repeatable Battery for the Assessment of
Neuropsychological Status (RBANS Update) 165
Ruff Neurobehavioral Inventory (RNBI) 190
7. DEMENTIA SCREENING 196
7 Minute Screen (7MS) 196
Alzheimer’s Disease Assessment Scale-Cognitive
(ADAS-Cog) 201
Clinical Dementia Rating (CDR) 206
Dementia Rating Scale-2 (DRS-2) 213
General Practitioner Assessment of Cognition
(GPCOG) 233
Mini-Mental State Examination (MMSE),
Mini-Mental State Examination, 2nd Edition
(MMSE-2), and Modified Mini-Mental State
Examination (3MS) 237
Montreal Cognitive Assessment (MoCA) 260
8. ATTENTION 273
Brief Test of Attention (BTA) 273
Conners Continuous Performance Test
3rd Edition (CPT 3) 283
Integrated Visual and Auditory
Continuous Performance Test, Second
Edition (IVA-2) 289
Paced Auditory Serial Addition Test (PASAT) 298
Ruff 2 & 7 Selective Attention Test (2 & 7 Test) 318
Symbol Digit Modalities Test (SDMT) 327
Test of Everyday Attention (TEA) 347
Test of Variables of Attention (T.O.V.A.) 356
9. EXECUTIVE FUNCTIONING 362
Behavior Rating Inventory of Executive
Function—Adult Version (BRIEF-A) 362
Behavioural Assessment of the Dysexecutive
Syndrome (BADS) 374
Category Test (CAT) 382
Clock Drawing Test (CDT) 391
Cognitive Estimation Test (CET) 409
Delis-Kaplan Executive Function System
(D-KEFS) 419
Design Fluency Test 434
Dysexecutive Questionnaire (DEX) 442
Five-Point Test 448
Frontal Systems Behavior Scale (FrSBe) 457
Hayling and Brixton Tests 466
Ruff Figural Fluency Test (RFFT) 480
Stroop Test (Stroop) 488
Trail Making Test (TMT) 518
Verbal Fluency Test 549
Wisconsin Card Sorting Test (WCST) 583
Loading page 9...
viii | C O N T E N T S
10. MEMORY 602
Benton Visual Retention Test Fifth Edition (BVRT-5) 602
Brief Visuospatial Memory Test—Revised (BVMT-R) 614
California Verbal Learning Test—Second Edition
(CVLT-II) 624
Continuous Visual Memory Test (CVMT) 636
Hopkins Verbal Learning Test—Revised (HVLT-R) 642
Rey Auditory Verbal Learning Test (RAVLT) 665
Rey-Osterrieth Complex Figure Test (RCFT) 697
Rivermead Behavioural Memory Test—Third
Edition (RBMT-3) 720
Selective Reminding Test (SRT) 726
Tactual Performance Test (TPT) 752
Warrington Recognition Memory Test (WRMT) 760
Wechsler Memory Scale—Fourth Edition
(WMS-IV) 769
11. LANGUAGE 786
Boston Diagnostic Aphasia Examination Third
Edition (BDAE-3) 786
Boston Naming Test, Second Edition (BNT-2) 797
Multilingual Aphasia Examination Third Edition
(MAE) 829
Token Test 835
12. VISUAL-SPATIAL SKILLS 843
Benton Facial Recognition Test (FRT) 843
Hooper Visual Organization Test (HVOT) 850
Judgment of Line Orientation (JLO) 858
13. SENSORY FUNCTION 871
Bells Cancellation Test 871
Finger Localization 876
University of Pennsylvania Smell Identification
Test (UPSIT) 880
14. MOTOR FUNCTION 892
Finger Tapping Test (FTT) 892
Grip Strength 904
Grooved Pegboard Test 914
Purdue Pegboard Test 923
15. PERFORMANCE VALIDITY 930
b Test 930
Dot Counting Test (DCT) 937
Medical Symptom Validity Test (MSVT) 944
Non-Verbal Medical Symptom Validity Test
(NV-MSVT) 957
Rey Fifteen-Item Test (FIT) 966
Test of Memory Malingering (TOMM) 974
Victoria Symptom Validity Test (VSVT) 987
Word Choice 995
Word Memory Test (WMT) 1002
16. SYMPTOM VALIDITY 1019
Minnesota Multiphasic Personality Inventory-2
(MMPI-2) 1019
Minnesota Multiphasic Personality Inventory-2
Restructured Form (MMPI-2-RF) 1038
Personality Assessment Inventory (PAI) 1056
Structured Inventory of Malingered
Symptomatology (SIMS) 1068
Credits 1077
List of Acronyms 1081
Test Index 1085
Subject Index 1097
Loading page 10...
PREFACE
KNOW YOUR TOOLS
How well do you know your tools? Although most of us
have a fairly good grasp of the main advantages and limita-
tions of the tests we use, if we dig below the surface, we see
that this knowledge can at times be quite shallow. For ex-
ample, how many neuropsychologists know the test-retest
reliability coefficients for all the tests in their battery or can
describe the sensitivity and specificity of their tests? This is
not because the information is lacking (although this is also
at times a problem), and it isn’t because the information is
difficult to find. Indeed, most of the information one could
ever want on neuropsychological tests can be found on the
office shelves of practicing neuropsychologists, in the test
manuals of the tests we most frequently use. The rest can be
easily obtained via literature searches or online. A working
knowledge of neuropsychological tests is hampered by the
most common of modern-day afflictions: lack of time, too
many priorities, and, for want of a better term, information
overload.
Understanding the tests we use requires enough time to
read test manuals and to regularly survey the research liter-
ature for pertinent information as it arises. However, there
are simply too many manuals and too many studies for the
average neuropsychologist to stay up to date on the strengths
and weaknesses of every test used. The reality is that many
tests have lengthy manuals several hundred pages long,
and some tests are associated with literally hundreds, even
thousands, of research studies. The longer the neuropsy-
chological battery, the higher the stack of manuals and the
more voluminous the research. A thorough understanding
of every test’s psychometric properties and research base, in
addition to expert competency in administration, scoring,
and interpretation, requires hours and hours of time,
which for most practicing neuropsychologists is simply not
feasible.
Our own experience bears this out. As is always the case
prior to launching a revision of the Compendium, there was
a large number of tests to review since the previous edition,
and this was compounded by the release of several major
test batteries and complex scales such as the Wechsler Adult
Intelligence Scale, Fourth Edition (WAIS-IV), Wechsler
Memory Scale, Fourth Edition (WMS-IV), Advanced
Clinical Solutions (ACS), and Minnesota Multiphasic
Personality Test-2 Restructured Form (MMPI-2-RF) since
the previous edition. As an example, the ACS has an online
manual that is almost 400 pages long, in addition to an ad-
ministration and scoring manual of more than 150 pages; the
MMPI-2-RF has multiple test manuals and entire books ded-
icated to its use. In parallel, since the previous edition of this
book, there was an exponential increase in the number of re-
search studies involving neuropsychological tests. As authors
and practicing clinicians, we were elated at the amount of
new scholarship on neuropsychological assessment, yet dis-
mayed as our offices became stacked with paperwork and
our virtual libraries and online cloud storage repeatedly
reached maximum storage capacity. The sheer volume of lit-
erature that we reviewed for this book was staggering, and
completing this book was the most challenging professional
task we have encountered. Our wish for this book is that our
efforts will have been worth it. At the very least, we hope that
the time we spent on this book will save the readers some
time of their own.
The essential goal for this book was to create a clinical
reference that would provide, in a relatively easy-to-read,
searchable format, major highlights of the most commonly
used neuropsychological tests in the form of comprehensive,
empirically based critical reviews. To do this, we balanced
between acting as clinicians and acting as researchers: we
were researchers when we reviewed the details of the
scientific literature for each test, and we were clinicians
when providing commentary on tests, focusing as much on
the practicalities of the test as on the scientific literature. As
every neuropsychologist knows, there are some exquisitely
researched tests that are terrible to use in clinical practice
because they are too long, too cumbersome, or too com-
plicated, and this was essential to convey to the readership
so that the book could be of practical utility to everyday
clinicians like ourselves.
In addition to the core focus on test reviews, the book
was also designed to provide an overview of foundational
psychometric concepts relevant to neuropsychological
practice including overviews of models of test validity and
basics of reliability which have been updated since the
previous edition. As well, woven throughout the text is a
greater emphasis on performance validity and symptom
Loading page 11...
x | P R E F A C E
validity in each review, as well as updated criteria for malin-
gered neurocognitive dysfunction. The current edition of
this book presents a needed updating based on the past
several years of research on malingering and performance
validity in neuropsychology.
“Know Your Tools” continues to be the guiding
principle behind this edition of the Compendium of
Neuropsychological Tests. We hope that after reading this
book, users will gain a greater understanding of critical is-
sues relevant to the broader practice of neuropsycholog-
ical assessment, a strong working knowledge of the specific
strengths and weaknesses of the tests they use, and, most
importantly, an enhanced understanding of clinical neuro-
psychological assessment grounded in clinical practice and
research evidence.
CHANGES COMPARED TO PRIOR
EDITIONS
Users will notice several changes from the previous edition.
Arguably the biggest change is the exclusive focus on adult
tests and norms. Not including pediatric tests and norms
had to be done to prevent the book from ballooning into
absurd proportions. As some of us have combined adult and
pediatric practices, this was a painful albeit necessary deci-
sion. Fortunately, pediatric neuropsychological tests are al-
ready well covered elsewhere (e.g., Baron, 2018).
Since its first publication in 1991, the Compendium
of Neuropsychological Tests has been an essential reference
text to guide the reader through the maze of literature on
tests and to inform clinicians and researchers of the psycho-
metric properties of their instruments so that they can make
informed choices and sound interpretations. The goals of
the fourth edition of the Compendium remain the same, al-
though admittedly, given the continued expansion of the
field, our coverage is necessarily selective; in the end, we had
to make very hard decisions about which tests to include and
which tests to omit. Ultimately, the choice of which tests
to include rested on practice surveys indicating the tests
most commonly used in the field; we selectively chose those
with at least a 10% utilization rate based on surveys. Several
surveys were key in making these decisions (Dandachi-
FitzGerald, Ponds, & Merten, 2013; LaDuke, Barr, Brodale,
& Rabin, 2017; Martin, Schroeder, & Odland, 2015;
Rabin, Paolillo, & Barr, 2016; Young, Roper, & Arentsen,
2016). As well, a small number of personal or sentimental
favorites made it to the final edition, including some dear to
Esther and Otfried. All the reviews were extensively revised
and updated, and many new tests were added, in particular
a number of new cognitive screening tests for dementia, as
well as additional performance and symptom validity tests
not covered in the prior edition. We can therefore say fairly
confidently that the book does indeed include most of the
neuropsychological tests used by most neuropsychologists.
Nevertheless, we acknowledge that some readers may
find their favorite test missing from the book. For example,
we did not cover computerized concussion assessment
batteries or some specialized computerized batteries such
as the Cambridge Neuropsychological Test Automated
Battery (CANTAB). To our great regret, this was impos-
sible for both practical and logistical reasons. These reasons
included but were not limited to a lower rate of usage in
the field according to survey data, but also the need to
avoid more weekday evenings, early mornings, weekends,
and holidays with research papers to review for this book,
a regular albeit inconvenient habit in our lives for the last
several years. Hopefully the reviews of computerized as-
sessment batteries already in the literature will compensate
for this necessary omission; a few did manage to slip into
the book as well, such as the review of the CNS Vital Signs
(CNS VS).
Because of the massive expansion of research studies on
tests, most reviews also had to be expanded. To make room
for these longer reviews, some of the general introductory
chapters were not carried over from the prior edition, as
most of the information is available in other books and re-
sources (e.g., Lezak, Howieson, Bigler, & Tranel, 2012). We
retained the chapter on psychometrics and gave validity
and reliability their own chapter to better cover changing
models in the field. We also retained the chapter on per-
formance validity, symptom validity, and malingering given
their critical importance in assessment.
In this edition, we also elected not to include any scales
covering the assessment of psychopathology, unless they also
functioned as symptom validity scales. Psychopathology
scales are not specific to neuropsychological assessment and
are reviewed in multiple other sources, including several
books. We retained some scales and questionnaires meas-
uring neuropsychological constructs such as executive func-
tion, however. Last, for this edition, we included a look-up
box at the beginning of each review outlining the main
features of each test. We hope that this change will make it
easier for readers to locate critical information and to com-
pare characteristics across measures.
ORGANIZATION OF THE BOOK
The first chapter in this volume presents basic psychometric
concepts in neuropsychological assessment and provides
an overview of critical issues to consider in evaluating tests
for clinical use. The second chapter presents new ways of
looking at validity and reliability as well as psychometric
and practical principles involved in evaluating validity
and reliability evidence. (Note the important table in this
chapter entitled, “Top 10 Reasons for Not Using Tests,”
a personal favorite courtesy of Susan Urbina [2014].)
Chapter 3 presents an overview of malingering, including
updated malingering criteria.
Loading page 12...
P R E F A C E | xi
Chapters 4 to 16 address the specific domains of de-
mentia screening, premorbid estimation, intelligence,
neuropsychological batteries and related scales, attention,
executive functioning, memory, language, visual-spatial
skills, sensory function, motor function, performance va-
lidity, and symptom validity. Tests are assigned in a rational
manner to each of the separate domains—with the implicit
understanding that there exists considerable commonality
and overlap across tests measuring purportedly discrete
domains. This is especially true of tests measuring attention
and of those measuring executive functioning.
To promote clarity, each test review follows a fixed format
and includes Domain, Age Range, Administration Time,
Scoring Format, Reference, Description, Administration,
Scoring, Demographic Effects, Normative Data, Evidence
for Reliability, Evidence for Validity, Performance/
Symptom Validity, and Comment. In each review, we take
the bird’s-eye view while grounding our impressions in the
nitty-gritty of the scientific research; we have also tried
to highlight clinical issues relevant to a wide variety of
examinees and settings, with emphasis on diversity.
CAUTIONS AND CAVEATS
First, a book of this scope and complexity will unfortunately—
and necessarily—contain errors. As well, it is possible
that in shining a spotlight on a test’s limitations, we have
inadvertently omitted or distorted some information sup-
portive of its strengths and assets. For that, we apologize in
advance. We encourage readers to inform us of omissions,
misinterpretations, typographical errors, and inadvertent
scientific or clinical blunders so that we can correct them in
the next edition.
Second, while this book presents relevant research on
tests, it is not intended as an exhaustive survey of neuropsycho-
logical test research, and as such, will not include every rele-
vant or most up-to-date research study for each test profiled.
Our aim is to provide a general overview of research studies
while retaining mention of some older studies as historical
background, particularly for some of the older measures
included in the book. The reader is encouraged to use the
book as a jumping-off point for more detailed reading and
exploration of research relevant to neuropsychological tests.
Third, neuropsychology as a field still has a consider-
able way to go in terms of addressing inclusivity and diver-
sity, particularly with regard to ethnicity and gender. Many
older tests and references have ignored diversity altogether
or have used outdated terms or ways of classifying and
describing people. As much as possible we have attempted
to address this, but our well-meaning efforts will necessarily
fall short.
We also want to make it explicit that norms based on
ethnicity/race including the ones in this book are not to
be interpreted as reflecting physical/biological/genetic
differences and that the selection of which norms to use
should be a decision based on what is best for the particular
patient’s clinical situation. We acknowledge the Position
Statement on Use of Race as a Factor in Neuropsychological
Test Norming and Performance Prediction by the American
Academy of Clinical Neuropsychology (AACN), as
follows:
The field of neuropsychology recognizes that environ-
mental influences play the predominant role in cre-
ating racial disparities in test performance. Rather
than attributing racial differences in neuropsycholog-
ical test scores to genetic or biological predispositions,
neuropsychology highlights environmental factors to
explain group differences including underlying socio-
economic influences; access to nutritional, preventative
healthcare, and educational resources; the psycholog-
ical and medical impact of racism and discrimination;
the likelihood of exposure to environmental toxins
and pollutants; as well as measurement error due to
biased expectations about the performance of histor-
ically marginalized groups and enculturation into
the groups on which tests were validated. The above
is only a partial list of factors leading to differences in
performance among so-called racial groups, but none
of these factors, including those not enumerated here,
is thought to reflect any biological predisposition that
is inherent to the group in question. Race, therefore,
is often a proxy for factors that are attributable to in-
equity, injustice, bias, and discrimination. (https://
theaacn.org/wp-content/uploads/2021/11/AACN-
Position-Statement-on-Race-Norms.pdf )
ACKNOWLEDGMENTS
We first acknowledge the immense contribution to the field
of neuropsychology by Otfried Spreen and Esther Strauss,
who first had the idea that neuropsychology needed a com-
pendium for its tests and norms. They created the first
Compendium in 1991 and were authors for the subsequent
editions in 1998, with Elisabeth Sherman joining them as
an additional author in the 2006 edition. Both Otfried
and Esther sadly passed away after the 2006 edition was
published, leaving a large void in the field. We hope that this
book does justice to their aim in creating the Compendium
and that the fourth edition continues their legacy of pro-
viding the field of neuropsychology with the essential refer-
ence text on neuropsychological tests and testing.
We express our gratitude to the numerous authors
whose published work has provided the basis for our
reviews and who provided additional information, clarifi-
cation, and helpful comments. Thank you to Travis White
at Psychological Assessment Resources, David Shafer at
Pearson, Jamie Whitaker at Houghton Mifflin Harcourt,
Loading page 13...
xii | P R E F A C E
and Paul Green for graciously providing us with test
materials for review, and to all the other test authors and
publishers who kindly provided us with materials. We are
indebted to them for their generous support.
We also wish to thank those who served as ad hoc
reviewers for some test reviews. Special thanks to Glenn
Larrabee, Jim Holdnack, and Brian Brooks who provided
practical and scholarly feedback on some of the reviews and
to Kevin Bianchini and Grant Iverson for some spirited
discussions and resultant soul-searching on malingering.
Thanks also to Amy Kovacs at Psychological Assessment
Resources and Joseph Sandford at BrainTrain for checking
some of the reviews for factual errors. An immense debt of
gratitude is owed to Shauna Thompson, M.Ed., for her in-
valuable help at almost every stage of this book and espe-
cially for the heavy lifting at the very end that got this book
to print.
Finally, we thank our families for their love and un-
derstanding during the many hours, days, months, and
years it took to write this book. Elisabeth wishes to thank
Michael Brenner, who held up the fort while the book
went on, and on, and on; she also dedicates this book to
her three reasons: Madeleine, Tessa, and Lucas. Special
thanks to Tessa in particular for her flawless editing and
reference work.
Jing wishes to thank Sheldon Tay, who showered her
with love and encouragement through the evenings and
weekends she spent writing, and for rearranging his life
around her writing schedule.
Marianne extends gratitude to Jagjit, for support, love,
dedication, humor, and his “can do” attitude that sustained
her during this book; to their children Avani, Saheli, and
Jorah, for continuous light and inspiration; to her Mom,
who spent many hours of loving, quality time with her
grandkids so Marianne could focus on writing; and to her
family for support and believing in her always.
R E F E R E N C E S
Baron, I. S. (2018). Neuropsychological evaluation of the child: Domains,
methods, and case studies (2nd ed.). New York: Oxford University
Press.
Dandachi-FitzGerald, B., Ponds, R. W. H. M., & Merten, T. (2013).
Symptom validity and neuropsychological assessment: A survey of
practices and beliefs of neuropsychologists in six European countries.
Archives of Clinical Neuropsychology, 28(8), 771–783. https://doi.
org/10.1093/arclin/act073
LaDuke, C., Barr, W., Brodale, D. L., & Rabin, L. A. (2017). Toward
generally accepted forensic assessment practices among clinical
neuropsychologists: A survey of professional practice and common
test use. Clinical Neuropsychologist, 1–20. https://doi.org/10.1080/
13854046.2017.1346711
Lezak, M. D., Howieson, D. B., Bigler, E. D., & Tranel, D. (2012).
Neuropsychological assessment (5th ed.). New York: Oxford
University Press.
Martin, P. K., Schroeder, R. W., & Odland, A. P. (2015). Neuropsycho-
logists’ validity testing beliefs and practices: A survey of North
American professionals. Clinical Neuropsychologist, 29(6), 741–776.
https://doi.org/10.1080/13854046.2015.1087597
Rabin, L. A., Paolillo, E., & Barr, W. B. (2016). Stability in test-usage
practices of clinical neuropsychologists in the United States and
Canada over a 10-year period: A follow-up survey of INS and NAN
members. Archives of Clinical Neuropsychology, 31(3), 206–230.
https://doi.org/10.1093/arclin/acw007
Rabin, L., Spadaccini, A., Brodale, D., Charcape, M., & Barr, W. (2014).
Utilization rates of computerized tests and test batteries among
clinical neuropsychologists in the US and Canada. Professional
Psychology: Research and Practice, 45, 368–377.
Young, J. C., Roper, B. L., & Arentsen, T. J. (2016). Validity testing and
neuropsychology practice in the VA healthcare system: Results from
recent practitioner survey. Clinical Neuropsychologist, 30(4), 497–514.
https://doi.org/10.1080/13854046.2016.1159730
Loading page 14...
1 | PSYCHOMETRICS IN NEUROPSYCHOLOGICAL
ASSESSMENT
D A N I E L J . S L I C K A N D E L I S A B E T H M . S . S H E R M A N
OVERVIEW
The process of neuropsychological assessment depends to
a large extent on the reliability and validity of neuropsy-
chological tests. Unfortunately, not all neuropsycholog-
ical tests are created equal, and, like any other product,
published tests vary in terms of their “quality,” as defined in
psychometric terms such as reliability, measurement error,
temporal stability, sensitivity, specificity, and predictive
validity and with respect to the care with which test items
are derived and normative data are obtained. In addition
to commercially available tests, numerous tests developed
primarily for research purposes have found their way into
clinical usage; these vary considerably with regard to psy-
chometric properties. With few exceptions, when tests orig-
inate from clinical research contexts, there is often validity
data but little else, which makes estimating measurement
precision and stability of test scores a challenge.
Regardless of the origins of neuropsychological tests, their
competent use in clinical practice demands a good working
knowledge of test standards and of the specific psychometric
characteristics of each test used. This includes familiarity with
the Standards for Educational and Psychological Testing
(American Educational Research Association [AERA] et al.,
2014) and a working knowledge of basic psychometrics.
Texts such as those by Nunnally and Bernstein (1994) and
Urbina (2014) outline some of the fundamental psycho-
metric prerequisites for competent selection of tests and in-
terpretation of obtained scores. Other neuropsychologically
focused texts such as Mitrushina et al. (2005), Lezak et al.
(2012), Baron (2018), and Morgan and Ricker (2018) also
provide guidance. This chapter is intended to provide a broad
overview of some important psychometric concepts and
properties of neuropsychological tests that should be consid-
ered when critically evaluating tests for clinical usage.
THE NORMAL CURVE
Within general populations, the frequency distributions
of a large number of physical, biological, and psychological
attributes approximate a bell-shaped curve, as shown in
Figure 1–1. This normal curve or normal distribution,
so named by Karl Pearson, is also known as the Gaussian
or Laplace-Gauss distribution, after the 18th-century
mathematicians who first defined it. It should be noted
that Pearson later stated that he regretted his choice of
“normal” as a descriptor for the normal curve because it had
“the disadvantage of leading people to believe that all other
distributions of frequency are in one sense or another ‘ab-
normal.’ That belief is, of course, not justifiable” (Pearson,
1920, p. 25).
The normal distribution is central to many commonly
used statistical and psychometric models and analytic
methods (e.g., classical test theory) and is very often the
implicitly or explicitly assumed population distribution
for psychological constructs and test scores, though this as-
sumption is not always correct.
D E F I N I T I O N A N D C H A R A C T E R I S T I C S
The normal distribution has a number of specific properties.
It is unimodal, perfectly symmetrical, and asymptotic at the
tails. With respect to scores from measures that are normally
distributed, the ordinate, or height of the curve at any point
along the x (test score) axis, is the proportion of persons
within the sample who obtained a given score. The ordinates
for a range of scores (i.e., between two points on the x axis)
may also be summed to give the proportion of persons who
obtained a score within the specified range. If a specified
normal curve accurately reflects a population distribution,
then ordinate values are also equivalent to the probability
of observing a given score or range of scores when randomly
sampling from the population. Thus, the normal curve may
also be referred to as a probability distribution.
x
f (x)
Figure 1– 1 The normal curve.
Loading page 15...
2
The normal curve is mathematically defined as follows:
f x e x( ) ( )= − −
1
2 2
2
πσ
μ [1]
Where:
x = measurement values (test scores)
μ = the mean of the test score distribution
σ = the standard deviation of the test score distribution
π = the constant pi (3.14 . . . )
e = the base of natural logarithms (2.71 . . . )
f (x) = the height (ordinate) of the curve for any given
test score
R E L E V A N C E F O R A S S E S S M E N T
As noted previously, because it is a frequency distribu-
tion, the area under any given segment of the normal curve
indicates the frequency of observations or cases within
that interval. From a practical standpoint, this provides
psychologists with an estimate of the “normality” or “ab-
normality” of any given test score or range of scores (i.e.,
whether it falls in the center of the bell shape, where the
majority of scores lie, or instead at either of the tail ends,
where few scores can be found).
S TA N D A R D I Z E D S C O R E S
An individual examinee’s raw score on a test has little value
on its own and only takes on clinical meaning by com-
paring it to the raw scores obtained by other examinees in
appropriate normative or reference samples. When reference
sample data are normally distributed, then raw scores may
be standardized or converted to a metric that denotes rank
relative to the participants comprising the reference sample.
To convert raw scores to standardized scores, scores may be
linearly transformed or “standardized” in several ways. The
simplest standard score is the z score, which is obtained by
subtracting the sample mean score from an obtained score
and dividing the result by the sample standard deviation, as
show below:
z x X SD= −( )/ [2]
Where:
x = measurement value (test score)
X = the mean of the test score distribution
SD = the standard deviation of the test score
distribution
The resulting distribution of z scores has a mean of 0 and
a standard deviation (SD) of 1, regardless of the metric of
raw scores from which it was derived. For example, given a
mean of 25 and an SD of 5, a raw score of 20 translates into
a z score of −1.00. In addition to the z score, linear trans-
formation can be used to produce other standardized scores
that have the same properties. The most common of these
are T scores (mean [M] = 50, SD = 10) and standardized
scores used in most IQ tests (M = 10, SD = 3, and M = 100,
SD = 15). It must be remembered that z scores, T scores,
and all other standardized scores are derived from samples;
although these are often treated as population values, any
limitations of generalizability due to reference sample com-
position or testing circumstances must be taken into con-
sideration when standardized scores are interpreted.
T H E M E A N I N G O F S TA N D A R D I Z E D
T E S T S C O R E S
As well as facilitating translation of raw scores to estimated
population ranks, standardization of test scores, by virtue of
conversion to a common metric, facilitates comparison of
scores across measures—as long as critical assumptions are
met, including that raw score distributions of tests being com-
pared are approximately normal. In addition, if standardized
scores are to be compared, they should be derived from similar
samples or, more ideally, from the same sample. A T score of
50 on a test normed on a population of university students
does not have the same meaning as an “equivalent” T score
on a test normed on a population of older adults. When
comparing standardized scores, one must also take into con-
sideration both the reliability of the two measures and their
intercorrelation before determining if a significant difference
exists (see Crawford & Garthwaite, 2002). In some cases (e.g.,
tests with low precision), relatively large disparities between
standardized scores may not actually reflect reliable differences
and therefore may not be clinically meaningful. Furthermore,
statistically significant or reliable differences between test
scores may be common in a reference sample; therefore, the
base rate of score differences in reference samples must also be
considered. One should also keep in mind that when raw test
scores are not normally distributed, standardized scores will
not accurately reflect actual population rank, and differences
between standardized scores will be misleading.
Note also that comparability across tests does not imply
equality in meaning and relative importance of scores. For
example, one may compare standardized scores on measures
of pitch discrimination and intelligence, but it will rarely be
the case that these scores are of equal clinical or practical
significance.
S TA N D A R D I Z E D P E R C E N T I L E S
The standardized scores just described are useful but also
somewhat abstract. In comparison, a more easily under-
standable and clinically useful metric is the percentile,
which denotes the percentage of scores that fall at or below
Loading page 16...
3
a given test score. It is critically important to distinguish
between percentile scores that are derived directly from
raw untransformed test score distributions and percentile
scores that are derived from linear transformations of raw
test scores because the two types of percentile scores will
only be equivalent when reference sample distributions are
normally distributed, and they may diverge quite mark-
edly when reference sample distributions are non-normal.
Unfortunately, there is no widely used nomenclature to dis-
tinguish between the two types of percentiles, and so it may
not always be clear which type is being referred to in test
documentation and research publications. To ensure clarity
within this chapter, percentile scores derived from linear
transformations of raw test scores are always referred to as
standardized percentiles.
When raw scores have been transformed into standard-
ized scores, the corresponding standardized percentile rank
can be easily looked up in tables available in most statistical
texts or quickly obtained via online calculators. Z score
conversions to percentiles are shown in Table 1–1. Note
that this method for deriving percentiles should only be
used when raw score distributions are normally distributed.
When raw score distributions are substantially non-normal,
percentiles derived via linear transformation will not accu-
rately correspond to actual percentile ranks within the ref-
erence samples from which they were derived.
I N T E R P R E TAT I O N O F S TA N D A R D I Z E D
P E R C E N T I L E S
An important property of the normal curve is that the re-
lationship between raw or z scores (which for purposes of
this discussion are equivalent since they are linear trans-
formations of each other) and percentiles is not linear. That
is, a constant difference between raw or z scores will be as-
sociated with a variable difference in percentile scores as a
function of the distance of the two scores from the mean.
This is due to the fact that there are proportionally more
observations (scores) near the mean than there are farther
from the mean; otherwise, the distribution would be rec-
tangular, or non-normal. This can readily be seen in Figure
1–2, which shows the normal distribution with demar-
cation of z scores and corresponding percentile ranges.
Because percentiles have a nonlinear relationship with raw
scores, they cannot be used for some arithmetic procedures
such as calculation of average scores; standardized scores
must be used instead.
The nonlinear relation between z scores and percentiles
has important interpretive implications. For example, a one-
point difference between two z scores may be interpreted
differently depending on where the two scores fall on the
normal curve. As can be seen, the difference between a z
score of 0 and a z score of +1.00 is 34 percentile points, be-
cause 34% of scores fall between these two z scores (i.e., the
scores being compared are at the 50th and 84th percentiles).
However, the difference between a z score of +2.00 and a z
score of +3.00 is less than three percentile points because
only 2.5% of the distribution falls between these two points
(i.e., the scores being compared are at the 98th and 99.9th
percentiles). On the other hand, interpretation of percen-
tile score differences is also not straightforward in that an
equivalent “difference” between two percentile rankings
may entail different clinical implications depending on
whether the scores occur at the tail end of the curve or if
they occur near the middle of the distribution. For example,
the 30 percentile point difference between scores at the 1st
and 31st percentiles will be more clinically meaningful than
the same 30 percentile point difference between scores at
the 35th and 65th percentiles.
I N T E R P R E T I N G E X T R E M E S TA N D A R D I Z E D
S C O R E S
A final critical issue with respect to the meaning of standard-
ized scores has to do with extreme observations. In clinical
practice, one may encounter standardized scores that are
either extremely low or extremely high. The meaning and
comparability of such scores will depend critically on the
characteristics of the normative samples from which they
are derived.
For example, consider a hypothetical case in which an
examinee obtains a raw score that is below the range of
scores found in a normative sample. Suppose further that
the examinee’s raw score translates to a z score of −5.00,
nominally indicating that the probability of encountering
this score in the normative sample would be 3 in 10 mil-
lion (i.e., a percentile ranking of .00003). This represents a
considerable extrapolation from the actual normative data,
as (1) the normative sample did not include 10 million
individuals, and (2) not a single individual in the normative
sample obtained a score anywhere close to the examinee’s
score. The percentile value is therefore an extrapolation and
confers a false sense of precision. While one may be con-
fident that it indicates impairment, there may be no basis
to assume that it represents a meaningfully “worse” perfor-
mance than a z score of −3.00, or of −4.00.
The estimated prevalence value of an obtained standard
score can be calculated to determine whether interpreta-
tion of extreme scores may be appropriate. This is simply ac-
complished by inverting the percentile score corresponding
to the z score (i.e., dividing 1 by the percentile score). For
example, a z score of −4 is associated with an estimated
frequency of occurrence or prevalence of approximately
0.00003. Dividing 1 by this value gives a rounded result of
33,333. Thus, the estimated prevalence value of this score in
the population is 1 in 33,333. If the normative sample from
which a z score is derived is considerably smaller than the
denominator of the estimated prevalence value (i.e., 33,333
Loading page 17...
4
TABLE 1– 1 score Conversion Table
sTANdARd
sCOREsa T sCOREs
sCALEd
sCOREsb pERCENTILEs – Z / +Z pERCENTILEs
sCALEd
sCOREsb T sCOREs
sTANdARd
sCOREsa
≤55 ≤20 ≤1 ≤0.1 ≤3.00≥ ≥99.9 ≥19 ≥80 ≥145
56–60 21–23 2 <1 2.67–2.99 >99 18 77–99 140–144
61–67 24–27 3 1 2.20–2.66 99 17 73–76 133–139
68–70 28–30 4 2 1.96–2.19 98 16 70–72 130–132
71–72 31 3 1.82–1.95 97 69 128–129
73–74 32–33 4 1.70–1.81 96 67–68 126–127
75–76 34 5 5 1.60–1.69 95 15 66 124–125
77 6 1.52–1.59 94 123
78 35 7 1.44–1.51 93 65 122
79 36 8 1.38–1.43 92 64 121
80 6 9 1.32–1.37 91 14 120
81 37 10 1.26–1.31 90 63 119
11 1.21–1.25 89
82 38 12 1.16–1.20 88 62 118
83 13 1.11–1.15 87 117
84 39 14 1.06–1.10 86 61 116
15 1.02–1.05 85
85 40 7 16 .98–1.01 84 13 60 115
17 .94–.97 83
86 41 18 .90–.93 82 59 114
87 19 .86–.89 81 113
20 .83–.85 80
88 42 21 .79–.82 79 58 112
22 .76–.78 78
89 23 .73–.75 77 111
43 24 .70–.72 76 57
90 8 25 .66–.69 75 12 110
26 .63–.65 74
91 44 27 .60–.62 73 56 109
28 .57–.59 72
29 .54–.56 71
92 30 .52–.53 70 108
45 31 .49–.51 69 55
93 32 .46–.48 68 107
33 .43–.45 67
94 46 34 .40–.42 66 54 106
35 .38–.39 65
36 .35–.37 64
95 9 37 .32–.34 63 11 105
47 38 .30–.31 62 53
96 39 .27–.29 61 104
40 .25–.26 60
41 .22–.24 59
97 48 42 .19–.21 58 52 103
43 .17–.18 57
44 .14–.16 56
98 45 .12–.13 55 102
49 46 .09–.11 54 51
99 47 .07–.08 53 101
48 .04–.06 52
49 .02–.03 51
100 50 10 50 .00–.01 50 10 50 100
aM = 100, SD = 15.
bM = 10, SD = 3.
Loading page 18...
5
in the example), then some caution may be warranted in
interpreting the percentile. In addition, whenever such ex-
treme scores are being interpreted, examiners should also
verify that the examinee’s raw score falls within the range of
raw scores in the normative sample. If the normative sample
size is substantially smaller than the estimated prevalence
sample size and the examinee’s score falls outside the sample
range, then standardized scores and associated percentiles
should be interpreted with considerable caution. Regardless
of the z score value, it must also be kept in mind that in-
terpretation of the associated percentile value may not be
justifiable if the normative sample has a significantly non-
normal distribution. In sum, the clinical interpretation of
extreme scores depends to a large extent on how extreme
the score is and on the properties of the reference samples
involved. One can have more confidence that a percen-
tile is reasonably accurate if (1) the score falls within the
range of scores in the reference sample, (2) the reference
sample is large and accurately reflects relevant population
parameters, and (3) the shape of the reference sample distri-
bution is approximately normal, particularly in tail regions
where extreme scores are found.
NON- NORMALITy
Although ideal from a psychometric standpoint, normal
distributions appear to be the exception rather than the
rule when it comes to normative data for psychological
measures, even for very large samples. In a landmark study,
Micceri (1989) analyzed 400 reference samples for psycho-
logical and education tests, including 30 national tests and
131 regional tests. He found that extremes of asymmetry
and multimodality were the norm rather than the exception
and so concluded that the “widespread belief in the naïve
assumption of normality” of score distributions for psycho-
logical tests is not supported by the actual data (p. 156).
The primary factors that lead to non-normal test score
distributions have to do with test design, reference sample
characteristics, and the constructs being measured. More
concretely, these factors include (1) test item sets that
do not cover a full range of difficulty resulting in floor/
ceiling effects, (2) the existence of distinct unseparated
subpopulations within reference samples, and (3) the abil-
ities being measured are not normally distributed in the
population.
S K E W
As with the normal curve, some varieties of non-normality
may be characterized mathematically. Skew is a formal
measure of asymmetry in a frequency distribution that
can be calculated using a specific formula (see Nunnally &
Bernstein, 1994). It is also known as the third moment of a
distribution (the mean and variance are the first and second
moments, respectively). A true normal distribution is per-
fectly symmetrical about the mean and has a skew of zero.
A non-normal but symmetric distribution will also have a
skew value that is at or near zero. Negative skew values indi-
cate that the left tail of the distribution is heavier (and often
more elongated) than the right tail, which may be trun-
cated, while positive skew values indicate that the opposite
pattern is present (see Figure 1–3). When distributions are
skewed, the mean and median are not identical; the mean
will not be at the midpoint in rank, and z scores will not
accurately translate into sample percentile rank values. The
error in mapping of z scores to sample percentile ranks
increases as skew increases.
T R U N C AT E D D I S T R I B U T I O N S
Significant skew often indicates the presence of a truncated
distribution, characterized by restriction in the range of
scores on one side of a distribution but not the other, as is
the case, for example, with reaction time measures, which
cannot be lower than several hundred milliseconds, but
can reach very high positive values in some individuals. In
fact, distributions of scores from reaction time measures,
whether aggregated across trials on an individual level or
across individuals, are often characterized by positive skew
and positive outliers. Mean values may therefore be posi-
tively biased with respect to the “central tendency” of the
distribution as defined by other indices, such as the me-
dian. Truncated distributions are also commonly seen for
error scores. A good example of this is failure to maintain
set (FMS) scores on the Wisconsin Card Sorting Test (see
34% 34%
13.5%13.5%
2.35%
0.15%
+3+2+1–1–2–3
0.15%
2.35%
0
Figure 1– 2 The normal curve demarcated by z scores.
Positive Skew Negative Skew
Figure 1– 3 Skewed distributions.
Loading page 19...
6
review in this volume). In a normative sample of 30- to 39-
year-old persons, observed raw scores range from 0 to 21,
but the majority of persons (84%) obtain scores of 0 or 1,
and less than 1% obtain scores greater than 3.
F L O O R A N D C E I L I N G E F F E C T S
Floor and ceiling effects may be defined as the presence of
truncated tails in the context of limitations in range of item
difficulty. For example, a test may be said to have a high floor
when a large proportion of the examinees obtain raw scores
at or near the lowest possible score. This may indicate that
the test lacks a sufficient number and range of easier items.
Conversely, a test may be said to have a low ceiling when
the opposite pattern is present (i.e., when a high number
of examinees obtain raw scores at or near the highest pos-
sible score). Floor and ceiling effects may significantly limit
the usefulness of a measure. For example, a measure with a
high floor may not be suitable for use with low functioning
examinees, particularly if one wishes to delineate level of
impairment.
M U LT I M O D A L I T Y A N D O T H E R T Y P E S O F
N O N - N O R M A L I T Y
Multimodality is the presence of more than one “peak” in a
frequency distribution (see the histogram in Figure 1–4 for
an example). Pronounced multimodality strongly suggests
the presence of two or more distinct subpopulations
within a reference sample, and test developers who are
confronted with such data should strongly consider evalu-
ating grouping variables (e.g., level of education) that might
separate examinees into subgroups that have better shaped
score distributions. Another form of non-normality is the
uniform or near-uniform distribution (a distribution with
no or minimal peak and relatively equal frequency across
all scores), though this type of distribution is rarely seen in
psychological data.
S U B G R O U P S V E R S U S L A R G E R
R E F E R E N C E S A M P L E S
Score distributions for a general population and
subpopulations may not share the same shape. Scores may
be normally distributed within an entire population but
not normally distributed within specific subgroups, and the
converse may also be true. Scores from general populations
and subgroups may even be non-normal in different ways
(e.g., positively vs. negatively skewed). Therefore, test users
should not assume that reference samples and subgroups
from those samples share a common distribution shape but
should carefully evaluate relevant data from test manuals
or other sources to determine the characteristics of the
distributions of any samples or subsamples they may utilize
to obtain standardized scores. It should also be noted that
even when an ability being measured is normally distrib-
uted within a subgroup, distributions of scores from such
subgroups may nevertheless be non-normal if tests do not
include sufficient numbers of items covering a wide enough
range of difficulty, particularly at very low and high levels.
For example, score distributions from intelligence tests
may be truncated and/or skewed within subpopulations
with very low or high levels of education. Within such
subgroups, test scores may be of limited utility for ranking
individuals because of ceiling and floor effects.
S A M P L E S I Z E A N D N O N - N O R M A L I T Y
The degree to which a given distribution approximates the
underlying population distribution increases as the number
of observations (N) increases and becomes less accurate as
N decreases. This has important implications for norms de-
rived from small samples. A larger sample will produce a
more normal distribution, but only if the underlying pop-
ulation distribution from which the sample is obtained is
normal. In other words, a large N does not “correct” for
non-normality of an underlying population distribution.
However, small samples may yield non-normal test score
distributions due to random sampling errors, even when the
construct being measured is normally distributed within
the population from which the sample is drawn. That is, one
may not automatically assume, given a non-normal distribu-
tion in a small sample, that the population distribution is in
fact non-normal (note that the converse may also be true).
N O N - N O R M A L I T Y A S A F U N D A M E N TA L
C H A R A C T E R I S T I C O F C O N S T R U C T S
B E I N G M E A S U R E D
Depending on the characteristics of the construct being
measured and the purpose for which a test is being designed,
a normal distribution of reference sample scores may not be
expected or even desirable. In some cases, the population
distribution of the construct being measured may not be
normally distributed (e.g., reaction time). Alternatively,
test developers may want to identify and/or discriminate
between persons at only one end of a continuum of abili-
ties. For example, the executive functioning scales reviewed
in this volume are designed to detect deficits and not exec-
utive functioning strengths; aphasia scales work the same
way. These tests focus on the characteristics of only one side
of the distribution of the general population (i.e., the lower
end), while the characteristics of the other side of the dis-
tribution are less of a concern. In such cases, measures may
even be deliberately designed to have floor or ceiling effects
when administered to a general population. For example,
if one is not interested in one tail (or even one-half ) of the
distribution, items that would provide discrimination in
that region may be omitted to save administration time. In
this case, a test with a high floor or low ceiling in the general
Loading page 20...
7
Percentiles
0.0
20 30 40 50 60 70 80
Raw Score
Mean = 50, SD = 10
0.8 68 84 93
Figure 1– 4 A non- normal test score distribution.
population (and with positive or negative skew) may be
more desirable than a test with a normal distribution.
Nevertheless, all things being equal, a more normal-looking
distribution of scores within the targeted subpopulation is
usually desirable, particularly if tests are to be used across
the range of abilities (e.g., intelligence tests).
I M P L I C AT I O N S O F N O N - N O R M A L I T Y
When reference sample distributions are substantially non-
normal, any standardized scores derived by linear transfor-
mation, such as T scores and standardized percentiles, will
not accurately correspond to actual percentile ranks within
the reference sample (and, by inference, the reference pop-
ulation). Depending on the degree of non-normality, the
degree of divergence between standardized scores and
percentiles derived directly from reference sample raw
scores can be quite large. For a concrete example of this
problem, consider the histogram in Figure 1–4, which
shows a hypothetical distribution (n = 1,000) of raw scores
from a normative sample for a psychological test. To sim-
plify the example, the raw scores have a mean of 50 and a
standard deviation of 10, and therefore no linear transfor-
mation is required to obtain T scores. From a glance, it is
readily apparent that the distribution of raw scores is grossly
non-normal; it is bimodal with a truncated lower tail and
significant positive skew, consistent with a significant floor
effect and the likely existence of two distinct subpopulations
within the normative sample.
A normal curve derived from the sample mean and
standard deviation is overlaid on the histogram in Figure
1–4 for purposes of comparing the assumed distribution of
raw scores corresponding to T scores with the actual dis-
tribution of raw scores. As can be seen, the shapes of the
assumed and actual distributions differ quite considerably.
Percentile scores derived directly from the raw test scores
are also shown for given T scores to further illustrate the
degree of error that can be associated with standardized
scores derived via linear transformation when reference
sample distributions are non-normal. For example, a T
score of 40 nominally corresponds to the 16th percentile,
but, with respect to the hypothetical test being considered
here, a T score of 40 actually corresponds to a level of per-
formance that falls below the 1st percentile within the ref-
erence sample. Clearly, the difference between percentiles
derived directly from the sample distribution as opposed
to standardized percentiles is not trivial and has significant
implications for clinical interpretation. Therefore, when-
ever reference sample distributions diverge substantially
from normality, percentile scores derived directly from
untransformed raw test scores must be used rather than
scaled scores and percentiles derived from linear transform-
ations, and tables with such data should be provided by test
publishers as appropriate. Ultimately, regardless of what in-
formation test publishers provide, it is always incumbent on
clinicians to evaluate the degree to which reference sample
distributions depart from normality in order to determine
which types of scores should be used.
C O R R E C T I O N S F O R N O N - N O R M A L I T Y
Although the normal curve is from many standpoints
an ideal or even expected distribution for psychological
data, reference sample scores do not always conform to a
normal distribution. When a new test is constructed, non-
normality can be “corrected” by examining the distribution
of scores on the prototype test, adjusting test properties,
and resampling until a normal distribution is reached. For
example, when a test is first administered during a try-out
phase and a positively skewed distribution is obtained (i.e.,
Loading page 21...
8
with most scores clustering at the tail end of the distribu-
tion), the test likely has too high a floor. Easy items can then
be added so that the majority of scores fall in the middle
of the distribution rather than at the lower end (Urbina,
2014). When this is successful, the greatest numbers of
individuals obtain about 50% of items correct. This level of
difficulty usually provides the best differentiation between
individuals at all ability levels (Urbina, 2014).
When confronted with reference samples that are not
normally distributed, some test developers resort to a va-
riety of “normalizing” procedures, such as log transform-
ations on the raw data, before deriving standardized scores.
A discussion of these procedures is beyond the scope of
this chapter, and interested readers are referred to Urbina
(2014). Although they can be useful in some circumstances,
normalization procedures are by no means a panacea be-
cause they often introduce problems of their own with re-
spect to interpretation. Urbina (2014) states that scores
should only be normalized if (1) they come from a large
and representative sample, or (2) any deviation from nor-
mality arises from defects in the test rather than character-
istics of the sample. Furthermore, it is preferable to modify
test content and procedures during development (e.g., by
adding or modifying items) to obtain a more normal dis-
tribution of scores rather than attempting to transform
non-normal scores into a normal distribution. Whenever
normalization procedures are used, test publishers should
describe in detail the nature of any sample non-normality
that is being corrected, the correction procedures used, and
the degree of success of such procedures (i.e., the distribu-
tion of scores after application of normalizing procedures
should be thoroughly described). The reasons for correc-
tion should also be justified, and percentile conversions de-
rived directly from un-normalized raw scores should also be
provided as an option for users. Despite the limitations in-
herent in methods for correcting for non-normality, Urbina
(2014) notes that most test developers will probably con-
tinue to use such procedures because normally distributed
test scores are required for some statistical analyses. From
a practical point of view, test users should be aware of the
mathematical computations and transformations involved
in deriving scores for their instruments. When all other
things are equal, test users should choose tests that provide
information on score distributions and any procedures that
were undertaken to correct non-normality over those that
provide partial or no information.
P E R C E N T I L E S D E R I V E D D I R E C T LY F R O M
R A W S C O R E D I S T R I B U T I O N S A S A P R I M A R Y
M E T R I C F O R T E S T R E S U LT S
Crawford and Garthwaite (2009) argue that, for clinical
assessments, percentile scores derived directly from raw score
distributions should always be obtained and they should serve
as the primary metric for interpretation and presentation of
test results in reports. These researchers state that “percentile
ranks express scores in a form that is of greater relevance to the
neuropsychologist than any alternative metric because they
tell us directly how common or uncommon such scores are
in the normative population” (p. 194). They note that when
reference sample distributions are normally distributed, stan-
dardized scores are also useful, particularly for certain arith-
metical and psychometric procedures for which percentiles
cannot be used, such as averaging scores. However, raw score
percentiles must always be used instead of standardized scores
whenever reference samples are non-normal as the latter have
minimal meaning in such cases. Crawford, Garthwaite, and
Slick (2009) also advance the preceding argument and, in
addition, provide a proposed set of reporting standards for
percentiles as well as detailed methods for calculating accu-
rate confidence intervals for raw score percentiles—including
a link to free software for performing the calculations on
Dr. John Crawford’s website (https://homepages.abdn.
ac.uk/j.crawford/pages/dept/psychom.htm). It is good prac-
tice to include confidence intervals when percentiles are
presented in reports, particularly in high-stakes assessments
where major decisions rely on finite score differences (e.g., de-
termination of intellectual disability for criminal-forensic or
disability purposes).
E X T R A P O L AT I O N A N D I N T E R P O L AT I O N
Despite the best efforts of test publishers to obtain optimum
reference samples, there are times when such samples fall
short with respect to score ranges or cell sizes for subgroups
such as age categories. In these cases, test developers may
turn to extrapolation and/or interpolation for purposes of
obtaining a full range of scaled scores, using techniques such
as multiple regression. For example, Heaton and colleagues
have published sets of norms that use multiple regression to
derive scaled scores that are adjusted for demographic char-
acteristics, including some for which reference sample sizes
are very small (Heaton et al., 2003). Although multiple re-
gression is robust to slight violations of assumptions, substan-
tial estimation errors may occur when model assumptions are
violated.
Test publishers sometimes derive standardized score
conversions by extrapolation beyond the bounds of
variables such as age within a reference sample. Such norms
should always be used with considerable caution due to the
lack of actual reference data. Extrapolation methods, such
as regression techniques, depend on trends in the reference
data. Such trends can be complex and difficult to model,
changing slope quite markedly across the range of predictor
variables. For example, in healthy individuals, vocabulary
increases exponentially during preschool years, but then the
rate of acquisition begins to taper off during early school
years and slows considerably over time through early adult-
hood, remains relatively stable in middle age, and then
shows a minor decrease with advancing age. Modeling such
Loading page 22...
9
complex curves in a way that allows for accurate extrapola-
tion is certainly a challenge, and even a well-fitting model
that is extended beyond actual data points provides only an
educated guess that may not be accurate.
Interpolation, utilizing the same types of methods
as are employed for extrapolation, is sometimes used for
deriving standardized scores when there are gaps in refer-
ence samples with respect to variables such as age or years
of education. When this is done, the same limitations and
interpretive cautions apply. Whenever test publishers use
extrapolation or interpretation to derive scaled scores, the
methods employed should be adequately described, any
violations of underlying assumptions of statistical models
utilized should be noted, and estimation error metrics
should be reported.
MEAsUREMENT ERROR
A good working understanding of conceptual issues and
methods of quantifying measurement error is essential for
competent clinical practice. We start our discussion of this
topic with concepts arising from classical test theory.
T R U E S C O R E S
A central element of classical test theory is the concept
of a true score, or the score an examinee would obtain on
a measure in the absence of any measurement error (Lord
& Novick, 1968). True scores can never be known. Instead,
they are estimated and are conceptually defined as the mean
score an examinee would obtain across an infinite number
of equivalent randomly sampled parallel forms of a test, as-
suming that the examinee’s scores were not systematically
affected by test exposure, practice, or other time-related
factors such as maturation (Lord & Novick, 1968). In
contrast to true scores, obtained scores are the actual scores
yielded by tests. Obtained scores include any measurement
error associated with a given test. That is, they are the sum
of true scores and error. Note that measurement error in the
classical model arises only from test characteristics; meas-
urement error arising from particular characteristics of in-
dividual examinees or testing circumstances is not explicitly
addressed or accounted for.
In the classical model, the relation between obtained
and true scores is expressed in the following formula, where
error (e) is random and all variables are assumed to be nor-
mally distributed:
x t e= + [3]
Where:
x = obtained score
t = true score
e = error
When test reliability is less than perfect, as is always the
case, the net effect of measurement error across examinees is
to bias obtained scores outward from the population mean.
That is, scores that are above the mean are most likely higher
than true scores, while those that are below the mean are
most likely lower than true scores (Lord & Novick, 1968).
Estimated true scores correct this bias by regressing obtained
scores toward the normative mean, with the amount of re-
gression depending on test reliability and deviation of the
obtained score from the mean. The formula for estimated
true scores (t ′) is:
t X r x Xxx′ = + −[ ( )] [4]
Where:
X = mean test score
rxx = test reliability (internal consistency reliability)
x = obtained score
If working with z scores, the formula is simpler:
t r zxx′ = × [5]
Formula 4 shows that an examinee’s estimated true score
is the sum of the mean score of the group they belong
to (i.e., the normative sample) and the deviation of their
obtained score from the normative mean weighted by
test reliability (as derived from the same normative
sample). Furthermore, as test reliability approaches unity
(i.e., r = 1.0), estimated true scores approach obtained
scores (i.e., there is little measurement error, so estimated
true scores and obtained scores are nearly equivalent).
Conversely, as test reliability approaches zero (i.e., when
a test is extremely unreliable), estimated true scores ap-
proach the mean test score. That is, when a test is highly
reliable, greater weight is given to obtained scores than
to the normative mean score; but, when a test is very un-
reliable, greater weight is given to the normative mean
score than to obtained scores. Practically speaking, esti-
mated true scores will always be closer to the mean than
obtained scores (except, of course, where the obtained
score is at the mean).
T H E U S E O F T R U E S C O R E S I N
C L I N I C A L P R A C T I C E
Although the true score model is abstract, it has practical
utility and important implications for test score interpre-
tation. For example, what may not be immediately obvious
from Formulas 4 and 5 is readily apparent in Table 1–2: esti-
mated true scores translate test reliability (or lack thereof )
into the same metric as actual test scores.
As can be seen in Table 1–2, the degree of regression
to the mean of true scores is inversely related to test relia-
bility and directly related to degree of deviation from the
Loading page 23...
10
reference mean. This means that the more reliable a test is,
the closer obtained scores are to true scores and that the fur-
ther away the obtained score is from the sample mean, the
greater the discrepancy between true and obtained scores.
For a highly reliable measure such as Test 1 (r = .95), true
score regression is minimal even when an obtained score lies
a considerable distance from the sample mean; in this ex-
ample, a standard score of 130, or two SDs above the mean,
is associated with an estimated true score of 129. In con-
trast, for a test with low reliability, such as Test 3 (r = .65),
true score regression is quite substantial. For this test, an
obtained score of 130 is associated with an estimated true
score of 120; in this case, fully one-third of the observed de-
viation from the mean is “lost” to regression when the esti-
mated true score is calculated.
Such information has important implications with re-
spect to interpretation of test results. For example, as shown
in Table 1–2, as a result of differences in reliability, obtained
scores of 120 on Test 1 and 130 on Test 3 are associated
with essentially equivalent estimated true scores (i.e., 119
and 120, respectively). If only obtained scores are consid-
ered, one might interpret scores from Test 1 and Test 3 as
significantly different even though these “differences” ac-
tually disappear when measurement precision is taken into
account. It should also be noted that this issue is not limited
to comparisons of scores from the same individual across
different tests but also applies to comparisons between
scores from different individuals from the same test when
the individuals come from different groups and the test in
question has different reliability levels across those groups.
Regression to the mean may also manifest as pro-
nounced asymmetry of confidence intervals centered on
true scores, relative to obtained scores, as discussed in more
detail later. Although calculation of true scores is encour-
aged as a means of translating reliability coefficients into
more concrete and useful values, it is important to con-
sider that any significant difference between characteristics
of an examinee and the sample from which a mean sample
score and reliability estimate were derived may invalidate
the process. For example, it makes little sense to estimate
true scores for severely brain-injured individuals on meas-
ures of cognition using test parameters from healthy nor-
mative samples because mean scores within brain-injured
populations are likely to be substantially different from
those seen in healthy normative samples; reliabilities may
differ substantially as well. Instead, one may be justified in
deriving estimated true scores using data from a comparable
clinical sample if this is available. These issues underscore
the complexities inherent in comparing scores from dif-
ferent tests in different populations.
T H E S TA N D A R D E R R O R O F M E A S U R E M E N T
Examiners may wish to quantify the margin of error associ-
ated with using obtained scores as estimates of true scores.
When the reference sample score SD and the internal con-
sistency reliability of a test are known, an estimate of the
SD of obtained scores about true scores may be calculated.
This value is known as the standard error of measurement,
or SEM (Lord & Novick, 1968). More simply, the SEM
provides an estimate of the amount of error in a person’s
observed score. It is a function of the reliability of the test
and of the variability of scores within the sample. The SEM
is inversely related to the reliability of the test. Thus, the
greater the reliability of the test, the smaller the SEM is, and
the more confidence the examiner can have in the precision
of the score.
The SEM is defined by the following formula:
SEM SD rxx= −1 [6]
Where:
SD = the standard deviation of the test, as derived from
an appropriate normative sample
rxx = the reliability coefficient of the test (usually
internal reliability)
C O N F I D E N C E I N T E R V A L S
While the SEM can be considered on its own as an index
of test precision, it is not necessarily intuitively interpret-
able, and there is often a tendency to focus excessively on
test scores as point estimates at the expense of considera-
tion of associated estimation error ranges. Such a tendency
to disregard imprecision is particularly inappropriate
when interpreting scores from tests with lower reliability.
Clinically, it is therefore very important to report, in a con-
crete and easily understandable manner, the degree of pre-
cision associated with specific test scores. One method of
doing this is to use confidence intervals.
The SEM is used to form a confidence interval (or range of
scores) around estimated true scores within which obtained
scores are most likely to fall. The distribution of obtained
scores about the true score (the error distribution) is assumed
to be normal, with a mean of zero and an SD equal to the
SEM; therefore, the bounds of confidence intervals can be set
to include any desired range of probabilities by multiplying
by the appropriate z value. Thus, if an individual were to take
a large number of randomly parallel versions of a test, the
TABLE 1– 2 Estimated True score Values for Three Observed
scores at Three Levels of Reliability
OBsERVEd sCOREs (M = 100, SD = 15)
RELIABILITy 110 120 130
Test 1 .95 110 119 129
Test 2 .80 108 116 124
Test 3 .65 107 113 120
NOTE: Estimated true scores rounded to whole values.
Loading page 24...
11
resulting obtained scores would fall within an interval of ±1
SEM of the estimated true scores 68% of the time and within
1.96 SEM 95% of the time (see Table 1–1).
Obviously, confidence intervals for unreliable tests (i.e.,
with a large SEM) will be larger than those for highly reli-
able tests. For example, we may again use data from Table
1–2. For a highly reliable test such as Test 1, a 95% confi-
dence interval for an obtained score of 110 ranges from 103
to 116. In contrast, the confidence interval for Test 3, a less
reliable test, is considerably larger, ranging from 89 to 124.
It is important to bear in mind that confidence intervals
for obtained scores that are based on the SEM are centered
on estimated true scores and are based on a model that deals
with performance across a large number of randomly par-
allel forms. Such confidence intervals will be symmetric
around obtained scores only when obtained scores are at
the test mean or when reliability is perfect. Confidence
intervals will be asymmetric about obtained scores to the
same degree that true scores diverge from obtained scores.
Therefore, when a test is highly reliable, the degree of asym-
metry will often be trivial, particularly for obtained scores
within one SD of the mean. For tests of lesser reliability,
the asymmetry may be marked. For example, in Table 1–2,
consider the obtained score of 130 on Test 2. The esti-
mated true score in this case is 124 (see Equations 4 and 5).
Using Equation 5 and a z-multiplier of 1.96, we find that a
95% confidence interval for the obtained scores spans ±13
points, or from 111 to 137. This confidence interval is sub-
stantially asymmetric about the obtained score.
It is also important to note that SEM-based confidence
intervals should not be used for estimating the likelihood of
obtaining a given score at retesting with the same measure as
effects of prior exposure are not accounted for. In addition,
Nunnally and Bernstein (1994) point out that use of SEM-
based confidence intervals assumes that error distributions
are normally distributed and homoscedastic (i.e., equal in
spread) across the range of scores obtainable for a given test.
However, this assumption may often be violated. A number
of alternate error models do not require these assumptions
and may thus be more appropriate in some circumstances
(see Nunnally & Bernstein, 1994, for a detailed discussion).
In addition, there are quite a number of alternate methods
for estimating error intervals and adjusting obtained scores
for regression to the mean and other sources of measure-
ment error (Glutting et al., 1987). There is no universally
agreed upon method for estimating measurement errors,
and the most appropriate methods may vary across different
types of tests and interpretive uses, though the majority of
methods will produce roughly similar results in many cases.
In any case, a review of alternate methods for estimating and
correcting for measurement error is beyond the scope of this
book; the methods presented were chosen because they con-
tinue to be widely used and accepted, and they are relatively
easy to grasp conceptually and mathematically. Ultimately,
the choice of which specific method is used for estimating
and correcting for measurement error is far less important
than the issue of whether any such estimates and corrections
are calculated and incorporated into test score interpreta-
tion. That is, test scores should never be interpreted in the
absence of consideration of measurement error.
T H E S TA N D A R D E R R O R O F E S T I M AT I O N
In addition to estimating confidence intervals for obtained
scores, one may also be interested in estimating confidence
intervals for estimated true scores (i.e., the likely range of
true scores about the estimated true score). For this pur-
pose, one may construct confidence intervals using the
standard error of estimation (SEE; Lord & Novick, 1968).
The formula for this is:
SE SD r rE xx xx= −( )1 [7]
Where:
SD = the standard deviation of the variable being
estimated
rxx = the test reliability coefficient
The SEE, like the SEM, is an indication of test precision.
As with the SEM, confidence intervals are formed around
estimated true scores by multiplying the SEE by a desired z
value. That is, one would expect that, over a large number
of randomly parallel versions of a test, an individual’s true
score would fall within an interval of ±1 SEE of the esti-
mated true scores 68% of the time, and fall within 1.96 SEE
95% of the time. As with confidence intervals based on
the SEM, those based on the SEE will usually not be sym-
metric around obtained scores. All of the other caveats de-
tailed previously regarding SEM-based confidence intervals
also apply.
The choice of constructing confidence intervals based
on the SEM versus the SEE will depend on whether one is
more interested in true scores or obtained scores. That is,
while the SEM is a gauge of test accuracy in that it is used
to determine the expected range of obtained scores about
true scores over parallel assessments (the range of error in
measurement of the true score), the SEE is a gauge of esti-
mation accuracy in that it is used to determine the likely
range within which true scores fall (the range of error of esti-
mation of the true score). Regardless, both SEM-based and
SEE-based confidence intervals are symmetric with respect
to estimated true scores rather than the obtained scores, and
the boundaries of both will be similar for any given level of
confidence interval when a test is highly reliable.
T H E S TA N D A R D E R R O R O F P R E D I C T I O N
When the standard deviation of obtained scores for an al-
ternate form is known, one may calculate the likely range
of obtained scores expected on retesting with a parallel
Loading page 25...
12
form. For this purpose, the standard error of prediction (SEP;
Lord & Novick, 1968) may be used to construct confidence
intervals. The formula for this is:
SE SD rp y xx= −1 2 [8]
Where:
SDy = the standard deviation of the parallel form
administered at retest
rxx = the reliability of the form used at initial testing
In this case, confidence intervals are formed around esti-
mated true scores (derived from initial obtained scores) by
multiplying the SEP by a desired z value. That is, one would
expect that, when retested over a large number of randomly
sampled parallel versions of a test, an individual’s obtained
score would fall within an interval of ±1 SEP of the esti-
mated true scores 68% of the time and fall within 1.96 SEE
95% of the time. As with confidence intervals based on the
SEM, those based on the SEP will generally not be symmetric
around obtained scores. All of the other caveats detailed pre-
viously regarding the SEM-based confidence intervals also
apply. In addition, while it may be tempting to use SEP-based
confidence intervals for evaluating significance of change at
retesting with the same measure, this practice violates the
assumptions that a parallel form is used at retest and, particu-
larly, that no prior exposure effects apply.
S TA N D A R D E R R O R S A N D T R U E S C O R E S :
P R A C T I C A L I S S U E S
Nunnally and Bernstein (1994) note that most test
manuals do “an exceptionally poor job of reporting esti-
mated true scores and confidence intervals for expected
obtained scores on alternative forms. For example,
intervals are often erroneously centered about obtained
scores rather than estimated true scores. Often the topic
is not even discussed” (p. 260). As well, in general, con-
fidence intervals based on age-specific SEMs are prefer-
able to those based on the overall SEM (particularly at the
extremes of the age distribution, where there is the most
variability) and can be constructed using age-based SEMs
found in most manuals.
As outlined earlier, estimated true scores and their as-
sociated confidence intervals can contribute substantially
to the process of interpreting test results, and an argument
can certainly be made that these should be preferred to
obtained scores for clinical purposes and also for research.
Nevertheless, there are compelling practical reasons to
primarily focus on obtained scores, the most important
of which is that virtually all data in test manuals and
independent research concerning psychometric prop-
erties of tests are presented in the metric of obtained
scores. In addition, a particular problem with the use of
the SEP for test-retest comparisons is that it is based on
a psychometric model that typically does not apply: in
most cases, retesting is carried out using the same test that
was originally administered rather than a parallel form.
Usually, obtained test-retest scores are interpreted rather
than the estimated true scores, and test-retest reliability
coefficients for obtained scores are usually lower—and
sometimes much lower—than internal consistency relia-
bility coefficients. In addition, the SEP does not account
for practice/exposure effects, which can be quite substan-
tial when the same test is administered a second time.
As a result, SEP-based confidence intervals will often be
miscentered and too small, resulting in high false-positive
rates when used to identify significant changes in perfor-
mance over time. For more discussion regarding the calcu-
lation and uses of the SEM, SEE, SEP, and alternative error
models, see Dudek (1979), Lord and Novick (1968), and
Nunnally and Bernstein (1994).
sCREENINg, dIAgNOsIs, ANd
OUTCOME pREdICTION Of TEsTs
In some cases, clinicians use tests to measure how much
of an attribute (e.g., intelligence) an examinee has, while
in other cases tests are used to help determine whether or
not an examinee has a specific attribute, condition, or ill-
ness that may be either present or absent (e.g., Alzheimer’s
disease). In the latter case, a special distinction in test use
may be made. Screening tests are those which are broadly or
routinely used to detect a specific attribute or illness, often
referred to as a condition of interest (COI) among persons
who are not “symptomatic” but who may nonetheless have
the COI (Streiner, 2003). Diagnostic tests are used to as-
sist in ruling in or out a specific condition in persons who
present with “symptoms” that suggest the diagnosis in
question. Another related use of tests is for purposes of pre-
diction of outcome. As with screening and diagnostic tests,
the outcome of interest may be defined in binary terms—it
will either occur or not occur (e.g., the examinee will be
able to handle independent living or not). Thus, in all three
cases, clinicians will be interested in the relation between
a measure’s distribution of scores and an attribute or out-
come that is defined in binary terms. It should be noted
that tests used for screening, diagnosis, and prediction may
be used when the COI or outcome to be predicted consists
of more than two categories (e.g., mild, moderate, and se-
vere). However, only the binary case will be considered in
this chapter.
Typically, data concerning screening or diagnostic ac-
curacy are obtained by administering a test to a sample of
persons who are also classified, with respect to the COI, by
a so-called gold standard. Those who have the condition ac-
cording to the gold standard are labeled COI+, while those
who do not have the condition are labeled COI−. In medi-
cine, the gold standard may be a highly accurate diagnostic
Loading page 26...
13
test that is more expensive and/or has a higher level of asso-
ciated risk of morbidity than some new diagnostic method
that is being evaluated for use as a screening measure or as
a possible replacement for the existing gold standard. In
neuropsychology, the situation is often more complex as
the COI may be a psychological construct or behavior (e.g.,
cognitive impairment, malingering) for which consensus
with respect to fundamental definitions is lacking or diag-
nostic gold standards may not exist.
The simplest way to relate test results to binary diag-
noses or outcomes is to utilize a cutoff score. This is a single
point along the continuum of possible scores for a given
test. Scores at or above the cutoff classify examinees as
belonging to one of two groups; scores below the cutoff
classify examinees as belonging to the other group. Those
who have the COI according to the test are labeled as test
positive (Test+), while those who do not have the COI are
labeled test negative (Test−).
Table 1–3 shows the relation between examinee
classifications based on test results versus classifications
based on a gold standard measure. By convention, test clas-
sification is denoted by row membership and gold standard
classification is denoted by column membership. Cell values
represent the total number of persons from the sample
falling into each of four possible outcomes with respect to
agreement between a test and a respective gold standard.
Agreements between gold standard and test classifications
are referred to as true-positive and true-negative cases, while
disagreements are referred to as false-positive and false-
negative cases, with positive and negative referring to the
presence or absence of a COI per classification by the gold
standard. When considering outcome data, observed out-
come is substituted for the gold standard. It is important
to keep in mind while reading the following section that
while gold standard measures are often implicitly treated as
100% accurate, this may not always be the case. Any limita-
tions in accuracy or applicability of a gold standard or out-
come measure need to be accounted for when interpreting
classification accuracy statistics. See Mossman et al. (2012)
and Mossman et al. (2015) for thorough discussions of this
problem and methods to account for it when validating di-
agnostic measures.
S E N S I T I V I T Y, S P E C I F I C I T Y, A N D
L I K E L I H O O D R AT I O S
The general accuracy of a test with respect to a specific COI
is reflected by data in the columns of a classification accuracy
table (Streiner, 2003). The column-based indices include
sensitivity, specificity, and the positive and negative likelihood
ratios (LR+ and LR−). The formulas for calculation of the
column-based classification accuracy statistics from data in
Table 1–4 are given below:
Sensitivity = A / A + C( ) [9]
Specificity = D/ D + B( ) [10]
LR = Sensitivity/ 1 Specificity( )+ − [11]
LR Specificity Sensitivity− = −/( )1 [12]
Sensitivity is defined as the proportion of COI+ examinees
who are correctly classified as such by a test. Specificity is
defined as the proportion of COI− examinees who are cor-
rectly classified as such by a test. The positive likelihood ratio
(LR+) combines sensitivity and specificity into a single
index of overall test accuracy indicating the odds (likeli-
hood) that a positive test result has come from a COI+
examinee. For example, a likelihood ratio of 3.0 may be
interpreted as indicating that a positive test result is three
times as likely to have come from a COI+ examinee as from
a COI− one. The LR− is interpreted conversely to the LR+.
As the LR approaches 1, test classification approximates
random assignment of examinees. That is, a person who is
Test+ is equally likely to be COI+ or COI−. For purposes of
working examples, Table 1–4 presents hypothetical test and
gold standard data.
Using Equations 9 to 12, the hypothetical test
demonstrates moderate sensitivity (.75) and high speci-
ficity (.95), with an LR+ of 15 and an LR− of 3.8. Thus, for
the hypothetical measure, a positive result is 15 times more
likely to be obtained by an examinee who has the COI than
by one who does not, while a negative result is 3.8 times
more likely to be obtained by an examinee who does not
have the COI than by one who does.
TABLE 1– 3 Classification/prediction Accuracy of a Test
in Relation to a “gold standard” or Actual Outcome
gOLd sTANdARd
TEsT REsULT COI+ COI− ROW TOTAL
Test Positive A (True
Positive)
B (False
Positive)
A + B
Test Negative C (False
Negative)
D (True
Negative)
C + D
Column total A + C B + D N = A + B + C + D
NOTE: COI = condition of interest.
TABLE 1– 4 Classification/prediction Accuracy of a Test
in Relation to a “gold standard” or Actual Outcome
(Hypothetical data)
gOLd sTANdARd
TEsT REsULT COI+ COI− ROW TOTAL
Test Positive 30 2 32
Test Negative 10 38 48
Column total 40 40 N = 80
NOTE: COI = condition of interest.
Loading page 27...
14
Note that sensitivity, specificity, and LR+/ − are param-
eter estimates that have associated errors of estimation that
can be quantified. The magnitude of estimation error is in-
versely related to sample size and can be quite large when
sample size is small. The formulas for calculating standard
errors for sensitivity, specificity, and the LR are complex
and will not be presented here (see McKenzie et al., 1997).
Fortunately, these values may also be easily calculated using
a number of readily available computer programs. Using
one of these (Mackinnon, 2000) with data from Table 1–4,
the 95% confidence interval for sensitivity was found to be
.59 to .87, while that for specificity was .83 to .99. LR+ was
3.8 to 58.6, and LR− was 2.2 to 6.5. Clearly, the range of
measurement error is not trivial for this hypothetical study.
In addition to appreciating issues relating to estimation
error, it is also important to understand that while column-
based indices provide useful information about test validity
and utility, a test may nevertheless have high sensitivity and
specificity but still be of limited clinical value in some situ-
ations, as will be detailed later.
P O S I T I V E A N D N E G AT I V E P R E D I C T I V E V A L U E
As opposed to being concerned with test accuracy at the
group level, clinicians are typically more concerned with
test accuracy in the context of diagnosis and other deci-
sion making at the level of individual examinees. That is,
clinicians wish to determine whether or not an individual
examinee does or does not have a given COI. In this sce-
nario, clinicians must consider indices derived from the data
in the rows of a classification accuracy table (Streiner, 2003).
These row-based indices are positive predictive value (PPV)
and negative predictive value (NPV). The formulas for cal-
culation of these from data in Table 1–3 are given here:
PPV = A/ A + B( ) [13]
NPV = D/ C + D( ) [14]
PPV is defined as the probability that an individual with
a positive test result has the COI. Conversely, NPV is de-
fined as the probability that an individual with a negative
test result does not have the COI. For example, predictive
power estimates derived from the data presented in Table
1–4 indicate that PPV = .94 and NPV = .79. Thus, in the
hypothetical dataset, 94% of persons who obtain a positive
test result actually have the COI, while 79% of people who
obtain a negative test result do not in fact have the COI.
When predictive power is close to .50, examinees are ap-
proximately equally likely to be COI+ as COI−, regardless
of whether they are Test+ or Test−. When predictive power
is less than .50, test-based classifications or diagnoses will be
incorrect more often than not. However, predictive power
values at or below .50 may still be informative. For example,
if the population prevalence of a COI is .05 and the PPV
based on test results is .45, a clinician can rightly conclude
that an examinee is much more likely to have the COI than
members of the general population, which may be clinically
relevant.
As with sensitivity and specificity, PPV and NPV are
parameter estimates that should always be considered in the
context of estimation error. Unfortunately, standard errors
or confidence intervals for estimates of predictive power are
rarely listed when these values are reported; clinicians are
thus left to their own devices to calculate them. Fortunately,
these values may be easily calculated using a number of freely
available computer programs (see Crawford, Garthwaite, &
Betkowska, 2009; Mackinnon, 2000). Using one of these
(Mackinnon, 2000) with data from Table 1–4, the 95%
confidence intervals for PPV and NPV given the base rate
in the study were found to be .94 to .99 and .65 to .90, re-
spectively. Clearly, the confidence interval range is not
trivial for this small dataset.
B A S E R AT E S
Of critical importance to clinical interpretation of test
scores, PPV and NPV vary with the base rate or prevalence
of a COI.
The prevalence of a COI is defined with respect to
Table 1–3 as:
( )A + C /N [15]
As should be readily apparent from inspection of Table
1–4, the prevalence of the COI in the sample is 50%.
Formulas for deriving predictive power for any level of
sensitivity and specificity and a specified prevalence are
given here:
PPV Prevalence Sensitivity
(Prevalence Sensitivity
Pre
= ×
× +
−
)
[(1 vvalence) ( Specificity× −1 )]
[16]
NPV 1 Prevalence Specificity
[(1 Prevalence) Specificity]
= − ×
− × +
[PPrevalence ( Sensitivity× −1 )]
[17]
From inspection of these formulas, it should be apparent
that, regardless of sensitivity and specificity, predictive
power will vary between 0 and 1 as a function of prevalence.
Application of Formulas 16 and 17 to the data presented in
Table 1–4 across the range of possible base rates provides the
range of possible PPV and NPV values depicted in Figure
1–5 (note that Figure 1–5 was produced by a spreadsheet
developed for analyzing the predictive power of tests and
is freely available from Daniel Slick at dslick@gmail.com).
As can be seen in Figure 1–5, the relation between pre-
dictive power and prevalence is curvilinear and asymptotic,
Loading page 28...
15
with endpoints at 0 and 1. For any given test cutoff score,
PPV will always increase with base rate, while NPV will
simultaneously decrease. For the hypothetical test being
considered, one can see that both PPV and NPV are mod-
erately high (at or above .80) when the COI base rate ranges
from 20% to 50%. The tradeoff between PPV and NPV at
high and low base rate levels is also readily apparent; as the
base rate increases above 50%, PPV exceeds .95 while NPV
declines, falling below .50 as the base rate exceeds 80%.
Conversely, as the base rate falls below 30%, NPV exceeds
.95 while PPV rapidly drops off, falling below 50% as the
base rate falls below 7%.
From the foregoing, it is apparent that the predic-
tive power values derived from data presented in Table
1–4 would not be applicable in settings where base rates
vary from the 50% value in the hypothetical dataset. This
is important because, in practice, clinicians may often be
presented with PPV values based on data where “preva-
lence” values are near 50%. This is due to the fact that, re-
gardless of the prevalence of a COI in the population, some
diagnostic validity studies employ equal-sized samples of
COI+ and COI− individuals to facilitate statistical ana-
lyses. In contrast, the actual prevalence of COIs may
differ substantially from 50% in various clinical settings
and circumstances (e.g., screening vs. diagnostic use).
For examples of differing PPV and NPV across different
base rates, see Chapter 16, on the Minnesota Multiphasic
Personality Inventory, 2 (MMPI-2) and Minnesota
Multiphasic Personality Inventory, 2 Restructured Form
(MMPI-2-RF).
For example, suppose that the data from Table 1–4
were from a validity trial of a neuropsychological measure
designed for administration to young adults for purposes
of predicting development of schizophrenia. The question
then arises: Should the measure be used for broad screening
given a lifetime schizophrenia prevalence of .008? Using
Formula 16, one can determine that for this purpose the
measure’s PPV is only .11 and thus the “positive” test results
would be incorrect 89% of the time.
Conversely, the prevalence of a COI may in some
settings be substantially higher than 50%. As an example
of the other extreme, the base rate of head injuries among
persons admitted to an acute hospital head injury reha-
bilitation service is essentially 100%, in which case the
use of neuropsychological tests to determine whether
or not examinees had sustained a head injury would not
only be redundant, but very likely lead to false-negative
errors (such tests could, of course, be legitimately used for
other purposes, such as grading injury severity). Clearly,
clinicians need to carefully consider published data con-
cerning sensitivity, specificity, and predictive power in
light of intended test use and, if necessary, calculate PPV
and NPV values and COI base rate estimates applicable to
specific groups of examinees seen in their own practices.
In addition, it must be kept in mind that PPV and NPV
values calculated for individual examinees are estimates
that have associated measurement errors that allow for con-
struction of confidence intervals. Crawford, Garthwaite,
and Betkowska (2009) provide details on the calculation
of such confidence intervals and also a free computer pro-
gram that performs the calculations.
D I F F I C U LT I E S W I T H E S T I M AT I N G A N D
A P P LY I N G B A S E R AT E S
Prevalence or base rate estimates may be based on large-scale
epidemiological studies that provide good data on the rate of
occurrence of COIs in the general population or within spe-
cific subpopulations and settings (e.g., prevalence rates of var-
ious psychiatric disorders in inpatient psychiatric settings).
However, in some cases, no prevalence data may be avail-
able, or reported prevalence data may not be applicable to
specific settings or subpopulations. In these cases, clinicians
who wish to determine predictive power must develop their
own base rate estimates. Ideally, these can be derived from
data collected within the same setting in which the test will
be employed, though this is typically time-consuming and
many methodological challenges may be faced, including
limitations associated with small sample sizes. Methods for
estimating base rates in such contexts are beyond the scope
of this chapter; interested readers are directed to Mossman
(2003), Pepe (2003), and Rorer and Dawes (1982).
D E T E R M I N I N G T H E O P T I M U M C U T O F F
S C O R E : R O C A N A LY S E S A N D O T H E R
M E T H O D S
The foregoing discussion has focused on the diagnostic
accuracy of tests using specific cutoff points, presumably
1.0
Sensitivity = .75 Specificity = .95
.9
Predictive Power
.8
.7
.6
.5
.4
.3
.2
.1
.0.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
Base Rate
PPV NPV
Figure 1– 5 Relation of predictive power to prevalence—hypothetical data.
Loading page 29...
16
ones that are optimal for given tasks such as diagnosing
dementia or detecting noncredible performance.
A number of methods for determining an optimum cutoff
point are available, and, although they may lead to sim-
ilar results, the differences between them are not trivial.
Many of these methods are mathematically complex and/
or computationally demanding, thus requiring computer
applications.
The determination of an optimum cutoff score for de-
tection or diagnosis of a COI is often based on simulta-
neous evaluation of sensitivity and specificity or predictive
power across a range of scores. In some cases, this infor-
mation, in tabular or graphical form, is simply inspected
and a score is chosen based on a researcher’s or clinician’s
comfort with a particular error rate. For example, in malin-
gering research, cutoffs that minimize false-positive errors
or hold them below a low threshold are often explicitly
chosen (i.e., by convention, a specificity of .90 or higher),
even though such cutoffs are associated with relatively large
false-negative error rates (i.e., lower detection of examinees
with the COI, malingering).
A more formal, rigorous, and often very useful set of
tools for choosing cutoff points and for evaluating and com-
paring test utility for diagnosis and decision making falls
under the rubric of receiver operating characteristics (ROC)
analyses. Clinicians who use tests for diagnostic or other
decision-making purposes should be familiar with ROC
procedures. The statistical procedures utilized in ROC ana-
lyses are closely related to and substantially overlap those of
Bayesian analyses. The central graphic element of ROC ana-
lyses is the ROC graph, which is a plot of the true-positive
proportion (y axis) against the false-positive proportion (x
axis) associated with each specific score in a range of test
scores. Figure 1–6 shows an example a ROC graph. The
area under the curve is equivalent to the overall accuracy
of the test (proportion of the entire sample correctly classi-
fied), while the slope of the curve at any point is equivalent
to the LR+ associated with a specific test score.
A number of ROC methods have been developed for
determining cutoff points that consider not only accu-
racy, but also allow for factoring in quantifiable or quasi-
quantifiable costs and benefits and the relative importance
of specific costs and benefits associated with any given
cutoff score. ROC methods may also be used to compare
the diagnostic utility of two or more measures, which may
be very useful for purposes of test selection. Although
ROC methods can be very useful clinically, they have not
yet made broad inroads into most of the clinical neuropsy-
chological literature, with the exception of some research
on dementia screening and research on performance va-
lidity and symptom validity (see reviews in this volume).
A detailed discussion of ROC methods is beyond the scope
of this chapter; interested readers are referred to Mossman
and Somoza (1992), Pepe (2003), Somoza and Mossman
(1992), and Swets, Dawes, and Monahan (2000).
E V A L U AT I O N O F P R E D I C T I V E P O W E R
A C R O S S A R A N G E O F C U T O F F S C O R E S A N D
B A S E R AT E S
As noted earlier, it is important to recognize that positive
and negative predictive power are not properties of tests
but rather are properties of specific test scores in specific
contexts. The foregoing sections describing the calculation
and interpretation of predictive power have focused on
methods for evaluating the value of a single cutoff point for
a given test for purposes of classifying examinees as COI+
or COI−. However, by focusing exclusively on single cutoff
points, clinicians are essentially transforming continuous
test scores into binary scores, thus discarding much po-
tentially useful information, particularly when scores are
considerably above or below a cutoff. Lindeboom (1989)
proposed an alternative approach in which predictive
power across a range of test scores and base rates can be
displayed in a single Bayesian probability table. In this ap-
proach, test scores define the rows and base rates define
the columns of a table; individual table cells contain the
associated PPV and NPV for a specific score and spe-
cific base rate. Such tables have rarely been constructed
for standardized measures, but examples can be found in
some test manuals (e.g., the Victoria Symptom Validity
Test; Slick et al., 1997). The advantage of this approach is
that it allows clinicians to consider the diagnostic confi-
dence associated with an examinee’s specific score, leading
to more accurate assessments. A limiting factor for use
of Bayesian probability tables is that they can only be
constructed when sensitivity and specificity values for an
entire range of scores are available, which is rarely the case
for most tests. In addition, predictive power values in such
tables are subject to any validity limitations of underlying
1.0
.9
.8
.7
.6
.5
.4
.3
.2
.1
.0
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9
False-Positive Probability
True-Positive Probability
1.0
Figure 1– 6 An ROC graph.
Loading page 30...
17
data and should include associated standard errors or con-
fidence intervals.
C O M B I N I N G R E S U LT S O F M U LT I P L E
S C R E E N I N G / D I A G N O S T I C T E S T S
Often, more than one test that provides data relevant to a
specific diagnosis is administered. In these cases, clinicians
may wish to integrate predictive power estimates across
measures. There may be a temptation to use the PPV asso-
ciated with a score on one measure as the “base rate” when
the PPV for a score from a second measure is calculated.
For example, suppose that the base rate of a COI is 15%.
When a test designed to detect the COI is administered, an
examinee’s score translates to a PPV of 65%. The examiner
then administers a second test designed to detect the COI,
but when PPV for the examinee’s score on the second test
is calculated, a “base rate” of 65% is used rather than 15%
because the former is now the assumed prior probability
that the examinee has the COI given their score on the first
test administered. The resulting PPV for the examinee’s
score on the second measure is now 99%, and the examiner
concludes that the examinee has the COI. While this pro-
cedure may seem logical, it will produce an inflated PPV es-
timate for the second test score whenever the two measures
are correlated, which will almost always be the case when
both measures are designed to screen for or diagnose the
same COI.
A more defensible method for combining results of
multiple diagnostic tests is to derive empirically derived
classification rules based on the number of positive findings
from a set of screening/diagnostic tests. While this ap-
proach to combining test results can produce more accurate
classifications, its use of binary data (positive or negative
findings) as inputs does not capitalize on the full range of
data available from each test, and so accuracy may not be
optimized. To date, this approach to combining test results
has primarily been used with performance/symptom va-
lidity tests, and there have been some interesting debates
in the literature concerning the accuracy and clinical utility
of the derived classification rules; see Larrabee (2014a,
2014b), Bilder et al. (2014), and Davis and Millis (2014).
A preferred psychometric method for integrating scores
from multiple screening/diagnostic measures, one that
utilizes the full range of data from each test, is to construct
group membership (i.e., COI+ vs. COI−) prediction equa-
tions using methods such as logistic regression or multiway
frequency analyses. These methods can be used clinically to
generate binary classifications or classification probabilities,
with the latter being preferred because it is a better gauge of
accuracy. Ideally, the derived classification formulas should
be well validated before being utilized clinically. More
details on methods for combining classification data across
measures may be found in Franklin and Krueger (2003) and
Pepe (2003).
W H Y A R E C L A S S I F I C AT I O N A C C U R A C Y
S TAT I S T I C S N O T U B I Q U I T O U S I N
N E U R O P S Y C H O L O G I C A L R E S E A R C H A N D
C L I N I C A L P R A C T I C E ?
Of note, the mathematical relations between sensitivity,
specificity, base rates, and predictive power were first elu-
cidated by Thomas Bayes and published in 1763; methods
for deriving predictive power and other related indices of
confidence in decision making are thus often referred to
as Bayesian statistics. Note that in Bayesian terminology,
the prevalence or base rate of a COI is known as the prior
probability, while PPV and NPV are known as posterior
probabilities. Conceptually, the difference between the
prior and posterior probabilities associated with infor-
mation added by a test score is an index of the diagnostic
utility of a test. There is an entire literature concerning
Bayesian methods for statistical analysis of test utility.
These will not be covered here, and interested readers are
referred to Pepe (2003).
Needless to say, Bayes’s work predated the first diag-
nostic applications of psychological tests as we know them
today. However, although neuropsychological tests are rou-
tinely used for diagnostic decision making, information on
the predictive power of most tests is often absent from both
test manuals and applicable research literature. This is so de-
spite the fact that the importance and relevance of Bayesian
approaches to the practice of clinical psychology was
well described 60 years ago by Meehl and Rosen (1955).
Bayesian statistics are finally making major inroads into the
mainstream of neuropsychology, particularly in the research
literature concerning symptom/performance validity meas-
ures, in which estimates of predictive power have become de
rigueur, although these are still typically presented without
associated standard errors, thus greatly reducing the utility
of the data.
AssEssINg CHANgE OVER TIME
Neuropsychologists are often interested in tracking changes
in function over time. In these contexts, three interrelated
questions arise:
• To what degree do changes in examinee test scores
reflect “real” changes in function as opposed to
measurement error?
• To what degree do real changes in examinee test scores
reflect clinically significant changes in function as
opposed to clinically trivial changes?
• To what degree do changes in examinee test scores
conform to expectations, given the application of
treatments or the occurrence of other events or processes
occurring between test and retest, such as head injury,
dementia, or brain surgery?
Loading page 31...
28 more pages available. Scroll down to load them.
Sign in to access the full document!