a method of evaluationwhere test responses earn credit toward placement in a particular class/categorysometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific categoryalso called class scoringcontrast with cumulative scoring & ipsative scoringp.260

a method of evaluationwhere test responses earn credit toward placement in a particular class/categorysometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific categorycontrast with cumulative scoring & ipsative scoringp.260

in test developmenta method of developing ordinal scalesthrough the use of a sorting task entails judging a stimulus in comparison with every other stimulus used on the testp.249

requires an examinee to provide a word or phrase that completes a sentencep. 254

computerized adaptive testing (CAT)

an interactive, computer-administered testtaking processitems are presented to the testtaker, based in part on the testtakers’ performance on previous itemsp.15, 255-256

the test norming process conducted on two or more testsusing the same sample of testtakerswhen used to validate all of the tests being normed, this process may also be referred to as co-validationp.138n4, 278

constructed-response format

a form of test item requiring a testtaker to construct or create a responseas opposed to simply selecting a responsecontrast with selected-response formatp.252

when co-norming is used to validate all of the tests being normedthis process may also be referred to as co-validationp.278

a revalidation on a sample of testtakersother than the testtakers on whom test performance was originally found to be a valid predictor of some criterionp.278

a test item that requires a testtaker to write a compositiontypically one that demonstrates recall of facts, understanding, analysis, and/or interpretationp.255

in test development processgroup of people knowledgeable about - the subject matter being tested, and/or the population for whom the test is being designedthey can provide input to improve test’s content, fairness etc.p.274-275

a phenomenon arising from the diminished utility of a tool of assessment in distinguishing testtakers at the low end of the ability, trait, or other attribute being measuredp. 256-259

a test item, usually near the beginning of a test of ability or achievementdesigned to be relatively easyusually for the purpose of building the testtakers confidence or reducing test-related anxietyp.263n4

What three criteria must be met when correcting for the impact of guessing?

must recognize that guesses are not normally totally randommust deal with the problem of omitted itemssome testtakers are lucky and others unluckyp.269-271

a scale - items range sequentially from weaker to stronger expressions of the attitude or belief being measuredconstructed so that selection of an earlier item presumes that all following items are also true of the testtakernamed after its developerp.249

Education /Psychological - W3 - Chapter 8 - Test Development - DN

Psychological - W3 - Chapter 8 - Test Development - DN

Education58 CardsCreated 18 days ago

Anchor protocol is a standardized test answer sheet created by the test publisher to assess and ensure the consistency and accuracy of examiners' scoring. It serves as a benchmark to compare scoring practices across different raters.

Print Embed Import Report

anchor protocol

a test answer sheet

developed by a test publisher

to test the accuracy of examiners’ scoring

p.280

Tap to flipTap or swipe ↕ to flip

Space↑↓

←→Swipe ←→Navigate

SSpeak

FFocus

1/58

Key Terms

Term

Definition

anchor protocol

a test answer sheet

developed by a test publisher

to test the accuracy of examiners’ scoring

p.280

biased test item

an item that favours one group in relation to another

when differences in group ability are controlled

p.271

binary-choice item

multiple choice item

contains only two possible responses (true-false)

p.254

categorical scaling

system of scaling

stimuli placed in one of two or more alternative categories that differ quantitatively with respect to some continuum

categorical scoring

a method of evaluation

where test responses earn credit toward placement in a particular class/category

sometimes testtakers must meet ...

ceiling effect

diminished utility of a tool of assessment in distinguishing testtakers at the high end of the ability, trait, or other attribute being measured

Related Flashcard Decks

Education

A-LEVEL PE (OCR): PAPER 1 - Preparation and Training Method Part 2

This set of flashcards explains the concept of periodisation in training — the structured division of a training program into macro-, meso-, and micro-cycles. It also outlines tapering and the focus of Preparatory Phase 1, helping athletes optimise performance and recovery throughout the training year.

62 cards

View Deck

Education

A-LEVEL PE (OCR): PAPER 1 - Preparation and Training Method Part 1

75 cards

View Deck

Education

A-LEVEL PE (OCR): PAPER 3 - Routes to Sporting Excellence in the UK

These flashcards explain the role of UK Sport, an organization funded by the government and National Lottery to develop elite athletes. They cover how UK Sport identifies talent, supports athletes’ lifestyles and coaching, and runs the World Class Programme, which has two stages—Podium for athletes with immediate medal potential and Podium Potential for those aiming for future success.

22 cards

View Deck

Education

A-LEVEL PE (OCR): PAPER 3 - Global sporting events

These flashcards outline the origins and philosophy of the Modern Olympic Games, first established in 1896. They explain the Games’ core aims—to promote physical and moral development, unite athletes worldwide, spread Olympic values, and educate young people to foster peace through healthy international competition.

23 cards

View Deck

Education

A-LEVEL PE (OCR): PAPER 2 - Leadership in Sport

These flashcards explore key aspects of leadership in sport, including the characteristics of an effective leader such as communication, motivation, enthusiasm, and clear vision. They also define emergent leaders—those chosen from within a team—and discuss their advantages, such as relatability and system knowledge, along with the potential disadvantages that may arise.

19 cards

View Deck

Education

A-LEVEL PE (OCR): PAPER 2 - Memory Models

This deck covers key concepts of memory models, including definitions of encoding, storage, retrieval, and more, as outlined in the A-LEVEL PE curriculum.

15 cards

View Deck

Study Tips

Press F to enter focus mode for distraction-free studying
Review cards regularly to improve retention
Try to recall the answer before flipping the card
Share this deck with friends to study together

Psychological - W3 - Chapter 8 - Test Development - DN

Term	Definition
anchor protocol	a test answer sheet developed by a test publisher to test the accuracy of examiners’ scoring p.280
biased test item	an item that favours one group in relation to another when differences in group ability are controlled p.271
binary-choice item	multiple choice item contains only two possible responses (true-false) p.254
categorical scaling	system of scaling stimuli placed in one of two or more alternative categories that differ quantitatively with respect to some continuum p.249
categorical scoring	a method of evaluation where test responses earn credit toward placement in a particular class/category sometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific category also called class scoring contrast with cumulative scoring & ipsative scoring p.260
ceiling effect	diminished utility of a tool of assessment in distinguishing testtakers at the high end of the ability, trait, or other attribute being measured p. 259, 307
class scoring	a method of evaluation where test responses earn credit toward placement in a particular class/category sometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific category contrast with cumulative scoring & ipsative scoring p.260
comparative scaling	in test development a method of developing ordinal scales through the use of a sorting task entails judging a stimulus in comparison with every other stimulus used on the test p.249
completion item	requires an examinee to provide a word or phrase that completes a sentence p. 254
computerized adaptive testing (CAT)	an interactive, computer-administered testtaking process items are presented to the testtaker, based in part on the testtakers’ performance on previous items p.15, 255-256
co-norming	the test norming process conducted on two or more tests using the same sample of testtakers when used to validate all of the tests being normed, this process may also be referred to as co-validation p.138n4, 278
constructed-response format	a form of test item requiring a testtaker to construct or create a response as opposed to simply selecting a response contrast with selected-response format p.252
co-validation	when co-norming is used to validate all of the tests being normed this process may also be referred to as co-validation p.278
cross-validation	a revalidation on a sample of testtakers other than the testtakers on whom test performance was originally found to be a valid predictor of some criterion p.278
essay item	a test item that requires a testtaker to write a composition typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation p.255
expert panel	in test development process group of people knowledgeable about - the subject matter being tested, and/or the population for whom the test is being designed they can provide input to improve test’s content, fairness etc. p.274-275
floor effect	a phenomenon arising from the diminished utility of a tool of assessment in distinguishing testtakers at the low end of the ability, trait, or other attribute being measured p. 256-259
giveaway item	a test item, usually near the beginning of a test of ability or achievement designed to be relatively easy usually for the purpose of building the testtakers confidence or reducing test-related anxiety p.263n4
What three criteria must be met when correcting for the impact of guessing?	must recognize that guesses are not normally totally random must deal with the problem of omitted items some testtakers are lucky and others unlucky p.269-271
Guttman scale	a scale - items range sequentially from weaker to stronger expressions of the attitude or belief being measured constructed so that selection of an earlier item presumes that all following items are also true of the testtaker named after its developer p.249
ipsative scoring	approach to scoring & interpretation responses & presumed strength of measured trait are interpreted relative to the measured strength of other traits for that testtaker contrast with class scoring & cumulative scoring p.260
item analysis	general term used to describe various procedures usually statistical, designed to explore how individual items work compared to others in the test & in the context of the whole test e.g., to explore the level of difficulty of individual items on an achievement test e.g., to explore the reliability of a personality test contrast with qualitative item analysis p.262-275
item bank	a collection of questions to be used in the construction of a test p. 255, 257-259, 282-284
item branching	in computerised adaptive testing (CAT) the individualised presentation of test items drawn from an item bank based on the testtakers’ previous responses p.260
item-characteristic curve (ICC)	graphic representation of the probalistic relationship between a person's level of trait (ability, characteristic) being measured and the probability for responding to an item in a predicted way also known as a category response curve or an item trace line p.177, 281 p.268
item-difficulty index	items cannot be too easy or too hard in order to differentiate between testtakers knowledge of the subject matter a statistic obtained by calculating the proportion of the total number of testtakers who answered an item correctly p is used to denote item difficulty a subscript 1 refers to the item number = p1 can range from 0-1 the larger the item-difficulty index, the easier the item (i.e., the higher the p, the easier the item - because p represents the number of people passing the item) p.263-264
item-discrimination index	measure of item discrimination symbolised by d p.264-268
item-endorsement index	the name given to an item-difficulty test (which is used in achievement testing) when used in other contexts (e.g., personality testing) p. 263
item fairness	a reference to the degree of bias, if any, in a test item p. 271-272
item format	a reference to the form, plan, structure, arrangement, or layout of individual test items including whether the test items require testtakers to select or create a response p.252-255
item pool	the reservoir or well from which items will or will not be drawn for the final version of the test the collection of items to be further evaluated for possible selection for use in an item bank p.251
item-reliability index	provides an indication of the internal consistency of a test the higher the index, the greater the internal consistency index is equal to the product of the item-score standard deviation (s) and the correlation (r) between the item score and the total test score p.264
item-validity index	a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure important when a test developer's goal is to maximise the criterion-related validity of a test the higher the item-validity index, the greater the test's criterion-related validity to calculate we must first know the item-score standard deviation (symbolised as s1, s2, s3 etc.) and the correlation between the item score and the criterion score then we use the item difficulty index p1 in the following formula s1 = square root of p1 (1 - p1) the correlation between the score on item 1 and a score on a criterion measure (r1c) is multiplied by item 1's item-score standard deviation (s1) the product is an index of an items validity (s1 r1c) p.264
Likert scale	summative rating scale with 5 alternative responses ranging on a continuum from e.g., "strongly agree" to "strongly disagree" p.247
matching item	the testtaker is presented with two columns premises on the left & responses on the right task is to determine which response is best matched to which premise young testtakers (draw a line) others typically asked to write a letter/number as a response p.253
method of paired comparisons	a scaling method a pair of stimuli (e.g., photos) is selected according to a rule (e.g., "select the one that is more appealing") p.248
multiple-choice format	one of the three types of selected-response item formats three elements a stem a correct alternative or option and several incorrect alternatives (referred to as distractors or foils) p.252
pilot work	also referred to as pilot study & pilot research preliminary research surrounding the creation of a prototype test general objective is to determine how best to gauge assess, or evaluate the targeted construct(s) p.243-244
qualitative item analysis	non-statistical procedures designed to explore how individual test items work both compared to other items in the test & in the context of the whole test unlike statistical measures, they involve exploration of the issues by verbal means (e.g., interviews & group discussions with testtakers & other relevant parties) p.272-275
qualitative methods	techniques of data generation & analysis rely primarily on verbal rather than mathematical or statistical procedures p.272
rating scale	a system of ordered numerical or verbal descriptors used to make judgements about the presence, absence, or magnitude of a particular trait, attitude, emotion, or other variable p.205, 247, 371
scaling	1) in test construction the process of setting rules for assigning numbers in measurement 2) the process by which a measuring device is designed and calibrated & the way numbers (or other indices) are assigned to different amounts of a trait, attribute, or characteristic being measured p.244-251
scalogram analysis	an item-analysis procedure entails graphic mapping of a testtaker's responses p.250
scoring drift	a discrepancy between the scoring in an anchor protocol and the scoring of another protocol p. 280
selected-response format	a form of test item requiring testtakers to select a response (e.g., true/false, multiple choice, and matching items) as opposed to creating one - contrast with constructed-response format p.252
sensitivity review	a study of test items usually during test development items are examined for fairness to all prospective testtakers for the presence of offensive language, stereotypes, or situations p.274
short-answer item	may also be referred to as a completion item a word, term, sentence or a paragraph may qualify anything beyond this is an essay item p.254
summative scale	an index derived from the summing of selected scores on a test or sub-test p. 247
test conceptualization	an early stage of the test development process when an idea for a particular test or test revision is conceived p.240, 241-244
test construction	a stage in the process of test development entails writing test items (or rewriting/revising existing items) as well as formatting items, setting scoring rules, and otherwise designing and building a test p.240
test development	an umbrella term for all that goes into the process of creating a test p. 240-284
test revision	action taken to modify a test's content or format for the purpose of improving the test's effectiveness as a tool of measurement p.240
test tryout	a stage in the process of test development that entails administering a preliminary version of a test to a representative sample of testtakers under conditions that simulate the conditions under which the final version of the test will be administered p.240, 261-262
"think aloud" test administration	a method of qualitative item analysis examinees verbalize their thoughts as they take the test useful in understanding how individual items function in a test testtakers interpret or misinterpret the meaning of the individual items p.274
true-false item	a binary-choice item i.e., contains only one of two responses requires testtaker to indicate whether a statement is or is not a fact p.254
validity shrinkage	the decrease in item validities that inevitably occurs after cross-validation p. 278
What is the optimal item difficulty?	usually midpoint between 1.0 and the probability of answering correctly by guessing which is called the chance success proportion multi choice (50% chance of getting it right by guessing) - .5 +1.00 = 1.5 divided by 2 = .60 10:00 p.263
How can you create a visual representation of the best items on a test	(i.e., if the objective is to maximise criterion-related validity)? this can be achieved by plotting each item's item-validity index and item-reliability index p.265 Fig 8-5

Psychological - W3 - Chapter 8 - Test Development - DN

anchor protocol

Key Terms

Related Flashcard Decks

Study Tips

Company

Explore

Study Tools