Psychological Testing: Chapter 5: Reliability
This flashcard set explains the concept of reliability in psychological measurement, including how test scores consist of true scores and error, and introduces the reliability coefficient as a measure of consistency in test results. It emphasizes the importance of understanding variance in observed scores to ensure accurate assessment.
Reliability
Consistency in measurement; the total variance in an observed distribution of test scores equals the sum of the true variance plus the error variance
Key Terms
Reliability
Consistency in measurement; the total variance in an observed distribution of test scores equals the sum of the true variance plus the error varian...
Reliability Coefficient
Index of reliability; proportiion that indicates the ratio between the true score variance on a test and the total variance
Concept of Reliability
X = T+ E
X = Observed score
T = True score
E = ErrorTrue Score Model
Also true that the magnitude of the presence of a certain psychological trait as measured b a test of that trait will be due to the true amount of ...
Variance
Statistic useful in describing sources of a test score variability; useful because it can be broken down into components
True Variance
Variance from true differences
Related Flashcard Decks
Study Tips
- Press F to enter focus mode for distraction-free studying
- Review cards regularly to improve retention
- Try to recall the answer before flipping the card
- Share this deck with friends to study together
| Term | Definition |
|---|---|
Reliability | Consistency in measurement; the total variance in an observed distribution of test scores equals the sum of the true variance plus the error variance |
Reliability Coefficient | Index of reliability; proportiion that indicates the ratio between the true score variance on a test and the total variance |
Concept of Reliability | |
True Score Model | Also true that the magnitude of the presence of a certain psychological trait as measured b a test of that trait will be due to the true amount of that trait and other factors |
Variance | Statistic useful in describing sources of a test score variability; useful because it can be broken down into components |
True Variance | Variance from true differences |
Error Variance | Variance from irrelevant, random sources |
Reliability of a Test | The greater the proportion of the total variance attributed to true variance, the more reliable the test |
Sources of Error Variance |
|
Item/Content Sampling | Terms that refer to variation among items within a test as well as to variation among items between tests |
Challenge in Test Development | Maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance |
Factors related to the Test Environment | Room temperature |
Factors related to Testtaker variables | Pressing emotional problems |
| Examiner’s physical appearance and demeanor; presence or absence of an examiner |
| Technical glitches may contaminate data |
| Using the same instrument to measure the same thing at two points in time |
Test-Restest Reliability | Result of a reliability evaluation; estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test |
Test-Retest Measure | Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time; |
Coefficient of Stability | Estimate of test-retest reliability when the interval between testing is greater than six months |
| Alternate-Forms or Parraled forms coefficient of reliability |
Parallel Forms | Exist when for each form of the test, the means and the variances of observed test scores are equal; means of scores obtained on parallel forms correlate equally with the tue score; scores obtained on parallel test correlate equally with other measures |
Alternate Forms | Different versions of a test that have been constructed so as to be parallel; designed to be equivalent with respect to variables such as content and level of difficulty |
Similarity between obtaining estimates of alternate forms reliability and parallel forms reliability and obtaining an estimate of test-retest reliability | Two test administrations with the same group are required |
Item Sampling | Inherent in the computation of an alternate- or parallel-forms reliability coefficient; testtakers may do better or worse on a specific form of the test not as a function of their true ability but simply because of the particular items that were selected for inclusion in the test |
Internal Consistency Estimate of Reliability/Estimate of Inter-Item Consistency | Obtaining an estimate of the reliability of a test without developing an alternate form of the test and without having to administer the test twice to the same people |
Split-Half Reliability | Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once; useful measure of reliability when it is impractical or undersirable to assess reliability with two tests or to administer a test twice |
Steps to compute a Coefficient of Split-Half Reliability | Divide the test into equivalent halves. |
To Split a Test | Randomly assign items to one or the other half of the test; assign odd-numbered items to one half of the test and even-numbered items to the other half |
Odd-Even Reliability | assign odd-numbered items to one half of the test and even-numbered items to the other half |
Mini Parallel Forms | Each half equal to the other in format, stylistic, statistical, and related aspect |
Spearman-Brown Formula | Allows a test developer or user to estimate internal consistency and reliability from a correlation of two halves of a test; Specific application to estimate the reliability of a test that is legnthened or shhortened by any number of items; used to determine the number of items needed to attain a desired level of reliability |
In adding items to increase test reliability to a desired level | The rule is that new items must be equivalent in content and difficulty so that the longer test still measures what the original test measured |
When Internal Consistency Estimates of Reliability are Inappropriate | When measuring the reliability of a heterogeneous test and speed test |
Inter-item Consistency | Refers to the degree of correlation among all the items on a scale; calculated from a single administration of a single form on a test; useful in assessing the homogeneity of a test |
Homogeniety | Degree to which a test measures a single factor; extent to which items in a scale are unifactorial |
Heterogeneity | Degree to which a test measures different factors; composed of items that measure more than one trait |
Nature of Homogeneous Test | The more homogeneous a test is, the more inter-item consistency it can be expected to have; |
Testtakers with the same score on a Homogeneous Test | Have similar abilities in the area tested |
Testtakers with the same score on a Heterogeneous Test | May have different abilities |
Homogeneous Test | Insufficient tool for measuring multifaceted psychological variables such as intelligence or personality |
G. Frederic Kuder & M.W. Richardson | Developed their own measures for estimating reliability; Kuder-Richardson Formula 20 (KR-20) |
Kuder Richardson Formula 20 (KR-20) | Most popular formula |
Where test items are highly Homogeneous | KR-20 and split-half reliability estimates will be similar |
Where test items are highly Heterogeneous | KR-20 will yield lower reliability estimates than the split-half method |
Dichotomous Items | Items that can be scored right or wrong, such as multiple choice items |
Test Battery | A selected assortment of tests and assessment procedures in the process of evaluation; typically composed of tests designed to measure different variables |
r KR20 | The Kuder-Richardson Formula 20 Reliability Coefficient |
KR-21 | Used if there is reason to assume that all the test items have approximately the same degree of difficulty; Outdated in an era of calculators and computers |
Coefficient Alpha | Variant of the KR-20 that has received the most acceptance and is in widest used today; mean of all possible split-half correlations, corrected by the Spearman-Brown formula; approriate for use on tests containing nondichotomous items; preferred statistic for obtaining an estimate of internal consistency reliability; formula yields an estimate of the mean of all possible test-retest, split-half coefficients; widely used as a measure of reliability, in part because it requires only one administration of the test; gives information about the test scores and not the test itself |
Coefficient Alpha Result Coefficient alpha is calculated to help answer questions about how similar sets of data are | Ranges in value from 0 to 1; impossible to yield a negative value of alpha, if negative, report as zero |
Scale of Coefficient of Alpha | 0 Absolutely no similarity
|
Inter-Scorer Reliability | Degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure |
Coefficient of Inter-scorer Reliability | A way to determine the degree of consistency among scorers |
Approaches to the Estimation of Reliability | Test-Retest |
How High a Coefficient of Reliability Should Be | On a continuum relative to the purpose and importance of the decisions to be made on the basis of scores on the test |
Considerations of the Nature of The Testing Itself | Test items are homogeneous or heterogeneous in nature |
Sources of Variance in a Hypothetical Test | |
Homogeneity of Test Items | HOmogeneous in items if it is functionally uniform throughout |
Heterogeneity of Test Items | An estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability |
Dynamic Characteristic | A trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences; Obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or the alternate forms method would be appropriate; |
Static Characteristic | Trait, state, or ability resumed to be relatively unchanging |
Restriction of Range/Variance | If Variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower; if the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher |
Power Test | when a time limit is long enough to allow testtakers to attempt all items and if some items are so difficult that no testtaker is able to obtain a perfect score |
Speed Test | Generally contains items of uniform level of difficulty so that when gien generous time limits, all testtakers should be able to complete all test items correctly; based on performance speed; time limit is established so that few, if any, of the testtakers will be able to complete the entire test |
Reliability Estimate of A Speed Test | Based on performance from two independent testng periods using one of the following: |
If Split Half Procedure is Used for a Speed Test | The obtained reliability coeffiient is for a half test and should be adjusted using the Spearman-Brown formula |
Speed Test Administered Once & Measure of Internal Consistency is Calculated | Result will be a spuriously high reliability coefficient; two people, one who completes 82 items of a speed test and another who completes 61 items of the same speed test; correlation of the two will be close to 1 but will not say anything about response consistency |
Criterion-Referenced Test | Designed to provide an indication of whether a testtaker stands with respect to some variable or cirterion, such as an educational or a vocational objective; tend to contain material that has been mastered in heirarchical fashion; tend to be interpreted in pass-fail terms, and any scrutiny of performance on individual items tends to be for diagnostic and remedial purpose |
Test-Retest Reliability Estimate | Based on the correlation between the total scores on two admnistrations of the same test |
Alternate-Forms Reliability Estimate | A reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman-Brown formula to obtain a reliability estimate of the whole test |
Generalizability Theory/Domain Sampling Theory | Seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score; A test's reliability is conceived of as an objective measure ofhow precisely the test score assesses the domain from which the test draws a sample |
Domain of Behavior | Universe of items that could conceivably measure that behavior; hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test |
Generalizability Theory | May be viewed as an extension of true score theory wherein the concept of a universe score replaces that of a true score; developed by Lee J. Cronbach; Given the same conditions of all the facets in the universe, the exact same test score should be obtained |
Lee J. Cronbach | Encouraged test deelopers and researchers to describe the details of the particular test situation (universe) leading to a speciic test score |
Universe | Described in terms of its facets |
Facets | Include things like the number of items in the test, the amount of training the test scorers have had, and the purose of the test administration |
Universe Score | The test score; analogous to a true score in the true score model |
Generalizability Study | Examines how generalizable scores from a particular test are if the test is administered in different situations; examines how much of an impact different facets of the universe have on the test score |
Coefficients of Generalizability | Influence of particular facets on the test score; similar to reliability coefficients in the true score model |
Decision Study | Developers examine the usefulness of test scores in helping the test user make decisions; designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use |
Item Response Theory | Provide a way to model the probability that a person with X ability will be able to perform at a level of Y; Stated in terms of personality assessment, it models the probability that a person with X amount of a particular personality trait will exhibit Y amount of that trait on a personality test designed to measure it; not a term used to reer to a single theory or method |
Latent | Physically unobservable |
Latent-Trait Theory | Synonym for IRT; Propose models that describe how the latent trait influences performance on each test item; theoretically can take on values from -infinity to +infinity; |
Characteristics of Items within an IRT Framework | Difficulty Leel of an Item | Item's Level of Discrimination |
Difficulty | Refers to the attribute of not being easily accomplished, solved, or comprehended; May also refer to physical difficulty |
Physical Difficulty | How hard or easy it is for a person to engage in a particular activity |
Discrimination | Signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured |
Dichotomous Test Items | Test items or questions that can be answered with only one of two alternate responses, such as true-false, yes-no, or correct-incorrect questions |
Polytomous Test Items | TEst items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct |
Georg Rasch | Developed a group of IRT models; each item on the test is assumed to have an equivalent relationship with the construct being measured by the test; |
Reliability Coefficient | Helps the test developer build an adequate measuring instrument |
Standard Error of Measurement | SEM; provides a measure of the precision of an obsered test score; provides an estimate of the amount of error inherent in an obsered score or measurement; inverse relationship between SEM and reliability of a test; the higher the reliability of a test (or individual subtest within a test) the lower the SEM; tool used to estimate or infer the extent to which an observed score deviates from a ture score; standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests |
Standard Error of a Score | Another term for Standard Error of Measurement; Index of the extent to which one's individual's scores vary over tests presumed to be parallel |
Confidence Interval | Range or band of test scores that is likely to contain the true score |
Standard Error of the Difference | A statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant |
Questions that Standard Error of the Difference Between Two Scores can Answer | How did this individual's performance on test 1 compare with his or her performance on test 2? |