Chapter 5 END – Notes
Scoring and Interpreting
TestsStandards for Scoring
AERA requires test developers to specifically identify in detail and clarity how to score a test since accurate scoring and testing is essential.
ACA-American Counseling Association (formally American Association for Counseling and Development) identified seven guidelines for test scoring specifically:
1. Routinely randomly rescore a sample of the test answer sheets to verify accuracy.
2. Employ systematic procedures to verify the accuracy of manual, computerized or machine scored tests.
3. Obtain separate and independent verification that independent scoring rules and normative conversions are used for each person tests.
4. Verify as accurate the computation to normative or descriptive scales prior to release of info.
5. Same as #2, except verify that person performing the scoring modality is qualified to detect inappropriate or impossible scores.
6. Detect and identify using objective procedures the conditions and behaviors of individual taking the test and make that a part of the report.
7. Label clearly the results and the scores that are reported with the date of the test administration.
Models of Scoring
(3 primary models & 1 alternative model)
Cumulative- MOST COMMON MODEL, assumes that the amount/number of items endorsed or responded to represent the degree of the construct (EX: anxiety, depression, self-esteem, and etc.) or trait measured. (TEST EXAMPLE: BDI-II & DEPRESSION) THE HIGHER THE SCORE THE GREATER THE DEGREE OF THAT TENDENCY/CONSTRUCT. Tests such as achievement, aptitude, and personality tests. Tests usually involve differential weighting(likert scale) of items before being summed (Wechsler Adult Intelligence Scales - WAIS-III) (Example on intelligence tests, answers are rated on a two point scale for degree of abstractness and specificity such as vocabulary definitions on the WAIS – Orange is a color (1Pt answer) and a citrus fruit (2Pt answer)
Class- used to categorize individuals for the purpose of description or prediction. Usually used with criterion reference tests, licensure exams, and mastery tests. Responses are added to achieve a score but only to determine whether a person falls in the appropriate category. (Example Table 5.5 top of page 108) used in diagnostic systems, when a test takers response earns credit toward placement in a particular class or category.
Ipsative- indicates how an individual has performed on a set of variables or scales. Certain personality, interest and value tests where individuals rank responses. Traditional tests also assess an individual's standing on each trait by comparing him or her to norms for others on that trait. As an alternative (ipsative) model might compare one person's standing on each dimension with their position on another dimensions (not is he/she aggressive or curious, but how aggressive he/she is relative to their curiosity). This focuses on a comparison between the traits. Basically, ipsative scoring is comparing a test takers score on one scale to another scale within the same test.
Other Scoring Alternatives
Holistic scoring procedures – using scoring rubrics to compare a writing sample against a model such as the scorer comparing the examinees writing samples on 5 levels of comparison: excellent, very good, good, poor, very poor thus sorting the writing samples. This sometimes involves two or more readers to score each paper and if discrepancies occur a referee scorer is used.
Interpreting Scores
Code of Fair Testing Practices in Education guidelines:
Criterion Referenced Interpretation
Also called domain-referenced testing used for state and school standards representing mastery of essential or basic skills. These tests/evaluations reference an individual’s performance to a defined performance level. The absolute measure for each criterion is usually the percentage of correct answers and whether the examinee met the performance level required for that particular area of interest (top of page 108).
Norm Referenced Interpretation
This calls for the relative position of the test taker with respect to the normative group. Sometimes it is important for the examiner to discriminate among test takers or discriminate between individuals on the domain being measured. Usually measured in percentiles or standard scores. Percentiles (TABLE 6.2, p. 98) are the most widely used measure of relative position of the test taker on norm-referenced tests. Rankings range from 1 to 99, which indicate the number of people in the norming group who scored at or below that particular score. A score/percentile rank of 50 is the median.
Standardized tests also report a percentile band rather than a single score. The upper part of band is the percentile rank of one SEM(standard error of measurement) above the raw score and the lower part of the band is the percentile rank of one SEM below the raw score. SEE TABLE 5.7 pg 109
Standard Scores
(transformations of raw scores/data to be more user-friendly in interpretation)
What type of scale of measurement can you use this with these? Interval/Ratio scales of measurement
Standard scores are also a means of measuring the relative position of individuals on a test such as how many standard deviations a given score is above and below the mean of the norming group. STANDARD SCORES DON’T INCLUDE RAW SCORES, PERCENTILE RANKS, & STANINES & GRADE & AGE EQUIVALENT SCORES.
1. Raw Scores- Gives exact number of scores obtained on a test. CANNOT BE INTERPRETED OR COMPARED
2. Z Scores-:
Z (standard scores) = {(examinees raw score – mean score on the test)/(standard deviation of the test)}
Example pg. 109 bottom on how to calculate z-scores. Illustrations of Z score at bottom of page 109.
Z-score Advantages:
-has a mean of 0 and standard deviation of 1.0
-gives values that are number of standard deviations above and below the mean
-nice standardized charts of scores/values are developed for z scores specifically if they are normally distributed.
Z-score Disadvantages:
-results are often expressed in decimals
-approximately half the results will have a minus sign if we have a normal distribution.
3. T scores:
Are standard scores with fixed mean and standard deviation in units which eliminate the need for decimals and negative signs. Whole number are produced (Advantages). Disadvantage: You need to know the original raw/obtained test’s scores mean and standard deviation to get back to your true raw score. On most tests (including the SASSI-3) the fixed mean is 50 and fixed standard deviation is 10 both constants.
Formula: T(standardized score) = fixed standard deviation multiplied by the (z-score) + fixed mean (constant)
Example to calculate T scores on the middle of page 110.
Not all tests use a fixed mean of 50 and fixed standard deviation of 10
Wechsler subscales use 10 as the fixed mean and 3 as the fixed standard deviation
SAT’s and GRE’s use fixed mean of 500 and fixed standard deviation of 100.
4. Deviation IQ’s:
Standard score usually employed in intelligence tests that used to use intelligence quotients now use a fixed mean of 100 and a pre-determined fixed standard deviation. Most major tests now employ a fixed standard deviation of 15. Therefore, a score of 115 on this scale indicates that they are just one standard deviation above the norming group.
5. Stanines:
-Originated from the phrase “standard nine”
Standard score with just nine score values/units 1-9 with a mean/average performance of 5 and a standard deviation of 2.
-All values/units are half standard deviation in range except for values 1 and 9. Percentage of scores in stanines are listed in Table 5.9 pg. 111 as well as percentile ranks.
Stanines, like percentile ranks, indicate a student’s relative standing in a reference group. However, since stanines do represent approximately equal units of ability, they are particularly useful for comparing a student’s scores across subtests in a stanine profile. Because of their equal-interval property (where the difference between stanines 2 and 4 represents about the same difference in ability/trait as the difference between stanines 5 and 7), stanines also make it easy to identify broad performance categories. Stanine scores of 1, 2, and 3 are usually considered to reflect below-average performance; 4, 5, and 6 are generally thought of as average; and 7, 8, and 9 are above average.
6. Sten Scores:
-Originated from the phrase “standard ten”
A standard score that has 10 score values/units 1-10.
In Sten scores, the mean is 5.5 and the standard deviation is 2.
Table 5.10 pg. 112 indicating percentile rank of a given score to the norming group as well as percentage of scores in each value/unit in the norming group. Notice the values follow the normal curve, Sten’s 4-5-6-7 comprise 68% of scores, Sten’s 2-3-4-5-6-7-8-9 comprise approximately 96% of scores and etc.
-Realize that the 16PF Questionnaire uses Sten Scores to present their results.
7. Other Standard Score Scales: Fixed/arbitrary mean of 15 and fixed standard deviation of 5 for
Iowa Tests of Educational Development
American College Test(ACT)
Normal Curve Equivalent (NCE) normalized standard scores with mean of 50 and standard deviation of 21.06. Range from 1 to 99. Used by many schools, educators, and psychologists working with research projects for the U.S. Office of Education, especially reading projects.
Developmental Scales
Age Norms: based on what typical individuals can do at a given age. Usually cognitive tests and tests of social maturity. The scores have meaning when the behaviors being measured vary systematically with age. Age scores compare an individual’s performance with what most individuals do at an age.
Problem: First, oftentimes we don’t have an even progression of growth but varies at different periods of time. Secondly, interpretation problems such as high scores indicate that individuals are ahead of their peers but not necessarily able to perform tasks characteristic of a higher age.
Grade Norms: Grade equivalency scores used by major survey achievement batteries. They provide a way of comparing a student’s performance with other students at a given grade. Test scores are compared to the average or sometimes the median performance at a given grade level. School year is divided into 10 segments. #.0 beginning of a grade and #.9 the end of the year for a given grade level.
Problems: even if a students scores higher than their given grade level, this doesn’t mean their were advanced grade items on the test or can be competent at that advanced grade level. His or her raw score is equivalent to the mean or median raw score made by student at that level in the advanced grade.
-scores are not equal units of measure often ordinal scale of measurement
-scores are often used as a standard and students are expected to be at that grade level.
-sometimes examiners fail to recognize that these depend on typical or average performance in the norming group. So half of the students in the norming group scored higher and half scored lower at a given score.
-standard deviations differ for different subtests in an achievement test. Range of grade equivalent scores will vary for each subtest as well as between scores/subtests for different instruments than test publishers.
COMPARISION OF THE TYPE OF TEST SCORES: TABLE 5.11 PG. 115 ADVANTAGES AND DISADVANTAGES
Norms
Are statistical or tabular data that summarizes the test scores of specific groups of test takers.
-these norming groups should include those with which the users of the test will ordinarily wish to compare the individuals who take the tests.
-in the test manual, (1) authors should describe who is in the norming group
-and (2) explain the different types of converted scores included in tables.
SEE QUESTIONS 1-5 BOTTOM OF PAGE 114
1. IS KEY; Does the norming group include the type of person with whom the test taker should be compared? Size of norming group, summary information about gender and cultural aspects, date and description of norming method.
Conversion Tables
Presents every possible raw score and derived score for a given norm group including individual, gender, cultural, and age/grade specific. These can be presented for different scores/subscores on a given test or multiple tests. TABLE 5.12 page 116.
Profiles
Example SASSI-3, results where plotted on a T-score profile sheet by circling the raw score value under each scale and then connecting the dots/values. Profiles present an aid in interpretation by providing visual results. Profiles can represent the scores on a single battery or on several tests. See Figure 5.5 pg. 117 for individual profile of COPS system inventory.
Remember:
Patterns in the shape of the profile are important such as all high scores and low scores as well as combinations provide interpretative meaning which can be referenced in a manual.
remember SEM in evaluating each score in a profile
remember scores represent a snapshot in time
remember small differences should not be over interpreted.
Drummond, R. J. (2000). Appraisal procedures for counselors and helping professionals (5th ed.). Upper Saddle River, NJ: Prentice-Hall.