Validity is the most important concept of test theory.
A valid test is defined as a measure that is sound in purpose and meets satisfactory criteria for test construction.
or, the soundness of the interpretation of the test. It is the closeness of agreement between what the test measures and the behavior it is intended to measure.
Remember a test must be reliable to be valid.
There are three types of validity so when we speak about validity we need to determine the validity for a specific purpose.
so we need to ensure that we look at the testing population and purpose and determine which type of validity is important to that particular use of the test.
The three types of validity are
content Ėrelated evidence of validity
the degree to which the sample of items, tasks, or questions on a test are representative of some defined universe or domain of content.
when the person who develops the test demonstrates that the items in a test adequately represent the all important areas of content.
table of specifications is how we determine content validity on knowledge test
Logical validity used in test of physical skill and ability. defines as the extent to which a test measures the most important components of skill necessary to perform a motor task adequately.
Criterion-related Evidence of Validity is demonstrated by comparing test scores with one or more external variables that are considered direct measures of the characteristic or behavior in question.
both types are determined by statistical methods:
concurrent is used when a test is proposed as a substitute for another test that is known to be valid. The known test is said to be the criterion. use a pearson product to calculate rxy validity coefficient
Identify and examine the criterion test. is it a logically valid test?
Evaluate the correlation between the criterion test and the test you are interested in using. Is the correlation high enough? .8 or greater
If not valid look for another test.
Predictive validity is defined as the degree to which a criterion behavior can be predicted using a score from a predictor test. Two uses.
used to predict current state
used to predict future performance
Objectivity of Scoring
An objective test can be scored with very little error.
1. Intrajudge objectivity
The test user scores the same test two or more times
2. Interjudge objectivity
Validity for criterion referenced tests:
Definition of a criterion referenced test
a test with a predetermined standard of performance, with the standard tied to a specified domain of behavior.
Mastery learning uses criterion-referenced testing with mastery learning a minimal standard is set; either a specific score which must be met or a range within which the student should fall. These scores are often arbitrarily set by the instructor. Later we will discuss how to set this standard more scientifically to insure validity.
if you use the normal curve as a way of distributing grades as with norm-referenced testing you will give grades based on where they fall within the distribution and then give them a grade abc.
With mastery learning each student is considered to be a non-master of the material prior to instruction. And if graphed you would expect the scores to be positively skewed when tested on the material/skill. But after instruction and learning experiences the majority of students would be expected to score will or to have mastered the material.
With mastery learning each student is expected to reach a minimal level of competence of the skill or material.
example- the cut off score on the knowledge test for first aid certification.
the problem with a criterion-referenced test is how do we know whether the test is legitimate? i.e.
how is the standard set
does it reflect the criterion behavior that has been set
In other words, we can set a criterion to be met but how can we be sure that we can interpret the results with any meaning from the stand point of meeting the standard specified by a specific objective
are that normative data does not always represent the most desirable standards
criterion referenced test with properly set standards will allow the desirable standard to stand even with skewed normative data (remember normative data detects individual difference. in other words performance is based upon how one scores compared to the rest of the class. Therefore, if a large number exceed or fail to meet the objective you cannot interpret whether the standard set in the objective has been met. The book uses the example of a class with an average %body fat that exceeds the healthy level for fitness purposes. But if score by comparing performance of students within the class the performance then many people who are at unhealthy body fat levels will be considered to have met the standard . this would be inaccurate. rarely provide feedback to the examinee. Criterion referenced test allow feedback to students. If properly done. the student will know what is expected of him/her and why they did or did not meet the standard and then can be given instruction to help meet the standard.
Only a small proportion of students can meet the standard (important implications as far as fitness testing). In this situation if only a small proportion are allowed to succeed can be discouraging. with criterion-referenced as long as they meet the criterion score can be said to have achieved and possibly be motivated to continue participation.
Domain-referenced validity here the test is validated by showing that
the tasks sampled by the test adequately represent the criterion
behavior. domain here represents the criterion behavior. This type of
validity is similar to content and logical validity. the difference is its focus
is much narrower. usually used to measure a single objective.
Decision validity the classification as master non-master must be appropriate. The accuracy of classifications is investigated using decision validity procedures. use some sort of statistic to determine usually.
uninstructed instructed approach. two groups tested and the score where the two overlap is set as the cutoff score
contingency table best way. set the cutoff score at a reasonable level. then tally in a table how many master or didnít. then calculate a proportion. then calculate a contingency coefficient. AS C validity coefficient. it is the sum of the proportions in the upper left and lower right.
The coefficient is interpreted by if c = .50 or greater. usually want at least .80
criterion reference reliability is somewhat different it is looking at the
consistency of classification.
measure using proportion of agreement or P
set up same as for validity (contingency table. but solve for P the proportion
agreement. same as if solving for C.
Kappa coefficient is used to determine if P happened by chance. use the
formula 7-1.meaningful range is 0 to 1.0 anything below .50 are unacceptable
When developing a knowledge test you must take great care so that you can
have good validity.
The first step is to develop a table of specifications
You should have a heading for behavior in other words knowledge,
comprehension, application etc. And one for content.
Doubles rules etc.
You would then decide where you should place the emphasis of the test by
deciding on a percentage for each content area. You should also decide ho
many questions should be asked at each level of behavior by giving a percentage
of each. Then write questions to feel each column of the table.
When giving a test you should clearly state the directions.
1. print them
2. read them aloud
3. inform the examinee if a correction for guessing will be used.
4. Call attention to anything else that will help
Let know if there is a time limit
Let know periodically how much time is left.
5. have count pages
6. make sure know weighting of questions.
Types of test items
Consists of a single statement.
some people take trivial items from text.
Some think its associated with memorization onl
Say less reliable because of the 50/50 chance of answering it by guessing.
Suggestions for writing on page 424
Avoid trivial items
Avoid using sentence form textbooks or stereotyped phrases
Include an equal number of true and false items or more false than true
Reorder the test items randomly
Express only a single idea
Avoid negative statements
Avoid the use of words such as sometimes, always and never
Make false statements plausible
Make true statements clearly true.
Show examples of different types of t=f on 424
Made up of a stem and alternatives.
Hard to write
Stem is made up of an introductory or direct question or incomplete statement
Should be brief and clearly stated
Avoid stereotyped phrased
Donít use negative form when writing.
Alternatives or foils contain an answer and incorrect called distracters
Show examples of different styles on page 426.
Can have between 3 and 5 alternatives in a foil.
Doesnít matter how many as long as all are plausible
Distracters are sometimes hard to write
When writing make sure = length
If use numbers always do so ordering smallest to largest.
Never use words like always never etc.
Randomize answers to keep from patterning the correct responses.
Two columns of words and or phrases.
One contains the definition and the right one contains the alternatives. Ask the student to select an alternative from the right that corresponds with an item In the left colum.
Usually want more alternatives than items.
Many variations show pat 428 and 429.
Easiest to construct.
Examples on 430.
Usually more than one answer possible.
Often only measure simple recall.
Have to think of the answer the test writer has in mind.
Should us only when the answer is clear to experts and when answer can be given one or two words.
Develop a thougth of several sentences to answer a question
There are usually a variety of responses that are acceptable.
One problem is the consistency of evaluation.
Rater differences may lead to some grading more severe or
emphasizing writing over content or
middle of the road grades for the test so there is room incase somebody does well at the end. Etc.
To help correct for this can have them put an id number on the test or something
When constructing should make sure carefully define objectives: donít be vague
Make sure it is narrowed so that the student knows exactly the direction to take.
Several shorter answer may be is better than one vague where can b.s. on it.
When you grade base your deletion of points on the following:
Omission of imoprant or necessary ideas.
Make statements irrelevant to the topic
Unsound conclusions reached due to mistakes in reasoning or misapplication of principles.
A number of errors occurred in spelling and in the mechanics of the writing.
STANDARD DEVIATION METHOD OF NORM REF. GRADING
USE STANDARD DEVIATION TO DETERMINE GRAD.
BASED ON ASSUMPTION OF A NORMAL CURVE
FIND AVE AND S.D.
USE TABLE 9.3 TO DETERMIN THE RANGE FOR A GRAD OF A USING
1.5 S.D.S ABOVE SO, IF HAVE S.D. OF 8 WOULD MULTIPLE BY 1.5 AND ADD IT TO THE AVE DO THE SAME FOR EACH GRADE.
NOW LAY OUT THE SCALE
DESIGNATE A CERTAIN PERCENTAGE FOR EACH GRADE. THEN RANK ORDER INTO GRADES ACCORDINGLY MOST WIDELY USED IS THE 10, 20, 40, 20, 10 METHOD.
MULTIPLE BY THE NTOTAL NUMBER TO DETERMINE HOW MANY SCORES IN EACH SECTION.
CRITERION REFERENCED APPROACH IDENTIFY CUTOFF POINTS OR PERFORMANCE STANDARDS.
USE PERENTAGE CORRECT METHOD. AND APPLY SUCH AS WE DO 9/10 =90%
HOW DO WE SET THE STANDARD
ARBITARY EXPECT NO PREDETERMINED DISTRIBUTION .
FINAL GRADES AVERAGE POINTS, AVERAGE LETTER GRADES AVERAGING RAW TEST SCORES.
AVERAGE T SCORES