Validity

Validity is the most important concept of test theory.

A valid test is defined as a measure that is sound in purpose and meets satisfactory criteria for test construction.

or, the soundness of the interpretation of the test. It is the closeness of agreement between what the test measures and the behavior it is intended to measure.

Remember a test must be reliable to be valid.

There are three types of validity so when we speak about validity we need to determine the validity for a specific purpose.

so we need to ensure that we look at the testing population and purpose and determine which type of validity is important to that particular use of the test.

The three types of validity are

content

logical

criterion-related

concurrent

predictive

construct

content –related evidence of validity

the degree to which the sample of items, tasks, or questions on a test are representative of some defined universe or domain of content.

when the person who develops the test demonstrates that the items in a test adequately represent the all important areas of content.

table of specifications is how we determine content validity on knowledge test

Logical validity used in test of physical skill and ability. defines as the extent to which a test measures the most important components of skill necessary to perform a motor task adequately.

Criterion-related Evidence of Validity is demonstrated by comparing test scores with one or more external variables that are considered direct measures of the characteristic or behavior in question.

concurrent

predictive

both types are determined by statistical methods:

concurrent is used when a test is proposed as a substitute for another test that is known to be valid. The known test is said to be the criterion. use a pearson product to calculate rxy validity coefficient

Identify and examine the criterion test. is it a logically valid test?

Evaluate the correlation between the criterion test and the test you are interested in using. Is the correlation high enough? .8 or greater

If not valid look for another test.

Predictive validity is defined as the degree to which a criterion behavior can be predicted using a score from a predictor test. Two uses.

used to predict current state

used to predict future performance

Objectivity of Scoring

A. Definition

An objective test can be scored with very little error.

B. Types

1. Intrajudge objectivity

The test user scores the same test two or more times

2. Interjudge objectivity

Validity for criterion referenced tests:

Definition of a criterion referenced test

a test with a predetermined standard of performance, with the standard tied to a specified domain of behavior.

Mastery learning uses criterion-referenced testing with mastery learning a minimal standard is set; either a specific score which must be met or a range within which the student should fall. These scores are often arbitrarily set by the instructor. Later we will discuss how to set this standard more scientifically to insure validity.

if you use the normal curve as a way of distributing grades as with norm-referenced testing you will give grades based on where they fall within the distribution and then give them a grade abc.

With mastery learning each student is considered to be a non-master of the material prior to instruction. And if graphed you would expect the scores to be positively skewed when tested on the material/skill. But after instruction and learning experiences the majority of students would be expected to score will or to have mastered the material.

With mastery learning each student is expected to reach a minimal level of competence of the skill or material.

example- the cut off score on the knowledge test for first aid certification.

the problem with a criterion-referenced test is how do we know whether the test is legitimate? i.e.

how is the standard set

does it reflect the criterion behavior that has been set

In other words, we can set a criterion to be met but how can we be sure that we can interpret the results with any meaning from the stand point of meeting the standard specified by a specific objective

The advantages

norm-referenced tests:

are that normative data does not always represent the most desirable standards

criterion referenced test with properly set standards will allow the desirable standard to stand even with skewed normative data (remember normative data detects individual difference. in other words performance is based upon how one scores compared to the rest of the class. Therefore, if a large number exceed or fail to meet the objective you cannot interpret whether the standard set in the objective has been met. The book uses the example of a class with an average %body fat that exceeds the healthy level for fitness purposes. But if score by comparing performance of students within the class the performance then many people who are at unhealthy body fat levels will be considered to have met the standard . this would be inaccurate. rarely provide feedback to the examinee. Criterion referenced test allow feedback to students. If properly done. the student will know what is expected of him/her and why they did or did not meet the standard and then can be given instruction to help meet the standard.

Only a small proportion of students can meet the standard (important implications as far as fitness testing). In this situation if only a small proportion are allowed to succeed can be discouraging. with criterion-referenced as long as they meet the criterion score can be said to have achieved and possibly be motivated to continue participation.

Domain-referenced validity here the test is validated by showing that

the tasks sampled by the test adequately represent the criterion

behavior. domain here represents the criterion behavior. This type of

validity is similar to content and logical validity. the difference is its focus

is much narrower. usually used to measure a single objective.

Decision validity the classification as master non-master must be appropriate. The accuracy of classifications is investigated using decision validity procedures. use some sort of statistic to determine usually.

methods

uninstructed instructed approach. two groups tested and the score where the two overlap is set as the cutoff score

judgmental-empirical.

contingency table best way. set the cutoff score at a reasonable level. then tally in a table how many master or didn’t. then calculate a proportion. then calculate a contingency coefficient. AS C validity coefficient. it is the sum of the proportions in the upper left and lower right.

The coefficient is interpreted by if c = .50 or greater. usually want at least .80

criterion reference reliability is somewhat different it is looking at the

consistency of classification.

measure using proportion of agreement or P

set up same as for validity (contingency table. but solve for P the proportion

agreement. same as if solving for C.

Kappa coefficient is used to determine if P happened by chance. use the

formula 7-1.meaningful range is 0 to 1.0 anything below .50 are unacceptable

When developing a knowledge test you must take great care so that you can

have good validity.

The first step is to develop a table of specifications

You should have a heading for behavior in other words knowledge,

comprehension, application etc. And one for content.

General rules

Single rules

Doubles rules etc.

You would then decide where you should place the emphasis of the test by

deciding on a percentage for each content area. You should also decide ho

many questions should be asked at each level of behavior by giving a percentage

of each. Then write questions to feel each column of the table.

Test administration

When giving a test you should clearly state the directions.

1. print them

2. read them aloud

3. inform the examinee if a correction for guessing will be used.

4. Call attention to anything else that will help

Let know if there is a time limit

Let know periodically how much time is left.

5. have count pages

6. make sure know weighting of questions.

Types of test items

True false

Consists of a single statement.

Negatives

some people take trivial items from text.

Some think its associated with memorization onl

Often ambiguous

Say less reliable because of the 50/50 chance of answering it by guessing.

Suggestions for writing on page 424

Avoid trivial items

Avoid using sentence form textbooks or stereotyped phrases

Avoid ambiguity

Include an equal number of true and false items or more false than true

Reorder the test items randomly

Express only a single idea

Avoid negative statements

Avoid the use of words such as sometimes, always and never

Make false statements plausible

Make true statements clearly true.

Show examples of different types of t=f on 424

Multiple choice

Made up of a stem and alternatives.

Hard to write

Stem is made up of an introductory or direct question or incomplete statement

Should be brief and clearly stated

Avoid stereotyped phrased

Don’t use negative form when writing.

Alternatives or foils contain an answer and incorrect called distracters

Show examples of different styles on page 426.

Can have between 3 and 5 alternatives in a foil.

Doesn’t matter how many as long as all are plausible

Distracters are sometimes hard to write

When writing make sure = length

If use numbers always do so ordering smallest to largest.

Never use words like always never etc.

Randomize answers to keep from patterning the correct responses.

Matching

Two columns of words and or phrases.

One contains the definition and the right one contains the alternatives. Ask the student to select an alternative from the right that corresponds with an item In the left colum.

Usually want more alternatives than items.

Many variations show pat 428 and 429.

Short answer

Easiest to construct.

Examples on 430.

Negatives

Usually more than one answer possible.

Often only measure simple recall.

Have to think of the answer the test writer has in mind.

Should us only when the answer is clear to experts and when answer can be given one or two words.

Essay items.

Develop a thougth of several sentences to answer a question

There are usually a variety of responses that are acceptable.

One problem is the consistency of evaluation.

Very subjective

Rater differences may lead to some grading more severe or

emphasizing writing over content or

middle of the road grades for the test so there is room incase somebody does well at the end. Etc.

Halo effect.

To help correct for this can have them put an id number on the test or something

When constructing should make sure carefully define objectives: don’t be vague

Make sure it is narrowed so that the student knows exactly the direction to take.

Several shorter answer may be is better than one vague where can b.s. on it.

Scoring

When you grade base your deletion of points on the following:

Incorrect statements

Omission of imoprant or necessary ideas.

Make statements irrelevant to the topic

Unsound conclusions reached due to mistakes in reasoning or misapplication of principles.

Bad writing

A number of errors occurred in spelling and in the mechanics of the writing.

GRADING

STANDARD DEVIATION METHOD OF NORM REF. GRADING

USE STANDARD DEVIATION TO DETERMINE GRAD.

BASED ON ASSUMPTION OF A NORMAL CURVE

FIND AVE AND S.D.

USE TABLE 9.3 TO DETERMIN THE RANGE FOR A GRAD OF A USING

1.5 S.D.S ABOVE SO, IF HAVE S.D. OF 8 WOULD MULTIPLE BY 1.5 AND ADD IT TO THE AVE DO THE SAME FOR EACH GRADE.

NOW LAY OUT THE SCALE

DESIGNATE A CERTAIN PERCENTAGE FOR EACH GRADE. THEN RANK ORDER INTO GRADES ACCORDINGLY MOST WIDELY USED IS THE 10, 20, 40, 20, 10 METHOD.

MULTIPLE BY THE NTOTAL NUMBER TO DETERMINE HOW MANY SCORES IN EACH SECTION.

CRITERION REFERENCED APPROACH IDENTIFY CUTOFF POINTS OR PERFORMANCE STANDARDS.

USE PERENTAGE CORRECT METHOD. AND APPLY SUCH AS WE DO 9/10 =90%

HOW DO WE SET THE STANDARD

ARBITARY EXPECT NO PREDETERMINED DISTRIBUTION .

FINAL GRADES AVERAGE POINTS, AVERAGE LETTER GRADES AVERAGING RAW TEST SCORES.

AVERAGE T SCORES

WEIGHTING