Validity
Validity is the most important concept of test theory.
A valid test is defined as a measure that is sound in purpose and meets satisfactory criteria for test construction. or, the soundness of the interpretation of the test. It is the closeness of agreement between what the test measures and the behavior it is intended to measure. Remember a test must be reliable to be valid.
There are three types of validity so when we speak about validity we need to determine the validity for a specific purpose.
so we need to ensure that we look at the testing population and purpose and determine which type of validity is important to that particular use of the test. The three types of validity are content logical criterion-related concurrent predictive construct content –related evidence of validity the degree to which the sample of items, tasks, or questions on a test are representative of some defined universe or domain of content. when the person who develops the test demonstrates that the items in a test adequately represent the all important areas of content. table of specifications is how we determine content validity on knowledge test Logical validity used in test of physical skill and ability. defines as the extent to which a test measures the most important components of skill necessary to perform a motor task adequately. Criterion-related Evidence of Validity is demonstrated by comparing test scores with one or more external variables that are considered direct measures of the characteristic or behavior in question. concurrent predictive both types are determined by statistical methods: concurrent is used when a test is proposed as a substitute for another test that is known to be valid. The known test is said to be the criterion. use a pearson product to calculate rxy validity coefficient Identify and examine the criterion test. is it a logically valid test? Evaluate the correlation between the criterion test and the test you are interested in using. Is the correlation high enough? .8 or greater If not valid look for another test. Predictive validity is defined as the degree to which a criterion behavior can be predicted using a score from a predictor test. Two uses. used to predict current state used to predict future performance Objectivity of Scoring A. Definition An objective test can be scored with very little error. B. Types 1. Intrajudge objectivity The test user scores the same test two or more times 2. Interjudge objectivity Validity for criterion referenced tests: Definition of a criterion referenced test a test with a predetermined standard of performance, with the standard tied to a specified domain of behavior. Mastery learning uses criterion-referenced testing with mastery learning a minimal standard is set; either a specific score which must be met or a range within which the student should fall. These scores are often arbitrarily set by the instructor. Later we will discuss how to set this standard more scientifically to insure validity. if you use the normal curve as a way of distributing grades as with norm-referenced testing you will give grades based on where they fall within the distribution and then give them a grade abc. With mastery learning each student is considered to be a non-master of the material prior to instruction. And if graphed you would expect the scores to be positively skewed when tested on the material/skill. But after instruction and learning experiences the majority of students would be expected to score will or to have mastered the material. With mastery learning each student is expected to reach a minimal level of competence of the skill or material.
example- the cut off score on the knowledge test for first aid certification. the problem with a criterion-referenced test is how do we know whether the test is legitimate? i.e. how is the standard set does it reflect the criterion behavior that has been set In other words, we can set a criterion to be met but how can we be sure that we can interpret the results with any meaning from the stand point of meeting the standard specified by a specific objective The advantages norm-referenced tests: are that normative data does not always represent the most desirable standards criterion referenced test with properly set standards will allow the desirable standard to stand even with skewed normative data (remember normative data detects individual difference. in other words performance is based upon how one scores compared to the rest of the class. Therefore, if a large number exceed or fail to meet the objective you cannot interpret whether the standard set in the objective has been met. The book uses the example of a class with an average %body fat that exceeds the healthy level for fitness purposes. But if score by comparing performance of students within the class the performance then many people who are at unhealthy body fat levels will be considered to have met the standard . this would be inaccurate. rarely provide feedback to the examinee. Criterion referenced test allow feedback to students. If properly done. the student will know what is expected of him/her and why they did or did not meet the standard and then can be given instruction to help meet the standard.
Only a small proportion of students can meet the standard (important implications as far as fitness testing). In this situation if only a small proportion are allowed to succeed can be discouraging. with criterion-referenced as long as they meet the criterion score can be said to have achieved and possibly be motivated to continue participation.
Domain-referenced validity here the test is validated by showing that the tasks sampled by the test adequately represent the criterion behavior. domain here represents the criterion behavior. This type of validity is similar to content and logical validity. the difference is its focus is much narrower. usually used to measure a single objective.
Decision validity the classification as master non-master must be appropriate. The accuracy of classifications is investigated using decision validity procedures. use some sort of statistic to determine usually. methods uninstructed instructed approach. two groups tested and the score where the two overlap is set as the cutoff score judgmental-empirical. contingency table best way. set the cutoff score at a reasonable level. then tally in a table how many master or didn’t. then calculate a proportion. then calculate a contingency coefficient. AS C validity coefficient. it is the sum of the proportions in the upper left and lower right. The coefficient is interpreted by if c = .50 or greater. usually want at least .80 criterion reference reliability is somewhat different it is looking at the consistency of classification. measure using proportion of agreement or P set up same as for validity (contingency table. but solve for P the proportion agreement. same as if solving for C. Kappa coefficient is used to determine if P happened by chance. use the formula 7-1.meaningful range is 0 to 1.0 anything below .50 are unacceptable When developing a knowledge test you must take great care so that you can have good validity. The first step is to develop a table of specifications You should have a heading for behavior in other words knowledge, comprehension, application etc. And one for content. General rules Single rules Doubles rules etc. You would then decide where you should place the emphasis of the test by deciding on a percentage for each content area. You should also decide ho many questions should be asked at each level of behavior by giving a percentage of each. Then write questions to feel each column of the table. Test administration When giving a test you should clearly state the directions. 1. print them 2. read them aloud 3. inform the examinee if a correction for guessing will be used. 4. Call attention to anything else that will help Let know if there is a time limit Let know periodically how much time is left. 5. have count pages 6. make sure know weighting of questions. Types of test items True false Consists of a single statement. Negatives some people take trivial items from text. Some think its associated with memorization onl Often ambiguous Say less reliable because of the 50/50 chance of answering it by guessing. Suggestions for writing on page 424 Avoid trivial items Avoid using sentence form textbooks or stereotyped phrases Avoid ambiguity Include an equal number of true and false items or more false than true Reorder the test items randomly Express only a single idea Avoid negative statements Avoid the use of words such as sometimes, always and never Make false statements plausible Make true statements clearly true. Show examples of different types of t=f on 424 Multiple choice Made up of a stem and alternatives. Hard to write Stem is made up of an introductory or direct question or incomplete statement Should be brief and clearly stated Avoid stereotyped phrased Don’t use negative form when writing.
Alternatives or foils contain an answer and incorrect called distracters Show examples of different styles on page 426. Can have between 3 and 5 alternatives in a foil. Doesn’t matter how many as long as all are plausible Distracters are sometimes hard to write When writing make sure = length If use numbers always do so ordering smallest to largest. Never use words like always never etc. Randomize answers to keep from patterning the correct responses. Matching Two columns of words and or phrases. One contains the definition and the right one contains the alternatives. Ask the student to select an alternative from the right that corresponds with an item In the left colum. Usually want more alternatives than items. Many variations show pat 428 and 429. Short answer Easiest to construct. Examples on 430. Negatives Usually more than one answer possible. Often only measure simple recall. Have to think of the answer the test writer has in mind. Should us only when the answer is clear to experts and when answer can be given one or two words. Essay items. Develop a thougth of several sentences to answer a question There are usually a variety of responses that are acceptable. One problem is the consistency of evaluation. Very subjective Rater differences may lead to some grading more severe or emphasizing writing over content or middle of the road grades for the test so there is room incase somebody does well at the end. Etc. Halo effect. To help correct for this can have them put an id number on the test or something When constructing should make sure carefully define objectives: don’t be vague Make sure it is narrowed so that the student knows exactly the direction to take. Several shorter answer may be is better than one vague where can b.s. on it. Scoring When you grade base your deletion of points on the following: Incorrect statements Omission of imoprant or necessary ideas. Make statements irrelevant to the topic Unsound conclusions reached due to mistakes in reasoning or misapplication of principles. Bad writing A number of errors occurred in spelling and in the mechanics of the writing. GRADING STANDARD DEVIATION METHOD OF NORM REF. GRADING USE STANDARD DEVIATION TO DETERMINE GRAD. BASED ON ASSUMPTION OF A NORMAL CURVE FIND AVE AND S.D. USE TABLE 9.3 TO DETERMIN THE RANGE FOR A GRAD OF A USING 1.5 S.D.S ABOVE SO, IF HAVE S.D. OF 8 WOULD MULTIPLE BY 1.5 AND ADD IT TO THE AVE DO THE SAME FOR EACH GRADE. NOW LAY OUT THE SCALE DESIGNATE A CERTAIN PERCENTAGE FOR EACH GRADE. THEN RANK ORDER INTO GRADES ACCORDINGLY MOST WIDELY USED IS THE 10, 20, 40, 20, 10 METHOD. MULTIPLE BY THE NTOTAL NUMBER TO DETERMINE HOW MANY SCORES IN EACH SECTION. CRITERION REFERENCED APPROACH IDENTIFY CUTOFF POINTS OR PERFORMANCE STANDARDS. USE PERENTAGE CORRECT METHOD. AND APPLY SUCH AS WE DO 9/10 =90% HOW DO WE SET THE STANDARD ARBITARY EXPECT NO PREDETERMINED DISTRIBUTION . FINAL GRADES AVERAGE POINTS, AVERAGE LETTER GRADES AVERAGING RAW TEST SCORES. AVERAGE T SCORES WEIGHTING |