Thursday 13 September 2018

Norm-referenced and Criterion referenced. Norm-referenced testing

There are two types of standardized tests: Norm-referenced and Criterion referenced. Norm-referenced testing measures performance relative to all other students taking the same test. It lets you know how well a student did compare to the rest of the testing population. For example, if a student is ranked in the 86th percentile, that means he/she did better than 86 percent of others who took the test. This type of testing is the most common found among standardized testing. Criterion referenced testing measures factual knowledge of a defined body of material. Multiple-choice tests that people take to get their license or a test in fractions are both examples of this type of testing. In addition to the two main categories of standardized tests, these tests can be divided even further into performance tests or aptitude tests. Performance tests are assessments of what learning has already occurred in a particular subject area, while aptitude tests are assessments of abilities or skills considered important to future success in school. Intelligence tests are also standardized tests that aim to determine how a person can handle problem solving using higher level cognitive thinking. Often just called an IQ test for common use, a typical IQ test asks problems involving pattern recognition and logical reasoning. It then takes into account the time needed and how many questions the person completes correctly, with penalties for guessing. Specific tests and how the results are used change from district to district but intelligence testing is common during the early years of schooling.

(b) Advantages
• It can be obtained easily and available on researcher’s convenience.
• It can be adopted and implemented quickly.
• It reduces or eliminates faculty time demands in instrument development and grading.
• It helps to score objectively.
• It can provide the external validity of test.
• It helps to provide reference group measures.
• It can make longitudinal comparisons.

• It can test large numbers of students.
(c) Disadvantages
• It measures relatively superficial knowledge or learning.
• Norm-referenced data may be less useful than criterion-referenced.
• It may be cost prohibitive to administer as a pre- and post-test.
• It is more summative than formative (may be difficult to isolate what changes are needed).
• It may be difficult to receive results in a timely manner.

Types advantages and dis advantages of Rating Scales

Numerical Rating Scales:
A sequence of numbers is assigned to descriptive Categories; the rater marks a number to indicate the degree to which a characteristic is present
Graphic Rating Scales:
A set of categories described at certain points along the line of a continuum; the rater can mark his or her judgment at any location on the line

Advantages of Rating Scales:  Used for behaviours not easily measured by other means  Quick and easy to complete  User can apply knowledge about the child from other times  Minimum of training required  Easy to design using consistent descriptors (e.g., always, sometimes, rarely, or never)  Can describe the child’s steps toward understanding or mastery
(b) Disadvantages
 Highly subjective (rater error and bias are a common problem).
 Raters may rate a child on the basis of their previous interactions or on an emotional, rather than an objective, basis.
 Ambiguous terms make them unreliable: raters are likely to mark characteristics by using different interpretations of the ratings (e.g., do they all agree on what “sometimes” means?).

what is Rating Scale?


A rating scale is a tool used for assessing the performance of tasks, skill levels, procedures, processes, qualities, quantities, or end products, such as reports, drawings, and computer programs. These are judged at a defined level within a stated range. Rating scales are similar to checklists except that they indicate the degree of accomplishment rather than just yes or no. Hence rating scale used to determine the degree to which the child exhibits a behaviour or the quality of that behavior; each trait is rated on a continuum, the observer decides where the child fits on the scale overall rating scale focuses on:
• Make a qualitative judgment about the extent to which a behavior is present
• Consist of a set of characteristics or qualities to be judged by using a systematic procedure
• Numerical and graphic rating scales are used most frequently

What are the Advantages and disadvantages of Interview?

  Very good technique for getting the information about the complex, emotionally laden subjects.  Can be easily adapted to the ability of the person being interviewed.  Yields a good percentage of returns.
 Yields perfect sample of the general population.
 Data collected by this method is likely to be more correct as compared to the other methods that are used to investigate issues in an in depth way for the data collection
 Discover how individuals think and feel about a topic and why they hold certain opinions
 Investigate the use, effectiveness and usefulness of particular library collections and services
 Inform decision making, strategic planning and resource allocation
Sensitive topics which people may feel uncomfortable discussing in a focus group
 Add a human dimension to impersonal data
 Deepen understanding and explain statistical data.

Disadvantages of interview

 Time consuming process.
 Involves high cost.
 Requires highly skilled interviewer.
 Requires more energy.
 May sometimes involve systematic errors.
 More confusing and a very complicated method.

 Different interviewers may understand and transcribe interviews in different ways.

What are the Types of Interview?

1. Structured Interview
Here, every single detail of the interview is decided in advance. The questions to be asked, the order in which the questions will be asked, the time given to each candidate, the information to be collected from each candidate, etc. is all decided in advance. Structured interview is also called Standardized, Patterned, Directed or Guided interview. Structured interviews are preplanned. They are accurate and precise. All the interviews will be uniform (same). Therefore, there will be consistency and minimum bias in structured interviews.


2. Unstructured Interview
This interview is not planned in detail. Hence it is also called as Non-Directed interview. The question to be asked, the information to be collected from the candidates, etc. are not decided in advance. These interviews are non-planned and therefore, more flexible. Candidates are more relaxed in such interviews. They are encouraged to express themselves about different subjects, based on their expectations, motivations, background, interests, etc. Here the interviewer can make a better judgment of the candidate's personality, potentials, strengths and weaknesses. However, if the interviewer is not efficient then the discussions will lose direction and the interview will be a waste of time and effort.



3. Group Interview
Here, all the candidates or small groups of candidates are interviewed together. The time of the interviewer is saved. A group interview is similar to a group discussion. A topic is given to the group, and they are asked to discuss it. The interviewer carefully watches the candidates. He tries to find out which candidate influences others, who clarifies issues, who summarizes the discussion, who speaks effectively, etc. He tries to judge the behaviour of each candidate in a group situation.



4. Exit Interview

When an employee leaves the company, he is interviewed either by his immediate superior or by the Human Resource Development (HRD) manager. This interview is called an exit interview. Exit interview is taken to find out why the employee is leaving the company. Sometimes, the employee may be asked to withdraw his resignation by providing some incentives. Exit interviews are taken to create a good image of the company in the minds of the employees who are leaving the company. They help the company to make proper Human Resource Development (HRD) policies, to create a favourable work environment, to create employee loyalty and to reduce labour turnover.



5. Depth Interview
This is a semi-structured interview. The candidate has to give detailed information about his background, special interest, etc. He also has to give detailed information about his subject. Depth interview tries to find out if the candidate is an expert in his subject or not. Here, the interviewer must have a good understanding of human behaviour.

6. Stress Interview
The purpose of this interview is to find out how the candidate behaves in a stressful situation. That is, whether the candidate gets angry or gets confused or gets frightened or gets nervous or remains cool in a stressful situation. The candidate who keeps his cool in a stressful situation is selected for the stressful job. Here, the interviewer tries to create a stressful situation during the interview. This is done purposely by asking the candidate rapid questions, criticizing his answers, interrupting him repeatedly, etc. Then the behviour of the interviewee is observed and future educational planning based on his/her stress levels and handling of stress.


7. Individual Interview
This is a 'One-To-One' Interview. It is a verbal and visual interaction between two people, the interviewer and the candidate, for a particular purpose. The purpose of this interview is to match the candidate with the job. It is a two way communication.


8. Informal Interview
Informal interview is an oral interview which can be arranged at any place. Different questions are asked to collect the required information from the candidate. Specific rigid procedure is not followed. It is a friendly interview.

10. Panel Interview
Panel means a selection committee or interview committee that is appointed for interviewing the candidates. The panel may include three or five members. They ask questions to the candidates about different aspects. They give marks to each candidate. The final decision will be taken by all members collectively by rating the candidates. Panel interview is always better than an interview by one interviewer because in a panel interview, collective judgment is used for selecting suitable candidates.
11. Behavioral Interview In a behavioural interview, the interviewer will ask you questions based on common situations of the job you are applying for. The logic behind the behavioral interview is that your future performance will be based on a past performance of a similar situation. You should expect questions that inquire about what you did when you were in some situation and how did you dealt with it. In a behavioral interview, the interviewer wants to see how you deal with certain problems and what you do to solve them.
 12. Phone Interview A phone interview may be for a position where the candidate is not local or for an initial prescreening call to see if they want to invite you in for an in-person interview. You may be asked typical questions or behavioural questions. Most of the time you will schedule an appointment for a phone interview. If the interviewer calls unexpectedly, it's ok to ask them politely to schedule an appointment. On a phone interview, make sure your call waiting is turned off, you are in a quiet room, and you are not eating, drinking or chewing gum.
9. Formal Interview
Formal interview is held in a more formal atmosphere. The interviewer asks pre-planned questions. Formal interview is also called planned interview.

Tuesday 11 September 2018

Relationship between Validity and Reliability of a test,


Reliability and validity are two different standards used to gauge the usefulness of a test. Though different, they work together. It would not be beneficial to design a test with good reliability that did not measure what it was intended to measure. The inverse, accurately measuring what we desire to measure with a test that is so flawed that results are not reproducible, is impossible. Reliability is a necessary requirement for validity. This means that you have to have good reliability in order to have validity. Reliability actually puts a cap or limit on validity, and if a test is not reliable, it cannot be valid. Establishing good reliability is only the first part of establishing validity. Validity has to be established separately. Having good reliability does not mean we have good validity, it just means we are measuring something consistently. Now we must establish, what it is that we are measuring consistently. The main point here is reliability is necessary but not sufficient for validity. In short we can say that reliability means noting when the problem is validity.

Factors Affecting Validity of a test.

Validity evidence is an important aspect to consider while thinking of the classroom testing and measurement. There are many factors that tend to make test result invalid for their intended use. A little careful effort by the test developer help to control these factors, but some of them need systematic approach. No teacher would think of measuring knowledge of social studies with an English test. Nor would a teacher consider measuring problem-solving skills in third-grade arithmetic with a test designed for sixth grades. In both instances, the test results would obviously be invalid. The factors influencing validity are of this same general but match more subtle in character. For example, a teacher may overload a social studies test with items concerning historical facts, and thus the scores are less valid as a measure of achievement in social studies. Or a third–grade teacher may select appropriate arithmetic problems for a test but use vocabulary in the problems and directions that only the better readers are able to understand. The arithmetic test then becomes, in part, reading test, which invalidates the result for their intended use. These examples show some of the more subtle factors influencing validity, for which the teacher should be alert, whether constructing classroom tests or selecting published tests. Some other factors that may affect the test validity are discussed as under.

1. Instructions to Take A Test:
The instructions with the test should be clear and understandable and it should be in simple language. Unclear instructions may restrict the pupil how to respond to the items, whether it is permissible to guess, and how to record the answers will tend to reduce validity.
2. Difficult Language Structure:
Language of the test or instructions to the test that is too complicated for the pupils taking the test will result in the test’s measuring reading comprehension and aspects of intelligence, which will distort the meaning of the test results. Therefore it should be simple considering the grade for which the test is meant.
3. Inappropriate Level of Difficulty:
norm-references tests, items that are too easy or too difficult will not provide reliable discriminations among pupils and will therefore lower validity. In criterion-referenced tests, the failure to match the difficulty specified by the learning outcome will lower validity.
4. Poorly Constructed Test Items:
There may be some items that provide direction to the answer or test items that unintentionally provide alertness in detecting clues are poor items, these items may harm the validity of the test.
5. Ambiguity in Items Statements:
Ambiguous statements in test items contribute to misinterpretations and confusion. Ambiguity sometimes confuses the better pupils more than it does the poor pupils, causing the items to discriminate in a negative direction.
6. Length of the Test:
A test is only a Sample of the many questions that might be asked. If a test is too short to provide a representative sample of the performance we are interested in, its validity will suffer accordingly. Similarly a too lengthy test is also a threat to the validity evidence of the test.
7. Improper Arrangement of Items:
Test items are typically arranged in order of difficulty, with the easiest items first. Placing difficult items early in the test may cause pupils to spend too much time on these and prevent them from reaching items they could easily answer. Improper arrangement may also influence validity by having a detrimental effect on pupil motivation. The influence is likely to be strongest with young pupils.
8. Identifiable Pattern of Answers:
Placing correct answers in some systematic pattern will enable pupils to guess the answers to some items more easily, and this will lower validity.
In short, any defect in the tests construction that prevents the test items from functioning as intended will invalidate the interpretations to be drawn from the results. There may be many other factors that can also affect the validity of the test to some extents. Some of these factors are listed as under.
 Inadequate sample
 Inappropriate selection of constructs or measures.
 Items that do not function as intended
 Improper administration: inadequate time allowed, poorly controlled conditions
 Scoring that is subjective
 Insufficient data collected to make valid conclusions.
 Too great a variation in data (can't see the wood for the trees).
 Inadequate selection of target subjects.
 Complex interaction across constructs.
 Subjects giving biased answers or trying to guess what they should say.

VALIDITY OF THE ASSESSMENT TOOLS

Nature of Validity
The validity of an assessment tool is the degree to which it measures for what it is designed to measure. For example if a test is designed to measure the skill of addition of three digit in mathematics but the problems are presented in difficult language that is not according to the ability level of the students then it may not measure the addition skill of three digits, consequently will not be a valid test. Many experts of measurement had defined this term, some of the definitions are given as under.
According to Business Dictionary the “Validity is the degree to which an instrument, selection process, statistical technique, or test measures what it is supposed to measure.”
Cook and Campbell (1979) define validity as the appropriateness or correctness of inferences, decisions, or descriptions made about individuals, groups, or institutions from test results.
According to APA (American Psychological association) standards document the validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences. Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity always refers to the degree to which that evidence supports the inferences that are made from the scores. The inferences regarding specific uses of a test are validated, not the test itself.
Howell’s (1992) view of validity of the test is; a valid test must measure specifically what it is intended to measure.
According to Messick the validity is a matter of degree, not absolutely valid or absolutely invalid. He advocates that, over time, validity evidence will continue to gather, either enhancing or contradicting previous findings. Overall we can say that in terms of assessment, validity refers to the extent to which a test's content is representative of the actual skills learned and whether the test can allow accurate conclusions concerning achievement. Therefore validity is the extent to which a test measures what it claims to measure. It is vital for a test to be valid in order for the results to be accurately applied and interpreted. Let’s consider the following examples.
Examples:
1. Say you are assigned to observe the effect of strict attendance policies on class participation. After observing two or three weeks you reported that class participation did increase after the policy was established.
2. Say you are intended to measure the intelligence and if math and vocabulary truly represent intelligence then a math and vocabulary test might be said to have high validity when used as a measure of intelligence.

A test has validity evidence, if we can demonstrate that it measures what it says to measure. For instance, if it is supposed to be a test for fifth grade arithmetic ability, it should measure fifth grade arithmetic ability and not the reading ability.



Test Validity and Test Validation
Tests can take the form of written responses to a series of questions, such as the paper-and-pencil tests, or of judgments by experts about behaviour in the classroom/school, or for a work performance appraisal. The form of written test results also vary from pass/fail, to holistic judgments, to a complex series of numbers meant to convey minute differences in behaviour.
Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school.
Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes.

Purpose of Measuring Validity
Most, but not all, tests are designed to measure skills, abilities, or traits that are and are not directly observable. For example, scores on the Scholastic Aptitude Test (SAT) measure developed critical reading, writing and mathematical ability. The score on the SAT that an examinee obtains when he/she takes the test is not a direct measure of critical reading ability, such as degrees centigrade is a direct measure of the heat of an object. The amount of an examinee's developed critical reading ability must be inferred from the examinee's SAT critical reading score.
The process of using a test score as a sample of behaviour in order to draw conclusions about a larger domain of behaviours is characteristic of most educational and psychological tests. Responsible test developers and publishers must be able to demonstrate that it is possible to use the sample of behaviours measured by a test to make valid inferences about an examinee's ability to perform tasks that represent the larger domain of interest.

Validity versus Reliability
A test can be reliable but may not be valid. If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. In simple words we can say that same test administered to same students may yield same score. However, tests can be highly reliable and still not be valid for a particular purpose. Consider the example of a thermometer if there is a systematic error and it measures five degrees higher. When the repeated readings has been taken under the same conditions the thermometer will yield consistent (reliable) measurements, but the inference about the temperature is faulty.
This analogy makes it clear that determining the reliability of a test is an important first step, but not the defining step, in determining the validity of a test.


Methods of Measuring Validity
Validity is the appropriateness of a particular uses of the test scores, test validation is then the process of collecting evidence to justify the intended use of the scores. In order to collect the evidence of validity there are many types of validity methods that provide usefulness of the assessment tools. Some of them are listed below.

Content Validity
The evidence of the content validity is judgmental process and may be formal or informal. The formal process has systematic procedure which arrives at a judgment. The important components are the identification of behavioural objectives and construction of table of specification. Content validity evidence involves the degree to which the content of the test matches a content domain associated with the construct. For example, a test of the ability to add two numbers, should include a range of combinations of digits. A test with only one-digit numbers, or only even numbers, would not have good coverage of the content domain. Content related evidence typically involves Subject Matter Experts (SME's) evaluating test items against the test specifications.
It is a non-statistical type of validity that involves “the systematic examination of the test content to determine whether it covers a representative sample of the behaviour domain to be measured” (Anastasi & Urbina, 1997). For example, does an IQ questionnaire have items covering all areas of intelligence discussed in the scientific literature?
A test has content validity built into it by careful selection of which items to include (Anastasi & Urbina, 1997). Items are chosen so that they comply with the test specification which is drawn up through a thorough examination of the subject domain. Foxcraft et al. (2004, p. 49) note that by using a panel of experts to review the test specifications and the selection of items the content validity of a test can be improved. The experts will be able to review the items and comment on whether the items cover a representative sample of the behaviour domain.
For Example - In developing a teaching competency test, experts on the field of teacher training would identify the information and issues required to be an effective teacher and then will choose (or rate) items that represent those areas of information and skills which are expected from a teacher to exhibit in classroom.
Lawshe (1975) proposed that each rater should respond to the following question for each item in content validity:
Is the skill or knowledge measured by this item?
 Essential
 Useful but not essential
 Not necessary
With respect to educational achievement tests, a test is considered content valid when the proportion of the material covered in the test approximates the proportion of material covered in the course.
There are different types of content validity; the major types face validity and the curricular validity are as below.
1 Face Validity
Face validity is an estimate of whether a test appears to measure a certain criterion; it does not guarantee that the test actually measures phenomena in that domain. Face validity is very closely related to content validity. While content validity depends on a theoretical basis for assuming if a test is assessing all domains of a certain criterion (e.g. does assessing addition skills yield in a good measure for mathematical skills? - To answer this you have to know, what different kinds of arithmetic skills mathematical skills include ) face validity relates to whether a test appears to be a good measure or not. This judgment is made on the "face" of the test, thus it can also be judged by the amateur.
Face validity is a starting point, but should NEVER be assumed to be provably valid for any given purpose, as the "experts" may be wrong.
For example- suppose you were taking an instrument reportedly measuring your attractiveness, but the questions were asking you to identify the correctly spelled word in each list. Not much of a link between the claim of what it is supposed to do and what it actually does.Possible Advantage of Face Validity...
• If the respondent knows what information we are looking for, they can use that “context” to help interpret the questions and provide more useful, accurate answers.
Possible Disadvantage of Face Validity...
• If the respondent knows what information we are looking for, they might try to “bend & shape” their answers to what they think we want 2. Curricular Validity
The extent to which the content of the test matches the objectives of a specific curriculum as it is formally described. Curricular validity takes on particular importance in situations where tests are used for high-stakes decisions, such as Punjab Examination Commission exams for fifth and eight grade students and Boards of Intermediate and Secondary Education Examinations. In these situations, curricular validity means that the content of a test that is used to make a decision about whether a student should be promoted to the next levels should measure the curriculum that the student is taught in schools.
Curricular validity is evaluated by groups of curriculum/content experts. The experts are asked to judge whether the content of the test is parallel to the curriculum objectives and whether the test and curricular emphases are in proper balance. Table of specification may help to improve the validity of the test.

Construct Validity
Before defining the construct validity, it seems necessary to elaborate the concept of construct. It is the concept or the characteristic that a test is designed to measure. A construct provides the target that a particular assessment or set of assessments is designed to measure; it is a separate entity from the test itself. According to Howell (1992) Construct validity is a test’s ability to measure factors which are relevant to the field of study. Construct validity is thus an assessment of the quality of an instrument or experimental design. It says 'Does it measure the construct it is supposed to measure'. Construct validity is rarely applied in achievement test.

Construct validity refers to the extent to which operationalizations of a construct (e.g. practical tests developed from a theory) do actually measure what the theory says they do. For example, to what extent is an IQ questionnaire actually measuring "intelligence"? Construct validity evidence involves the empirical and theoretical support for the interpretation of the construct. Such lines of evidence include statistical analyses of the internal structure of the test including the relationships between responses to different test items. They also include relationships between the test and measures of other constructs. As currently understood, construct validity is not distinct from the support for the substantive theory of the construct that the test is designed to measure. As such, experiments designed to reveal aspects of the causal role of the construct also contribute to construct validity evidence.
Construct validity occurs when the theoretical constructs of cause and effect accurately represent the real-world situations they are intended to model. This is related to how well the experiment is operationalized. A good experiment turns the theory (constructs) into actual things you can measure. Sometimes just finding out more about the construct (which itself must be valid) can be helpful. The construct validity addresses the construct that are mapped into the test items, it is also assured either by judgmental method or by developing the test specification before the development of the test. The constructs have some essential properties the two of them are listed as under:
1. Are abstract summaries of some regularity in nature?
2. Related with concrete, observable entities.
For Example - Integrity is a construct; it cannot be directly observed, yet it is useful for understanding, describing, and predicting human behaviour.

1. Convergent Validity
Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with. OR
Convergent validity occurs where measures of constructs that are expected to correlate do so. This is similar to concurrent validity (which looks for correlation with other tests).
For example, if scores on a specific mathematics test are similar to students scores on other mathematics tests, then convergent validity is high (there is a positively correlation between the scores from similar tests of mathematics).


2. Discriminant Validity
Discriminant validity describes the degree to which the operationalization does not correlate with other operationalizations that it theoretically should not be correlated with. OR
Discriminant validity occurs where constructs that are expected not to relate with each other, such that it is possible to discriminate between these constructs. For example, if discriminant validity is high, scores on a test designed to assess students skills in mathematics should not be positively correlated with scores from tests designed to assess intelligence.

Criterion Validity
Criterion validity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct. In other words, it compares the test with other measures or outcomes (the criteria) already held to be valid. For example, employee selection tests are often validated against measures of job performance (the criterion), and IQ tests are often validated against measures of academic performance (the criterion).
If the test data and criterion data are collected at the same time, this is referred to as concurrent validity evidence. If the test data is collected first in order to predict criterion data collected at a later point in time, then this is referred to as predictive validity evidence.
For example, the company psychologist would measure the job performance of the new artists after they have been on-the-job for 6 months. He or she would then correlate scores on each predictor with job performance scores to determine which one is the best predictor.


Concurrent Validity
According to Howell (1992) “concurrent validity is determined using other existing and similar tests which have been known to be valid as comparisons to a test being  developed. There is no other known valid test to measure the range of cultural issues tested for this specific group of subjects”.
Concurrent validity refers to the degree to which the scores taken at one point correlates with other measures (test, observation or interview) of the same construct that is measured at the same time. Returning to the selection test example, this would mean that the tests are administered to current employees and then correlated with their scores on performance reviews. This measure the relationship between measures made with existing tests. The existing test is thus the criterion. For example, a measure of creativity should correlate with existing measures of creativity.
For example:
To assess the validity of a diagnostic screening test. In this case the predictor (X) is the test and the criterion (Y) is the clinical diagnosis. When the correlation is large this means that the predictor is useful as a diagnostic tool.

Predictive Validity
Predictive validity assures how well the test predicts some future behaviour of the examinee. It validity refers to the degree to which the operationalization can predict (or correlate with) other measures of the same construct that are measured at some time in the future. Again, with the selection test example, this would mean that the tests are administered to applicants, all applicants are hired, their performance is reviewed at a later time, and then their scores on the two measures are correlated. This form of the validity evidence is particularly useful and important for the aptitude tests, which attempt to predict how well the test taker will do in some future setting.
This measures the extent to which a future level of a variable can be predicted from a current measurement. This includes correlation with measurements made with different instruments. For example, a political poll intends to measure future voting intent. College entry tests should have a high predictive validity with regard to final exam results. When the two sets of scores are correlated, the coefficient that results is called the predictive validity coefficient.
Examples:
1. If higher scores on the Boards Exams are positively correlated with higher G.P.A.’s in the Universities and vice versa, then the Board exams is said to have predictive validity.
2. We might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession.

The predictive validity depends upon the following two steps.
 Obtain test scores from a group of respondents, but do not use the test in making a decision.
 At some later time, obtain a performance measure for those respondents, and correlate these measures with test scores to obtain predictive validity.

Scoring the Test

Scoring Objective Test Items
If the student’s answers are recorded on the test paper itself, a scoring key can be made by marking the correct answers on a blank copy of the test. Scoring then is simply a matter of comparing the columns of the answers on this master copy with the columns of answers on each student’s paper. A strip key which consists merely of strips of paper, on which the columns of answers are recorded, may also be used if more convenient. These can easily be prepared by cutting the columns of answers from the master copy of the test and mounting them on strips of cardboard cut from manila folders.
When separate answer sheets are used, a scoring stencil is more convenient. This is a blank answer sheet with holes punched where the correct answers should appear. The stencil is laid over the answer sheet, and the number of the answer checks appearing through holes is counted. When this type of scoring procedure is used, each test paper should also be scanned to make certain that only one answer was marked for each item. Any item containing more than one answer should be eliminated from the scoring.
As each test paper is scored, mark each item that is scored incorrectly. With multiple choice items, a good practice is to draw a red line through the correct answers of the missed items rather than through the student’s wrong answers. This will indicate to the students those items missed and at the same time will indicate the correct answers. Time will be saved and confusion avoided during discussion of the test. Marking the correct answers of the missed items is simple with a scoring stencil. When no answer check appears through a hole in the stencil, a red line is drawn across the hole.
In scoring objective tests, each correct answer is usually counted as one point, because an arbitrary weighting of items make little difference in the students’ final scores. If some items are counted two points, some one point, and some half point, the scoring will be more complicated without any accompanying benefits. Scores based on such weightings will be similar to the simpler procedure of counting each item on one point. When a test consists of a combination of objective items and a few, more time-consuming, essay questions, however, more than a single point is needed to distinguish several levels of response and to reflect disproportionate time devoted to each of the essay questions.
When students are told to answer every item on the test, a student’s score is simply the number of items answered correctly. There is no need to consider wrong answers or to correct for guessing. When all students answer every item on the test, the rank of the students’ scores will be same whether the number is right or a correction for guessing is used.
A simplified form of item analysis is all that is necessary or warranted for classroom tests because most classroom groups consist of 20 to 40 students, an especially useful procedure to compare the responses of the ten lowest-scoring students. As we shall see later, keeping the upper and lower groups and ten students each simplifies the interpretation of the results. It also is a reasonable number for analysis in groups of 20 to 40 students. For example, with a small classroom group, like that of 20 students, it is best to use the upper and lower halves to obtain dependable data, whereas with a larger group, like that of 40 students, use of upper and lower 25 percent is quite satisfactory. For more refined analysis, the upper and lower 27 percent is often recommended, and most statistical guides are based on that percentage.
To illustrate the method of item analysis, suppose we have just finished scoring 32 test papers for a sixth-grade science unit on weather. Our item analysis might then proceed as follows:
1. ... Rank the 32 test papers in order from the highest to the lowest score.
2. ... Select the 10 papers within the highest total scores and the ten papers with the lowest total scores.
3. ... Put aside the middle 12 papers as they will not be used in the analysis.
4. ... For each test item, tabulate the number of students in the upper and lower groups who selected each alternative. This tabulation can be made directly on the test paper or on the test item card.
5. ... Compute the difficulty of each item (percentage of the students who got the item right).
6. ... Compute the discriminating power of each item (difference between the number of students in the upper and lower groups who got the item right).
7. ... Evaluate the effectiveness of distracters in each item (attractiveness of the incorrect alternatives).
Although item analysis by inspection will reveal the general effectiveness of a test item and is satisfactory for most classroom purposes, it is sometimes useful to obtain a more precise estimate of item difficulty and discriminating power. This can be done by applying relatively simple formulas to the item-analysis data.
Computing item difficulty:
The difficulty of a test item is indicated by the percentage of students who get the item right. Hence, we can compute item difficulty (P) by means of following formula, in which R equals the number of students who got the item right, and T equals the total number of students who tried the item.

P=(R/T)x 100
The discriminating power of an achievement test items refers to the degree to which it discriminates between students with high and low achievements. Item discriminating power (D) can be obtained by subtracting the number of students in the lower group who get the item right (RL) from the number of students in the upper group who get the item right (RU) and dividing by one-half the total number of students included in the item analysis (.5T). Summarized in formula form, it is:
D= (RU-RL)/.5T
An item with maximum positive discriminating power is one in which all students in the upper group get the item right and all the students in the lower group get the item wrong. This results in an index of 1.00, as follows:
D= (10-0)/10=1.00
An item with no discriminating power is one in which an equal number of students in both the upper and lower groups get the item right. This results in an index of .00, as follows:
D= (10-10)/10= .00
Scoring Essay Type Test Items
According to N.E. Gronlund (1990) the chief weakness of the essay test is the difficulty of scoring. The objectivity of scoring the essay questions may be improved by following a few rules developed by test experts.
a. Prepare a scoring key in advance. The scoring key should include the major points of the acceptable answer, the feature of the answer to be evaluated, and the weights assigned to each. To illustrate, suppose the question is “Describe the main elements of teaching.” Suppose also that this question carries 20 marks. We can prepare a scoring key for the question as follows.
i. Outline of the acceptable answer. There are four elements in teaching these are: the definition of instructional objectives, the identification of the entering behaviour of students, the provision of the learning experiences, and the assessment of the students’ performance.
ii. Main features of the answer and the weights assigned to each.
- Content: Allow 4 points to each elements of teaching.
- Comprehensiveness: Allow 2 points.
- Logical organization: Allow 2 points.
- Irrelevant material: Deduct upto a maximum of 2 points.
- Misspelling of technical terms: Deduct 1/2 point for each mistake upto a maximum of 2 points.
- Major grammatical mistakes: Deduct 1 point for each mistake upto a maximum of 2 points.
- Poor handwriting, misspelling of non-technical terms and minor grammatical errors: ignore.
Preparing the scoring key in advance is useful since it provides a uniform standard for evaluation.
b. Use an appropriate scoring method. There are two scoring methods commonly used by the classroom teacher. The point method and the rating method.
In the point method, the teacher compares each answer with the acceptable answer and assigns a given number of points in terms of how will each answer approximates the acceptable answer. This method is suitable in a restricted response type of question since in this type each feature of the answer can be identified and given proper point values. For example: Suppose that the question is: “List five hypotheses that might explain why nations go to wars.” In the question, we can easily assign a number of point values to each hypothesis and evaluate each answer accordingly.
In the rating method, the teacher reads each answer and places it in one of the several categories according to quality. For example, the teacher may set up five categories: Excellent – 10 points, good – 8 points, average – 6 points, weak – 4 points and poor – 2 points. This method is suitable in an extended response type of question since in this type we make gross judgment concerning the main features of the answer. It’s a good practice to grade each feature separately and then add the point values to get the total score.
a. Read a sampling of the papers to get a ‘feel’ of the quality of the answers. This will give you confidence in scoring and stability in your judgment.
b. Score one question through all of the papers before going on to the next question. This procedure has three main advantages. First, the comparison of answer makes the scoring more exact and just, second, having to keep only one list of points in mind saves time and promotes accuracy and third, it avoids halo effect. A halo effect is defined as the tendency in rating a person to let one of its characteristics influence rating on other characteristics.
c. Adopt a definite policy regarding factors which may not be relevant to learning outcomes being measured. The grading of answer to essay questions is influenced by a large number of factors. These factors include handwriting, spelling, punctuation, sentence structure, style, padding of irrelevant material, and neatness. The teacher should specify which factor would or would not be taken into account and what score values would be assigned to or deducted from each factor.
d. Score the papers anonymously. Have the student record his name on the back or at the end of the paper, rather than at the top of each page. Another way is to let each student have a code number and write it on his paper instead of his name. Keeping the author of the paper unknown will decrease the bias with which the paper is graded.


Administering the Test

I. Test Assembly
We have discussed various aspects of test planning and construction. If you have written instructional objectives, constructed a test, and written items that match your objectives, then more than likely you will have a good test. All the “raw material” will be there. However, sometimes the raw material, as good as it may be, can be rendered useless because of poorly assembled and administrated test. By now you know it requires a substantial amount of time to write objectives, put together a test blueprint, and write items. It is worth a little more time to properly assemble or packages your test so that your efforts will not be wasted. Assembly of the test comprises the following steps:-
(i) Group together all item of similar format e.g. group all essay type item or MCQ’s in one group.
(ii) Arrange test items from easy to hard
(iii) Space the items for easy reading
(iv) Keep items and their options on the same page of the test
(v) Position illustrations, tables, charts, pictures diagrams or maps near descriptions
(vi) Answer keys must be checked carefully
(vii) Determine how students record answers
(viii) Provide adequate and proper space for name and date
(ix) Test directions must be precised and clear
(x) Test must be proofread to make it error free
(xi) Make all the item unbiased (gender, culture, ethnic, racial etc)
II. Reproduction of the Test
Most test reproduction in the schools is done by photocopy machines. As you well know, the quality of such copies can vary tremendously. Regardless of how valid and reliable your test might be, poor printing/copies will not have a good impact. Take the following practical steps to ensure that time you spent constructing a valid and reliable test does not end in illegible printing.
 Manage printing of the test if test takers are large in number
 Manage photocopy from a proper/new machine
 Use good quality of the paper and printing
 Retain original test in your own custody
 Be careful while making sets of the test (staple different papers carefully)
 Manage confidentiality of the test
III. Administration of the Test
The test is ready. All that remains is to get the students ready and hand out the test. Here are some suggestions to help your students psychologically prepared for the test:-
 Maintain a positive attitude for achievement
 Maximize achievement motivation
 Equalize advantages to all the students
 Provide easy, comfortable and proper seats
 Provide proper system of light, temperature, air and water.
 Clarify all the rules and regulations of the examination center/hall
 Rotate distributions
 Remind the students to check their copies
 Monitor students continuously
 Minimize distractions
 Give time warnings properly
 Collect test uniformly
 Count the answer sheets, seal it in a bag and hand it over to the quarter concerned.

IV. Test Taking Strategies
To improve test-taking skills, there are three approaches that might prove fruitful. Students need to understand the mechanics of test-taking, such as the need to carefully follow instructions, checking their work, and so forth. Second, they need to use appropriate test-taking strategies, including ways in which test items should be addressed and how to make educated guesses. Finally, they need to practice their test-taking skills to refine their abilities and to become more comfortable in testing situations. By acting upon the following strategies the students may enhance their test taking strategies:-
 Students need to follow directions carefully.
 Students need to understand how to budget their time.
 Students need to check their work.
 For each item, students need to read the entire test item and all the possible answers very carefully.
 Answer the easier questions first and persist to the end of the test.
 Students need to make educated guesses.
 Use test item formats for practice.
 Review the practice items and answer choices with students.
 Practice using answer sheets.
V. Steps to Prevent Cheating
Cheating is a big issue while administering tests to get reliable and valid data of students learning achievement. Following steps can be followed to prevent cheating:-
i. Take special precautions to keep the test secure during preparation, storage and administration.
ii. Students should be provided sufficient space on their desks to work easily and to prevent use of helping material.
iii. If scratch paper is used have it turned in with the test.
iv. Testing hours must be watched carefully. Walk around the room periodically and observe the students what are they doing.
v. Two forms of the tests can also be used or use some items different in the test to prevent cheating.
vi. Use special seating arrangements while placing the students for the test. Provide sufficient empty spaces between students.
vii. Create and maintain a positive attitude concerning the value of tests for improving learning.


General Consideration in Constructing Essay type Test Items

Robert L. Ebel and David A. Frisbie (1991) in their book, write that “teachers are often as concerned with measuring the ability of students to think about and use knowledge as they are with measuring the knowledge their students possess. In these instances, tests are needed that permit students some degree of latitude in their responses. Essay tests are adapted to this purpose. Student writes a response to a question that is several paragraphs to several pages long. Essays can be used for higher learning outcomes such as synthesis or evaluation as well as lower level outcomes. They provide items in which students supply rather than select the appropriate answer, usually the students compose a response in one or more sentences. Essay tests allow students to demonstrate their ability to recall, organize, synthesize, relate, analyze and evaluate ideas.

Types of Essay Tests
Essay tests may be divided into many types. Monree and Cater (1993) divide essay tests into the many categories like Selective recall-basis given, evaluation recall-basis given, comparison of two things on a single designated basis, comparison of two things in general, Decisions – For or against, cause and effect, explanation of the use or exact meaning of some word, phrase on statement, summary of some unit of the text book or article, analysis, statement of relationships, Illustration or examples, classification, application of rules, laws, or principles to new situation, discussion, statement of an author’s purpose in the selection or organization of material, Criticism – as to the adequacy, correctness or relevance of a printed statement or to a class mate’s answer to a question on the lesson, reorganization of facts, formulation of new question – problems and question raised, new methods of procedure etc.

Types of Constructed Response Items
Essay items can vary from very lengthy, open ended end of semester term papers or take home tests that have flexible page limits (e.g. 10-12 pages, no more than 30 pages etc.) to essays with responses limited or restricted to one page or less. Thus essay type items are of two types:-
 Restricted Response Essay Items
 Extended Response Essay Items
I. Restricted Response Essay Items
An essay item that poses a specific problem for which a student must recall proper information, organize it in a suitable manner, derive a defensible conclusion, and express it within the limits of posed problem, or within a page or time limit, is called a restricted response essay type item. The statement of the problem specifies response limitations that guide the student in responding and provide evaluation criteria for scoring.
Example 1:
List the major similarities and differences in the lives of people living in Islamabad and Faisalabad.
Example 2:
Compare advantages and disadvantages of lecture teaching method and demonstration teaching method.
When Should Restricted Response Essay Items be used?
Restricted Response Essay Items are usually used to:-
 Analyze relationship
 Compare and contrast positions
 State necessary assumptions
 Identify appropriate conclusions
 Explain cause-effect relationship
 Organize data to support a viewpoint
 Evaluate the quality and worth of an item or action
 Integrate data from several sources

II. Extended Response Essay Type Items
An essay type item that allows the student to determine the length and complexity of response is called an extended-response essay item. This type of essay is most useful at the synthesis or evaluation levels of cognitive domain. We are interested in determining whether students can organize, integrate, express, and evaluate information, ideas, or pieces of knowledge the extended response items are used.
Example:
Identify as many different ways to generate electricity in Pakistan as you can? Give advantages and disadvantages of each. Your response will be graded on its accuracy, comprehension and practical ability. Your response should be 8-10 pages in length and it will be evaluated according to the RUBRIC (scoring criteria) already provided.
Scoring Essay Type Items
A rubric or scoring criteria is developed to evaluate/score an essay type item. A rubric is a scoring guide for subjective assessments. It is a set of criteria and standards linked to learning objectives that are used to assess a student's performance on papers, projects, essays, and other assignments. Rubrics allow for standardized evaluation according to specified criteria, making grading simpler and more transparent. A rubric may vary from simple checklists to elaborate combinations of checklist and rating scales. How elaborative your rubric is, depends on what you are trying to measure. If your essay item is a restricted-response item simply assessing mastery of factual content, a fairly simple listing of essential points would be sufficient. An example of the rubric of restricted response item is given below.
Test Item:
Name and describe five of the most important factors of unemployment in Pakistan. (10 points)
Rubric/Scoring Criteria:
(i) 1 point for each of the factors named, to a maximum of 5 points
(ii) One point for each appropriate description of the factors named, to a maximum of 5 points
(iii) No penalty for spelling, punctuation, or grammatical error
(iv) No extra credit for more than five factors named or described.
(v) Extraneous information will be ignored.
However, when essay items are measuring higher order thinking skills of cognitive domain, more complex rubrics are mandatory. An example of Rubric for writing test in language is given below.



Advantages of Essay Type Items
The main advantages of essay type tests are as follows:
(i) They can measures complex learning outcomes which cannot be measured by other means.
(ii) They emphasize integration and application of thinking and problem solving skills.
(iii) They can be easily constructed.
(iv) They give examines freedom to respond within broad limits.
(v) The students cannot guess the answer because they have to supply it rather than select it.
(vi) Practically it is more economical to use essay type tests if number of students is small.
(vii) They required less time for typing, duplicating or printing. They can be written on the blackboard also if number of students is not large.
(viii) They can measure divergent thinking.
(ix) They can be used as a device for measuring and improving language and expression skill of examinees.
(x) They are more helpful in evaluating the quality of the teaching process.
(xi) Studies has supported that when students know that the essay type questions will be asked, they focus on learning broad concepts and articulating relationships, contrasting and comparing.
(xii) They set better standards of professional ethics to the teachers because they expect more time in assessing and scoring from the teachers.


Limitations of Essay Type Items
The essay type tests have the following serious limitations as a measuring instrument:
(i) A major problem is the lack of consistency in judgments even among competent examiners.
(ii) They have Halo effects. If the examiner is measuring one characteristic, he can be influenced in scoring by another characteristic. For example, a well behaved student may score more marks on account of his good behaviour also.
(iii) They have question to question carry effect. If the examinee has answered satisfactorily in the beginning of the question or questions he is likely to score more than the one who did not do well in the beginning but did well later on.
(iv) They have examinee to examinee carry effect. A particular examinee gets marks not only on the basis of what he has written but also on the basis that whether the previous examinee whose answered book was examined by the examiner was good or bad.
(v) They have limited content validity because of sample of questions can only be asked in essay type test.
(vi) They are difficult to score objectively because the examinee has wide freedom of expression and he writes long answers.
(vii) They are time consuming both for the examiner and the examinee.
(viii) They generally emphasize the lengthy enumeration of memorized facts.
Suggestions for Writing Essay Type Items
I. Ask questions or establish tasks that will require the student to demonstrate command of essential knowledge. This means that students should not be asked merely to reproduce material heard in a lecture or read in a textbook. To "demonstrate command" requires that the question be somewhat novel or new. The substance of the question should be essential knowledge rather than trivia that might be a good board game question.
II. Ask questions that are determinate, in the sense that experts (colleagues in the field) could agree that one answer is better than another. Questions that contain phrases such as "What do you think..." or "What is your opinion about..." are indeterminate. They can be used as a medium for assessing skill in written expression, but because they have no clearly right or wrong answer, they are useless for measuring other aspects of achievement.
III. Define the examinee's task as completely and specifically as possible without interfering with the measurement process itself. It is possible to word an essay item so precisely that there is one and only one very brief answer to it. The imposition of such rigid bounds on the response is more limiting than it is helpful. Examinees do need guidance, however, to judge how extensive their response must to be considered complete and accurate.
IV. Generally give preference to specific questions that can be answered briefly. The more questions used, the better the test constructor can sample the domain of knowledge covered by the test. And the more responses available for scoring, the more accurate the total test scores are likely to be. In addition, brief responses can be scored more quickly and more accurately than long, extended responses, even when there are fewer of the latter type.
V. Use enough items to sample the relevant content domain adequately, but not so many that students do not have sufficient time to plan, develop, and review their responses. Some instructors use essay tests rather than one of the objective types because they want to encourage and provide practice in written expression. However, when time pressures become great, the essay test is one of the most unrealistic and negative writing experiences to which students can be exposed. Often there is no time for editing, for rereading, or for checking spelling. Planning time is short changed so that writing time will not be. There are few, if any, real writing tasks that require such conditions. And there are few writing experiences that discourage the use of good writing habits as much as essay testing does.
VI. Avoid giving examinees a choice among optional questions unless special circumstances make such options necessary. The use of optional items destroys the strict comparability between student scores because not all students actually take the same test. Student A may have answered items 1-3 and Student B may have answered 3-5. In these circumstances the variability of scores is likely to be quite small because students were able to respond to items they knew more about and ignore items with which they were unfamiliar. This reduced variability contributes to reduced test score reliability. That is, we are less able to identify individual differences in achievement when the test scores form a very homogeneous distribution. In sum, optional items restrict score comparability between students and contribute to low score reliability due to reduced test score variability.
VII. Test the question by writing an ideal answer to it. An ideal response is needed eventually to score the responses. It if is prepared early, it permits a check on the wording of the item, the level of completeness required for an ideal response, and the amount of time required to furnish a suitable response. It even allows the item writer to determine if there is any "correct" response to the question.
VIII. Specify the time allotment for each item and/or specify the maximum number of points to be awarded for the "best" answer to the question. Both pieces of information provide guidance to the examinee about the depth of response expected by the item writer. They also represent legitimate pieces of information a student can use to decide which of several items should be omitted when time begins to run out. Often the number of points attached to the item reflects the number of essential parts to the ideal response. Of course if a definite number of essential parts can be determined, that number should be indicated as part of the question.
IX. Divide a question into separate components when there are obvious multiple questions or pieces to the intended responses. The use of parts helps examinees organizationally and, hence, makes the process more efficient. It also makes the grading process easier because it encourages organization in the responses. Finally, if multiple questions are not identified, some examinees may inadvertently omit some parts, especially when time constraints are great.

General Consideration in Constructing Objective Test Items,

The second step in test planning is determining the format and length of the test. The format is based on the different types of items to be included in the test. The construction of valid and good test items is a skill just like effective teaching. Some rules are to be followed and some techniques are to be used to construct good test items. Test items can be used to assess student’s ability to recognize concepts or to recall
concepts. Generally there are two types of objective test items:-
i. Select type.
ii. Supply type.

Select Type Items

Matching Items
According to W. Wiersma and S.G. Jurs (1990), the matching items consist of two parallel columns. The column on the left contains the questions to be answered, termed premises; the column on the right, the answers, termed responses. The student is asked to associate each premise with a response to form a matching pair. For example








According to W. Wiersma and S.G. Jurs (1990) in some matching exercises the number of premises and responses are the same, termed a balanced or perfect matching exercise. In others, the number and responses may be different.

Advantages
The chief advantage of matching exercises is that a good deal of factual information can be tested in minimal time, making the tests compact and efficient. They are especially well suited to who, what, when and where types of subject matter. Further students frequently find the tests fun to take because they have puzzle qualities to them.

Disadvantages
The principal difficulty with matching exercises is that teachers often find that the subject matter is insufficient in quantity or not well suited for matching terms. An exercise should be confined to homogeneous items containing one type of subject matter (for instance, authors-novels; inventions inventors; major events-dates terms – definitions; rules examples and the like). Where unlike clusters of questions are used to adopt but poorly informed student can often recognize the ill-fitting items by their irrelevant and extraneous nature (for instance, in a list of authors the inclusion of the names of capital cities).
Student identifies connected items from two lists. It is Useful for assessing the ability to discriminate, categorize, and association amongst similar concepts.

Suggestions for Writing Matching Items

Here are some suggestions for writing matching items:
i. Keep both the list of descriptions and the list of options fairly short and homogeneous – they should both fit on the same page. Title the lists to ensure homogeneity and arrange the descriptions and options in some logical order. If this is impossible, you’re probably including too wide a variety in the exercise. Try constructing two or more exercises.
ii. Make sure that all the options are plausible distracters for each description to ensure homogeneity of lists.
iii. The list of descriptions on the left side should contain the longer phrases or statements, whereas the options on the right side should consist of short phrases, words or symbols.
iv. Each description in the list should be numbered (each is an item), and the list of options should be identified by letter.
v. Include more options than descriptions. If the option list is longer than the description list, it is harder for students to eliminate options. If the option list is shorter, some options must be used more than once. Always include some options that do not match any of the descriptions, or some that match more than one, or both.

vi. In the directions, specify the basis for matching and whether options can be used more than once.
B. Multiple Choice Questions (MCQ’s)
Norman E. Grounlund (1990) writes that the multiple choice question is probably the most popular as well as the most widely applicable and effective type of objective test. Student selects a single response from a list of options. It can be used effectively for any level of course outcome. It consists of two parts: the stem, which states the problem and a list of three to five alternatives, one of which is the correct (key) answer and the others are distracters (“foils” or incorrect options that draw the less knowledgeable pupil away from the correct response).
The stem may be stated as a direct question or as an incomplete statement. For example:

Direct question
Which is the capital city of Pakistan? -------- (Stem)
A. Lahore. -------------------------------------- (Distracter)
B. Karachi. ------------------------------------- (Distracter)
C. Islamabad. ---------------------------------- (Key)
D. Peshawar. ----------------------------------- (Distracter)
Incomplete Statement
The capital city of Pakistan is
A. Lahore.
B. Karachi.
C. Islamabad.
D. Peshawar.


RULES FOR WRITING MULTIPLE-CHOICE QUESTIONS
1. Use Plausible Distracters (wrong-response options)
 Only list plausible distracters, even if the number of options per question changes
 Write the options so they are homogeneous in content
 Use answers given in previous open-ended exams to provide realistic distracters
2. Use a Question Format
 Experts encourage multiple-choice items to be prepared as questions (rather than incomplete statements)
Incomplete Statement Format:
The capital of AJK is in-----------------.
Direct Question Format:
In which of the following cities is the capital of AJK?
3. Emphasize Higher-Level Thinking
 Use memory-plus application questions. These questions require students to recall principles, rules or facts in a real life context.
 The key to prepare memory-plus application questions is to place the concept in a life situation or context that requires the student to first recall the facts and then apply or transfer the application of those facts into a situation.
 Seek support from others who have experience writing higher-level thinking multiple-choice questions.

EXAMPLES:


Memory Only Example (Less Effective)
Which description best characterizes whole foods?
a. orange juice
b. toast
c. bran cereal
d. grapefruit
Memory-Plus Application Example (More Effective)
Sana’s breakfast this morning included one glass of orange juice (from Concentrate), one slice of toast, a small bowl of bran cereal and a grapefruit. What “whole food” did Sana eat for breakfast?
a. orange juice
b. toast
c. bran cereal
d. grapefruit
Memory-Plus Application Example
Ability to Interpret Cause-and-Effect Relationships Example
Why does investing money in common stock protect against loss of assets during inflation?
a. It pays higher rates of interest during inflation.
b. It provides a steady but dependable income despite economic conditions.
c. It is protected by the Federal Reserve System.
d. It increases in value as the value of a business increases.
Ability to Justify Methods and Procedures Example
Why is adequate lighting necessary in a balanced aquarium?
a. Fish need light to see their food.
b. Fish take in oxygen in the dark.
c. Plants expel carbon dioxide in the dark.
d. Plants grow too rapidly in the dark.

4. Keep Option Lengths Similar
 Avoid making your correct answer the long or short answer
5. Balance the Placement of the Correct Answer
 Correct answers are usually the second and third option
6. Be Grammatically Correct
 Use simple, precise and unambiguous wording
 Students will be more likely to select the correct answer by finding the grammatically correct option
7. Avoid Clues to the Correct Answer
 Avoid answering one question in the test by giving the answer somewhere else in the test
 Have the test reviewed by someone who can find mistakes, clues, grammar and punctuation problems before you administer the exam to students
 Avoid extremes – never, always, only
 Avoid nonsense words and unreasonable statements
8. Avoid Negative Questions
 31 of 35 testing experts recommend avoiding negative questions
 Students may be able to find an incorrect answer without knowing the correct answer
9. Use Only One Correct Option (Or be sure the best option is clearly the best option)
 The item should include one and only one correct or clearly best answer
 With one correct answer, alternatives should be mutually exclusive and not overlapping
 Using MC with questions containing more than one right answer lowers discrimination between students
10. Give Clear Instructions
Such as:

 Questions 1 - 10 are multiple-choice questions designed to assess your ability to remember or recall basic and foundational pieces of knowledge related to this course.
 Please read each question carefully before reading the answer options. When you have a clear idea of the question, find your answer and mark your selection on the answer sheet. Please do not make any marks on this exam.
 Questions 11 – 20 are multiple-choice questions designed to assess your ability to think critically about the subject.
 Please read each question carefully before reading the answer options.
 Be aware that some questions may seem to have more than one right answer, but you are to look for the one that makes the most sense and is the most correct.
 When you have a clear idea of the question, find your answer and mark your selection on the answer sheet.
 You may justify any answer you choose by writing your justification on the blank paper provided.
11. Use Only a Single, Clearly-Defined Problem and Include the Main Idea in the Question
 Students must know what the problem is without having to read the response options
12. Avoid “All the Above” Option
 Students merely need to recognize two correct options to get the answer correct
13. Avoid the “None of the Above” Option
 You will never know if students know the correct answer
14. Don’t Use MCQ When Other Item Types Are More Appropriate
 Limited distracters or assessing problem-solving and creativity
Advantages
The chief advantage of the multiple-choice question according to N.E. Gronlund (1990) is its versatility. For instance, it is capable of being applied to a wide range of subject areas. In contrast to short answer items limit the writer to those content areas that are capable of being stated in one or two words, multiple choice item necessary bound to homogeneous items containing one type of subject matter as are matching items, and a multiple choice question greatly reduces the opportunity for a student to guess the correct answer from one choice in two with a true – false items to one in four or five, there by increasing the reliability of the test. Further, since a multiple – choice item contains plausible incorrect or less correct alternative, it permits the test constructor to tine tune the discriminations (the degree or homogeneity of the responses) and control the difficulty level of the test.
Disadvantages
N.E. Gronlund (1990) writes that multiple-choice items are difficult to construct. Suitable distracters are often hard to come by and the teacher is tempted to fill the void with a “junk” response. The effect of narrowing the range of options will available to the test wise student. They are also exceedingly time consuming to fashion, one hour per question being by no means the exception. Finally they generally take student longer to complete (especially items containing fine discrimination) than do other types of objective question.

Suggestions for Writing MCQ’s Items
Here are some guidelines for writing multiple-choice tests:
I. The stem of the item should clearly formulate a problem. Include as much of the item as possible, keeping the response options as short as possible. However, include only the material needed to make the problem clear and specific. Be concise – don’t add extraneous information.
II. Be sure that there is one and only one correct or clearly best answer.
III. Be sure wrong answer choices (distracters) are plausible. Eliminate unintentional grammatical clues, and keep the length and form of all the answer choices equal. Rotate the position of the correct answer from item to item randomly.
IV. Use negation questions or statements only if the knowledge being tested requires it. In most cases it is more important for the student to know what a specific item of information is rather than what it is not.
V. Include from three to five options (two to four distracters plus one correct answer) to optimize testing for knowledge rather than encouraging guessing. It is not necessary to provide additional distracters from an item simply to maintain the same number of distracters for each item. This usually leads to poorly constructed distracters that add nothing to test validity and reliability.
VI. To increase the difficulty of a multiple-choice item, increase the similarity of content among the options.
VII. Use the option “none of the above” sparingly and only when the keyed answer can be classified unequivocally as right or wrong.
VII. Avoid using “all of the above”. It is usually the correct answer and makes the item too easy for students with partial information.

II. Supply Type Items

A. Completion Items
Like true-false items, completion items are relatively easy to write. Perhaps the first tests classroom teachers’ construct and students take completion tests. Like items of all other formats, though, there are good and poor completion items. Student fills in one or more blanks in a statement. These are also known as “Gap-Fillers.” Most effective for assessing knowledge and comprehension learning outcomes but can be written for higher level outcomes. e.g.
The capital city of Pakistan is -----------------.
Suggestions for Writing Completion or Supply Items
Here are our suggestions for writing completion or supply items:
I. If at all possible, items should require a single-word answer or a brief and definite statement. Avoid statements that are so indefinite that they may be logically answered by several terms.
a. Poor item:
Motorway (M1) opened for traffic in ____________.
b. Better item:
Motorway (M1) opened for traffic in the year______.
II. Be sure the question or statement poses a problem to the examinee. A direct question is often more desirable than an incomplete statement because it provides more structure.
III. Be sure the answer that the student is required to produce is factually correct. Be sure the language used in the question is precise and accurate in relation to the subject matter area being tested.
IV. Omit only key words; don’t eliminate so many elements that the sense of the content is impaired.
c. Poor item:
The ____________ type of test item is usually more _________ than the _____ type.
d. Better item:
The supply type of test item is usually graded less objectively than the _________ type.
V. Word the statement such that the blank is near the end of the sentence rather than near the beginning. This will prevent awkward sentences.

VI. If the problem requires a numerical answer, indicate the units in which it is to be expressed.
B. Short Answer
Student supplies a response to a question that might consistent of a single word or phrase. Most effective for assessing knowledge and comprehension learning outcomes but can be written for higher level outcomes. Short answer items are of two types.
 Simple direct questions
Who was the first president of the Pakistan?
 Completion items
The name of the first president of Pakistan is ___________.
The items can be answered by a work, phrase, number or symbol. Short-answer tests are a cross between essay and objective tests. The student must supply the answer as with an essay question but in a highly abbreviated form as with an objective question.
Advantages
Norman E. Gronlund (1990) writes that short-answer items have a number of advantages.
 They reduce the likelihood that a student will guess the correct answer
 They are relatively easy for a teacher to construct.
 They are will adapted to mathematics, the sciences, and foreign languages where specific types of knowledge are tested (The formula for ordinary table salt is ________).
 They are consistent with the Socratic question and answer format frequently employed in the elementary grades in teaching basic skills.
Disadvantages
According to Norman E. Grounlund (1990) there are also a number of disadvantages with short-answer items.
 They are limited to content areas in which a student’s knowledge can be adequately portrayed by one or two words.
 They are more difficult to score than other types of objective-item tests since students invariably come up with unanticipated answers that are totally or partially correct.

 Short answer items usually provide little opportunity for students to synthesize, evaluate and apply information.

Planning a Test.

The main objective of classroom assessment is to obtain valid, reliable and useful data regarding student learning achievement. This requires determining what is to be measured and then defining it precisely so that assessments tasks to measure desired performance can be developed. Classroom tests and assessments can be used for the following instructional objectives:

i. Pre-testing
Tests and assessments can be given at the beginning of an instructional unit or course to determine:-
 weather the students have the prerequisite skills needed for the instruction (readiness, motivation etc)
 to what extent the students have already achieved the objectives of planned instruction (to determine placement or modification of instruction)
ii. During the Instruction Testing
 provides bases for formative assessment
 monitor learning progress
 detect learning errors
 provide feedback for students and teachers
iii. End of Instruction Testing
 measure intended learning outcomes
 used for formative assessment
 provides bases for grades, promotion etc
Prior to developing an effective test, one needs to determine whether or not a test is the appropriate type of assessment. If the learning objectives are of primarily types of procedural knowledge (how to perform a task) then a written test may not be the best approach. Assessment of procedural knowledge generally calls for a performance demonstration assessed using a rubric. Where demonstration of a procedure is not appropriate, a test can be an effective assessment tool.
The first stage of developing a test is planning the test content and length. Planning the test begins with development of a blueprint or test specifications for the test structured on the learning outcomes or instructional objectives to be assessed by the test instrument. For each learning outcome, a weight should be assigned based on the relative importance of that outcome in the test. The weight will be used to determine the number of items related to each of the learning outcomes.

Test Specifications
When an engineer prepares a design to construct a building and choose the materials, he intends to use in construction, he usually know what a building is going to be used for, and therefore designs it to meet the requirements of its planned inhabitants. Similarly, in testing, table of specification is the blueprint of the assessment which specifies percentages and weightage of test items and measuring constructs. It includes constructs and concepts to be measured, tentative weightage of each construct, specify number of items for each concept, and description of item types to be constructed. It is not surprising that specifications are also referred to as ‘blueprints’, for they are literally architectural drawings for test construction. Fulcher & Davidson (2009) divided test specifications into the following four elements:
 Item specifications: Item specifications describe the items, prompts or tasks, and any other material such as texts, diagrams, and charts which are used as stimuli. Typically, a specification at this sub-level contains two key elements: samples of the tasks to be produced, and guiding language that details all information necessary to produce the task.
 Presentation Model: Presentation model provides information how the items and tasks are presented to the test takers.
 Assembly Model: Assembly model helps the test developer to combine test items and tasks to develop a test format.
 Delivery Model: Delivery Model tells how the actual test is delivered. It includes information regarding test administration, test security/confidentiality and time constraint.


Sunday 9 September 2018

Conducting Parent-Teacher Conferences

The first conference is usually arranged in the beginning of the school year to allow parents and teachers to get acquaintance and preparing plan for the coming months. Teachers usually receive some training to plan and conduct such conferences. Following steps may be observed for holding effective parent-teacher conferences.
1. Prepare for the conference
 Review the goals and objectives
 Organize the information to present
 If portfolios are to discuss, these are well-arranged
 Start and keep positive focus
 Announce the final date and time as per convenience of the parents and children
 Consider socio-cultural barriers of students / parents
 Check with other staff who works your advisee
 Develop a packet of conference including student’s goals, samples of work, and reports or notes from other staff.
2. Rehearse the conference with students by role-playing
 Students present their goals, learning activities, samples of work
 Students ask for comments and suggestions from parents
3. Conduct conference with student, parent, and advisor. Advisee takes the lead to the greatest possible extent
 Have a comfortable setting of chairs, tables etc.
 Notify a viable timetable for the conferences
 Review goals set earlier
 Review progress towards goals
 Review progress with samples of work from learning activities
 Present students strong points first
 Review attendance and handling of responsibilities at school and home
 Modify goals for balance of the year as necessary
 Determine other learning activities to accomplish goals
 Describe upcoming events and activities
 Discuss how the home can contribute to learning
 Parents should be encouraged to share their thoughts on students’ progress
 Ask parents and students for questions, new ideas
4. Do’s of parent-teacher conferences
 Be friendly
 Be honest
 Be positive in approach
 Be willing to listen and explain
 Be willing to accept parents’ feelings
 Be careful about giving advice
 Be professional and maintain a positive attitude
 Begin with student’s strengths
 Review student’s cumulative record prior to conference
 Assemble samples of student’s work
 List questions to ask parents and anticipate parents’ questions
 Conclude the conference with an overall summary
 Keep a written record of the conference, listing problems and suggestions, with a copy for the parents
5. Don’ts of the parent teacher conference
 Don’t argue
 Don’t get angry
 Don’t ask embarrassing questions
 Don’t talk about other students, parents and teachers
 Don’t bluff if you don’t know
 Don’t reject parents’ suggestions
 Don’t blame parents
 Don’t talk too much; be a good listener .

Calculating CGPA and Assigning Letter Grades.

CGPA stands for Cumulative Grade Point Average. It reflects the grade point average of all subjects/courses regarding a student’s performance in composite way. To calculate CGPA, we should have following information.
 Marks in each subject/course
 Grade point average in each subject/course
 Total credit hours (by adding credit hours of each subject/course)
Calculating CGPA is very simple that total grade point average is divided by total credit hours. For example if a student MA Education programme has studied 12 courses, each of 3 credits. The total credit hours will be 36. The average of GPA, in all the twelve course will be the CGPA.
In the following table the GPA calculated for astudent of MA Education program is given as example.




The average of GPA, will represent (GPA) CGPA=sum of gpa/sum of total course
Assigning letter grades
Letter grade system is most popular in the world including Pakistan. Most teachers face
problems while assigning grades. There are four core problems or issues in this regard; 1)
what should be included in a letter grade, 2) how should achievement data be combined
in assigning letter grades?, 3) what frame of reference should be used in grading, and 4)
how should the distribution of letter grades be determined?
1. Determining what to include in a grade
Letter grades are likely to be most meaningful and useful when they represent
achievement only. If they are communicated with other factors or aspects such as effort
of work completed, personal conduct, and so on, their interpretation will become
hopelessly confused. For example, a letter grade C may represent average achievement
with extraordinary effort and excellent conduct and behaviour or vice versa.
If letter grades are to be valid indicators of achievement, they must be based on valid
measures of achievement. This involves defining objectives as intended learning
outcomes and developing or selecting tests and assessments which can measure these
learning outcomes.
2. Combining data in assigning grades
One of the key concerns while assigning grades is to be clear what aspects of a student
are to be assessed or what will be the tentative weightage to each learning outcome. For
example, if we decide that 35 percent weightage is to be given to mid-term assessment,
40 percent final term test or assessment, and 25% to assignments, presentations,
classroom participation and conduct and behaviour; we have to combine all elements by
assigning appropriate weights to each element, and then use these composite scores as a
basis for grading.
3. Selecting the proper frame of reference for grading
Letter grades are typically assigned on the basis of one of the following frames of
reference.
a) Performance in relation to other group members (relative grading)







b) Performance in relation to specified standards (absolute grading)
c) Performance in relation to learning ability (amount of improvement)
Assigning grades on relative basis involves comparing a student’s performance with that of a reference group, mostly class fellows. In this system, the grade is determined by the student’s relative position or ranking in the total group. Although relative grading has a disadvantage of a shifting frame of reference (i.e. grades depend upon the group’s ability), it is still widely used in schools, as most of the time our system of testing is ‘norm-referenced’.
Assigning grades on an absolute basis involves comparing a student’s performance to specified standards set by the teacher. This is what we call as ‘criterion-referenced’ testing. If all students show a low level of mastery consistent with the established performance standard, all will receive low grades.
The student performance in relation to the learning ability is inconsistent with a standard-based system of evaluating and reporting student performance. The improvement over the short time span is difficult. Thus lack of reliability in judging achievement in relation to ability and in judging degree of improvement will result in grades of low dependability. Therefore such grades are used as supplementary to other grading systems.
4. Determining the distribution of grades
The assigning of relative grades is essentially a matter of ranking the student in order of overall achievement and assigning letter grades on the basis of each student’s rank in the group. This ranking might be limited to a single classroom group or might be based on the combined distribution of several classroom groups taking the same course.
If grading on the curve is to be done, the most sensible approach in determining the distribution of letter grades in a school is to have the school staff set general guidelines for introductory and advanced courses. All staff members must understand the basis for assigning grades, and this basis must be clearly communicated to users of the grades. If the objectives of a course are clearly mentioned and the standards for mastery appropriately set, the letter grades in an absolute system may be defined as the degree to which the objectives have been attained, as followed.
A = Outstanding (90 to 100%)
B = very Good (80-89%)
C = Satisfactory (70-79%)
D = Very Weak (60-69%)
F = Unsatisfactory (Less than 60%)

EDUCATION

PHILOSPHY AND EDUCATION

The word philosophy is derived from the Greek words philia (Loving) and sophia (Wisdom) and means" the love of wisdom".Philosophy...