CSUF Department of Psychology
Unit Banner

RELIABILITY OF MEASUREMENT

Overview

Reliability is, in essence, the extent to which a measurement method is “consistent with itself.”  There are two important aspects of reliability.  The first is the extent to which a measurement method produces the same score for the same case under the same conditions. This is called test-retest reliability.  The second is the extent to which responses to the individual items in a multiple-item measure are consistent with each other. This is called internal consistency.

Test-Retest Reliability

If you take an IQ test today, and then you take it again next week, you would expect your scores to be quite similar.  This is because we think of intelligence as a relatively stable aspect of our personalities.  If you scored very differently when you took the test the second time, you would rightly wonder about the accuracy of the test.

In principle, it is very easy to assess test-retest reliability.  Just use the measurement method in question to test several people today and then use it to test the same people again in a day or a week or a year.  Then compute the Pearson's r for the correlation between the two sets of scores.  A high correlation (say .80 or above) indicates good test-retest reliability.  (A scatterplot for such a correlation would have most of the points falling pretty close to a single straight line.) For example, the scatterplot below shows the relationship between 20 people's scores on tha self-esteem test taken first on Monday and then again on Friday. The correlation between the two sets of scores is +.87, which indicates good test-retest reliability.

In practice, however, this approach does not always work.  Consider that people taking the same test a second time might remember their original answers and give them again just to be consistent.  This would make the test appear to be more reliable than it is. Also, we might expect some variables to change between the first and second measurements. If we measure the mood of everyone in our class today and then we do the same thing next week, the correlation might be low because people’s moods have changed—not because there is anything wrong with our measurement method. There are ways to deal with these problems, but you will have to learn about them in a more advanced measurement course.

Internal Consistency

If you take a 10-item self-esteem test—and you have high self-esteem—then you should tend to give “high self-esteem responses” to all 10 items.  If you have low self-esteem, then you should tend to give “low self-esteem responses” to all 10 items.  In general, people’s responses to the different items on a multiple-response measure should be positively correlated with each other. If they are not, then this indicates a problem with internal consistency. For example, if people’s responses to Item 3 are completely unrelated to their responses to Item 6, then it does not make sense to think that these two items are both measuring self-esteem … so they probably should not be on the same test.

One simple way to check for internal consistency is to look at the item-total correlations.  These are the correlations between the individual items and the total score. That is, you can compute the correlation between Item 1 and the total score, between Item 2 and the total score, and so on. If the measure is internally consistent, then these correlations should all be positive. Because each item on a self-esteem test is there because it supposedly measures self-esteem, each item should be positively correlated with the total score.  If many items have low or negative correlations with the total score, then this indicates poor internal consistency. Such items are usually dropped from the test. Item-total correlations are interesting in part because your instructors (not just in psychology) will sometimes use them to identify, and maybe even throw out, poor exam questions.

A second way to check for internal consistency is to compute what is called the split-half correlation.  This is the correlation between two scores, one based on one half of the items and the other based on the other half of the items.  Imagine that 100 people have taken the Rosenberg Self-Esteem Scale.  You could compute two self-esteem scores for each person: one based on Items 1, 3, 5, 7, and 9, and the other based on Items 2, 4, 6, 8, and 10.  Then you could compute the correlation between these two sets of scores.  Again, it should be fairly strong and positive.

A final way to check for internal consistency is to compute Cronbach’s alpha, which is the statistic that is most often presented in research reports.  You would do this using a computer, of course, but conceptually Cronbach’s alpha is the mean split-half correlation for all possible ways of splitting the items in half.  Note that you could split the items on a 10-item measure into the even items and the odd items, the first half (Items 1– 5) and the second half (Items 6–10) , or even Items 1, 3, 4, 9, and 10 vs. Items 2, 5, 6, 7, and 8 … and so on.  If you were to split the items in each of these ways, compute the split-half correlation for each one, and take the mean of these split-half correlations, you would have Cronbach’s alpha. (By the way, Lee J. Cronbach was an undergraduate at Fresno State and we have an undergraduate Cronbach Scholarship that goes in alternate years to a psychology or math student interested in measurement.)

Some psychological tests have what are called sub-scales.  These are, in essence, separate tests that are combined together into one, where each test measures a different construct or a different aspect of the same construct.  For example, researchers have identified two components of test anxiety.  The first is autonomic nervous system arousal or “nervous feelings” (fast heartbeat, muscle tension), and the second is negative thoughts (e.g., “I’m gonna fail, I’m gonna fail”).  Furthermore, they have discovered that these two components seem to be independent of each other.  It is possible to have nervous feelings without the negative thoughts, and it is possible to have negative thoughts without the nervous feelings.  So a good measure of test-anxiety contains two sets of items: one to measure nervous feelings and the other to measure negative thoughts.  Note that what is important here is the internal consistency of each sub-scale.  You would not want to measure internal consistency across all items because you would not expect them all to be related to each other. 

Why Does Reliability Matter?

The main reason that reliability matters is that a measure that is not reliable cannot be valid.  You can think of reliability as being a prerequisite for validity.  For example, if a self-esteem test gives very different scores for the same person under essentially the same conditions, then we are not very well justified in taking either of those scores as a measure of the person’s self-esteem.  Similarly, if the items on a self-esteem test are not correlated with each other, then they cannot all be measuring self-esteem, and an aggregate of them cannot be a very good measure of self-esteem.  There are other reasons that reliability matters, but they can wait until you take an advanced measurement course.