Reliability of Measurement




Reliability is, in essence, the extent to which a measure is “consistent with itself.”  There are two important aspects of reliability.  The first is the extent to which a measure produces the same score for the same case under the same conditions.  We will call this test-retest reliability.  The second is the extent to which responses to the individual items in a multiple-item measure are consistent with each other.  We will call this internal consistency.


Test-Retest Reliability


If you take an IQ test today, and then you take it again next week, you would expect your scores to be quite similar.  This is because we think of IQ as a relatively stable aspect of our personalities.  If you scored very differently when you took the test the second time, you would rightly wonder about the test.


In principle, it is very easy to assess test-retest reliability.  Just use the measure for a bunch of cases today and then use it on the same cases again in a day or week or year.  Then compute the correlation between the two sets of scores.  A high correlation (say .80 or above) indicates pretty good test-retest reliability.  (A scatterplot for such a correlation would have most of the points falling pretty close to a single straight line.)


In practice, however, this approach does not always work.  Consider that people taking the same personality test a second time might remember their original answers and give them again just to be consistent.  This would make the test appear to be more reliable than it was.  It is also the case that we would expect some variables to change between the first and second measurements.  If I measure the mood of everyone in our class today and then I do the same thing next week, the correlation might be low because people’s moods have changed—not because there is anything wrong with the measure.  There are some ways to deal with these problems, but you will have to learn about them in the upper-division measurement course (e.g., Psych 149).


Internal Consistency


If you take a 10-item self-esteem test—and you have high self-esteem—then you should tend to give “high self-esteem responses” to all 10 items.  If you have low self-esteem, then you should tend to give “low self-esteem responses” to all 10 items.  In general, people’s responses to the different items on a multiple-response measure should be positively correlated with each other.  If people’s responses to Item 3 are completely unrelated to their responses to Item 6, then it does not seem reasonable to think that these two items are both measuring self-esteem … so they probably should not be on the same test.


One simple way to check for internal consistency is to look at the item-total correlations.  These are the correlations between the individual items and the total score.  That is, you can compute the correlation between Item 1 and the total score, between Item 2 and the total score, and so on.  If the measure is internally consistent, then these correlations should all be positive.  Because each item on a self-esteem test is there because it supposedly measures self-esteem, each item should be positively correlated with the total score.  If a lot of items have low or negative correlations with the total score, then this would indicate poor internal consistency.


Item-total correlations are interesting in part because your instructors (not just in psychology) will sometimes use them to identify (and maybe even throw out) poor test questions.  I have even seen cases where the correlation is negative.  That is, the students who scored well overall on the exam were actually less likely to answer a particular question correctly than the students who scored poorly.  I think this usually indicates that the poor students were just guessing (e.g., choosing the longest response, or choosing “c”) while the better students had just enough knowledge to be fooled into choosing some “trick” alternative.  Although I do not usually look at item-total correlations on exams, I would definitely throw out any items that had negative correlations.

A second way to check for internal consistency is to compute what is called the split-half reliability coefficient.  This is the correlation between two scores, one based on one half of the items and the other based on the other half of the items.  Imagine that 100 people have taken the Rosenberg self-esteem scale.  You could compute two self-esteem scores for each person: one based on Items 1, 3, 5, 7, and 9, and the other based on Items 2, 4, 6, 8, and 10.  Then you could compute the correlation between these two sets of scores.  Again, it should be fairly strong and positive.


A final way to check for internal consistency is to compute Cronbach’s alpha, which is the statistic that is most often presented in research reports.  You would do this using a computer, of course, but conceptually Cronbach’s alpha is the mean split-half correlation for all possible ways of splitting the items in half.  Note that you could split the items on a 10-item measure into the even items and the odd items, the first half (Items 1– 5) and the second half (Items 6–10) , or even Items 1, 3, 4, 9, and 10 vs. Items 2, 5, 6, 7, and 8 … and so on.  For all possible ways of splitting the items in half, you could compute the split-half reliability coefficient, take the mean of all these correlations, and you would have Cronbach’s alpha. 


Some psychological tests have what are called sub-scales.  These are, in essence, separate tests that are combined together into one, where each test measures a different construct or a different aspect of the same construct.  For example, researchers have identified two components of test anxiety.  The first is autonomic nervous system arousal or “nervous feelings” (fast heartbeat, muscle tension), and the second is negative thoughts (e.g., “I’m gonna fail, I’m gonna fail”).  Furthermore, they have discovered that these two components seem to be independent of each other.  It is possible to have nervous feelings without the negative thoughts, and it is possible to have the negative thoughts without having the nervous feelings.  So a good measure of test-anxiety contains two sets of items, although they may be mixed together: one to measure nervous feelings and the other to measure negative thoughts.  Note that what is important here is the internal consistency of each sub-scale.  You would not want to measure internal consistency across all items because you would not expect them all to be related to each other. 


Why Does Reliability Matter?


The main reason that reliability matters is that a measure that is not reliable cannot be valid.  You can think of reliability as being a prerequisite for validity.  For example, if a self-esteem test gives very different scores for the same person under essentially the same conditions, then we are not very well justified in taking either of those scores as a measure of the person’s self-esteem.  Similarly, if the items on a self-esteem test are not correlated with each other, then they cannot all be measuring self-esteem, and an aggregate of them cannot be a very good measure of self-esteem.  There are other reasons that reliability matters, but they can wait until you take Psych 149.