Statistical Relationships Between Variables

 

Overview

 

It is often interesting to know about people’s scores on individual variables: the percentage of people in the U.S. who are depressed, the average driving speed on Highway 41, the range of sizes of the normal human brain, and so on.  However, most psychological research concerns relationships between variables.  Of course, there are lots of ways in which variables can be related to each other.  For example, the variables “depression” and “anger” both involve negative emotions.  The variables “working memory capacity” and “vocabulary size” are both related to intellectual functioning.  But these are not the kind of relationships that concern us here.  Instead, we are interested in statistical relationships between vaiables.  In general, there is a statistical relationship between two variables when the average score on one variable differs reliably across the values or levels of the other variable.

 

So although the percentage of depressed people in the U.S. might be interesting, it might be even more interesting to study the relationship between depression (Variable 1) and where people live, specifically whether they live in urban or rural environments (Variable 2).  If the average depression score (Variable 1) differs across the two levels of where people live, then there is a relationship between these variables.  Similarly, although it might be interesting to know how large human brains are in general, it might be more interesting to know whether there is a relationship between brain size (Variable 1) and intelligence (Variable 2).  If the average intelligence score differs across people who have small, medium, and large brains, then there is a relationship between these variables.

 

There are three basic types of statistical relationship between variables.  The differences among them have to do mainly with whether the two variables involved are both categorical, both quantitative, or one of each.

 

Two Categorical Variables

 

Imagine that you are interested in the relationship between people’s sex (male or female) and whether or not they own a computer (yes or no).  Both of these variables are categorical.  Here there there would be a statistical relationship if the percentage of people owning computers (a kind of average) were different for men than it for women.  For example, if 60% of men own a computer, but only 40% of women own a computer, then this would constitute a statistical relationship between these two variables.  If there were no difference in the percentage of men owning a computer and the percentage of women owning a computer (e.g., both were 50%), then there would be no statistical relationship between the variables.  We might also say there is a null relationship between them.

 

Another way to display these data is in a contingency table. The contingency table below shows that a total of 100 men and 100 women were included in this particular study.  This sample of 200 people also turned out to include 100 computer owners and 100 non-computer owners.  Note that of the men, 60 owned computers and 40 did not.  (This is the 60% mentioned above.)  Of the women, 40 owned computers and 60 did not.  (This is the 40%.)

 

 

 

 

Sex

 

 

 

 

Male

Female

 

 

 

Computer

 

Yes

 

60

 

 

40

 

= 100 Yes

Owner?

 

No

 

40

 

 

60

 

= 100 No

 

 

 

= 100 M

 

= 100 F

 

= 200 Total

 

Contingency tables help make clear why we compare percentages to establish a statistical relationship rather than absolute numbers.  In other words, why not just note that a greater number of men (60) than women (40) own computers and leave it at that?  The answer is that such absolute numbers are misleading when there are different numbers of cases at the different levels of one or both variables.  Consider the contingency table below.  Here there are still 60 men and 40 women who own computers, but there are now 150 men total.  So although there are still more men who own computers, men are no more likely to own them.  For both men and women, the percentage that owns computers is 40%.  The greater number of male computer owners simply reflects a greater number of men in the sample.

 

 

 

Sex

 

 

 

 

Male

Female

 

 

 

Computer

 

Yes

 

60

 

 

40

 

= 100 Yes

Owner?

 

No

 

90

 

 

60

 

= 150 No

 

 

 

= 150 M

 

= 100 F

 

= 250 Total

 

 

All of these basic ideas can be generalized to situations in which one or both of the categorical variables has more than two levels.  The contingency table below shows a statistical relationship between students’ sex and their major (social science, natural science, and humanities).  Here there is a statistical relationship between the two variables because the percentage of male students differs across the three majors.  Specifically, the percentage of men in natural science is 60%, but the percentage of men in social science and humanities is 40%.

 

 

 

 

College Major

 

 

 

 

 

 

Natural Sci.

Social Sci.

Humanities

 

 

 

 

Sex

 

Male

 

90

 

 

80

 

40

 

= 210 M

 

 

 

Female

 

60

 

 

120

 

60

 

= 240 F

 

 

 

 

= 150 NS

 

200 SS

 

= 100 H

 

= 450 Total

 

 

One Categorical and One Quantitative Variable

 

Note that the levels of a categorical variable define different “groups.”  So if we measure the sex of each person in a sample, we will have a male group and a female group.  If we measure the political party preference of each person in a sample, we will have a Democratic group, a Republican group, a Green group, and so on.  When we have one categorical variable and one quantitative variable, therefore, there is an easy way to think about differences in the average score for one variable across levels of the other variable.  Specifically, we can think about differences in the mean score of the quantitative variable across the different groups.

 

For example, the question of whether there is a relationship between sex (male vs. female) and self-rated happiness (e.g., on a 10-point scale) is just the question of whether men and women differ in their mean happiness rating.  If the mean happiness rating for men were 6.03 and the average happiness rating for women were 8.25, then there would be a statistical relationship between the two variables.  If the two mean happiness ratings were roughly the same, then there would be no statistical relationship (i.e., a null relationship).

 

This kind of statistical relationship is often represented using a bar graph, where the x-axis represents the groups (i.e., the categorical variable) and the y-axis represents the quantitative variable.  For each group, there is a bar that represents the mean score for that group on the quantitative variable.  Here are two examples.  Note that the first shows a relationship between sex and happiness, but the second shows no relationship between political party preference and happiness.

 

 

Two Quantitative Variables

 

If at least one of your two quantitative variables is discrete and has a small number of values, then you can check for a statistical relationship between them by finding the average score on one variable at each level of the other.  Imagine that you are interested in the relationship between the number of close friends people have and their blood pressure.  The number of friends is a discrete quantitative variable with a small number of values: 0, 1, 2, and so on.  It is unlikely to be greater than 8 or 10.  Here you could compute the average blood pressure of people with each possible number of friends and compare.  To make this easier, you would probably also plot the means in a line graph.

 

 

When both quantitative variables have many different values (say, more than 10), we do not usually find the average score on one variable for each level of the other.  This is too cumbersome.  Instead, we check for a relationship by plotting the data in a scatterplot.  The important difference between a scatterplot and a line graph is that whereas the points in a line graph represent the average score for a group of cases, the points in a scatterplot represent individual cases.  The scatterplot below shows the relationship between students’ scores on an exam and the amount of time they spent studying for the exam.

                                                                                                                                             

 

With relationships between quantitative variables, we are usually looking for certain kinds of patterns.  One pattern is a positive relationship, in which higher scores on one variable are associated with higher scores on the other.  The second example above (the scatterplot) shows a positive relationships.  Another pattern is a negative relationship, in which higher scores on one variable are associated with lower scores on the other.  The first example above (the line graph) shows a negative relationship.  Both of these relationships are roughly linear.  That is, the points fall roughly along a straight line.  Often, however, relationships between quantitative variables are non-linear.  The general relationship between stress and task performance is often conceptualized as non-linear.  As stress increases from none to a moderate amount, task performance increases, but as stress increases from moderate to extreme, task performance decreases.  Plotted as a line graph or scatterplot, this relationship would look like an upside-down U. 

 

A Few Last Points About Statistical Relationships

 

Underlying similarity.  Although these three types of statistical relationships might seem quite different from each other on the surface, they are essentially the same.  They are all reliable differences in the average score of one variable across levels of a second variable.  Here are two additional ways to see this.  The first is to note that we can talk about a relationship between two variables—say, television ownership and party going—without even specifying whether the variables are measured categorically or quantitatively.  For example, we might hypothesize that there is a relationship between television ownership and party going, so that owning TVs (vs. not owning them) is associated with going to fewer parties.  But depending on how the variables are operationally defined, this could be a relationship between two categorical variables (TV ownership [yes vs. no] and party attendance [yes vs. no]), between a categorical and a quantitative variable (TV ownership [yes vs. no] and number of parties attended), or between two quantitative variables (the number of TVs a person owns and the number of parties he or she attends).  So the nature of the relationship is really the same regardless of which basic type of relationship it is.  The second way to see the underlying similarity of the three types of relationships is to note that the data for any of the types can be presented using any of the four types of graphical displays discussed: contingency tables, bar graphs, line graphs, and scatterplots.  To be sure, plotting the data for two categorical variables in a scatterplot produces a weird looking scatterplot, but it can be done.  I will leave it as an exercise to the reader….

 

Relationships as comparisons.  Another way to understand statistical relationships is to see that they involve making a comparison.  We compare the men and women in terms of their computer ownership.  We compare the number of men and women in different college majors.  We compare the blood pressure of people with different numbers of friends.  We comapre ….  You get the idea. 

 

What is a “correlation?”  Another name for a statistical relationship is a correlation, and all three types discussed here of can rightly be called correlations.  For example, it is perfectly fine to talk about a “correlation between sex and computer ownership.”  In practice, however, relationships between two quantitative variables are more likely to be called correlations.  The others are more likely to be called differences between groups (or “group differences”) … but this does not change the fact that they are the same thing.

 

Bi-directionality.  Statistical relationships are always bidirectional.  If the percentage of computer owners differs between men and women, then the percentage of men differs between computer owners and non-owners.  If people with more friends tend to have have lower blood pressure, then people with lower blood pressure tend to have more friends. 

 

What vs. why.  To establish that there is a statistical relationship between two variables is to establish a simple statistical fact.  It tells us what is the case.  However, it does not tell us why the statistical relationship exists.  Let us say that we establish that a greater percentage of men than women own computers.  This is a statistical fact … but explaining it is something else.  It could be that men tend to be more “into” technology.  It could be that men tend to have more money.  It could be that men tend to have jobs that require computers.  We will say more about this when we learn about theories.