Review of correlation & introduction
to regression
Correlation deals with the concept of relationship.
A correlation coefficient is used to DESCRIBE a relationship
between two variables.
The sign of the correlation coefficient determines whether
a relationship is described by a regression line of positive or negative slope.
The correlation coefficient is really a sort of average
between two regression lines
the best fitting line when Y is predicted from X
and the best fitting line when X is predicted from Y
Description, as desirable as it can be, may not be sufficient
for some research.
For example, INFERENCES are frequently a goal of research.
While we can make inferences with correlation, correlation
does not imply causation
Inferences can be made as to whether or not there is support
that a true correlation exists or if the null hypothesis should be retained.
However, there is considerable danger in using Correlational
outcomes as the basis of many inferences.
Correlation is simply a measure of the strength of a
relationship between variables.
Thus, one should not even think about CAUSATION when
presenting correlational results.
A common mistake among students is falling to the temptation
to take a correlational result and either assert or imply causation.
Regression is no better suited to determining causation
than correlation if the research methods of data collection do not warrant
it.
Students, and for that matter,
many who should know better, also often forget that it is not a statistical
test responsible for determining causation, but the rigor of the experimental
methodology.
what is the predicted Final exam for an Exam score of 5?
let's magnify the graph for this...
extend a perpendicular line from 5 on the x-axis to
the regression line
extend a perpendicular line to the y-axis from this
point on the regression line
we can see that the predicted (y) Final from the (x)
Exam is 5.9 as calculated above
please note that the actual Y score was 6 for the X
score of 5
The difference between B1 and b1 requires explanation
B1 is the slope expressed in raw scores
this has been the focus until now
recall that the average of the slopes from the line
predicting Y from X and the line predicting X from Y, represent our correlation
coefficient
In contrast, b1
is the slope expressed in standard scores
the slope of the line predicting Y from X and the line
predicting X from Y are the same
in this instance, the correlation is equal to the slope
Regression toward the mean
how many times have your heard the expression "regression
toward the mean" but not really understand what was being said?
if you understand the concept of regression toward the mean,
what keeps us all from regressing to mediocrity?
let's take the concept of regression toward the mean to
make sure we understand it
now that you know how to use regression equations to
predict, the mathematical part of regression toward the mean should be
easy to grasp
unless b1
is equal to either
1 or -1 we will make errors in prediction
further, any other value of b1 will be closer to 0.000
let's say for 2 different tests that measure the same
thing we have a value of b1 that is fairly
high, such as 0.800
let's also say that we have a z score of 2.00 on one
of the tests
we predict that the second test score will be
z timesb1
or a value of 1.6
clearly a z of 1.60 is less deviant, or closer to the
mean z = 0.00 than a z of 2.00
demonstrate to yourself that when we predict a score
from one that is greatly below the mean that we see an opposite trend
- the greatly negatively deviant value will predict one that is closer
to the mean
have you seen the explanation on page 202 of the text?
perhaps this will help
So why do we not lapse into total regression to mediocrity
on all variables?
the mean is often a terribly seductive concept that
keeps us from thinking about what is really represented by this statistic
in other words, the mean is only a single value, with
several values both above and below the mean - or as we often say,
variability (around the mean)
when we predict, Y from X, our best prediction is the mean
of Y
however, we have dispersion or variability about the Y mean,
just as we have in any distribution
this variability can be illustrated by the following illustration
let us also assume that we have hundreds or even thousands
of people who have been measured on each X and Y, the correlation is 0.700,
and the best fitting regression line is in red
let us focus for a moment on X = 2, Y = 4
note that the predicted Y for X = 2 is Y = 4
note also that the regression line goes through the mean
of the distribution with mean = 4
we can also see that there are many other values represented
in this distribution
look at the predicted Ys for X = 6 and X = 10
we see the same type of distribution about the predicted
Y values and the regression line goes through the mean of each predicted Y
thus, for any given X we are actually predicting a mean
Y score, but the Y we predict is a summary score for all individuals with
this given X
A real problem
let us assume that we have 13 individuals, each of whom
has an ordered pair
the X values in the ordered pair is a Quiz score
the Y score is a Mid-term exam score (incidentally,
these scores are from a previous offering of this course)
we could use the PEARSON (or CORREL) function to determine
the correlation coefficient
if we use the FUNCTION WIZARD, does it make a difference
whether we put the X or Y scores in Array 1 or Array 2?
what is the correlation coefficient?
what does this coefficient represent besides the index
of association between X & Y?
how many degrees of freedom do we have in this problem?
is our r = 0.776 statistically significant
for p < 0.05; for p < 0.01, for p
< 0.001?
how do we easily plot the data and best fitting regression
line?
in Excel, on the Tools menu, the bottom most selection
is Data Analysis...
after selecting Data Analysis..., a pop-up will appear
with the heading Data Analysis
scroll down and select Regression and this will activate
a WIZARD to help
there are four horizontal panels in this WIZARD:
"Input"
"Output Options"
"Residuals"
"Normal probability"
we will deal with the Input and Residuals
supply the WIZARD with the Y array and X array row
& column references
in the "Residuals" panel, select the box
"Line fit plots"
click OK
Excel will automatically add a new worksheet to your
workbook UNLESS YOU SPECIFY A CELL REFERENCE ON THE CURRENT WORKSHEET
IN THE "Output Options" panel of the regression pop-up
in this new worksheet will be your data plot (automatically)
with regression line
note that there are also four separate tables
1. "Regression Statistics" note that the
r is the top entry
2. "ANOVA" this is an ANOVA source table
3. the Parameter estimates for the Y intercept and
slope (B1)
4. "Residual Output" the predicted Y for
each actual Y and the residual
let's now write the equation for the regression line in
the form:
Read the exam instructions, the exam text, and the exam
data workbook and bring any questions to class on 10/11
Read De Veaux Chapters in this order 12, 13, 11, 14, 15
Chapter 8 problems 16, 18, 20, 22, 24, 30, 32, 39, 40. Be
sure for any textbook question that asks for a simple yes/no answer or
only a couple of words of response, thatyouEXPLAIN WHY....
Additional Problems
A. Perform a regression on the class data to look at the
prediction of weight from height
1) what is the Multiple R?
2) what is r?
3) how many degrees of freedom (df) are there
in this problem?
4 )how much shared variance is there between weight
and height?
5) what is the significance of Multiple R?
6) show the line fit plot, including the line of best
fit - remember to label graph appropriately
7) what is the equation of the line of best fit?
8) what would we predict the weight to be for someone
who is 69 inches tall?
9) how should we interpret the results of this regression
problem?
B. Perform a regression on the class data to look at the
prediction of weight (in z-scores) from height (in z-scores)
1) what is the Multiple R?
2) what is r?
3) what relationship do you see between the slope (coefficient)
and Multiple R? why?
4) what relationship do you see between the slope (coefficient)
and r? why?
C. Perform a regression on the class data to look at the
prediction of height from weight
1) what is the Multiple R?
2) what is r?
3) how many degrees of freedom (df) are there
in this problem?
4 )how much shared variance is there between height
and weight?
5) what is the significance of Multiple R?
6) show the line fit plot, including the line of best
fit - remember to label graph appropriately
7) what is the equation of the line of best fit?
8) how does this equation compare with your answer in
A.7? why?
9) should we try to predict the height for someone with
a weight of 102 pounds? why or why not?
10) how do we determine whether we should examine height
as a function of weight in this problem or weight as a function of height
as in the Additional Problem A?