IntroductionExploratory Data AnalysisAdding the line of ideal FitTesting AssumptionsMean Square Error and R2Hypothesis Tests

This R guide will cover product from thing 3 of “Forecasting, Time Series, and also Regression” by Bowerman, O’Connell, and also Koehler called simple Linear Regression.

You are watching: Speed and stopping distance dependent variable


In thing 3, we started by talk about simple linear regression models. Simple linear regression observes the relationship in between an live independence variable, x, and also a dependency variable, y, by looking in ~ a scatterplot the data and also approximating a right line with the points. The equation because that the design is:

In this equation, is the average value that the dependent variable y through independent change x. is the y-intercept, or the median value of y when x amounts to 0. is the slope. The is the change in the typical value the y with every one-unit rise in the median of x. is the error term that accounts for things various other than x that impact y.

Our Example

To produce this model in R, us must first select a data set. We are going to select a package the lists the speed of dare (mph) and the ranges (ft) taken to stop. We call up this data and also tell R we want to usage it by utilizing the data and also attach commands as follows:

Exploratory Data AnalysisIf we want to check out our change names and also a snippet the the data in the set, we will use the head() command v the name of our data as the argument. We have the right to use the summary() duty to see a short analysis of ours data, including the min, max, mean, median, etc. Of ours variables. We can use this to acquire an idea the the range of our data points.

head(cars)## rate dist## 1 4 2## 2 4 10## 3 7 4## 4 7 22## 5 8 16## 6 9 10summary(cars)## speed dist ## Min. : 4.0 Min. : 2.00 ## first Qu.:12.0 first Qu.: 26.00 ## median :15.0 typical : 36.00 ## typical :15.4 typical : 42.98 ## third Qu.:19.0 3rd Qu.: 56.00 ## Max. :25.0 Max. :120.00In this example, both of our variables are quantitative variables due to the fact that they are defined using numbers rather than words.

Creating Scatterplots

We are currently going to develop a scatterplot that the rate of car vs. their stopping ranges to watch the relationship in between these two variables. In this example, we will say the the stopping distance relies on the speed, for this reason the protecting against distance is the dependent variable (y) and the speed is the independent change (x). To create a scatterplot, we usage the plot role followed through the (predictor variable, an answer variable). Us can also label our axes using the ylab and also xlab specifiers and also put a title on the graph utilizing the key specifier, together follows:

plot(speed, dist, ylab = "Car stopping Distance (ft)", xlab = "Speed (mph)", main = "Car rate vs. Protecting against Distance")


Adding the line of ideal Fit

We currently want to uncover the heat that describes the relationship in between our 2 variables, the line described by the an easy linear regression model. We produce this regression line using the lm duty and 2 arguments, the solution ~ the predictor, as follows:

mymod ## ## Call:## lm(formula = dist ~ speed)## ## Coefficients:## (Intercept) speed ## -17.579 3.932After we operation our regression model, we have the right to interpret our results. Our regression equation from these outcomes is dist = -17.579 + 3.932(speed) + (epsilon). Through a slope of 3.932, us say that through every one-unit (mph) rise in speed, our stopping distance boosts by 3.932 feet top top average.

In this case, we will certainly not interpret our y-intercept because it would certainly not make sense that there would certainly be a stopping distance v a rate of 0. You would not have a preventing distance if you were not moving to start with.

We will now include our regression heat to the plot by first plotting the data and also then making use of the abline function. The abline duty requires two arguments, the worths of the intercept and also the slope the we got from from our regression model. In ~ the function, they have to be put in the order (intercept, slope).

plot(speed, dist, ylab = "Car preventing Distance", xlab = "Speed", main = "Car speed vs. Stopping Distance")abline(-17.579,3.932)


So, the relationship in between speed and also stopping distance is ideal described by the line viewed here. The equation that this line is the exact same as our straightforward linear regression model: dist = -17.579 + 3.932(speed) + (epsilon).

Correlation (R)

Another method we can examine the relationship between speed and stopping street is to look at the correlation in between them. We usage the change r to represent correlation. In ours case, this will measure the strength and direction that the straight relationship between the speed and also stopping distance. We usage the cor() function followed by our 2 variables:

cor(speed, dist)## <1> 0.8068949Because this is positive and also pretty close to 1, over there is a strong, positive, straight relationship between the 2 variables. The closer to 1, the much more strongly associated (linearly related) two variables are. And because they space positively correlated, this means that they move in the exact same direction: together the rate increases, the avoiding distance increases.

Testing Assumptions

When we develop regression models, us make many assumptions around the error term.

See more: How Do You Get Rayquaza In Pokemon Emerald ? Where Can I Catch Rayquaza In Pokemon Emerald

Normal Assumption

The first assumption we make is that the error term is usually distributed. We can examine this assumption by creating a histogram of the residuals and checking that is shape:


This histogram watch decently bell-shaped, yet we do have a little of a appropriate skew. In context of ours example, this skew way that for greater speeds, we room underpredicting our protecting against distances. Us would mean there come be more variability in stopping distances for higher speeds just as result of variability in the means people stop (slamming top top brakes, coming progressively to a stop, etc.) differences in cars could likewise account for few of this variability just based upon how well the cars stop.

If we have actually trouble seeing whether our error terms follow a common distribution based on a histogram, we could also plot ours residuals versus a right line. We use the qqnorm and also qqline functions with our residual model as the argument to watch this comparison: