Research Question:

How go the date a video game is hosted on impact game attencdance in current years in major league baseball?

Annual attendance in major league baseball has actually been decreasing for the previous 13 years. Due to the fact that the league attendance height in 2007 there is has actually been a decline each dearteassociazione.orgon in all at once attendanace. Baseball has been referred to as America’s pastime and also was even deemed an essantail job during world war II to keep the nations morale. However, many modern fans think the sport is as well slow and also the gamings are too long and attendance continues to decline. It is necessary to understand what influences a modern-day fan’s decision to attend a video game in bespeak for teams to maximize attendance.

You are watching: What variables can affect attendance at a baseball game

Data Sources:

For this task I have determined three key data sources. The main data source that i originally identified is called games which originates from a a website referred to as kaggle, in ~ the link: The data comes originally from the major league baseball website and also was then compiled right into this csv file. The original data was gathered by significant League Baseball and also the data was then scraped right into this csv. I cannot find the initial data i beg your pardon is not ideal because I cannot determine what changes have been made native the initial data set. The website did carry out a link to the initial data, however, the link appeared to be out data and also no longer led to the exactly page. Below I have contained a table of all the variables consisted of in this data set, together with the form of variable and description the the variable.

Variable NameTypeDescription
attendanceDoubleNumber of people who attended the game
away_final_scoreDoubleFinal score that the far team
away_teamCharacterThree letter abbreviation of the far team
dateDataYear, month, and also day of date played
elapsed_timeDoubleNumber of minute of video game play
g_idDoubleGame ID
home_final_scoreDoubleFinal score that the house team
home_teamCharacterThree letter abbreviation of the home team
start_timeTimeThe sceduled start time
umpire_1BCharacterThe surname of the very first base umpire
umpire_2BCharacterThe name of the 2nd base umpire
umpire_3BCharacterThe surname of the third base umpire
venue_nameCharacterThe name of the stadium the video game was played at
weatherCharacterTemperature and weather conditions
windCharacterSpeed and also direction that wind
delayDoubleLenght of hold-up in minutes

The change that i was most interested in, in this certain data collection was attendence. I want to investigate exactly how it was influenced by other variables such together date and also weather.

The second data set that I want to find was a data collection about stadium capacity. The factor I wanted to incorporate this data is that I want to take into account the capacity of stadiums since it could reason misleading results if one stadium constantly has higher attendance than another because there space different number of seats in every stadium. I could not uncover a premade data collection with this information but there are just 30 parks in significant league baseball; I determined to discover the data and create the tibble through hand. I replicated the data native this wikipedia site: While it is not appropriate to manually entry data ns felt that since it was a tiny data collection that could not be discovered elsewhere it would certainly be my ideal option. This new tibble contained three variables deatailed below:

Variable NameTypeDescription
venue_nameCharacterThe present name of the venue
stadium_capacityDoubleThe total variety of seats in the venue
year_builtDoubleThe year the venue to be built

I then an unified the first two data to adjust together into one data collection before I started my graphical analysis.

The third data collection that I tried to find was a data collection that went further earlier and contained attendance data about more than three years. I had the ability to find another dataset ~ above Kaggle the covers this topic at this link: This data to be scraped from a website called baseball referral which is a website the tracks a vast amount the data around baseball. However, no connect is listed for the initial data set so i cannot find the initial source. This data set has data around each team because that each dearteassociazione.orgon dating earlier to 1876. The an obstacle with this data set is the resource I gained this data set from walk not include variable descriptions and also I was unable to identify what few of the variables mean. I still think that is benefical because that looking at as whole league trends. Additionally attendence data in the is the data set is not available until 1890. Over there are many variables had in this data collection that i have thorough in a table below, however, the only two I used in this project was year and also attendance.

Variable NameTypeDescription
X1DoubleThe number monitoring in the data set
YearDoubleThe year the observation is from
TmCharacterThe team the observation is about
LgCharacterThe league that the team was play in
GDoubleNumber of gamings played that dearteassociazione.orgon
WDoubleNumber of gamings won
LDoubleNumber of games lost
TiesDoubleNumber of gamings tied
W.L.DoubleRation that wins to complete games played
pythW.L.DoublePythagorean winning percentage (estimated games a team should have actually won)
FinishCharacterFinal ranking the end of total variety of teams
GBCharacterNumber the games earlier from height ranked team
PlayoffsCharacterHow far into the playoffs the team make it that dearteassociazione.orgon
RDoubleNumber of operation scored
RADoubleRuns permitted (number of runs opposing groups scored)
AttendanceDoubleTotal variety of attendees because that the dearteassociazione.orgon
Top.PlayerCharacterTop player for the dearteassociazione.orgon and also team
ManagersCharacterManagers the the team
currentCharacterThe present team name

Process because that Cleaning Data:

The main data was currently well organzied right into a CSV document with clean columns names. There to be very tiny that I essential to execute to clean this data. The trouble I ran into was that some of the parks adjusted names within the data set or have adjusted names since, in bespeak to it is in consistiant I adjusted all park names to it is in their most recent venue name. This additionally made it easier to combine my an initial two data sets. Ns was may be to join the first two data sets using the change venue_name that existed in both data to adjust to add the data around the venue come data around each video game played at the venue.

Some games also had attendence the 0 or 1, these are refered to together crowdless games. The first crowdless video game in the MLB to be played in 2015 in baltimore. The game was played with no fans due to civil unrest in the city. I was can not to investigate all the crowdless in this data set though they show up to largely occur because of civil unrest in the city whereby the game is an alleged to be played or dual header gamings where attendance was only recorded for one of the games. I chose to eliminate them from the data collection due to just how rare lock are, and likewise given the they are typically crowdless early to components not measured in ~ this data set.

Creation of new Variables:

There was also a singular weather change that consisted of the temperature and also the conditions. This was all stored as a string. I separated this variable right into two variables. One consisted of the temperature stored as a double and the other included on the weather problems stored together a string which deserve to be offered as categorical data.

I additionally created a new variable licensed has been granted percent_capacity which to be the attendance split by the complete capacity of the stadium. The goal of this was to produce a much more equal comparison in between venues of various size.


The very first data set that I functioned with in my results section was the historical data set. The two variables I provided were year and a summary variable full attendance every year that i created. The selection from 1890 to 2015. The variance that the full attendance is presented in a box plot below. As shown in the boxplot the there is a big spread specifically on the upper fifty percent of the mean, however, there space no outliers in this data.

## Rows: 141## Columns: 2## $ Year 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884…## $ total_attendance NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## Warning: eliminated 16 rows containing non-finite worths (stat_boxplot).


The next variables I worked with to be days the the main which space of course a traditional seven and mean video game attendance portion by team for each day of the week. The boxplot below show the spread of mean attendace percentage. The boxplot shows an incredibly also spread v no outliers.

See more: What Is The Mass Of The Sun In Scientific Notation, Scientific Notation Basics

## Rows: 217## Columns: 4## $ grouping_variable "Angel stadium of Anaheim1", "Angel stadium of Anah…## $ mean_day_venue 82.43382, 76.46182, 77.18735, 77.66853, 76.64184, 8…## $ week_day "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "…## $ venue_name "Angel stadion of Anaheim", "Angel stadion of Anahe…


After looking at the work of the week I wanted to continue analyzing how the date influence attendance so i experimented through grouping by month. The months ranged indigenous 3 come 10 as baseball is only played march through october. The boxplot because that the average attendance through month and also venue is below. Again it shows a an extremely even spread with no outliers.

## Rows: 227## Columns: 4## $ grouping_variable "Angel stadium of Anaheim10", "Angel stadium of Ana…## $ mean_month_venue 69.94456, 82.38399, 78.57694, 81.84634, 87.20427, 8…## $ month "0", "4", "5", "6", "7", "8", "9", "0", "4", "5", "…## $ venue_name "Angel stadion of Anaheim1", "Angel stadium of Anah…