How go the date a video game is hosted on impact game attencdance in current years in major league baseball?
Annual attendance in major league baseball has actually been decreasing for the previous 13 years. Due to the fact that the league attendance height in 2007 there is has actually been a decline each dearteassociazione.orgon in all at once attendanace. Baseball has been referred to as America’s pastime and also was even deemed an essantail job during world war II to keep the nations morale. However, many modern fans think the sport is as well slow and also the gamings are too long and attendance continues to decline. It is necessary to understand what influences a modern-day fan’s decision to attend a video game in bespeak for teams to maximize attendance.
You are watching: What variables can affect attendance at a baseball game
Data Sources:
For this task I have determined three key data sources. The main data source that i originally identified is called games which originates from a a website referred to as kaggle, in ~ the link: https://www.kaggle.com/pschale/mlb-pitch-data-20152018#games.csv. The data comes originally from the major league baseball website and also was then compiled right into this csv file. The original data was gathered by significant League Baseball and also the data was then scraped right into this csv. I cannot find the initial data i beg your pardon is not ideal because I cannot determine what changes have been made native the initial data set. The website did carry out a link to the initial data, however, the link appeared to be out data and also no longer led to the exactly page. Below I have contained a table of all the variables consisted of in this data set, together with the form of variable and description the the variable.
attendance | Double | Number of people who attended the game |
away_final_score | Double | Final score that the far team |
away_team | Character | Three letter abbreviation of the far team |
date | Data | Year, month, and also day of date played |
elapsed_time | Double | Number of minute of video game play |
g_id | Double | Game ID |
home_final_score | Double | Final score that the house team |
home_team | Character | Three letter abbreviation of the home team |
start_time | Time | The sceduled start time |
umpire_1B | Character | The surname of the very first base umpire |
umpire_2B | Character | The name of the 2nd base umpire |
umpire_3B | Character | The surname of the third base umpire |
venue_name | Character | The name of the stadium the video game was played at |
weather | Character | Temperature and weather conditions |
wind | Character | Speed and also direction that wind |
delay | Double | Lenght of hold-up in minutes |
The change that i was most interested in, in this certain data collection was attendence. I want to investigate exactly how it was influenced by other variables such together date and also weather.
The second data set that I want to find was a data collection about stadium capacity. The factor I wanted to incorporate this data is that I want to take into account the capacity of stadiums since it could reason misleading results if one stadium constantly has higher attendance than another because there space different number of seats in every stadium. I could not uncover a premade data collection with this information but there are just 30 parks in significant league baseball; I determined to discover the data and create the tibble through hand. I replicated the data native this wikipedia site: https://en.wikipedia.org/wiki/List_of_current_Major_League_Baseball_stadiums. While it is not appropriate to manually entry data ns felt that since it was a tiny data collection that could not be discovered elsewhere it would certainly be my ideal option. This new tibble contained three variables deatailed below:
venue_name | Character | The present name of the venue |
stadium_capacity | Double | The total variety of seats in the venue |
year_built | Double | The year the venue to be built |
I then an unified the first two data to adjust together into one data collection before I started my graphical analysis.
The third data collection that I tried to find was a data collection that went further earlier and contained attendance data about more than three years. I had the ability to find another dataset ~ above Kaggle the covers this topic at this link: https://www.kaggle.com/timschutzyang/dataset1/data. This data to be scraped from a website called baseball referral which is a website the tracks a vast amount the data around baseball. However, no connect is listed for the initial data set so i cannot find the initial source. This data set has data around each team because that each dearteassociazione.orgon dating earlier to 1876. The an obstacle with this data set is the resource I gained this data set from walk not include variable descriptions and also I was unable to identify what few of the variables mean. I still think that is benefical because that looking at as whole league trends. Additionally attendence data in the is the data set is not available until 1890. Over there are many variables had in this data collection that i have thorough in a table below, however, the only two I used in this project was year and also attendance.
X1 | Double | The number monitoring in the data set |
Rk | Double | Unknown |
Year | Double | The year the observation is from |
Tm | Character | The team the observation is about |
Lg | Character | The league that the team was play in |
G | Double | Number of gamings played that dearteassociazione.orgon |
W | Double | Number of gamings won |
L | Double | Number of games lost |
Ties | Double | Number of gamings tied |
W.L. | Double | Ration that wins to complete games played |
pythW.L. | Double | Pythagorean winning percentage (estimated games a team should have actually won) |
Finish | Character | Final ranking the end of total variety of teams |
GB | Character | Number the games earlier from height ranked team |
Playoffs | Character | How far into the playoffs the team make it that dearteassociazione.orgon |
R | Double | Number of operation scored |
RA | Double | Runs permitted (number of runs opposing groups scored) |
Attendance | Double | Total variety of attendees because that the dearteassociazione.orgon |
BatAge | Double | Unknown |
PAge | Double | Unknown |
Top.Player | Character | Top player for the dearteassociazione.orgon and also team |
Managers | Character | Managers the the team |
current | Character | The present team name |
Process because that Cleaning Data:
The main data was currently well organzied right into a CSV document with clean columns names. There to be very tiny that I essential to execute to clean this data. The trouble I ran into was that some of the parks adjusted names within the data set or have adjusted names since, in bespeak to it is in consistiant I adjusted all park names to it is in their most recent venue name. This additionally made it easier to combine my an initial two data sets. Ns was may be to join the first two data sets using the change venue_name that existed in both data to adjust to add the data around the venue come data around each video game played at the venue.
Some games also had attendence the 0 or 1, these are refered to together crowdless games. The first crowdless video game in the MLB to be played in 2015 in baltimore. The game was played with no fans due to civil unrest in the city. I was can not to investigate all the crowdless in this data set though they show up to largely occur because of civil unrest in the city whereby the game is an alleged to be played or dual header gamings where attendance was only recorded for one of the games. I chose to eliminate them from the data collection due to just how rare lock are, and likewise given the they are typically crowdless early to components not measured in ~ this data set.
Creation of new Variables:
There was also a singular weather change that consisted of the temperature and also the conditions. This was all stored as a string. I separated this variable right into two variables. One consisted of the temperature stored as a double and the other included on the weather problems stored together a string which deserve to be offered as categorical data.
I additionally created a new variable licensed has been granted percent_capacity which to be the attendance split by the complete capacity of the stadium. The goal of this was to produce a much more equal comparison in between venues of various size.
Data:
The very first data set that I functioned with in my results section was the historical data set. The two variables I provided were year and a summary variable full attendance every year that i created. The selection from 1890 to 2015. The variance that the full attendance is presented in a box plot below. As shown in the boxplot the there is a big spread specifically on the upper fifty percent of the mean, however, there space no outliers in this data.
## Rows: 141## Columns: 2## $ Year 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884…## $ total_attendance NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## Warning: eliminated 16 rows containing non-finite worths (stat_boxplot).

The next variables I worked with to be days the the main which space of course a traditional seven and mean video game attendance portion by team for each day of the week. The boxplot below show the spread of mean attendace percentage. The boxplot shows an incredibly also spread v no outliers.
See more: What Is The Mass Of The Sun In Scientific Notation, Scientific Notation Basics
## Rows: 217## Columns: 4## $ grouping_variable "Angel stadium of Anaheim1", "Angel stadium of Anah…## $ mean_day_venue 82.43382, 76.46182, 77.18735, 77.66853, 76.64184, 8…## $ week_day "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "…## $ venue_name "Angel stadion of Anaheim", "Angel stadion of Anahe…

After looking at the work of the week I wanted to continue analyzing how the date influence attendance so i experimented through grouping by month. The months ranged indigenous 3 come 10 as baseball is only played march through october. The boxplot because that the average attendance through month and also venue is below. Again it shows a an extremely even spread with no outliers.
## Rows: 227## Columns: 4## $ grouping_variable "Angel stadium of Anaheim10", "Angel stadium of Ana…## $ mean_month_venue 69.94456, 82.38399, 78.57694, 81.84634, 87.20427, 8…## $ month "0", "4", "5", "6", "7", "8", "9", "0", "4", "5", "…## $ venue_name "Angel stadion of Anaheim1", "Angel stadium of Anah…
-->