This project is to visualize MTA ridership on the map of New York City. In the course of the project, I will encompass various aspects of data science from data munging, data visualizaition, to data analysis. The base data set is NYC turnstile data which can be found in the Udacity courses Intro to Data Science. Additionally, I will utilize two other external data sets; NYC Subway Station and MTA Remote Unit Code. The location information from the NYC subway station is to be added onto the other data set. After that, the combined data set is merged with the base data set to provide the ridership data by station and its individual physical location. The final outcome will be the map of New York City with MTA ridership as dots in the size of popularity. The sections are as below (click the section title to jump to).
• Import Libraries and Set File Paths
# manipulate data and data frames
import numpy as np
import pandas as pd
# match station names
import string
from fuzzywuzzy import fuzz
# geographical data
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame
# visualization
from IPython.display import IFrame
import matplotlib.pyplot as plt
%matplotlib inline
# do not show warnings
import warnings
warnings.filterwarnings("ignore")
# file paths
turnstile_base = './Data/turnstile_data_master_with_weather.csv'
remote_unit = './Data/Remote-Booth-Station.csv'
station_position = './Data/DOITT_SUBWAY_STATION_01_13SEPT2010.csv'
correct_station_name = './Data/correct_station_name.txt'
nyc_map_data = './Data/Borough Boundaries/geo_export_1e1c40bf-0d50-4899-ab1d-0056152c3698.shp'
mta_map = "./Data/The-New-York-City-Subway-Unique-Geographically-Accurate-Nyc-Subway-Map.jpg"
• Read Base Data
# base data set
master_df = pd.read_csv(turnstile_base)
master_df.drop(['Unnamed: 0'], axis = 1, inplace = True)
# create a column for day
master_df['DATEn'] = pd.to_datetime(master_df['DATEn'])
master_df['day'] = master_df['DATEn'].apply(lambda d : d.day)
# create another column for
day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
master_df['day_of_week'] = master_df['DATEn'].dt.weekday_name
master_df['day_of_week'] = master_df['day_of_week'].map(dict(zip(day_list, np.arange(1, 8))))
• Define Functions
"""
The station names of the two external data sets are different in alphabet, length, and so on.
The functions below are mostly to match the station names before merging the data sets.
"""
# get rid of ordinal expressions
def process_st_number(words):
ordinal = ["st", "nd", "rd", "th"]
result = ""
words = words.split(" ")
for word in words:
if word[0].isdigit():
for match in ordinal:
word = word.replace(match, "")
result = result + " " + word
# ex. 123rd -> 123
return result.strip()
# move number part before character
def swap_station_name(name):
# take care of "name (123)" cases
name = name.replace("(", "- ")
name = name.replace(")", "")
# delete ordinal and make it uppercase
name = process_st_number(name)
name = name.replace(" - ", "-").upper()
split = name.split("-")
# cases that have '-' symbol
if len(split) == 2:
if (not split[0][0].isdigit()) and (split[1][0].isdigit()):
return split[1] + "-" + split[0]
else:
return split[0] + "-" + split[1]
# ex. street-123 -> 123-street
return name
# make the two station lists in accordance
def process_station_name(name):
name = swap_station_name(name)
# limit the length to 15 characters long
if len(name) >= 15:
if name[13] is " ":
return name[0:13] # special case
else:
return name[0:15]
return name
# find the same or the most similar station name
def compare_station_name(name, match_list):
# just return if same
if (name in match_list):
return name
# fuzzywuzzy return a score ranged 0~100
highest = 0
similar_name = ""
for match in match_list:
score = fuzz.ratio(name, match)
if score > highest:
highest = score
similar_name = match
return similar_name
• Read and Process Data Sets in Format
# read Remote Unit data set and process it
remote_unit_df = pd.read_csv(remote_unit, usecols = ['Remote', 'Station'])
remote_unit_df.drop_duplicates(subset = 'Remote', keep = 'first', inplace = True)
remote_unit_df.columns = ['UNIT', 'station']
remote_unit_df['station'] = remote_unit_df['station'].apply(
lambda s : swap_station_name(s))
# read Station Position data set and process it
station_position_df = pd.read_csv(station_position, usecols = ['NAME', 'the_geom'])
station_position_df.drop_duplicates(subset = 'NAME', keep = 'first', inplace = True)
station_position_df.columns = ['station', 'position']
station_position_df['station'] = station_position_df['station'].apply(
lambda s : process_station_name(s))
• Process Data Sets for Comparison
"""
Even after processing station names with the functions, there are still discrepancies.
This is because the differences are out of pattern that the function can catch.
Part of the job is done manually, instread of writing more functions for each case.
"""
# station_temp is from Remote Unit data set
station_position_df['station_temp'] = station_position_df['station'].apply(
lambda name : compare_station_name(name, list(remote_unit_df['station'])))
# compare and find different names
diff = []
for i in range(0, len(station_position_df)):
if (station_position_df['station'].iloc[i] is not
station_position_df['station_temp'].iloc[i]):
diff.append(i)
# look through names to find the ones that are totally different
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
display((station_position_df.iloc[diff]).sort_values('station')[0:10])
• Fine Tuning before Merging
# bring the list written manually
correct_station_name_df = pd.read_csv(correct_station_name)
# correct the names
for row, index in correct_station_name_df.iterrows():
station_position_df['station_temp'][index[0]] = index[1]
# match stations names for two data sets
station_position_df['station'] = station_position_df['station_temp']
station_position_df.drop('station_temp', axis = 1, inplace = True)
• Merge All Data Sets
"""
Now, the two data sets have identical station names to be ready for merge.
By merging them, Remote Unit data set is to have physical positions for each of stations.
And then, the base data set is to have station names and their positions
by being merged with Remote Unit data set.
"""
# merge on station names
remote_unit_df = pd.merge(remote_unit_df, station_position_df, on = 'station')
# merge on Unit
master_df = pd.merge(master_df, remote_unit_df, on = 'UNIT')
# select station and turnstile columns for visualization
entry_exit_by_station = master_df.groupby(
['station', 'position'])['ENTRIESn_hourly', 'EXITSn_hourly'].mean()
entry_exit_by_station_df = pd.DataFrame(entry_exit_by_station)
entry_exit_by_station_df.drop_duplicates(inplace = True)
entry_exit_by_station_df.reset_index(level = ['station', 'position'], inplace = True)
entry_exit_by_station_df.columns = ['station', 'position', 'entry_mean', 'exit_mean']
entry_exit_by_station_df.head()
• Process Geographical Positions
"""
GeoPandas requires a certain geographical format.
Since the position is simply a string, it should be processed.
"""
def process_position(pos):
# drop string 'POINT', '(' and ')'
pos = pos.translate({ord(x) : y for (x, y) in zip("()", [None, None])})
pos = pos.split(" ")
# Point from GeoSeries
return Point(float(pos[1]), float(pos[2]))
entry_exit_by_station_df['position'] = entry_exit_by_station_df['position'].apply(
lambda pos : process_position(pos))
print(type(entry_exit_by_station_df['position'][0]))
"""
NYC map can be drawn by utilizing geographical data frames.
MTA stations is drawn on the NYC map as dots at the exact position in size of popularity.
"""
# create geographical data frames
nyc_map = gpd.read_file(nyc_map_data)
nyc_map = nyc_map[nyc_map['boro_name'] != 'Staten Island'] # no data in Staten Island
station_gs = GeoSeries(entry_exit_by_station_df['position'])
# popularity of a station is measured by adding hourly entries and exits
entry_exit = pd.Series((entry_exit_by_station_df['entry_mean'] +
entry_exit_by_station_df['exit_mean']))
turnstile_marker_size_1 = entry_exit/10 # divided by 10 for reasonable marker size
turnstile_marker_size_2 = (entry_exit - 5000)/10 # only for stations with over 5000 turnstiles
# plot NYC map and geographical turnstile data
fig, ax = plt.subplots(figsize = (12, 12))
nyc_map.plot(ax = ax, cmap = 'Set2', column = 'boro_name', legend = True)
station_gs.plot(ax = ax, marker='.', color = 'red',
markersize = turnstile_marker_size_1, alpha = 1/5)
ax.set_axis_off()
ax.set_title("MTA Ridership")
plt.show()
# actual MTA lines on NYC map
IFrame(mta_map, width = 500, height = 500)
• Popular MTA stations
# stations with over 5000 hourly turnstiles
busy_station_df = pd.DataFrame({'station': entry_exit_by_station_df['station'],
'turnstile': entry_exit})
busy_station_df = busy_station_df[busy_station_df['turnstile'] >= 5000]
# plot NYC map and popular stations as stars
fig, ax = plt.subplots(figsize = (12, 12))
nyc_map.plot(ax = ax, cmap = 'Set2', column = 'boro_name', legend = True)
station_gs.plot(ax = ax, marker='*', color = 'blue',
markersize = turnstile_marker_size_2, alpha = 1/5)
plt.show()
print("<Top 10 Busiest Stations>\n",
(busy_station_df.sort_values(by = 'turnstile', ascending = False).head(10))
.to_string(index=False))
As seen in the plot and the image, the position data seems to be accurate in that the dots correspond to the actual layout of MTA lines on the map. The size of a dot implies popularity of the station, that is, the bigger the more people use it.
Although it is widely known that Manhattan area is busy, the visualization was still very useful to understand how relatively busy Manhattan is compared with other areas. Also, it helped me discover another fact that there are quite popular MTA stations in other areas than Manhattan. This led to another question like why this station in Queens is busy and what neighborhood there is. To further the research, I can look into population data, income, or such data in New York City to better analyze it.
Below are the links to websites where I found useful information excluding what were already mentioned above.