I explore the data sets, Heart Disease Mortality and Farmers Market In The US, to study the relationship between them. The goal of this project is explore data sets to capture a possible link of Farmers market data to heart disease mortality.
The Center For Disease Control And Prevention provides the data set ‘Heart Disease Mortality Data Among US Adults (35+) by State/County in 2014’ which is downloadable at DATA.GOV. This data set contains the numbers of deaths from Cardiovascular Disease, also simply called heart disease, every 100,000 population by state/county level. Another data set ‘Farmers Market In The US’ is procured from The United States Department of Agriculture at USDA. 8.7k+ Farmers markets are listed with information of state, exact locations and items. Even though the farmers market data are collected between 2011 and 2018, most of the data was updated in 2015 which is almost the same year of that of the heart disease data.
I will be studying these data to discover trends of which states have high or low heart disease mortality and whether the states have a relationship with the number of farmers markets. The importance of these questions stem from the fact that the number one cause of death in the US is none other than heart diseases.
It has been talked a lot about what causes heart diseases. Although people now believe that bad diets could cause heart diseases thanks to a number of researches and education, there lacks the reverse thoughts for good diets that might reduce risks of heart disease. In that sense, farmers market could be a useful indicator as good diets because it is regarded as fresh and healthy.
First step is explore the data sets, separately. I will look into the data, one by one, to see how they are distributed and visualize on the US map to gain some ideas where heart disease mortality or farmers markets are found the most and least.
Second, I will join the two data sets to find relationships between the heart disease mortality and the number of farmers markets. This will be shown on the map for better understanding. The anticipated result is that the mortality is lower where there are many farmers markets than where there are less.
# load packages
library(ggplot2)
library(gridExtra)
library(dplyr)
library(knitr)
library(kableExtra)
library(stringi)
library(readr)
library(maps)
library(mapdata)
The data sets I will use come in different formats. State names are one of those and I need to set a rule for them to merge the data sets. Classifying the states for this follows The US Census Region.
# state names
state.name <- tolower(state.name)
state.name[51] <- "district of columbia"
state.abb[51] <- "DC"
# state divisions
NE <- c("CT","ME","MA","NH","RI","VT","NJ","NY","PA")
MW <- c("IN","IL","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD")
SO <- c("DE","DC","FL","GA","MD","NC","SC","VA","WV","AL",
"KY","MS","TN","AR","LA","OK","TX")
WE <- c("AZ","CO","ID","NM","MT","UT","NV","WY","CA","OR","WA")
region_list <- list(Northeast = NE, Midwest = MW, South = SO, West = WE)
The US Census Region also specifies how the states are grouped into four different regions. The following function will do the job.
# four regions
region_division <- function(state) {
if (state %in% NE) {
return("Northeast")
} else if (state %in% MW) {
return("Midwest")
} else if (state %in% SO) {
return("South")
} else {
return("West")
}
}
States and counties are the variables on which merge will be based on. They should be thoroughly examined and processed through the single function.
process_region <- function(df, state_list) {
# state name for merging
df$state <- state.name[match(trimws(df$state), state_list)]
df$state <- tolower(df$state)
df <- df[which(df$state != "alaska" & df$state != "hawaii"), ]
# state abbreviation for text on axis
df$state_abb <- state.abb[match(df$state, state.name)]
df$state_abb <- factor(df$state_abb)
# region
df$division <- sapply(df$state_abb, function(x) region_division(x))
df$division <- factor(df$division)
# county
df$county <- tolower(df$county)
# remove unnecessary string or character
df$county <- gsub("county", "", df$county)
df$county <- gsub("city", "", df$county)
df$county <- gsub("parish", "", df$county)
df$county <- gsub("\\.", "", df$county)
df$county <- gsub("\\'", "", df$county)
# modify names for consistency
df$county <- gsub("dekalb", "de kalb", df$county)
df$county <- gsub("desoto", "de soto", df$county)
df$county <- gsub("dupage", "du page", df$county)
df$county <- gsub("laporte", "la porte", df$county)
df$county <- gsub("yellowstone", "yellowstone national", df$county)
# washington dc county
df$county[which(df$state == "district of columbia")] <- "washington"
# final screening
df$state <- factor(trimws(df$state))
df$county <- factor(trimws(df$county))
return(df)
}
First, I need to read Heart Disease Mortality and modify it to fit the format in which common variables are refined and named the same with Farmers market. Some other process here includes dropping variables such as race and gender. These could be useful for other analysis but not for the subject of this project. I will use only the overall figures, instead.
# load heart disease mortality data
heart_mortality <- read.csv('../Data/Heart_Disease_Mortality_by_County.csv')
# select columns
heart_mortality <- subset(heart_mortality, GeographicLevel == 'County')
heart_mortality <- heart_mortality[, -c(1, 4:7, 9:13, 15, 17:19)]
heart_mortality <- heart_mortality[!is.na(heart_mortality$Data_Value), ]
# change column names
colnames(heart_mortality) <- c('state', 'county', 'mortality',
'gender', 'race')
# process regional columns
heart_mortality <- process_region(heart_mortality, state.abb)
# show structure
str(heart_mortality)
## 'data.frame': 30810 obs. of 7 variables:
## $ state : Factor w/ 49 levels "alabama","arizona",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ county : Factor w/ 1795 levels "abbeville","acadia",..: 144 79 86 96 159 217 227 238 284 305 ...
## $ mortality: num 434 425 384 494 416 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ race : Factor w/ 6 levels "American Indian and Alaskan Native",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ state_abb: Factor w/ 49 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ division : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
Next, Farmers Market is read and processed in the same way. The data set has a number of categorical variables of items that each farmers markets sells. This will be used as a qualitative aspect while number of farmers market as a quantative aspect. I will give each available item a point. Although there could be better criteria to score them, I choose this simple system for the nature of the project that is exploring the data, not modeling.
# load farmers market data set
farmers_market <- read.csv('../Data/Farmers_Markets_by_County.csv')
# select columns
farmers_market <- farmers_market[, -c(1:9, 12:20, 23, 29, 59)]
farmers_market <- farmers_market[farmers_market$County != "", ]
farmers_market <- farmers_market[!is.na(farmers_market$x), ]
# change column names
colnames(farmers_market)[1:4] <- c('county', 'state', 'long', 'lat')
colnames(farmers_market) <- tolower(colnames(farmers_market))
# process regional columns
farmers_market$state <- tolower(farmers_market$state)
farmers_market <- process_region(farmers_market, state.name)
# show structure
str(farmers_market[1:10])
## 'data.frame': 8053 obs. of 10 variables:
## $ county : Factor w/ 1344 levels "abbeville","accomack",..: 1274 251 789 1164 789 789 789 432 656 789 ...
## $ state : Factor w/ 49 levels "alabama","arizona",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ long : num -88.3 -88.2 -88.2 -88.2 -88.2 ...
## $ lat : num 31.5 32.1 30.6 32.6 30.7 ...
## $ credit : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ wic : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ wiccash : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ sfmnp : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 2 1 2 ...
## $ snap : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 2 1 1 ...
## $ bakedgoods: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 2 1 1 ...
In this section, I will look into the two data sets individually to see how they are distributed and what trends they have in general first, and then I will put them together to find relationships between the two.
Histogram helps you understand the distribution of a data set. I focus on Heart Disease Mortality so I want to get an idea how each county’s mortality is laid out. I will do the similar job to Farmers Market to answer sort of how many questions.
ggplot(heart_mortality, aes(mortality)) +
geom_histogram(binwidth = 20, color = 'black', fill = 'blue') +
ggtitle("Heart Disease Mortality in the US") +
xlab("Number of Deaths per 100,000") + ylab("Counties") +
scale_x_continuous(limits = c(0, 800), breaks = seq(0, 800, 100))
summary(heart_mortality$mortality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 256.8 343.2 352.3 436.4 3000.9
As it turns out, it could be said that they are normally distributed. In fact, it is slightly skewed with the median on the left of the mean. Also, the range is quite large with the min and max set far from each other. Min and max must be very small counties and should probably be thrown out as outliers later for analysis.
ggplot(subset(heart_mortality, gender != "Overall"),
aes(mortality, color = gender)) +
geom_freqpoly(bins = 100) +
ggtitle("Heart Disease Mortality in the US by Gender") +
labs(x = "Number of Deaths per 100,000", y = "Counties", color = "") +
scale_x_continuous(limits = c(0, 900), breaks = seq(0, 900, 100))
It is clear that male is higher in the mortality than female. Because they are distinctively different, I’d better not use ‘Overall’ for gender than separate male and female to compare each individually.
ggplot(subset(heart_mortality, race != "Overall"),
aes(mortality, color = race)) +
geom_freqpoly(bins = 100) +
ggtitle("Heart Disease Mortality in the US by Race") +
labs(x = "Number of Deaths per 100,000", y = "Counties", color = "") +
scale_x_continuous(limits = c(0, 800), breaks = seq(0, 800, 100))
As with gender, it looks like ‘race’ should not be ‘Overall’. Easily speaking, each county has different demographic features and it could become a confounding factor if I use ‘Overall’. Unlike gender, however, race has a problem that each of them has the different number of observations. That is, not all counties have the data for all races. Note that the areas under the lines are not equal. Thus, it needs a further investigation as following.
race_breakdown <- data.frame(table(select(heart_mortality, race, division)))
overall_division <- subset(race_breakdown, race == "Overall")[2:3]
colnames(overall_division)[2] <- "Freq_overall"
race_breakdown <- merge(race_breakdown, overall_division)
race_breakdown$share <- with(race_breakdown, round(Freq/Freq_overall, 2))
race_breakdown <- race_breakdown[race_breakdown$race != "Overall", -4]
race_breakdown <- race_breakdown[order(race_breakdown$race), ]
kable(race_breakdown, row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover",
full_width = FALSE, position = "center"))
division | race | Freq | share |
---|---|---|---|
Midwest | American Indian and Alaskan Native | 353 | 0.11 |
Northeast | American Indian and Alaskan Native | 51 | 0.08 |
South | American Indian and Alaskan Native | 356 | 0.08 |
West | American Indian and Alaskan Native | 577 | 0.46 |
Midwest | Asian and Pacific Islander | 375 | 0.12 |
Northeast | Asian and Pacific Islander | 315 | 0.48 |
South | Asian and Pacific Islander | 756 | 0.18 |
West | Asian and Pacific Islander | 458 | 0.37 |
Midwest | Black | 1129 | 0.36 |
Northeast | Black | 484 | 0.74 |
South | Black | 3523 | 0.83 |
West | Black | 424 | 0.34 |
Midwest | Hispanic | 634 | 0.20 |
Northeast | Hispanic | 376 | 0.58 |
South | Hispanic | 1534 | 0.36 |
West | Hispanic | 814 | 0.66 |
Midwest | White | 3165 | 1.00 |
Northeast | White | 651 | 1.00 |
South | White | 4266 | 1.00 |
West | White | 1242 | 1.00 |
Except for ‘White’, the rest are differently missing in the regions. For example, only about 10% of ‘American Indian and Alaskan Native’ data are available in three regions but 46% in West. The same goes with ‘Asian’, ‘Black’, ‘Hispanic’. Taking a close look at the data, there are such observations that do have ‘Overall’ but not male/female for certain races. For all these, it would be better to focus on race with ‘Overall’ for the given data set, even if this leaves the racial factor uncontrolled to some degree.
# downsize the data set
heart_mortality <- subset(heart_mortality, gender != 'Overall'
& race == 'Overall')
heart_mortality <- heart_mortality[, -5]
# theme for barplot
theme_barplot <- theme_bw() +
theme(axis.text.x = element_text(size = 7, angle = 75),
legend.title = element_blank(),
legend.text = element_text(size = 10),
legend.position = c(0.9,0.8),
legend.background = element_rect(fill = alpha('white', 0)))
Fisrt, I like to know the mean mortality by state and also by division.
# mean mortality by state and division
mortality_state <- heart_mortality %>%
group_by(state_abb, state, division) %>%
summarise(mortality_mean = mean(mortality)) %>%
arrange(state_abb)
# mean mortality
ggplot(mortality_state, aes(reorder(state_abb, -mortality_mean),
mortality_mean,
fill = division)) +
geom_bar(stat = 'identity', width = 0.5) +
ggtitle("Mean Mortality by State") +
xlab("State") + ylab("Mortality") + theme_barplot
Like seen on the bar plot, South is seemingly at highest the risk. Kentuchy comes the seventh, but note that the Midwest state is right above Tennessee in South. The followings are quite well-known facts; Racial Factors against heart disease and Southern Diet known for a lot of meat consumption. The regional mortality can be found again as below.
mortality_region_rank <- aggregate(mortality_state[, "mortality_mean"],
list(mortality_state$division), mean)
kable(mortality_region_rank[order(-mortality_region_rank$mortality_mean), ],
col.names = c("Region", "Mean Mortality"), row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover",
full_width = FALSE, position = "center"))
Region | Mean Mortality |
---|---|
South | 410.1525 |
Midwest | 342.7488 |
Northeast | 322.6877 |
West | 304.0757 |
Next, I also like to see by state and division the farmers market count.
# number of farmers market by state and division
farmers_market_state <- farmers_market %>%
group_by(state_abb, state, division) %>%
summarise(count = n()) %>%
arrange(state_abb)
# number of farmers market
ggplot(farmers_market_state,
aes(reorder(state_abb, -count),
count, fill = division)) +
geom_bar(stat = 'identity', width = 0.5) +
ggtitle("Farmers Markets by State") +
xlab("State") + ylab("Count") + theme_barplot
As expected, the largest state California comes first followed by New York, the third largest state. Even though there seems to be a pattern of population, this is not true with so many exceptions including Texas being not even in top 10. One trend is that the last half are mostly states in South and some part of West in Mountain Time Zone.
farmers_market_region <- aggregate(farmers_market_state[, "count"],
list(farmers_market_state$division), mean)
kable(farmers_market_region[order(-farmers_market_region$count), ],
col.names = c("Region", "Count"), row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover",
full_width = FALSE, position = "center"))
Region | Count |
---|---|
Northeast | 203.8889 |
Midwest | 199.5833 |
West | 143.2727 |
South | 132.1765 |
Recalling that California has the most farmers market in number, West was not expect to come third with slightly more than South does. This is probably because of other states in West thinly populated.
The findings made by now can be plotted on the real map. In the next, the state map and the county map will be provided with mortality and farmers market data.
state_map <- merge(map_data("state"), mortality_state,
by.x = "region", by.y = "state", all.x = TRUE)
# mean mortality by state
ggplot(data = state_map) +
geom_polygon(aes(long, lat, fill = mortality_mean, group = group),
color = "black", size = 0.1) +
labs(title = "Heart Disease Mortality by State", x = "", y = "", fill = "Mortality") +
coord_fixed(1.3) +
scale_fill_gradientn(colours = rev(rainbow(2)), guide = "colorbar")
As confirmed before, South is the highest in heart disease mortality. Northeast and West are generally low, so is Midwest except for Missouri and Kentuchy, the states neighboring South.
county_map <- merge(map_data("county"), heart_mortality,
by.x = c('region', 'subregion'),
by.y = c('state', 'county'),
all.x = TRUE)
county_map <- county_map[order(county_map$order), ]
# mortality by county
mortality_county_map <- ggplot(data = county_map) +
geom_polygon(aes(long, lat, fill = mortality, group = group),
color = "darkgray", size = 0.1) +
labs(x = "", y = "", fill = "Mortality") +
coord_fixed(1.3) +
scale_fill_gradientn(colours = rev(rainbow(2)), guide = "colorbar")
mortality_county_map +
labs(title = "Heart Disease Mortality by County")
County map shows finer version of data. In fact, this is the map with raw data since Heart Disease Mortality is set by county. The dark gray is missing in the mortality data set. I am leaving them empty rather than artificially filling the data because they will be excluded in the process of analysis.
# farmers market location points
farmers_martket_loc <- data.frame(
lat = as.vector(farmers_market$lat),
long = as.vector(farmers_market$long)
)
mortality_county_map +
labs(title = "Heart Disease Mortality with Farmers Market") +
geom_point(data = farmers_martket_loc, aes(x = long, y = lat),
color = "yellow", size = 0.01, alpha = 0.2)
Now, the county map are plotted with farmers markets which are yellow dots. North East seems to have the most farmers markets followed by Midwest and Far West (CA, OR, WA). One thing to note here is that South obviously has more farmers markets than West, however, South is much higher than West in mortaliy. This must be related with how the states are populated. Examining this map and Population Density Map should will be helpful to understand the issue.
# receive reginal division as parameter
farmers_market_loc_division <- function(division) {
# filter parameter division
farmers_market_division <- farmers_market[farmers_market$division == division, ]
farmers_martket_loc <- data.frame(
lat = as.vector(farmers_market_division$lat),
long = as.vector(farmers_market_division$long)
)
# draw reginal map
p <- ggplot(data = county_map[which(county_map$division == division), ]) +
geom_polygon(aes(long, lat, fill = mortality, group = group),
color = "darkgray", size = 0.05) +
labs(title = division, x = "", y = "", fill = "Mortality") +
coord_fixed(1.3) +
scale_fill_gradientn(colours = rev(rainbow(2)), limits = c(100, 1400),
guide = "colorbar") +
geom_point(data = farmers_martket_loc, aes(x = long, y = lat),
color = "yellow", alpha = 0.5)
return(p)
}
The function does the job of closing up each region, still population adjusted. First to see Northeast, mortality is mostly very low and there are many farmers markets which are small dots in the highly populated region.
farmers_market_loc_division("Northeast")
Midwest also has a lot of farmers markets with decent mortality. There are some red parts that have less or smaller dots in the bottom area.
farmers_market_loc_division("Midwest")
Except for Texas area, it is quite clear that areas in red have smaller dots or even none. This pattern is strong in the middle around Louisiana and Mississippi.
farmers_market_loc_division("South")
West shows blue in most of the areas. There are some large yellow dots where population density is extremely low. Nevada has relatively red counties that have zero or one farmers market.
farmers_market_loc_division("West")
With the idea gained from visualization, it is time to analyze the data. Let’s take a look at the basic statistics first. 1st quantile is same as minimum, suggesting there are just so many counties with only one farmers market. Also, it is skewed to the right and maximum is far out of normal range. I would probably cut off only the right tail to remove outliers.
farmers_market_count <- farmers_market %>%
group_by(state, county) %>%
summarise(farmers_market_count = n()) %>%
arrange(state, county)
summary(farmers_market_count$farmers_market_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 3.632 4.000 125.000
To see the relationship, the number of farmers market in each county is added onto heart disease mortality. Note that Heart Disease Mortality has just a few counties missing the data, but there are many counties that do not appear on Farmers Market. Most of these counties are very small and thought to have none. Nonetheless, I will count them as ‘NA’ and I enter zero for convenience.
heart_mortality <- merge(heart_mortality, farmers_market_count,
by = c("state", "county"), all.x = T)
heart_mortality$farmers_market_count[is.na(heart_mortality$farmers_market_count)] <- 0
The following is to draw point plots of the number of farmers markets versus mortality in the four different regions.
# scatter plot
farmers_market_scatter <- function(var, division, color, x_limit) {
farmers_market_division <- heart_mortality[heart_mortality$division == division, ]
ggplot(farmers_market_division,
aes_string(var, "mortality", color = "gender")) +
geom_point(aes(color = gender), alpha = 0.25,
size = 0.75, position = "jitter") +
geom_smooth() +
labs(title = division, x = "", y = "") +
scale_x_continuous(limits = c(0, x_limit), breaks = seq(0, x_limit, 5)) +
scale_y_continuous(limits = c(100, 600), breaks = seq(150, 600, 50)) +
theme(legend.position = "none")
}
Now, I will call the function to plot for the regions together.
grid.arrange(farmers_market_scatter("farmers_market_count", 'Northeast', 'lightgreen', 15),
farmers_market_scatter("farmers_market_count", 'Midwest', 'lightpink', 15),
farmers_market_scatter("farmers_market_count", 'South', 'lightblue', 15),
farmers_market_scatter("farmers_market_count", 'West', 'violet', 15),
ncol = 4, left = "Mortality", bottom = "Number of Farmers Count")
Upto around 15 seems to be decreasing in mortality in all the regions. However, Midwest and South are increasing after then while North East and West gradually decrease or remain the same. What matters here is the fact that the number of farmers markets becomes very sparse after 10 count, especially in Midwest and South. Thus, I can say that there are declining trends in all the regions until count 10 or so.
One way to get closer to more fair correlation is to consider categorical data given in Farmers Market. There are 34 categories from credit card availability to different foods on sale. These are all indicators to the size of individual farmers market. I have already given each of them one point if the item is available and set aside the results as score variable.
# one point for each item if available
farmers_market$score <- 0
for (r in 1:nrow(farmers_market)) {
for (c in 5:38) {
if (farmers_market[r, c] == "Y") {
farmers_market$score[r] = farmers_market$score[r] + 1
}
}
}
farmers_market_score <- farmers_market %>%
group_by(state, county) %>%
summarise(farmers_market_score = round(mean(score))) %>%
arrange(state, county)
Luckily, the distribution is almost normal excluding at zero. While this many zero-score farmers market was not expected, their meaning is little to nothing because they are highly likely very small-sized farmers markets (none available out of 34 items listed). I will cut them off with some of left and right tails for later analysis.
heart_mortality <- merge(heart_mortality, farmers_market_score,
by = c("state", "county"), all.x = T)
heart_mortality$farmers_market_score[is.na(heart_mortality$farmers_market_score)] <- 0
As with farmers market count, I am drawing point plots in different regions. The X-value will be the scores this time.
grid.arrange(farmers_market_scatter("farmers_market_score", 'Northeast', 'lightgreen', 20),
farmers_market_scatter("farmers_market_score", 'Midwest', 'lightpink', 20),
farmers_market_scatter("farmers_market_score", 'South', 'lightblue', 20),
farmers_market_scatter("farmers_market_score", 'West', 'violet', 20),
ncol = 4, left = "Mortality", bottom = "Number of Farmers Score")
First thing to notice is that West is the lowest in mortality at almost every level, then North East and Midwest, South being the highest. This is not particularly found in the previous comparison of farmers market count. While Midwest does not show any pattern, North East and West show similar pattern at differnt levels, which slightly goes up until 10 and starts to drop. Secondly, South shows some meaningful trend. Only this region has a consistent tendancy of going down as score increases.
Before beginning to analyze the data, I like to apply one more process, removal of outliers. For this, I need to decide two things. One is which features I would do the job to, and the other is how much I want to remove. As confirmed throughout the other processes, ‘farmers market count’ and ‘farmers market score’ are very out of the normally shaped distribution, therefore, it would be risky to touch these features. Another feature to consider is ‘mortality’ which is quite normally distributed with some extreme outliers like 3000 when its mean is around 350. I would like to remove these extreme values by setting quantile 97.5%, that is, removing 2.5% highest values. This task should be done men and women separately because they are differently distributed.
# remove outliers
quantile_cal <- function(gen, prob) {
val <- quantile(heart_mortality[heart_mortality$gender == gen, "mortality"], prob)[[1]]
return(val)
}
heart_mortality <- subset(heart_mortality, gender == "Female" |
mortality < quantile_cal("Male", 0.975))
heart_mortality <- subset(heart_mortality, gender == "Male" |
mortality < quantile_cal("Female", 0.975))
Now that the data are ready, it is time to plot them. I am concerned with correlations, so I will plot correlations of observations. The correlations will be, of course, ‘mortality’ with ‘farmers market count’ and ‘farmers market score’, respectively. The below are the function to calculate correlations and the plot code snippet to build plots, state by state.
# correlation by gender
corr_test <- function(abb, gender_type, var) {
out <- tryCatch({
df <- subset(heart_mortality, state_abb == abb &
gender == gender_type &
farmers_market_count > 0 &
farmers_market_count < 10 &
farmers_market_score < 20)
if (count(df) >= 10) {
out <- as.numeric(with(df, cor.test(mortality,
eval(parse(text = var))))[["estimate"]])
} else {
out <- NA
}
}, error = function(cond) {
out <- NA
})
return(out)
}
# point plot
corr_point <- function(df, title) {
ggplot(df, aes(corr_count, corr_score, color = division)) +
geom_point() + geom_text(aes(label = state), hjust = 0.5, vjust = 1.25, size = 2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_vline(xintercept = 0, linetype = "dotted") +
labs(title = paste("Correlation of Mortality vs. Farmers Market Count/Score", title),
x = "Count", y = "Score", color = "") +
scale_x_continuous(limits = c(-0.7, 0.7), breaks = seq(-0.7, 0.7, 0.2)) +
scale_y_continuous(limits = c(-0.7, 0.7), breaks = seq(-0.7, 0.7, 0.2))
}
Now, let’s plot the data for male data by state. Before seeing the result, it is noteworthy that the more negative a value is, the stronger evidence it is to claim that more farmers market or bigger farmers market lessens heart disease mortality. In short, it would be ideal if most of them fall in the third quadrant.
state <- c(NE, MW, SO, WE)
corr_male <- data.frame(state)
corr_male$corr_count <- sapply(corr_male$state, function(x)
corr_test(x, "Male", "farmers_market_count"))
corr_male$corr_score <- sapply(corr_male$state, function(x)
corr_test(x, "Male", "farmers_market_score"))
corr_male$division <- sapply(corr_male$state, function(x) region_division(x))
corr_point(corr_male, "(Male)")
Fortunately, I see many the values fall in the negative-negative box, which supports my guess. Most of the state in South like Florida, Tennessee, Arkansas and many are grouped, showing the strong relationship. However, I still see a couple of states in the positive-positive box that could be counterevidence against my claim. Especially Nevada state is a bit off the zero point, suggesting that counties with more farmers market actually has higher heart disease mortality and same for farmers martket size.
# top 10 counties in Nevada by farmers market count
nv_male <- subset(heart_mortality, state_abb == 'NV' & gender == 'Male')
nv_male[order(-nv_male$farmers_market_count)[1:10],
c('county', 'mortality', 'farmers_market_count')]
## county mortality farmers_market_count
## 3432 clark 495.2 10
## 3457 washoe 466.8 8
## 3430 churchill 520.3 3
## 3445 lincoln 489.7 3
## 3447 lyon 400.6 3
## 3427 carson 424.5 2
## 3434 douglas 338.4 1
## 3436 elko 390.3 1
## 3440 eureka 463.3 1
## 3442 humboldt 454.8 1
Looking at the mortalities, they are not really in ascending order. It is not surprising that the top two counties by farmers market count are Clark and Washoe, having Las Vegas and Reno respectively. The two Nevadan cities are best known for casino industry which is commonly associated with alcohol and drugs for the darkside. This could have grounds to claim that there are other stronger causes of heart diseases in the region that are separable from farmers market factors.
corr_female <- data.frame(state)
corr_female$corr_count <- sapply(corr_female$state, function(x)
corr_test(x, "Female", "farmers_market_count"))
corr_female$corr_score <- sapply(corr_female$state, function(x)
corr_test(x, "Female", "farmers_market_score"))
corr_female$division <- sapply(corr_female$state, function(x) region_division(x))
corr_point(corr_female, "(Female)")
The correlations for female look even better in that there is none in the first quadrant. All the farmers market count are negatively correlated with mortality except for Maine. On the contrary, there are several states that are not negatively correlated with farmers market score. By division, South appears to have the strongest relationship with both features, especially, Florida and Tennessee seem to be so. The states in Midwest are grouped together as South, but Northeast and West are not.
For the conclusive analysis, I like to focus on median values since the majority of the farmers market data are centered around value 1 or 2. Median should be more proper to represent the general trend for that reason.
# median mortality by farmers market count
median_mortality_farmers_market_count <- heart_mortality %>%
group_by(farmers_market_count, gender) %>%
summarise(value = median(mortality), type = "count") %>%
arrange(farmers_market_count)
colnames(median_mortality_farmers_market_count)[1] <- "var"
median_mortality_farmers_market_count <- median_mortality_farmers_market_count[1:90, ]
# median mortality by farmers market score
median_mortality_farmers_market_score <- heart_mortality %>%
group_by(farmers_market_score, gender) %>%
summarise(value = median(mortality), type = "score") %>%
arrange(farmers_market_score)
colnames(median_mortality_farmers_market_score)[1] <- "var"
# to plot them together
median_values <- bind_rows(median_mortality_farmers_market_count,
median_mortality_farmers_market_score)
The line plot will show the median values by gender with corresponding points highlighted, for farmers market count and score separately.
# line plot
median_mortality_line <- function(type, x_limit) {
median_values <- median_values[median_values$type == type, ]
median_male <- median(heart_mortality[heart_mortality$gender == "Male", "mortality"])
median_female <- median(heart_mortality[heart_mortality$gender == "Female", "mortality"])
ggplot(median_values, aes(var, value, color = gender)) +
geom_point() + geom_line() +
geom_hline(yintercept = median_male, linetype = "dotted", color = "blue", size = 1) +
geom_hline(yintercept = median_female, linetype = "dotted", color = "red", size = 1) +
labs(x = stri_trans_totitle(type), y = "", color = "") +
scale_x_continuous(limits = c(1, x_limit), breaks = seq(1, x_limit, 2)) +
scale_y_continuous(limits = c(200, 500), breaks = seq(200, 500, 50))
}
There is no point of showing all the count and score values because they are skewed and get scarce after a certain value. I want to see the value points for quantile 90%, 95%, 99%.
cat("\nFarmers Market Count 90% Qt by 5%: ",
quantile(heart_mortality$farmers_market_count, probs = seq(0.9, 1, 0.05)),
"\nFarmers Market Score 90% Qt by 5%: ",
quantile(heart_mortality$farmers_market_score, probs = seq(0.9, 1, 0.05)))
##
## Farmers Market Count 90% Qt by 5%: 6 10 125
## Farmers Market Score 90% Qt by 5%: 18 20 29
90% of counties have 6 or less farmers markets, and 95% having 10 or less. Between 95% and the end is a huge gap. The same can describe the score even if it has a bit less impressive gap between 95% and the maximum. I like to cover 95% of all observations.
# include over 95% observations
grid.arrange(median_mortality_line("count", 11),
median_mortality_line("score", 21),
ncol = 2, left = "Mortality",
top = "Median Mortality vs. Farmers Market Count/Score")
The dotted lines are median values for all regardless of the farmers market factors. Let’s take a look at the farmers market count first. Both male and female show declining trends that are already on or below the aggregated median at 2 counts. The trends continue upto 6 for both genders and then bumps up a little at 7 counts. This could be due to the similar reason I guessed for Nevada counties that a city can develop its own problems as it gets bigger. The trends get back on track from 8 counts and reach the lowest at 11 counts. However, I should not put too much emphasis on the lowest, recalling that 95% quantile is around that, there are not that many observations there.
As for farmers market score, it does not seem to have as strong trends. They have quite a few bumps on the way. Generally speaking, it looks like they are below the dotted lines after score 10 or so. All I can say is that there are slight trends that go down.
I have confirmed some correlation between the number of farmers markets and heart disease mortality for counties. The correlations are found to be strong in the southern states like Florida, South Carolina, Tennessee. Especially, they have stronger relationships with how many farmers markets there are than how big farmers markets are. In other words, quantitative aspects outmatter qualitative aspects. This could be due to the scoring equation I set up that simply gives one point for one item available, which would probably be unfair because some items must have more to do with health than other items. The scoring might explain the median graphs too that show the declining trends for both men and women as the number of farmers markets increase, but not exactly so as the scores go up. In conclusion, I have found some meaningful reltionship that heart disease mortality is lower where there are more farmers markets than there are less, and some relationship with quality of farmers markets.