House price is considered to be difficult to predict as there are so many variables in the market. However, we can guess the price with a series of major factors such as size, area, quality and so on. Although they would rather be a factrue of numerous factors existing, these quantitative and qualitative variables are useful to narrow down the price in a certain range which is incomparably better than just mean or median. In the course of this project, we first analyze the observations to get some insight for modeling as well as to filter some data like influential points. Once a linear model is sought, we need to evaluate it with techniques like residual analysis to possibly remove insignificant features. Lastly, we build a predictive model and measure its performance with RMSE.
- Preparation
- Dataset
- Rows of Interest
- Dependent Variable
- Outliers
- Drop Variables
- Data Analysis
- Correlation Matrix
- Transformation
- Box Plot
- Facet Plot
- Diagnostic Analysis
- Plots and Influential Points
- Remove Influential Points
- Best Model
- Adjusted R-squared
- AIC
- Prediction
- Root Mean Square Error
- Conclusion
- Reference
The data set used for this project is D.C. Residential Properties which is refined from the original source on D.C. Open Data, and details of the columns can be found on Metadata page. Before analyzing the data, each column needs to be examined to determine whether to include it in a model, also to find outliers that could have a negative influence on modeling. This process starts with calling the following libraries required.
library(dplyr) # df manipulation
library(ggplot2) # graphical plot
library(GGally) # ggplot extension
library(grid) # grid plot
library(gridExtra) # grid extension
library(knitr) # pretty print
library(kableExtra) # print styling
library(stringr) # handle strings
library(ggmap) # geographical map
library(corrplot) # correlation matrix
library(car) # VIF
library(caret) # data set split
library(lmtest) # non-constant variance test
library(MASS) # AIC
library(leaps) # model selection
library(faraway) # max adjusted r
library(DAAG) # cross-validation
select <- dplyr::select
The next step is reading the data set and taking a look at the structure. This step focuses on having an overall idea what the columns are. Columns with redundant information, ID columns, or empty columns can be found and dropped. Also, such data type is checked that some of them could be converted to a proper type.
# read the dataset
house.price <- read.csv('DC_Properties.csv', na.strings = c("", "NA"))
# show column structures
str(house.price)
## 'data.frame': 158957 obs. of 49 variables:
## $ X.1 : int 0 1 2 3 4 5 6 7 8 9 ...
## $ BATHRM : int 4 3 3 3 2 3 1 3 3 1 ...
## $ HF_BATHRM : int 0 1 1 1 1 2 0 1 1 1 ...
## $ HEAT : Factor w/ 14 levels "Air-Oil","Air Exchng",..: 13 13 8 8 13 8 13 8 13 8 ...
## $ AC : Factor w/ 3 levels "0","N","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ NUM_UNITS : num 2 2 2 2 1 1 2 2 2 1 ...
## $ ROOMS : int 8 11 9 8 11 10 5 8 7 6 ...
## $ BEDRM : int 4 5 5 5 3 5 2 4 3 2 ...
## $ AYB : num 1910 1898 1910 1900 1913 ...
## $ YR_RMDL : num 1988 2007 2009 2003 2012 ...
## $ EYB : int 1972 1972 1984 1984 1985 1972 1957 1972 1967 1950 ...
## $ STORIES : num 3 3 3 3 3 4 2 3 2 2 ...
## $ SALEDATE : Factor w/ 6937 levels "1947-05-14 00:00:00",..: 3257 2422 6427 3926 NA 4847 5141 5249 6889 4539 ...
## $ PRICE : num 1095000 NA 2100000 1602000 NA ...
## $ QUALIFIED : Factor w/ 2 levels "Q","U": 1 2 1 1 2 1 2 1 1 2 ...
## $ SALE_NUM : int 1 1 3 1 1 1 1 1 4 1 ...
## $ GBA : num 2522 2567 2522 2484 5255 ...
## $ BLDG_NUM : int 1 1 1 1 1 1 1 1 1 1 ...
## $ STYLE : Factor w/ 18 levels "1 Story","1.5 Story Fin",..: 7 7 7 7 7 10 4 7 4 4 ...
## $ STRUCT : Factor w/ 9 levels "Default","Multi",..: 4 4 4 4 5 4 4 4 4 4 ...
## $ GRADE : Factor w/ 13 levels "Above Average",..: 13 13 13 13 13 13 1 13 1 9 ...
## $ CNDTN : Factor w/ 7 levels "Average","Default",..: 5 5 7 5 5 5 1 1 7 1 ...
## $ EXTWALL : Factor w/ 25 levels "Adobe","Aluminum",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ ROOF : Factor w/ 16 levels "Built Up","Clay Tile",..: 9 1 1 1 10 1 9 9 1 1 ...
## $ INTWALL : Factor w/ 12 levels "Carpet","Ceramic Tile",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ KITCHENS : num 2 2 2 2 1 1 2 2 2 1 ...
## $ FIREPLACES : int 5 4 4 3 0 4 0 1 1 0 ...
## $ USECODE : int 24 24 24 24 13 11 24 24 24 11 ...
## $ LANDAREA : int 1680 1680 1680 1680 2032 2196 1261 1627 1424 1424 ...
## $ GIS_LAST_MOD_DTTM : Factor w/ 2 levels "2018-07-22 18:01:38",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ SOURCE : Factor w/ 2 levels "Condominium",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ CMPLX_NUM : num NA NA NA NA NA NA NA NA NA NA ...
## $ LIVING_GBA : num NA NA NA NA NA NA NA NA NA NA ...
## $ FULLADDRESS : Factor w/ 105978 levels "1 16TH STREET NE",..: 26459 26410 26354 26291 27230 24511 26809 26392 25601 25911 ...
## $ CITY : Factor w/ 1 level "WASHINGTON": 1 1 1 1 1 1 1 1 1 1 ...
## $ STATE : Factor w/ 1 level "DC": 1 1 1 1 1 1 1 1 1 1 ...
## $ ZIPCODE : num 20009 20009 20009 20009 20009 ...
## $ NATIONALGRID : Factor w/ 105949 levels "18S UH 25216 99697",..: 24683 24713 24755 24776 25561 25343 24575 25100 25215 25118 ...
## $ LATITUDE : num 38.9 38.9 38.9 38.9 38.9 ...
## $ LONGITUDE : num -77 -77 -77 -77 -77 ...
## $ ASSESSMENT_NBHD : Factor w/ 57 levels "16th Street Heights",..: 43 43 43 43 43 43 43 43 43 43 ...
## $ ASSESSMENT_SUBNBHD: Factor w/ 121 levels "001 A American University",..: 93 93 93 93 93 93 93 93 93 93 ...
## $ CENSUS_TRACT : num 4201 4201 4201 4201 4201 ...
## $ CENSUS_BLOCK : Factor w/ 3848 levels "000100 1000",..: 1676 1676 1676 1676 1676 1676 1675 1675 1675 1675 ...
## $ WARD : Factor w/ 8 levels "Ward 1","Ward 2",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ SQUARE : Factor w/ 3292 levels "0004","0005",..: 58 58 58 58 58 58 58 58 58 58 ...
## $ X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ QUADRANT : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...
# drop ID, single value, redundant columns
house.price <- select(house.price, -c("X.1", "CITY", "STATE", "X", "Y"))
# ZIPCODE, USECODE are more of categorical data
house.price$ZIPCODE <- as.factor(house.price$ZIPCODE)
house.price$USECODE <- as.factor(house.price$USECODE)
house.price$CENSUS_TRACT <- as.factor(house.price$CENSUS_TRACT)
# show first a few rows
kable(head(house.price, n = 10)) %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
BATHRM | HF_BATHRM | HEAT | AC | NUM_UNITS | ROOMS | BEDRM | AYB | YR_RMDL | EYB | STORIES | SALEDATE | PRICE | QUALIFIED | SALE_NUM | GBA | BLDG_NUM | STYLE | STRUCT | GRADE | CNDTN | EXTWALL | ROOF | INTWALL | KITCHENS | FIREPLACES | USECODE | LANDAREA | GIS_LAST_MOD_DTTM | SOURCE | CMPLX_NUM | LIVING_GBA | FULLADDRESS | ZIPCODE | NATIONALGRID | LATITUDE | LONGITUDE | ASSESSMENT_NBHD | ASSESSMENT_SUBNBHD | CENSUS_TRACT | CENSUS_BLOCK | WARD | SQUARE | QUADRANT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | Warm Cool | Y | 2 | 8 | 4 | 1910 | 1988 | 1972 | 3 | 2003-11-25 00:00:00 | 1095000 | Q | 1 | 2522 | 1 | 3 Story | Row Inside | Very Good | Good | Common Brick | Metal- Sms | Hardwood | 2 | 5 | 24 | 1680 | 2018-07-22 18:01:43 | Residential | NA | NA | 1748 SWANN STREET NW | 20009 | 18S UJ 23061 09289 | 38.91468 | -77.04083 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2006 | Ward 2 | 0152 | NW |
3 | 1 | Warm Cool | Y | 2 | 11 | 5 | 1898 | 2007 | 1972 | 3 | 2000-08-17 00:00:00 | NA | U | 1 | 2567 | 1 | 3 Story | Row Inside | Very Good | Good | Common Brick | Built Up | Hardwood | 2 | 4 | 24 | 1680 | 2018-07-22 18:01:43 | Residential | NA | NA | 1746 SWANN STREET NW | 20009 | 18S UJ 23067 09289 | 38.91468 | -77.04076 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2006 | Ward 2 | 0152 | NW |
3 | 1 | Hot Water Rad | Y | 2 | 9 | 5 | 1910 | 2009 | 1984 | 3 | 2016-06-21 00:00:00 | 2100000 | Q | 3 | 2522 | 1 | 3 Story | Row Inside | Very Good | Very Good | Common Brick | Built Up | Hardwood | 2 | 4 | 24 | 1680 | 2018-07-22 18:01:43 | Residential | NA | NA | 1744 SWANN STREET NW | 20009 | 18S UJ 23074 09289 | 38.91468 | -77.04068 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2006 | Ward 2 | 0152 | NW |
3 | 1 | Hot Water Rad | Y | 2 | 8 | 5 | 1900 | 2003 | 1984 | 3 | 2006-07-12 00:00:00 | 1602000 | Q | 1 | 2484 | 1 | 3 Story | Row Inside | Very Good | Good | Common Brick | Built Up | Hardwood | 2 | 3 | 24 | 1680 | 2018-07-22 18:01:43 | Residential | NA | NA | 1742 SWANN STREET NW | 20009 | 18S UJ 23078 09288 | 38.91468 | -77.04063 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2006 | Ward 2 | 0152 | NW |
2 | 1 | Warm Cool | Y | 1 | 11 | 3 | 1913 | 2012 | 1985 | 3 | NA | NA | U | 1 | 5255 | 1 | 3 Story | Semi-Detached | Very Good | Good | Common Brick | Neopren | Hardwood | 1 | 0 | 13 | 2032 | 2018-07-22 18:01:43 | Residential | NA | NA | 1804 NEW HAMPSHIRE AVENUE NW | 20009 | 18S UJ 23188 09253 | 38.91438 | -77.03936 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2006 | Ward 2 | 0152 | NW |
3 | 2 | Hot Water Rad | Y | 1 | 10 | 5 | 1913 | NA | 1972 | 4 | 2010-02-26 00:00:00 | 1950000 | Q | 1 | 5344 | 1 | 4 Story | Row Inside | Very Good | Good | Common Brick | Built Up | Hardwood | 1 | 4 | 11 | 2196 | 2018-07-22 18:01:43 | Residential | NA | NA | 1709 S STREET NW | 20009 | 18S UJ 23157 09248 | 38.91433 | -77.03971 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2006 | Ward 2 | 0152 | NW |
1 | 0 | Warm Cool | Y | 2 | 5 | 2 | 1917 | 1988 | 1957 | 2 | 2011-05-02 00:00:00 | NA | U | 1 | 1260 | 1 | 2 Story | Row Inside | Above Average | Average | Common Brick | Metal- Sms | Hardwood | 2 | 0 | 24 | 1261 | 2018-07-22 18:01:43 | Residential | NA | NA | 1769 SWANN STREET NW | 20009 | 18S UJ 23042 09323 | 38.91498 | -77.04105 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2005 | Ward 2 | 0152 | NW |
3 | 1 | Hot Water Rad | Y | 2 | 8 | 4 | 1906 | 2011 | 1972 | 3 | 2011-09-29 00:00:00 | 1050000 | Q | 1 | 2401 | 1 | 3 Story | Row Inside | Very Good | Average | Common Brick | Metal- Sms | Hardwood | 2 | 1 | 24 | 1627 | 2018-07-22 18:01:43 | Residential | NA | NA | 1746 1/2 T STREET NW | 20009 | 18S UJ 23124 09368 | 38.91541 | -77.04013 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2005 | Ward 2 | 0152 | NW |
3 | 1 | Warm Cool | Y | 2 | 7 | 3 | 1908 | 2008 | 1967 | 2 | 2018-05-03 00:00:00 | 1430000 | Q | 4 | 1488 | 1 | 2 Story | Row Inside | Above Average | Very Good | Common Brick | Built Up | Hardwood | 2 | 1 | 24 | 1424 | 2018-07-22 18:01:43 | Residential | NA | NA | 1727 SWANN STREET NW | 20009 | 18S UJ 23142 09324 | 38.91502 | -77.03990 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2005 | Ward 2 | 0152 | NW |
1 | 1 | Hot Water Rad | Y | 1 | 6 | 2 | 1908 | 1979 | 1950 | 2 | 2008-12-05 00:00:00 | NA | U | 1 | 1590 | 1 | 2 Story | Row Inside | Good Quality | Average | Common Brick | Built Up | Hardwood | 1 | 0 | 11 | 1424 | 2018-07-22 18:01:43 | Residential | NA | NA | 1733 SWANN STREET NW | 20009 | 18S UJ 23127 09324 | 38.91501 | -77.04008 | Old City 2 | 040 D Old City 2 | 4201 | 004201 2005 | Ward 2 | 0152 | NW |
There are two types of properties in the data set, and they have such a distinctive pattern that separate studies need to be done. At this time, the study focuses on ‘Residential’ type which is majority of observations. Some factor that can’t be controlled is that house values fluctuate with time. In addition, prices might need some correction to today’s values, which could be very challenging but possibly only to make it more complicated. For that reason, the cases limits to the recent two years data for the goal to analyze the latest and general trend.
# residential price
print(summary(house.price[house.price$SOURCE == "Residential", 'PRICE']))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 240000 440000 579912 750000 25100000 48796
# condominium price
print(summary(house.price[house.price$SOURCE == "Condominium", 'PRICE']))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 239000 370000 1436073 535000 137427545 11945
# reduce it to 'Residential' type
house.price <- house.price[house.price$SOURCE == "Residential", ]
# cases density histogram by sale date
house.price$SALEDATE <- substr(house.price$SALEDATE, 0, 10)
house.price$SALEDATE <- as.Date(house.price$SALEDATE)
sale.year <- as.numeric(substr(house.price[, "SALEDATE"], 0, 4))
sale.year <- sale.year[!is.na(sale.year)]
hist(sale.year,
breaks = c(-Inf, seq(1990, 2018, 1)),
xlim = c(1990, 2018),
main = "Sales by Year",
xlab = "Year",
border = 'blue', col = 'green', alpha = 0.5,
prob = T)
## Warning in plot.window(xlim, ylim, "", ...): "alpha" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "alpha" is not a graphical parameter
## Warning in axis(1, ...): "alpha" is not a graphical parameter
## Warning in axis(2, ...): "alpha" is not a graphical parameter
axis(side = 1, at = c(2015, 2018))
lines(density(sale.year), col = 'red')
There are more observations as it gets closer to the recent. This is good news as the survey seeks for the recent trend. This figure also shows a random pattern that house markets are active one season and become loose next season. It will be very difficult to make use of old data except for finding a historical pattern. The data starting from 2016 still covers 20% of all.
quantile(sale.year, probs = seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1947 1999 2002 2005 2007 2010 2012 2014 2016 2017 2018
# drop sales made before 2016.01.01.
house.price <- subset(house.price, SALEDATE > as.Date("2015-12-31"))
It would be reasonable to investigate house prices as these are the targets to be analyzed and predicted. Looking into the dependent variable would help with having an idea for independent variables too since the goal is to define their relationships. The following histogram gives a basic idea how prices are distributed.
# drop rows with price as NA
house.price <- house.price[complete.cases(house.price[ , "PRICE"]), ]
# histogram independent variable "Price"
ggplot(house.price, aes(x = PRICE)) +
geom_histogram(binwidth = 20000,
fill = "blue",
alpha = .25) +
labs(title = "Washington D.C House Price Histogram") +
labs(x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(c(0, 2000000)) +
geom_vline(aes(xintercept = mean(PRICE), color = "mean"),
linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = median(PRICE), color = "median"),
linetype = "dashed", size = 1) +
scale_color_manual(name = "", values = c(mean = "pink", median = "green"))
From both the shape and statistics, house price data are skewed. Most of the houses are in a certain range with a few expensive houses. However, transformation might be carefully considered, asking for more evidence than just histogram. Again, the following statistics shows that prices are skewed with extremely large values, even if /$10 does not make sense either. In addition, it would be very difficult to buy a house in Washington D.C. from the median price.
summary(house.price$PRICE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 418000 659266 789970 940000 25100000
One of the best ways to have a sense of data is visualization. Since the data could be fundamentally geographic, it would be a good attemp to visualize prices on the Washington D.C. map. On the map, the darker dots are the more expensive.
ggmap(dc_map) +
geom_point(data = house.price,
aes(LONGITUDE, LATITUDE,
color = cut(PRICE, price_group)),
size = 0.3, alpha = 0.3) +
labs(title = "Washington D.C House Price Heat Map") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
scale_color_brewer(palette = "Oranges",
name = "Price Range",
labels = price_label)
The areas with no dots are mostly hills or other types where houses can’t be build on. Certainly, there are some expensive areas and some cheaper areas. Washington D.C. is consisted of 8 wards of which ward 2 and 3 are where dots are darker while the rest are brighter. But ward 6 in the middle is a bit ambiguous to tell because its left is dark but the right is not. Ward is actually one of the columns so it will be studied by each ward.
As shown on the histogram, there are a series of very expensive houses. These probably would interfere with modeling, thus, must be handled before moving forward. Outliers here are defined as observations over or less than IQR*1.5. The lower cut-off, however, is negative leaving only the expensive houses as outliers. \[Interquantile\ Range\ (IQR)=Quantile\ 1-\ Quantile\ 2\]
lowerq <- quantile(house.price$PRICE)[2]
upperq <- quantile(house.price$PRICE)[4]
iqr <- upperq - lowerq
lower.threshold <- lowerq - (iqr*1.5)
upper.threshold <- upperq + (iqr*1.5)
cat("Lower Outlier Threshold:", lower.threshold,
"\nUpper Outlier Threshold:", upper.threshold)
## Lower Outlier Threshold: -365000
## Upper Outlier Threshold: 1723000
house.price <- house.price[house.price$PRICE < upper.threshold, ]
As explained, some variables do not have meaningful or unique information. Although a variable has its own information, it could still be not useful if some other variable can replace it or have better information. The data set has so many geographical data which come in different forms but have a common ground. Before taking care of that, NA values should be controlled.
# no longer needed
house.price <- select(house.price, -c("SOURCE", "SALEDATE"))
# count NA for each variable
na_count <- sapply(house.price, function(y) {
sum(length(which(is.na(y))))
})
# all the variables with NA's
na_count <- na_count[na_count != 0]
# variable and the number of NA's
print(na_count)
## AYB YR_RMDL STORIES
## 19 4861 19
## KITCHENS CMPLX_NUM LIVING_GBA
## 1 12041 12041
## FULLADDRESS NATIONALGRID ASSESSMENT_SUBNBHD
## 119 119 1439
## CENSUS_BLOCK QUADRANT
## 119 25
“LIVING GBA” and “CMPLX NUM” are completely empty, suggesting that they are typical data of condominiums which are excluded at the beggining. Therefore, these two should simply be dropped. “YR RMDL” have about one third empty and this must be because not all the houses have been remodeled. While dropping is not always the best, the solution for this would be type conversion so that it is either a yes or a no as to whether remodeled or not. The rest are mostly geographical columns. While neighborhood is accepted as a good indicator for house prices, there are 57 different of them to make too many levels. As the final model will be a linear model, having too many category is susceptible to overfitting and losing general trends. Most importantly ward seems to be very useful as confirmed on the map.
# drop empty variables
house.price <- select(house.price, -c("CMPLX_NUM", "LIVING_GBA"))
# transform YR_RMDL by encoding into Yes or No
house.price$RMDL <- ifelse(is.na(house.price$YR_RMDL), "N", "Y")
house.price$RMDL <- as.factor(house.price$RMDL)
# drop geographical variables except for WARD
house.price <- select(house.price, -c("ZIPCODE", "USECODE", "SQUARE", "NATIONALGRID",
"ASSESSMENT_NBHD", "ASSESSMENT_SUBNBHD",
"CENSUS_BLOCK", "CENSUS_TRACT",
"QUADRANT", "FULLADDRESS",
"LONGITUDE", "LATITUDE"))
# fine tuning
print(levels(house.price$GIS_LAST_MOD_DTTM))
## [1] "2018-07-22 18:01:38" "2018-07-22 18:01:43"
house.price <- select(house.price, -c("GIS_LAST_MOD_DTTM", "YR_RMDL"))
# drop small number of NA's (AYB, STORIES...)
house.price <- house.price[complete.cases(house.price), ]
This part concentrates on relationships among variables. Separate analysis will be conducted by data type, numerical and categorical, and then another for both of them together. For numerical types, correlation matrix will be drawn not only to find relationship with dependent variable but also to catch multicollinearity patterns. For categorical types, box plots will be used to see whether each category has an ability to divide house prices and what values are useful if so. Box plot also can help to decide whether to reduce number of values in a category.
numeric_cols <- names(select_if(house.price, is.numeric))
print(numeric_cols)
## [1] "BATHRM" "HF_BATHRM" "NUM_UNITS" "ROOMS" "BEDRM"
## [6] "AYB" "EYB" "STORIES" "PRICE" "SALE_NUM"
## [11] "GBA" "BLDG_NUM" "KITCHENS" "FIREPLACES" "LANDAREA"
corr <- cor(house.price[, numeric_cols])
corrplot.mixed(corr, number.cex = .75, tl.pos = "lt")
“SALE NUM” has no relationship with any others, thus, it can be easily gone. “NUM UNITS” and “KITCHENS” have relationships with some variables but amlost none with “PRICE”. Our target “PRICE” has a decent relationship with the rest, especially with “GBA” and room-related variables. However, there are multicollinearity issues such as “ROOMS” and “BEDRM”. These should be handled first, and then another matrix should be drawn.
# drop unrelated variables
house.price <- subset(house.price, select = -c(SALE_NUM, BLDG_NUM, NUM_UNITS,
KITCHENS, LANDAREA))
# test for multicollinearity of numerical variables
VIF.model <- lm(PRICE ~ BATHRM + HF_BATHRM + ROOMS + BEDRM +
AYB + EYB + STORIES + GBA + FIREPLACES,
data = house.price)
print(vif(VIF.model))
## BATHRM HF_BATHRM ROOMS BEDRM AYB EYB
## 2.2109 1.2203 2.3428 2.1811 2.0363 2.5310
## STORIES GBA FIREPLACES
## 1.1156 2.5999 1.2261
Variance inflation factor (VIF) measures if variables are causing multicollinearity problems. Usually, VIF equals to 5 up to 10 is the limit which variables should not go over. Since we don’t see such a variable, we can keep them and draw another correlation matrix.
\[VIF_i=\frac{1}{1-R_i^2}\]
# final correaltion table
numeric_cols <- names(select_if(house.price, is.numeric))
corr <- cor(house.price[, numeric_cols])
corrplot.mixed(corr, number.cex = .75, tl.pos = "lt")
Even though there are still moderate multicollinearity issues, these are under the serious level and will be handled again during modeling. Again, house prices seem to have high correlations with area-related variables, which comes no surprise. Still, quality-related factors such as year built or fire place would mildly affect house prices. Although there are higher numbers on the matrix, “PRICE” has some relationships uniformly with the variables when the same can’t be applied to other variables.
Scatter plot is basic to find a pattern between two numerical variables. Either going up or down, a slope means a relationship. When units are so different among variables, normalization could be considered or it might mislead interpretation if they are put together. Another way to deal with this is to show similar variables together instead of the complicated normalization process. Variables related with rooms are in the similar range, therefore, they are put together first. Next, variables related with years and the rest are drawn on the same pane.
# define a function for repeating uses
house.price_scatter <- function(var, x_start, x_lim) {
ggplot(house.price, aes_string(var, "PRICE", fill = var)) +
geom_point(color = "darkgreen", shape = 1, alpha = 0.1,
size = 0.2, position = "jitter") +
geom_smooth(span = 5, method = lm) +
ggtitle(toString(var)) + xlab("") +
theme(legend.position="none") +
scale_x_continuous(limits = c(x_start, x_lim)) +
scale_y_continuous(limits = c(0, 1750000),
breaks = seq(0, 1750000, 250000))
}
# room-related variables
BATHRM <- house.price_scatter("BATHRM", 0, 8)
HF_BATHRM <- house.price_scatter("HF_BATHRM", 0, 4)
BEDRM <- house.price_scatter("BEDRM", 0, 8)
ROOMS <- house.price_scatter("ROOMS", 0, 15)
grid.arrange(BATHRM, HF_BATHRM, BEDRM, ROOMS,
ncol = 2, nrow = 2)
It seems like there is a strong pattern going up for them. It is very natural that a house price increases as it has more rooms. One surprising thing is that number of bathrooms has a sharper slope than number of bedrooms. Indeed, it is often seen that a 2-bedroom/2-bathroom house is more expensive than a 3-bedroom/1-bathroom house. Another thing to note is that it seems like prices become variable more widely as there are more rooms. For example, the price range of 3 to 4 rooms is much smaller than the price range of 7 to 8 rooms. Since room variables will play a significant role, heteroscedasticity is expected to happen, and this will be cared later.
# year-related and quality variables
AYB <- house.price_scatter("AYB", 1930, 2020)
EYB <- house.price_scatter("EYB", 1930, 2020)
STORIES <- house.price_scatter("STORIES", 0, 10)
FIREPLACES <- house.price_scatter("FIREPLACES", 0, 5)
grid.arrange(AYB, EYB, STORIES, FIREPLACES,
ncol = 2, nrow = 2)
Variables related with years show a rough pattern. “EYB” has some slope which is affected by data after 2000. These are what must be heavily affected by contingent market trends, not by typical features. Although “STORIES” has a good sharp slope, they are mostly 2 to 3 stories with large variances. “FIREPLACE” has a decent movement within 0 to 1 range.
# GBA showing growing variances
ggplot(house.price, aes(x = GBA, y = PRICE)) +
geom_point(color = "darkgreen",
shape = 1, alpha = 0.3, size = 0.5) +
geom_smooth() +
ggtitle("GBA") + xlab("") +
scale_x_continuous(limits = c(0, 6000),
breaks = seq(0, 6000, 1000)) +
scale_y_continuous(limits = c(0, 1750000),
breaks = seq(0, 1750000, 250000))
“GBA” appeals the heteroscedasticity issue very clearly. As a house gets bigger, its price could be in any range. This can be understood that houses with 1-bedroom, for instance, should be in a small price range because a new house in downtown is still just 1-bedroom. But houses with 4-bedroom could be over million dollars or an old out-of-maintenance house on auction.
Now that the variance issue is found, it is time to fix it. For the last check, Breusch-Pagan test can be conducted on a linear model. The test runs with null hypothesis that the variances are uniform over values. As seen below, the results are enough to reject the null hypothesis and claim the variance issues.
# linear model with numerical variables
Numeric.lm <- lm(PRICE ~ BATHRM + HF_BATHRM + BEDRM + ROOMS +
AYB + EYB + STORIES + FIREPLACES + GBA,
data = house.price)
# constant variance tests
print(bptest(Numeric.lm))
##
## studentized Breusch-Pagan test
##
## data: Numeric.lm
## BP = 1007.1, df = 9, p-value < 2.2e-16
print(ncvTest(Numeric.lm))
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1220.9, Df = 1, p = < 2.22e-16
Box-Cox suggests how to transform the data. The lambda is the core concept of the method that would be iterated between a certain range. Lambda nearing 0.5 indicates square root transformation for prices.
# box-cox
bc <- boxCox(Numeric.lm)
lambda <- bc$x[which.max(bc$y)]
cat("Lambda for Transformation:", lambda)
## Lambda for Transformation: 0.5454545
# transformation to sqrt
house.price$PRICE_TRANS <- sqrt(house.price$PRICE)
house.price <- select(house.price, -PRICE)
# price-sqrt histogram
ggplot(house.price, aes(x = PRICE_TRANS)) +
geom_histogram(aes(y = ..density..),
binwidth = 20, fill = "blue", alpha = 0.25) +
geom_density(fill = "pink", alpha = 0.25) +
labs(title = "Washington D.C House Price SQRT") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(aes(xintercept = mean(PRICE_TRANS), color = "mean"), size = 0.5) +
geom_vline(aes(xintercept = median(PRICE_TRANS), color = "median"),
linetype = "dashed", size = 0.5) +
scale_x_continuous(limits = c(0, 1500)) +
scale_color_manual(name = "", values = c(mean = "red", median = "blue")) +
stat_function(fun = dnorm, args = list(
mean = mean(house.price$PRICE_TRANS), sd = sd(house.price$PRICE_TRANS)),
linetype = "dashed")
After transformation, the distribution becomes better than the initial one. The dotted line is the normal curve for comparison. Despite being not perfectly matched, at least the skewness is fixed as the mean median are aligned.
# final correlations with PRICE_TRANS
X_num <- names(select_if(house.price, is.numeric))
X_num <- X_num[X_num != "PRICE_TRANS"]
cat("[Price SQRT vs. X Variables Correlation]\n")
## [Price SQRT vs. X Variables Correlation]
kable(cor(house.price[, X_num], house.price$PRICE_TRANS)) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
BATHRM | 0.4432783 |
HF_BATHRM | 0.2619754 |
ROOMS | 0.2784958 |
BEDRM | 0.3369135 |
AYB | -0.2574367 |
EYB | 0.2098391 |
STORIES | 0.2011221 |
GBA | 0.4984254 |
FIREPLACES | 0.4796368 |
Although the correlations decreased a bit, it is more important to abide by the assumption of variance and normality, thus, the transformation is necessary.
Box plot is helpful to see if a categorical variable can divide dependent variables into groups. The data set has several categorical variables from building materials, functions, quality, to area. Some has too many different values available, some has unbalanced values, and another has its own issue. The following is general remedy for these kind of problems, some of which will be used for the project.
categorical_cols <- names(select_if(house.price, is.factor))
print(categorical_cols)
## [1] "HEAT" "AC" "QUALIFIED" "STYLE" "STRUCT"
## [6] "GRADE" "CNDTN" "EXTWALL" "ROOF" "INTWALL"
## [11] "WARD" "RMDL"
# number of unique values in each categorical variable
str(apply(house.price[, categorical_cols], 2, function(x) unique(x)))
## List of 12
## $ HEAT : chr [1:14] "Warm Cool" "Hot Water Rad" "Forced Air" "Ht Pump" ...
## $ AC : chr [1:3] "Y" "N" "0"
## $ QUALIFIED: chr [1:2] "Q" "U"
## $ STYLE : chr [1:13] "2 Story" "3 Story" "2.5 Story Fin" "4 Story" ...
## $ STRUCT : chr [1:8] "Row Inside" "Semi-Detached" "Multi" "Row End" ...
## $ GRADE : chr [1:9] "Above Average" "Very Good" "Average" "Excellent" ...
## $ CNDTN : chr [1:7] "Very Good" "Excellent" "Good" "Average" ...
## $ EXTWALL : chr [1:21] "Common Brick" "Stone" "Stucco" "Brick/Stone" ...
## $ ROOF : chr [1:12] "Built Up" "Metal- Sms" "Slate" "Comp Shingle" ...
## $ INTWALL : chr [1:11] "Hardwood" "Hardwood/Carp" "Carpet" "Wood Floor" ...
## $ WARD : chr [1:8] "Ward 2" "Ward 1" "Ward 6" "Ward 5" ...
## $ RMDL : chr [1:2] "Y" "N"
# fix '0' values in AC
cat("Number of '0's in AC:", pull(count(house.price[house.price$AC == '0', ])))
## Number of '0's in AC: 6
house.price[house.price$AC == '0', 'AC'] <- "N"
# define a function for repeated uses
house.price_boxplot <- function(var, x.angle) {
ggplot(house.price, aes_string(var, "PRICE_TRANS", fill = var)) +
geom_boxplot(outlier.colour = "black",
outlier.shape = 16,
outlier.size = 1) +
geom_hline(aes(yintercept = median(PRICE_TRANS)),
color = 'gray', linetype = "dashed", size = 0.5) +
ggtitle(toString(var)) + xlab("") +
theme(legend.position="none",
axis.text.x = element_text(angle = x.angle, hjust = 1))
}
# air conditioning, remodeld, qualified, heating, ward
lay <- rbind(c(1, 2, 3),
c(4, 4, 5))
grid.arrange(grobs = list(house.price_boxplot("AC", 0),
house.price_boxplot("RMDL", 0),
house.price_boxplot("QUALIFIED", 0),
house.price_boxplot("HEAT", 90),
house.price_boxplot("WARD", 90)),
heights = c(0.4, 0.6),
layout_matrix = lay)
“AC”, “RMDL”, and “QUALIFIED” have two classes. Among them, “QUALIFIED” seems to do a good job to divide expensive houses and cheap houses, of course, qualified houses value more and the same goes for remodeled houses with air conditioner equipped. “HEAT” is a bit confusing in that most of values are around the median price (dotted line). Though there are some values “Air Exchng” or “Evp Cool”, their counts do not seem to be enough to make a contribution. For that reason, “HEAT” should rather be dropped. “WARD” looks like it will be very beneficial as each ward are ranged properly and located differently. Some issue here is that they have many outliers and ward 7 and 8 are almost identical. Nonetheless, this looks good enough useful when working with numerical variables.
# drop HEAT
print(summary(house.price$HEAT))
## Air-Oil Air Exchng Elec Base Brd Electric Rad Evp Cool
## 11 2 8 6 2
## Forced Air Gravity Furnac Hot Water Rad Ht Pump Ind Unit
## 5450 8 3957 189 1
## No Data Wall Furnace Warm Cool Water Base Brd
## 7 22 2319 20
house.price <- select(house.price, -HEAT)
grid.arrange(house.price_boxplot("STYLE", 90),
house.price_boxplot("STRUCT", 90),
ncol = 2)
The values in “STYLE” are basically stories. Since “STORIES” variable already exists, this can be removed. Also, almost all of them (9557 count) are “2 Story” making it severely unbalanced. “STRUCT” looks like it could do some job to find cheap houses by the value “Multi” or “Semi-Detached”.
# style
print(summary(house.price$STYLE))
## 1 Story 1.5 Story Fin 1.5 Story Unfin 2 Story
## 461 245 16 9557
## 2.5 Story Fin 2.5 Story Unfin 3 Story 3.5 Story Fin
## 633 63 918 14
## 3.5 Story Unfin 4 Story 4.5 Story Fin 4.5 Story Unfin
## 0 52 0 0
## Bi-Level Default Outbuildings Split Foyer
## 1 8 0 18
## Split Level Vacant
## 16 0
house.price <- select(house.price, -STYLE)
# struct
print(summary(house.price$STRUCT))
## Default Multi Row End Row Inside Semi-Detached
## 2 625 1432 5046 1829
## Single Town End Town Inside Vacant Land
## 3019 12 37 0
house.price$STRUCT <- ifelse(house.price$STRUCT %in%
c("Multi", "Semi-Detached", "Town Inside"),
"Cheap", "Expensive")
# condition, grade
grid.arrange(house.price_boxplot("CNDTN", 90),
house.price_boxplot("GRADE", 90),
ncol = 2)
“CDNTN” and “GRADE” show a similar pattern that average or fair houses are cheaper. These houses are under the median line and the rest are above very clearly. So, these several values can be grouped into two. This way, the values in “CDNTN” become balanced too. The values in “GRADE”, however, are not quite two group dividable as there are unignorable discrepancies among similair values. This will be further investigated.
# grade
print(summary(house.price$GRADE))
## Above Average Average Excellent Exceptional-A Exceptional-B
## 3741 4675 229 14 1
## Exceptional-C Exceptional-D Fair Quality Good Quality Low Quality
## 0 0 22 2313 0
## No Data Superior Very Good
## 0 164 843
# condition
print(summary(house.price$CNDTN))
## Average Default Excellent Fair Good Poor Very Good
## 4179 1 236 117 5392 28 2049
house.price$CNDTN <- ifelse(house.price$CNDTN %in%
c("Average", "Fair", "Poor"), "Bad", "Good")
# exwall
house.price_boxplot("EXTWALL", 90)
There are just too many values in “EXTWALL”, thus, it is inevitable to group them. With some extra research about exterior materials, they can be grouped into “Premium” or “Normal”. Generally, Brick and Stucco are considered as relatively better materials.
# exterior wall
print(summary(house.price$EXTWALL))
## Adobe Aluminum Brick Veneer Brick/Siding Brick/Stone
## 0 100 129 779 70
## Brick/Stucco Common Brick Concrete Concrete Block Default
## 65 9025 12 7 2
## Face Brick Hardboard Metal Siding Plywood Rustic Log
## 123 18 7 0 0
## Shingle SPlaster Stone Stone Veneer Stone/Siding
## 121 0 57 35 43
## Stone/Stucco Stucco Stucco Block Vinyl Siding Wood Siding
## 18 286 4 661 440
house.price$EXTWALL <- ifelse(house.price$EXTWALL %in%
c("Brick Veneer", "Brick/Stone", "Brick/Stucco", "Hardboard",
"Stone", "Stone/Stucco", "Stucco Block", "Wood Siding"),
"Premium", "Normal")
# interior wall
grid.arrange(house.price_boxplot("INTWALL", 90),
house.price_boxplot("ROOF", 90),
ncol = 2)
“INTWALL” might have a chance to be grouped, but it should be dropped because of unbalanced issue. There are 9466 houses with “Hardwood” interior type, which are almost entire cases. As for “ROOF”, there are four types that are obviously over the median line, and these can go together.
# drop Interior
print(summary(house.price$INTWALL))
## Carpet Ceramic Tile Default Hardwood Hardwood/Carp
## 300 6 6 9466 1419
## Lt Concrete Parquet Resiliant Terrazo Vinyl Comp
## 8 2 3 1 3
## Vinyl Sheet Wood Floor
## 0 788
house.price <- select(house.price, -INTWALL)
# Roof: cheap vs. expensive
print(summary(house.price$ROOF))
## Built Up Clay Tile Comp Shingle Composition Ro Concrete
## 3869 47 3464 24 0
## Concrete Tile Metal- Cpr Metal- Pre Metal- Sms Neopren
## 0 5 32 3460 121
## Shake Shingle Slate Typical Water Proof
## 65 43 853 19 0
## Wood- FS
## 0
house.price$ROOF <- ifelse(house.price$ROOF %in%
c("Clay Tile", "Metal- Cpr", "Neopren", "Slate"),
"Expensive", "Cheap")
Facet plot usually needs two numerical variables and a categorical variable. While “CDNTN” are now divided into two groups, it wouldn’t be a good idea to do the same grouping to “GRADE” as they are similar attributes. This is where facet plot would be considered to see if “GRADE” could be converted into a numerical variable.
# facet points by grade
ggplot(house.price[!house.price$GRADE %in%
c("Exceptional-A", "Exceptional-B", "Fair Quality"), ],
aes(GBA, PRICE_TRANS, color = GRADE)) +
geom_point(alpha = 0.05, shape = 1) +
geom_smooth(method = lm) +
facet_wrap(~GRADE) +
scale_x_continuous(limits = c(0, 4000),
breaks = seq(0, 4000, 2000))
# median price by grade
GRADE_median <- house.price %>%
group_by(GRADE) %>%
summarise(MEDIAN_PRICE = median(PRICE_TRANS)) %>%
arrange(MEDIAN_PRICE)
GRADE_median$MEDIAN_PRICE <- round(GRADE_median$MEDIAN_PRICE, digits = 2)
kable(GRADE_median) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
GRADE | MEDIAN_PRICE |
---|---|
Fair Quality | 409.13 |
Average | 663.25 |
Above Average | 782.62 |
Exceptional-B | 921.95 |
Good Quality | 950.53 |
Very Good | 1067.71 |
Superior | 1113.55 |
Excellent | 1140.18 |
Exceptional-A | 1214.66 |
# new variable median price by grade
house.price$GRADE_MEDIAN_PRICE <- lapply(house.price$GRADE, function(x)
GRADE_median[match(x, GRADE_median$GRADE), "MEDIAN_PRICE"][[1]])
house.price$GRADE_MEDIAN_PRICE <- as.numeric(house.price$GRADE_MEDIAN_PRICE)
house.price <- select(house.price, -GRADE)
As seen in the graph and the table, prices are distinctively different by grade. Therefore, these median prices grade by grade would replace the categorical variable “GRADE” rather than grouping them as reduced categories.
Recalling the goal of this project, evaluating linear models is crucial to building the best model. Diagnostic analysis is the step to do that in various ways such as Q-Q plot, Residual vs. Fitted plot, Cook’s distance, and so on. These techniques are designed to test the assumptions like constant variance, normality of residuals, and random error. The first step for this is building a linear model that will work as baseline and be modified later.
\[Cook's Distance (D_i)=\frac{(y_i-\hat{y}_i)^2}{p\times MSE}[\frac{h_ii}{(1-h_ii)^2}]\]
house.lm <- lm(PRICE_TRANS ~ ., data = house.price)
summary(house.lm)
##
## Call:
## lm(formula = PRICE_TRANS ~ ., data = house.price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1033.15 -48.89 -1.19 47.75 766.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.610e+02 1.503e+02 5.065 4.14e-07 ***
## BATHRM 2.367e+01 1.376e+00 17.195 < 2e-16 ***
## HF_BATHRM 1.989e+01 1.687e+00 11.791 < 2e-16 ***
## ACY 8.028e+00 2.612e+00 3.074 0.002117 **
## ROOMS -5.383e-02 6.188e-01 -0.087 0.930682
## BEDRM 4.084e+00 1.214e+00 3.365 0.000768 ***
## AYB -1.169e+00 5.455e-02 -21.429 < 2e-16 ***
## EYB 9.887e-01 1.062e-01 9.312 < 2e-16 ***
## STORIES -4.009e-01 1.662e+00 -0.241 0.809352
## QUALIFIEDU -9.573e+01 2.363e+00 -40.508 < 2e-16 ***
## GBA 6.857e-02 2.357e-03 29.092 < 2e-16 ***
## STRUCTExpensive 3.413e+01 2.300e+00 14.839 < 2e-16 ***
## CNDTNGood 3.435e+01 2.583e+00 13.300 < 2e-16 ***
## EXTWALLPremium 5.489e+00 3.490e+00 1.573 0.115761
## ROOFExpensive 1.115e+01 3.351e+00 3.328 0.000877 ***
## FIREPLACES 1.362e+01 1.350e+00 10.085 < 2e-16 ***
## WARDWard 2 8.984e+01 5.515e+00 16.289 < 2e-16 ***
## WARDWard 3 5.428e+01 4.768e+00 11.382 < 2e-16 ***
## WARDWard 4 -6.287e+01 3.984e+00 -15.783 < 2e-16 ***
## WARDWard 5 -8.875e+01 3.925e+00 -22.610 < 2e-16 ***
## WARDWard 6 3.723e+00 3.852e+00 0.966 0.333921
## WARDWard 7 -2.157e+02 4.370e+00 -49.356 < 2e-16 ***
## WARDWard 8 -2.458e+02 4.677e+00 -52.548 < 2e-16 ***
## RMDLY 2.674e+00 2.331e+00 1.147 0.251409
## GRADE_MEDIAN_PRICE 2.443e-01 9.596e-03 25.454 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92.86 on 11977 degrees of freedom
## Multiple R-squared: 0.8132, Adjusted R-squared: 0.8128
## F-statistic: 2172 on 24 and 11977 DF, p-value: < 2.2e-16
There are a few variables that are not significant. “ROOMS” must be due to collinearity, and the rest are seemingly because they are not useful. Interestingly enough, “Ward 6” has already been problematic on the map earlier.
With the baseline model, diagnostic plots can be drawn. Residual vs. Fitted plot should have the balancing line horizontally around zero. The ideal Q-Q plot follows the diagonal line. Cook’s distance is used as limits which dots should not go over. It gives the indice of observations with problems so they can be excluded.
plot(house.lm)
Generally speaking, Residual vs. Fitted plot look okay, but Q-Q plot doesn’t look like it is on the normal line. From the shape, it could be light-tailed distribution. Residual vs. Leverage plot seems okay as dots are in the boundary of Cook’s distance.
# influential observations from Cook's Distance-Leverage
X_y <- names(select_if(house.price, is.numeric))
print(house.price[row.names(house.price) %in%
c(21, 18569, 24841, 97202), c(X_y)])
## BATHRM HF_BATHRM ROOMS BEDRM AYB EYB STORIES GBA FIREPLACES
## 21 3 1 14 5 1880 1987 3.0 3465 3
## 18569 2 1 8 3 1890 1976 2.0 2272 0
## 24841 4 1 7 3 1996 2009 2.5 2480 2
## 97202 1 0 4 1 1915 1954 1.0 684 0
## PRICE_TRANS GRADE_MEDIAN_PRICE
## 21 182.296462 1067.71
## 18569 3.162278 1113.55
## 24841 409.878031 1067.71
## 97202 1178.982612 663.25
# drop influential observations
house.price <- house.price[!row.names(house.price) %in%
c(21, 18569, 24841, 97202), ]
There are four observations that the plots suggest looking closely. They are clearly strangely priced for their features. For example, index 21 has 14 rooms and it’s only 182 unit price when its group median is 1067 unit price. The rest are also deviant from normal pattern.
resid.plot <- function(model) {
ggplot(model, aes(.resid)) +
geom_density(fill = "pink", color = "red", alpha = 0.5) +
stat_function(fun = dnorm, args = list(
mean = mean(resid(model)), sd = sd(resid(model))),
linetype = "dashed", geom = "area", alpha = 0.2) +
xlab("Price Square Root Residuals") + ylab("Density") +
ggtitle("Residual Plot") +
scale_x_continuous(limits = c(-500, 500)) +
coord_fixed(ratio = 1e5)
}
resid.plot(house.lm)
From the warning of Q-Q plot, the distribution should be checked again. The gray in the back is normal distribution when the red area is the data distribution. It has a high peak with light tails both sides, suggesting that most of the observations are centered around the mean. As long as it is a symmetric bell shape, it won’t cause a serious problem. However, it still looks like there are some extreme values from the long tails.
One of the methods to detect influential points is DFBETA which measures how much impact each observation has on a particular predictor. Recalling the correlation matrix, the top three will be investigated in order to remove influential points. The ultimate goal of this process is again finding the general trend, not biased by extreme values.
\[DFBETAS_i=\frac{\hat{\beta}_i-\hat{\beta}_{(i)j}}{s_{(i)}\sqrt{(X^TX)_{jj}}}\]
# dfbeta
dfbeta.GBA <- dfbetaPlots(house.lm, terms = "GBA")
dfbeta.BATHRM <- dfbetaPlots(house.lm, terms = "BATHRM")
dfbeta.FIREPLACES <- dfbetaPlots(house.lm, terms = "FIREPLACES")
“GBA”, “BATHRM”, and “FIREPLACES” are the top three correlated variables. Because they will affect house prices the most, influential points should be detected based on them. The following calculates DFBETA’s and threshold for the data set, and finds influential points with the three features.
\[Threshold=\frac{2}{\sqrt{n}}\ where\ n=observations\]
# calculate dfbetas for selected variables
dfbeta.thrhd <- as.numeric(sqrt(4/count(house.price)))
dfbetas <- dfbetas(house.lm)
cat("DFBETA limit:", dfbeta.thrhd)
## DFBETA limit: 0.01825894
# influential point indice
dfbetas.infl.obs <- names(dfbetas[abs(dfbetas[ , "GBA"])
> dfbeta.thrhd, "GBA"])
dfbetas.infl.obs <- c(dfbetas.infl.obs,
names(dfbetas[abs(dfbetas[ , "BATHRM"])
> dfbeta.thrhd, "BATHRM"]))
dfbetas.infl.obs <- c(dfbetas.infl.obs,
names(dfbetas[abs(dfbetas[ , "FIREPLACES"])
> dfbeta.thrhd, "FIREPLACES"]))
# filter observations over dfbeta limit
house.price <- house.price[!row.names(house.price) %in% dfbetas.infl.obs, ]
house.infl.lm <- lm(PRICE_TRANS ~., data = house.price)
summary(house.infl.lm)
##
## Call:
## lm(formula = PRICE_TRANS ~ ., data = house.price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -466.85 -43.51 -1.04 40.47 544.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.651e+02 1.153e+02 5.767 8.28e-09 ***
## BATHRM 2.481e+01 1.124e+00 22.072 < 2e-16 ***
## HF_BATHRM 1.991e+01 1.316e+00 15.125 < 2e-16 ***
## ACY 8.312e+00 1.993e+00 4.171 3.05e-05 ***
## ROOMS -4.748e-01 4.962e-01 -0.957 0.338653
## BEDRM 5.246e+00 9.690e-01 5.414 6.31e-08 ***
## AYB -1.203e+00 4.323e-02 -27.831 < 2e-16 ***
## EYB 1.061e+00 8.227e-02 12.898 < 2e-16 ***
## STORIES 9.766e-01 1.250e+00 0.782 0.434494
## QUALIFIEDU -7.831e+01 1.875e+00 -41.760 < 2e-16 ***
## GBA 7.230e-02 2.077e-03 34.813 < 2e-16 ***
## STRUCTExpensive 3.292e+01 1.742e+00 18.902 < 2e-16 ***
## CNDTNGood 3.277e+01 1.980e+00 16.549 < 2e-16 ***
## EXTWALLPremium 3.696e-01 2.719e+00 0.136 0.891884
## ROOFExpensive 9.263e+00 2.626e+00 3.528 0.000421 ***
## FIREPLACES 1.490e+01 1.207e+00 12.341 < 2e-16 ***
## WARDWard 2 1.015e+02 4.449e+00 22.805 < 2e-16 ***
## WARDWard 3 6.428e+01 3.709e+00 17.330 < 2e-16 ***
## WARDWard 4 -5.428e+01 3.087e+00 -17.587 < 2e-16 ***
## WARDWard 5 -7.815e+01 3.061e+00 -25.532 < 2e-16 ***
## WARDWard 6 4.901e+00 3.006e+00 1.630 0.103099
## WARDWard 7 -2.095e+02 3.388e+00 -61.847 < 2e-16 ***
## WARDWard 8 -2.351e+02 3.607e+00 -65.190 < 2e-16 ***
## RMDLY 3.035e+00 1.785e+00 1.700 0.089117 .
## GRADE_MEDIAN_PRICE 2.430e-01 7.571e-03 32.094 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.8 on 10660 degrees of freedom
## Multiple R-squared: 0.891, Adjusted R-squared: 0.8907
## F-statistic: 3630 on 24 and 10660 DF, p-value: < 2.2e-16
After influential points are removed, R-squared is remarkably improved to 0.89 from 0.81. However, there are still insignificant features, so the last step becomes dropping them. These are what the model sees no difference whether they are in the model or not, therefore, R-squared should stay the same after losing them. Other than them, “ROOF” is relatively weaker than the rest. “WARD 6” is one of the values of “WARD”, and this will be handled before building a predictive model.
# drop insignificant regressors
house.price <- select(house.price, -c("ROOMS", "STORIES", "EXTWALL", "RMDL"))
write.csv(house.price, file = "DC_Properties_final.csv", row.names = F)
# regression with significant variables
house.final.lm <- lm(PRICE_TRANS ~ ., data = house.price)
summary(house.final.lm)
##
## Call:
## lm(formula = PRICE_TRANS ~ ., data = house.price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -466.55 -43.35 -0.93 40.45 544.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.982e+02 1.110e+02 6.292 3.25e-10 ***
## BATHRM 2.478e+01 1.083e+00 22.871 < 2e-16 ***
## HF_BATHRM 1.992e+01 1.314e+00 15.160 < 2e-16 ***
## ACY 9.162e+00 1.939e+00 4.726 2.32e-06 ***
## BEDRM 4.992e+00 9.271e-01 5.385 7.41e-08 ***
## AYB -1.230e+00 4.042e-02 -30.426 < 2e-16 ***
## EYB 1.071e+00 8.181e-02 13.093 < 2e-16 ***
## QUALIFIEDU -7.852e+01 1.871e+00 -41.969 < 2e-16 ***
## GBA 7.161e-02 1.909e-03 37.507 < 2e-16 ***
## STRUCTExpensive 3.286e+01 1.728e+00 19.018 < 2e-16 ***
## CNDTNGood 3.386e+01 1.879e+00 18.019 < 2e-16 ***
## ROOFExpensive 9.258e+00 2.621e+00 3.532 0.000415 ***
## FIREPLACES 1.495e+01 1.203e+00 12.430 < 2e-16 ***
## WARDWard 2 1.013e+02 4.442e+00 22.803 < 2e-16 ***
## WARDWard 3 6.396e+01 3.665e+00 17.448 < 2e-16 ***
## WARDWard 4 -5.441e+01 3.059e+00 -17.785 < 2e-16 ***
## WARDWard 5 -7.819e+01 3.046e+00 -25.673 < 2e-16 ***
## WARDWard 6 4.768e+00 2.999e+00 1.590 0.111878
## WARDWard 7 -2.094e+02 3.354e+00 -62.446 < 2e-16 ***
## WARDWard 8 -2.353e+02 3.571e+00 -65.876 < 2e-16 ***
## GRADE_MEDIAN_PRICE 2.431e-01 7.523e-03 32.319 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.8 on 10664 degrees of freedom
## Multiple R-squared: 0.8909, Adjusted R-squared: 0.8907
## F-statistic: 4356 on 20 and 10664 DF, p-value: < 2.2e-16
grid.arrange(resid.plot(house.lm) + ggtitle("Original Residuals"),
resid.plot(house.infl.lm) + ggtitle("After Removing Influential Points"),
ncol = 2)
On the left is before eliminating the influential points and on the right is after that. The assumption of normal residuals is better met with the influential points removed, almost being identical with the normal distribution in gray.
Although there are many other measurements to select features, two representative are used for this project. One is Adjusted R-squared to measure the overall fit, and the other is AIC to see if adding another feature is helpful for modeling.
Unlike R-square, Adjusted R-square takes number of parameters added into consideration. That is, it could be worse to add a new variable if the variable is not helpful enough to explain errors.
# full linear model
rsq.lm <- lm(PRICE_TRANS ~ ., data = house.price)
# convert it to matrix
x <- model.matrix(rsq.lm)[, -1]
y <- house.price$PRICE_TRANS
# 5 best models by adjusted r-squared
adjr <- leaps(x, y, method = "adjr2")
maxadjr(adjr, 5)
## 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
## 0.891
## 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20
## 0.891
## 1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20
## 0.891
## 1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,18,19,20
## 0.891
## 1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
## 0.891
The result of 5 best models indicates predictor numbers and adjusted r-squared for each model. While all the models perform the same (0.89), they have different set of predictors. Since a simpler model is bettern for the same performance, the 4th should be chosen with the least number of features.
print(colnames(x[, c(11, 17)]))
## [1] "ROOFExpensive" "WARDWard 6"
x <- x[, -c(11, 17)]
The two features the 4th model excludes happen to be what already had issues on significance. Now that it is clear they do not need to be in the model, “ROOF” and “WARD 6” can be let go. Nevertheless, it is safe to see feature selection at a different angle.
Basically, AIC is designed to decide whether to include an additinal variable or not. The lower the score is, the better the model is. The process runs through features and calculate AIC and continue until AIC stops decreasing. The following starts with a null model and add a feature one by one to see how AIC changes.
# null model for variable addition
null.lm <- lm(PRICE_TRANS ~ 1, data = house.price)
full.lm <- lm(PRICE_TRANS ~ ., data = house.price)
# stepwise selection
step(null.lm, scope = list(lower = null.lm, upper = full.lm),
direction = "both")
## Start: AIC=113448.5
## PRICE_TRANS ~ 1
##
## Df Sum of Sq RSS AIC
## + WARD 7 280357353 155994622 102472
## + GRADE_MEDIAN_PRICE 1 209347790 227004185 106468
## + GBA 1 127386145 308965830 109762
## + FIREPLACES 1 121626952 314725023 109959
## + BATHRM 1 98008564 338343411 110732
## + BEDRM 1 58669252 377682723 111908
## + QUALIFIED 1 47450012 388901963 112220
## + STRUCT 1 43602660 392749315 112326
## + CNDTN 1 40578184 395773791 112408
## + HF_BATHRM 1 33346273 403005702 112601
## + AYB 1 33215069 403136906 112605
## + ROOF 1 27149644 409202331 112764
## + AC 1 25692455 410659520 112802
## + EYB 1 18547356 417804619 112986
## <none> 436351975 113449
##
## Step: AIC=102471.6
## PRICE_TRANS ~ WARD
##
## Df Sum of Sq RSS AIC
## + GBA 1 53904968 102089653 97944
## + BATHRM 1 49887222 106107400 98356
## + GRADE_MEDIAN_PRICE 1 44064741 111929881 98927
## + BEDRM 1 33110405 122884217 99924
## + CNDTN 1 26333572 129661050 100498
## + EYB 1 23263095 132731526 100748
## + FIREPLACES 1 20577094 135417527 100962
## + QUALIFIED 1 20035415 135959207 101005
## + AC 1 17124704 138869918 101231
## + HF_BATHRM 1 12797291 143197331 101559
## + STRUCT 1 6563659 149430963 102014
## + ROOF 1 3748826 152245796 102214
## + AYB 1 401705 155592917 102446
## <none> 155994622 102472
## - WARD 7 280357353 436351975 113449
##
## Step: AIC=97943.52
## PRICE_TRANS ~ WARD + GBA
##
## Df Sum of Sq RSS AIC
## + CNDTN 1 21986830 80102823 95354
## + QUALIFIED 1 19369301 82720353 95698
## + GRADE_MEDIAN_PRICE 1 15902380 86187274 96136
## + AC 1 15530322 86559332 96182
## + BATHRM 1 10894819 91194834 96740
## + EYB 1 9993302 92096351 96845
## + HF_BATHRM 1 6097378 95992275 97288
## + STRUCT 1 6087446 96002208 97289
## + FIREPLACES 1 5987187 96102466 97300
## + BEDRM 1 2149040 99940613 97718
## + ROOF 1 999819 101089835 97840
## + AYB 1 98023 101991630 97935
## <none> 102089653 97944
## - GBA 1 53904968 155994622 102472
## - WARD 7 206876176 308965830 109762
##
## Step: AIC=95353.98
## PRICE_TRANS ~ WARD + GBA + CNDTN
##
## Df Sum of Sq RSS AIC
## + GRADE_MEDIAN_PRICE 1 10702257 69400566 93824
## + QUALIFIED 1 9857858 70244966 93953
## + STRUCT 1 4421828 75680995 94749
## + FIREPLACES 1 4035908 76066915 94804
## + BATHRM 1 2843797 77259026 94970
## + AC 1 2461461 77641362 95022
## + HF_BATHRM 1 1904389 78198434 95099
## + AYB 1 1355193 78747630 95174
## + BEDRM 1 1087531 79015292 95210
## + EYB 1 903071 79199752 95235
## + ROOF 1 703228 79399595 95262
## <none> 80102823 95354
## - CNDTN 1 21986830 102089653 97944
## - GBA 1 49558227 129661050 100498
## - WARD 7 198756939 278859762 108668
##
## Step: AIC=93823.58
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE
##
## Df Sum of Sq RSS AIC
## + QUALIFIED 1 9153173 60247393 92314
## + AYB 1 3359196 66041370 93295
## + BATHRM 1 2859494 66541072 93376
## + STRUCT 1 2184079 67216486 93484
## + AC 1 1423247 67977319 93604
## + FIREPLACES 1 1347815 68052751 93616
## + BEDRM 1 1272061 68128505 93628
## + HF_BATHRM 1 855942 68544624 93693
## + ROOF 1 96729 69303837 93811
## + EYB 1 76573 69323993 93814
## <none> 69400566 93824
## - GRADE_MEDIAN_PRICE 1 10702257 80102823 95354
## - CNDTN 1 16786708 86187274 96136
## - GBA 1 26158907 95559473 97239
## - WARD 7 118871084 188271650 104473
##
## Step: AIC=92314.34
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED
##
## Df Sum of Sq RSS AIC
## + AYB 1 3896218 56351175 91602
## + STRUCT 1 2307855 57939537 91899
## + BATHRM 1 2284329 57963063 91903
## + FIREPLACES 1 1326205 58921188 92079
## + BEDRM 1 1178410 59068983 92105
## + AC 1 694024 59553368 92193
## + HF_BATHRM 1 562033 59685359 92216
## + EYB 1 331220 59916172 92257
## + ROOF 1 139550 60107842 92292
## <none> 60247393 92314
## - QUALIFIED 1 9153173 69400566 93824
## - CNDTN 1 9229694 69477087 93835
## - GRADE_MEDIAN_PRICE 1 9997573 70244966 93953
## - GBA 1 26838667 87086060 96249
## - WARD 7 111273121 171520513 103479
##
## Step: AIC=91601.98
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB
##
## Df Sum of Sq RSS AIC
## + BATHRM 1 2576776 53774399 91104
## + STRUCT 1 2237156 54114019 91171
## + EYB 1 2139600 54211574 91190
## + HF_BATHRM 1 1203142 55148032 91373
## + AC 1 1083758 55267417 91396
## + FIREPLACES 1 923559 55427616 91427
## + BEDRM 1 742529 55608646 91462
## + ROOF 1 81048 56270127 91589
## <none> 56351175 91602
## - AYB 1 3896218 60247393 92314
## - QUALIFIED 1 9690195 66041370 93295
## - CNDTN 1 10671769 67022944 93453
## - GRADE_MEDIAN_PRICE 1 12127684 68478859 93683
## - GBA 1 27495744 83846919 95846
## - WARD 7 65460729 121811904 99825
##
## Step: AIC=91103.86
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM
##
## Df Sum of Sq RSS AIC
## + STRUCT 1 2554350 51220049 90586
## + HF_BATHRM 1 2148606 51625793 90670
## + EYB 1 1771359 52003039 90748
## + FIREPLACES 1 1036545 52737854 90898
## + AC 1 720785 53053614 90962
## + BEDRM 1 118109 53656289 91082
## + ROOF 1 94257 53680141 91087
## <none> 53774399 91104
## - BATHRM 1 2576776 56351175 91602
## - AYB 1 4188664 57963063 91903
## - CNDTN 1 6513401 60287800 92324
## - QUALIFIED 1 9086078 62860477 92770
## - GBA 1 10944722 64719121 93081
## - GRADE_MEDIAN_PRICE 1 12265770 66040169 93297
## - WARD 7 64864224 118638623 99545
##
## Step: AIC=90585.86
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT
##
## Df Sum of Sq RSS AIC
## + HF_BATHRM 1 1841526 49378523 90197
## + EYB 1 1190449 50029600 90337
## + FIREPLACES 1 820460 50399588 90415
## + AC 1 637281 50582768 90454
## + BEDRM 1 104029 51116020 90566
## + ROOF 1 65605 51154444 90574
## <none> 51220049 90586
## - STRUCT 1 2554350 53774399 91104
## - BATHRM 1 2893970 54114019 91171
## - AYB 1 4129337 55349385 91412
## - CNDTN 1 5945282 57165331 91757
## - QUALIFIED 1 9176424 60396473 92345
## - GRADE_MEDIAN_PRICE 1 9701864 60921913 92437
## - GBA 1 11198634 62418683 92697
## - WARD 7 60370243 111590292 98892
##
## Step: AIC=90196.62
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT + HF_BATHRM
##
## Df Sum of Sq RSS AIC
## + EYB 1 717029 48661493 90042
## + FIREPLACES 1 683499 48695024 90050
## + AC 1 299398 49079124 90134
## + BEDRM 1 72069 49306454 90183
## + ROOF 1 53412 49325111 90187
## <none> 49378523 90197
## - HF_BATHRM 1 1841526 51220049 90586
## - STRUCT 1 2247270 51625793 90670
## - BATHRM 1 3778421 53156944 90982
## - CNDTN 1 4241031 53619554 91075
## - AYB 1 5040915 54419438 91233
## - QUALIFIED 1 8605793 57984316 91911
## - GRADE_MEDIAN_PRICE 1 8865587 58244110 91959
## - GBA 1 9220270 58598793 92024
## - WARD 7 58144830 107523353 98498
##
## Step: AIC=90042.33
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT + HF_BATHRM + EYB
##
## Df Sum of Sq RSS AIC
## + FIREPLACES 1 786479 47875014 89870
## + AC 1 170258 48491235 90007
## + ROOF 1 127923 48533571 90016
## + BEDRM 1 90893 48570600 90024
## <none> 48661493 90042
## - EYB 1 717029 49378523 90197
## - HF_BATHRM 1 1368107 50029600 90337
## - STRUCT 1 1842007 50503501 90437
## - CNDTN 1 2149258 50810751 90502
## - BATHRM 1 3282557 51944050 90738
## - AYB 1 4403962 53065456 90966
## - GRADE_MEDIAN_PRICE 1 6568573 55230066 91393
## - QUALIFIED 1 8130725 56792219 91691
## - GBA 1 9249969 57911463 91900
## - WARD 7 57338791 106000284 98347
##
## Step: AIC=89870.22
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES
##
## Df Sum of Sq RSS AIC
## + BEDRM 1 123606 47751408 89845
## + AC 1 100503 47774511 89850
## + ROOF 1 62403 47812611 89858
## <none> 47875014 89870
## - FIREPLACES 1 786479 48661493 90042
## - EYB 1 820009 48695024 90050
## - HF_BATHRM 1 1217304 49092318 90137
## - STRUCT 1 1650818 49525832 90230
## - CNDTN 1 1946932 49821946 90294
## - BATHRM 1 3309849 51184863 90583
## - AYB 1 4340168 52215182 90795
## - GRADE_MEDIAN_PRICE 1 4962468 52837482 90922
## - QUALIFIED 1 8067402 55942416 91532
## - GBA 1 8213569 56088583 91560
## - WARD 7 55292893 103167907 98060
##
## Step: AIC=89844.6
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + BEDRM
##
## Df Sum of Sq RSS AIC
## + AC 1 106444 47644964 89823
## + ROOF 1 62448 47688960 89833
## <none> 47751408 89845
## - BEDRM 1 123606 47875014 89870
## - FIREPLACES 1 819192 48570600 90024
## - EYB 1 845807 48597215 90030
## - HF_BATHRM 1 1175261 48926669 90102
## - STRUCT 1 1631824 49383232 90202
## - CNDTN 1 1948501 49699909 90270
## - BATHRM 1 2539171 50290579 90396
## - AYB 1 4225864 51977272 90749
## - GRADE_MEDIAN_PRICE 1 4926213 52677621 90892
## - GBA 1 6216626 53968034 91150
## - QUALIFIED 1 8064515 55815923 91510
## - WARD 7 55288477 103039885 98049
##
## Step: AIC=89822.76
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + BEDRM +
## AC
##
## Df Sum of Sq RSS AIC
## + ROOF 1 55664 47589300 89812
## <none> 47644964 89823
## - AC 1 106444 47751408 89845
## - BEDRM 1 129547 47774511 89850
## - EYB 1 721979 48366943 89981
## - FIREPLACES 1 747167 48392131 89987
## - HF_BATHRM 1 1041156 48686120 90052
## - CNDTN 1 1483245 49128209 90148
## - STRUCT 1 1643256 49288220 90183
## - BATHRM 1 2338711 49983675 90333
## - AYB 1 4079026 51723990 90698
## - GRADE_MEDIAN_PRICE 1 4945920 52590884 90876
## - GBA 1 6321029 53965993 91152
## - QUALIFIED 1 7846204 55491168 91450
## - WARD 7 55385669 103030633 98050
##
## Step: AIC=89812.27
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED +
## AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + BEDRM +
## AC + ROOF
##
## Df Sum of Sq RSS AIC
## <none> 47589300 89812
## - ROOF 1 55664 47644964 89823
## - AC 1 99660 47688960 89833
## - BEDRM 1 129399 47718698 89839
## - FIREPLACES 1 689541 48278841 89964
## - EYB 1 765049 48354348 89981
## - HF_BATHRM 1 1025557 48614857 90038
## - CNDTN 1 1448874 49038174 90131
## - STRUCT 1 1613975 49203275 90167
## - BATHRM 1 2334329 49923629 90322
## - AYB 1 4131116 51720416 90700
## - GRADE_MEDIAN_PRICE 1 4661249 52250548 90809
## - GBA 1 6277879 53867179 91134
## - QUALIFIED 1 7860573 55449872 91444
## - WARD 7 55284755 102874055 98035
##
## Call:
## lm(formula = PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE +
## QUALIFIED + AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES +
## BEDRM + AC + ROOF, data = house.price)
##
## Coefficients:
## (Intercept) WARDWard 2 WARDWard 3
## 698.16080 101.28301 63.95630
## WARDWard 4 WARDWard 5 WARDWard 6
## -54.41096 -78.18838 4.76844
## WARDWard 7 WARDWard 8 GBA
## -209.41612 -235.26479 0.07161
## CNDTNGood GRADE_MEDIAN_PRICE QUALIFIEDU
## 33.86244 0.24312 -78.51845
## AYB BATHRM STRUCTExpensive
## -1.22984 24.77822 32.85632
## HF_BATHRM EYB FIREPLACES
## 19.91879 1.07114 14.95291
## BEDRM ACY ROOFExpensive
## 4.99228 9.16163 9.25816
As the result shows, the full model has the lowest AIC, meaning that adding all the features is the best. However, we know “ROOF” and “WARD 6” would not make any difference, thus, they can be removed. Still, the feature selection is in accordance with the Adjusted R-squared and AIC results.
A predictive model can be used to predict a house price for one purpose, and to see if a house value is evaluated properly for another. The principle steps start with splitting data set, train one part, and test the model against the rest. Since training set and testing set could be biased, partitioning should loop through the data set (Cross Validation).
# data frame for prediction
price.df <- data.frame(x, y)
# hold 30% for validation
train.indice <- createDataPartition(price.df$y, p = 0.7, list = F)
training.set <- price.df[train.indice, ]
testing.set <- price.df[-train.indice, ]
# cross validation
train.control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
lm.model <- train(y ~ ., data = price.df, trControl = train.control, method = "lm")
summary(lm.model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -466.25 -43.10 -1.17 40.51 543.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 763.86590 109.74435 6.960 3.59e-12 ***
## BATHRM 24.81960 1.08399 22.897 < 2e-16 ***
## HF_BATHRM 20.09576 1.31399 15.294 < 2e-16 ***
## ACY 9.52283 1.93769 4.915 9.03e-07 ***
## BEDRM 4.84664 0.92349 5.248 1.57e-07 ***
## AYB -1.21242 0.04018 -30.172 < 2e-16 ***
## EYB 1.02048 0.08083 12.625 < 2e-16 ***
## QUALIFIEDU -78.50728 1.87148 -41.949 < 2e-16 ***
## GBA 0.07181 0.00191 37.604 < 2e-16 ***
## STRUCTExpensive 33.16673 1.72692 19.206 < 2e-16 ***
## CNDTNGood 34.35376 1.87601 18.312 < 2e-16 ***
## FIREPLACES 15.56695 1.19340 13.044 < 2e-16 ***
## WARDWard.2 97.12006 3.95214 24.574 < 2e-16 ***
## WARDWard.3 61.87410 3.00451 20.594 < 2e-16 ***
## WARDWard.4 -56.78342 2.21129 -25.679 < 2e-16 ***
## WARDWard.5 -81.42730 2.20527 -36.924 < 2e-16 ***
## WARDWard.7 -212.63356 2.56912 -82.765 < 2e-16 ***
## WARDWard.8 -238.45803 2.87632 -82.904 < 2e-16 ***
## GRADE_MEDIAN_PRICE 0.24693 0.00743 33.234 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.84 on 10666 degrees of freedom
## Multiple R-squared: 0.8908, Adjusted R-squared: 0.8906
## F-statistic: 4833 on 18 and 10666 DF, p-value: < 2.2e-16
The result is not very different from the initial modeling, claiming that the prediction model performs well. All the variables are now significant, and Adjusted R-squared is very slightly lower than R-squared. Putting them altogether, there is no more feature to be eliminated. One last step is to see how well this model works.
Root Mean Square Error is widely used to evaluate a regression model. This is the average of differences between true values and predicted values of each observation with sign ignored. Note that the RMSE 68.14 is from unit price transformed as square root.
price_pred <- predict(lm.model, newdata = testing.set)
rmse <- RMSE(testing.set$y, price_pred)
cat("Root Mean Square Error (RMSE):", rmse)
## Root Mean Square Error (RMSE): 65.15407
True.Price <- (testing.set$y)^2
Predicted.Price <- price_pred^2
pred.df <- data.frame(True.Price, Predicted.Price)
pred.df$Diff <- pred.df$True.Price - pred.df$Predicted.Price
kable(head(pred.df, n = 20)) %>%
kable_styling() %>%
scroll_box(width = "500px", height = "500px")
True.Price | Predicted.Price | Diff | |
---|---|---|---|
44 | 1700000 | 1658674.3 | 41325.71 |
161 | 830000 | 886218.6 | -56218.57 |
162 | 635000 | 572257.5 | 62742.50 |
276 | 899000 | 750211.1 | 148788.89 |
341 | 928000 | 689197.8 | 238802.15 |
370 | 1250000 | 1132902.5 | 117097.55 |
533 | 1120000 | 1138339.3 | -18339.27 |
560 | 805000 | 942926.6 | -137926.60 |
623 | 1440000 | 1348719.1 | 91280.93 |
648 | 1412500 | 1434301.9 | -21801.92 |
842 | 932500 | 908637.5 | 23862.54 |
891 | 1041000 | 906915.5 | 134084.52 |
916 | 1450000 | 1365898.3 | 84101.65 |
921 | 635000 | 846595.9 | -211595.91 |
964 | 801600 | 765697.9 | 35902.05 |
978 | 1120160 | 1296661.6 | -176501.58 |
1001 | 1200000 | 1057110.6 | 142889.37 |
1057 | 915000 | 786833.2 | 128166.81 |
1071 | 822500 | 805455.1 | 17044.86 |
1075 | 875000 | 786492.7 | 88507.29 |
cat("True Price vs. Predicted Price\n",
"Mean Error: $", mean(abs(pred.df$Diff)), '\n',
"Median Error: $", median(abs(pred.df$Diff)))
## True Price vs. Predicted Price
## Mean Error: $ 80025.89
## Median Error: $ 61852.59
Median RMSE in dollar is $63,687, which implies that the model usually predicts a price for a given house with that much difference. For example, a $1,000,000 house could be predicted as much as $1,063,000 or $934,000. The following plot is the residual plot expressed in dollar amount. There seems no specific pattern among residuals. Note that the reason the left side is darker is simply due to more observations in that range.
ggplot(pred.df, aes(Predicted.Price, Diff)) +
geom_point(alpha = 0.25) +
geom_hline(yintercept = 0, color = 'red') +
geom_hline(yintercept = c(-250000, 250000), color = 'blue', linetype = "dashed") +
ggtitle("Residual Plot")
mean_price <- pred.df[pred.df$True.Price > 650000 &
pred.df$True.Price < 660000, "Predicted.Price"]
median_price <- pred.df[pred.df$True.Price > 615000 &
pred.df$True.Price < 625000, "Predicted.Price"]
cat("Mean and Median Price Prediction\n",
"Mean (650K~660K): $", mean(abs(mean_price)), '\n',
"Median (615K~625K): $", mean(abs(median_price)))
## Mean and Median Price Prediction
## Mean (650K~660K): $ 661071.5
## Median (615K~625K): $ 610062.6
Prediction for a given value can be helpful to feel how this model works. The mean and median house price is $660,583 and $619,900, respectively, and the predictions for these are $684,259 and $593,353. There is about $20K gap between true statistics and the predictions. This is much lower than overall the RMSE, but more meaningful because most of the observations are around mean and median, so will be new data.
ggplot(pred.df) +
geom_histogram(aes(True.Price, color="True"),
binwidth = 20000, fill = "white", alpha = 0.25) +
geom_histogram(aes(Predicted.Price, color="Predicted"),
binwidth = 20000, fill = "white", alpha = 0.25) +
ggtitle("True Price vs. Predicted Price")
Finally, histograms of true values and predicted values can be drawn together to see how similar they are over the price ranges. During the analysis, we found out that the distributions had issues on tails. The same occurs here where most of errors occur on tails. However, this shouldn’t be a serious problem because the model is focused on majority of houses, not overly expensive or cheap houses.
Beginning with about 50 variables available, we successfully selected 16 features to predict house prices. Adjusted R-square and AIC were used as the criteria for feature selection. Some of them are as follows.
House prices become more expensive as number of rooms and size increase. Area such as ward and house condition critically affect prices but building materials do not as much. The following diagnostic plots are what we used for checking the assumptions.
After the diagnosis, DFBETAS was used for filtering influential points without which the linear model was improved to meet the assumptions. Through all the processes, we could finally build a predictive model. A training set and a validation set were separated to train the model only with the training set and to test it against the testing set. By doing so, the model could be tested with data it didn’t see for training.
The performance of the model was measured as 0.89 R-squared. More realistically, the median error was $63,025 and less than $20,000 for average houses priced between $650,000 and $660,000. Considering how expensive houses in Washington D.C. are, error $20,000 or less is quite impressive even if there could be further improvement if with more variables added.