Washington D.C. Property Price Analysis and Prediction

Overview

House price is considered to be difficult to predict as there are so many variables in the market. However, we can guess the price with a series of major factors such as size, area, quality and so on. Although they would rather be a factrue of numerous factors existing, these quantitative and qualitative variables are useful to narrow down the price in a certain range which is incomparably better than just mean or median. In the course of this project, we first analyze the observations to get some insight for modeling as well as to filter some data like influential points. Once a linear model is sought, we need to evaluate it with techniques like residual analysis to possibly remove insignificant features. Lastly, we build a predictive model and measure its performance with RMSE.

Preparation

Dataset

Rows of Interest

Dependent Variable

Outliers

Drop Variables

Data Analysis

Correlation Matrix

Transformation

Box Plot

Facet Plot

Diagnostic Analysis

Plots and Influential Points

Remove Influential Points

Best Model

Adjusted R-squared

AIC

Prediction

Root Mean Square Error

Conclusion

Reference

Preparation

The data set used for this project is D.C. Residential Properties which is refined from the original source on D.C. Open Data, and details of the columns can be found on Metadata page. Before analyzing the data, each column needs to be examined to determine whether to include it in a model, also to find outliers that could have a negative influence on modeling. This process starts with calling the following libraries required.

library(dplyr)      # df manipulation
library(ggplot2)    # graphical plot
library(GGally)     # ggplot extension
library(grid)       # grid plot
library(gridExtra)  # grid extension
library(knitr)      # pretty print
library(kableExtra) # print styling
library(stringr)    # handle strings
library(ggmap)      # geographical map
library(corrplot)   # correlation matrix
library(car)        # VIF
library(caret)      # data set split
library(lmtest)     # non-constant variance test
library(MASS)       # AIC
library(leaps)      # model selection
library(faraway)    # max adjusted r
library(DAAG)       # cross-validation

select <- dplyr::select

Dataset Overview

The next step is reading the data set and taking a look at the structure. This step focuses on having an overall idea what the columns are. Columns with redundant information, ID columns, or empty columns can be found and dropped. Also, such data type is checked that some of them could be converted to a proper type.

# read the dataset
house.price <- read.csv('DC_Properties.csv', na.strings = c("", "NA"))

# show column structures
str(house.price)

## 'data.frame':    158957 obs. of  49 variables:
##  $ X.1               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ BATHRM            : int  4 3 3 3 2 3 1 3 3 1 ...
##  $ HF_BATHRM         : int  0 1 1 1 1 2 0 1 1 1 ...
##  $ HEAT              : Factor w/ 14 levels "Air-Oil","Air Exchng",..: 13 13 8 8 13 8 13 8 13 8 ...
##  $ AC                : Factor w/ 3 levels "0","N","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ NUM_UNITS         : num  2 2 2 2 1 1 2 2 2 1 ...
##  $ ROOMS             : int  8 11 9 8 11 10 5 8 7 6 ...
##  $ BEDRM             : int  4 5 5 5 3 5 2 4 3 2 ...
##  $ AYB               : num  1910 1898 1910 1900 1913 ...
##  $ YR_RMDL           : num  1988 2007 2009 2003 2012 ...
##  $ EYB               : int  1972 1972 1984 1984 1985 1972 1957 1972 1967 1950 ...
##  $ STORIES           : num  3 3 3 3 3 4 2 3 2 2 ...
##  $ SALEDATE          : Factor w/ 6937 levels "1947-05-14 00:00:00",..: 3257 2422 6427 3926 NA 4847 5141 5249 6889 4539 ...
##  $ PRICE             : num  1095000 NA 2100000 1602000 NA ...
##  $ QUALIFIED         : Factor w/ 2 levels "Q","U": 1 2 1 1 2 1 2 1 1 2 ...
##  $ SALE_NUM          : int  1 1 3 1 1 1 1 1 4 1 ...
##  $ GBA               : num  2522 2567 2522 2484 5255 ...
##  $ BLDG_NUM          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ STYLE             : Factor w/ 18 levels "1 Story","1.5 Story Fin",..: 7 7 7 7 7 10 4 7 4 4 ...
##  $ STRUCT            : Factor w/ 9 levels "Default","Multi",..: 4 4 4 4 5 4 4 4 4 4 ...
##  $ GRADE             : Factor w/ 13 levels "Above Average",..: 13 13 13 13 13 13 1 13 1 9 ...
##  $ CNDTN             : Factor w/ 7 levels "Average","Default",..: 5 5 7 5 5 5 1 1 7 1 ...
##  $ EXTWALL           : Factor w/ 25 levels "Adobe","Aluminum",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ ROOF              : Factor w/ 16 levels "Built Up","Clay Tile",..: 9 1 1 1 10 1 9 9 1 1 ...
##  $ INTWALL           : Factor w/ 12 levels "Carpet","Ceramic Tile",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ KITCHENS          : num  2 2 2 2 1 1 2 2 2 1 ...
##  $ FIREPLACES        : int  5 4 4 3 0 4 0 1 1 0 ...
##  $ USECODE           : int  24 24 24 24 13 11 24 24 24 11 ...
##  $ LANDAREA          : int  1680 1680 1680 1680 2032 2196 1261 1627 1424 1424 ...
##  $ GIS_LAST_MOD_DTTM : Factor w/ 2 levels "2018-07-22 18:01:38",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ SOURCE            : Factor w/ 2 levels "Condominium",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ CMPLX_NUM         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ LIVING_GBA        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ FULLADDRESS       : Factor w/ 105978 levels "1 16TH STREET NE",..: 26459 26410 26354 26291 27230 24511 26809 26392 25601 25911 ...
##  $ CITY              : Factor w/ 1 level "WASHINGTON": 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATE             : Factor w/ 1 level "DC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZIPCODE           : num  20009 20009 20009 20009 20009 ...
##  $ NATIONALGRID      : Factor w/ 105949 levels "18S UH 25216 99697",..: 24683 24713 24755 24776 25561 25343 24575 25100 25215 25118 ...
##  $ LATITUDE          : num  38.9 38.9 38.9 38.9 38.9 ...
##  $ LONGITUDE         : num  -77 -77 -77 -77 -77 ...
##  $ ASSESSMENT_NBHD   : Factor w/ 57 levels "16th Street Heights",..: 43 43 43 43 43 43 43 43 43 43 ...
##  $ ASSESSMENT_SUBNBHD: Factor w/ 121 levels "001 A American University",..: 93 93 93 93 93 93 93 93 93 93 ...
##  $ CENSUS_TRACT      : num  4201 4201 4201 4201 4201 ...
##  $ CENSUS_BLOCK      : Factor w/ 3848 levels "000100 1000",..: 1676 1676 1676 1676 1676 1676 1675 1675 1675 1675 ...
##  $ WARD              : Factor w/ 8 levels "Ward 1","Ward 2",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ SQUARE            : Factor w/ 3292 levels "0004","0005",..: 58 58 58 58 58 58 58 58 58 58 ...
##  $ X                 : num  -77 -77 -77 -77 -77 ...
##  $ Y                 : num  38.9 38.9 38.9 38.9 38.9 ...
##  $ QUADRANT          : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...

# drop ID, single value, redundant columns
house.price <- select(house.price, -c("X.1", "CITY", "STATE", "X", "Y"))

# ZIPCODE, USECODE are more of categorical data
house.price$ZIPCODE <- as.factor(house.price$ZIPCODE)
house.price$USECODE <- as.factor(house.price$USECODE)
house.price$CENSUS_TRACT <- as.factor(house.price$CENSUS_TRACT)

# show first a few rows 
kable(head(house.price, n = 10)) %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")

BATHRM	HF_BATHRM	HEAT	AC	NUM_UNITS	ROOMS	BEDRM	AYB	YR_RMDL	EYB	STORIES	SALEDATE	PRICE	QUALIFIED	SALE_NUM	GBA	BLDG_NUM	STYLE	STRUCT	GRADE	CNDTN	EXTWALL	ROOF	INTWALL	KITCHENS	FIREPLACES	USECODE	LANDAREA	GIS_LAST_MOD_DTTM	SOURCE	CMPLX_NUM	LIVING_GBA	FULLADDRESS	ZIPCODE	NATIONALGRID	LATITUDE	LONGITUDE	ASSESSMENT_NBHD	ASSESSMENT_SUBNBHD	CENSUS_TRACT	CENSUS_BLOCK	WARD	SQUARE	QUADRANT
4	0	Warm Cool	Y	2	8	4	1910	1988	1972	3	2003-11-25 00:00:00	1095000	Q	1	2522	1	3 Story	Row Inside	Very Good	Good	Common Brick	Metal- Sms	Hardwood	2	5	24	1680	2018-07-22 18:01:43	Residential	NA	NA	1748 SWANN STREET NW	20009	18S UJ 23061 09289	38.91468	-77.04083	Old City 2	040 D Old City 2	4201	004201 2006	Ward 2	0152	NW
3	1	Warm Cool	Y	2	11	5	1898	2007	1972	3	2000-08-17 00:00:00	NA	U	1	2567	1	3 Story	Row Inside	Very Good	Good	Common Brick	Built Up	Hardwood	2	4	24	1680	2018-07-22 18:01:43	Residential	NA	NA	1746 SWANN STREET NW	20009	18S UJ 23067 09289	38.91468	-77.04076	Old City 2	040 D Old City 2	4201	004201 2006	Ward 2	0152	NW
3	1	Hot Water Rad	Y	2	9	5	1910	2009	1984	3	2016-06-21 00:00:00	2100000	Q	3	2522	1	3 Story	Row Inside	Very Good	Very Good	Common Brick	Built Up	Hardwood	2	4	24	1680	2018-07-22 18:01:43	Residential	NA	NA	1744 SWANN STREET NW	20009	18S UJ 23074 09289	38.91468	-77.04068	Old City 2	040 D Old City 2	4201	004201 2006	Ward 2	0152	NW
3	1	Hot Water Rad	Y	2	8	5	1900	2003	1984	3	2006-07-12 00:00:00	1602000	Q	1	2484	1	3 Story	Row Inside	Very Good	Good	Common Brick	Built Up	Hardwood	2	3	24	1680	2018-07-22 18:01:43	Residential	NA	NA	1742 SWANN STREET NW	20009	18S UJ 23078 09288	38.91468	-77.04063	Old City 2	040 D Old City 2	4201	004201 2006	Ward 2	0152	NW
2	1	Warm Cool	Y	1	11	3	1913	2012	1985	3	NA	NA	U	1	5255	1	3 Story	Semi-Detached	Very Good	Good	Common Brick	Neopren	Hardwood	1	0	13	2032	2018-07-22 18:01:43	Residential	NA	NA	1804 NEW HAMPSHIRE AVENUE NW	20009	18S UJ 23188 09253	38.91438	-77.03936	Old City 2	040 D Old City 2	4201	004201 2006	Ward 2	0152	NW
3	2	Hot Water Rad	Y	1	10	5	1913	NA	1972	4	2010-02-26 00:00:00	1950000	Q	1	5344	1	4 Story	Row Inside	Very Good	Good	Common Brick	Built Up	Hardwood	1	4	11	2196	2018-07-22 18:01:43	Residential	NA	NA	1709 S STREET NW	20009	18S UJ 23157 09248	38.91433	-77.03971	Old City 2	040 D Old City 2	4201	004201 2006	Ward 2	0152	NW
1	0	Warm Cool	Y	2	5	2	1917	1988	1957	2	2011-05-02 00:00:00	NA	U	1	1260	1	2 Story	Row Inside	Above Average	Average	Common Brick	Metal- Sms	Hardwood	2	0	24	1261	2018-07-22 18:01:43	Residential	NA	NA	1769 SWANN STREET NW	20009	18S UJ 23042 09323	38.91498	-77.04105	Old City 2	040 D Old City 2	4201	004201 2005	Ward 2	0152	NW
3	1	Hot Water Rad	Y	2	8	4	1906	2011	1972	3	2011-09-29 00:00:00	1050000	Q	1	2401	1	3 Story	Row Inside	Very Good	Average	Common Brick	Metal- Sms	Hardwood	2	1	24	1627	2018-07-22 18:01:43	Residential	NA	NA	1746 1/2 T STREET NW	20009	18S UJ 23124 09368	38.91541	-77.04013	Old City 2	040 D Old City 2	4201	004201 2005	Ward 2	0152	NW
3	1	Warm Cool	Y	2	7	3	1908	2008	1967	2	2018-05-03 00:00:00	1430000	Q	4	1488	1	2 Story	Row Inside	Above Average	Very Good	Common Brick	Built Up	Hardwood	2	1	24	1424	2018-07-22 18:01:43	Residential	NA	NA	1727 SWANN STREET NW	20009	18S UJ 23142 09324	38.91502	-77.03990	Old City 2	040 D Old City 2	4201	004201 2005	Ward 2	0152	NW
1	1	Hot Water Rad	Y	1	6	2	1908	1979	1950	2	2008-12-05 00:00:00	NA	U	1	1590	1	2 Story	Row Inside	Good Quality	Average	Common Brick	Built Up	Hardwood	1	0	11	1424	2018-07-22 18:01:43	Residential	NA	NA	1733 SWANN STREET NW	20009	18S UJ 23127 09324	38.91501	-77.04008	Old City 2	040 D Old City 2	4201	004201 2005	Ward 2	0152	NW

Preprocessing: Rows of Interest

There are two types of properties in the data set, and they have such a distinctive pattern that separate studies need to be done. At this time, the study focuses on ‘Residential’ type which is majority of observations. Some factor that can’t be controlled is that house values fluctuate with time. In addition, prices might need some correction to today’s values, which could be very challenging but possibly only to make it more complicated. For that reason, the cases limits to the recent two years data for the goal to analyze the latest and general trend.

# residential price
print(summary(house.price[house.price$SOURCE == "Residential", 'PRICE']))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        1   240000   440000   579912   750000 25100000    48796

# condominium price
print(summary(house.price[house.price$SOURCE == "Condominium", 'PRICE']))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         1    239000    370000   1436073    535000 137427545     11945

# reduce it to 'Residential' type
house.price <- house.price[house.price$SOURCE == "Residential", ]

# cases density histogram by sale date
house.price$SALEDATE <- substr(house.price$SALEDATE, 0, 10)
house.price$SALEDATE <- as.Date(house.price$SALEDATE)

sale.year <- as.numeric(substr(house.price[, "SALEDATE"], 0, 4))
sale.year <- sale.year[!is.na(sale.year)]
hist(sale.year,
     breaks = c(-Inf, seq(1990, 2018, 1)),
     xlim = c(1990, 2018),
     main = "Sales by Year",
     xlab = "Year",
     border = 'blue', col = 'green', alpha = 0.5,
     prob = T)

## Warning in plot.window(xlim, ylim, "", ...): "alpha" is not a graphical
## parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "alpha" is not a graphical parameter

## Warning in axis(1, ...): "alpha" is not a graphical parameter

## Warning in axis(2, ...): "alpha" is not a graphical parameter

axis(side = 1, at = c(2015, 2018))
lines(density(sale.year), col = 'red')

There are more observations as it gets closer to the recent. This is good news as the survey seeks for the recent trend. This figure also shows a random pattern that house markets are active one season and become loose next season. It will be very difficult to make use of old data except for finding a historical pattern. The data starting from 2016 still covers 20% of all.

quantile(sale.year, probs = seq(0, 1, 0.1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 1947 1999 2002 2005 2007 2010 2012 2014 2016 2017 2018

# drop sales made before 2016.01.01.
house.price <- subset(house.price, SALEDATE > as.Date("2015-12-31"))

Dependent Variable: Price

It would be reasonable to investigate house prices as these are the targets to be analyzed and predicted. Looking into the dependent variable would help with having an idea for independent variables too since the goal is to define their relationships. The following histogram gives a basic idea how prices are distributed.

# drop rows with price as NA
house.price <- house.price[complete.cases(house.price[ , "PRICE"]), ]

# histogram independent variable "Price"
ggplot(house.price, aes(x = PRICE)) + 
  geom_histogram(binwidth = 20000,
                 fill = "blue",
                 alpha = .25) +
  labs(title = "Washington D.C House Price Histogram") +
  labs(x = "Price", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlim(c(0, 2000000)) +
  geom_vline(aes(xintercept = mean(PRICE), color = "mean"),
             linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(PRICE), color = "median"),
             linetype = "dashed", size = 1) +
  scale_color_manual(name = "", values = c(mean = "pink", median = "green"))

From both the shape and statistics, house price data are skewed. Most of the houses are in a certain range with a few expensive houses. However, transformation might be carefully considered, asking for more evidence than just histogram. Again, the following statistics shows that prices are skewed with extremely large values, even if /$10 does not make sense either. In addition, it would be very difficult to buy a house in Washington D.C. from the median price.

summary(house.price$PRICE)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       10   418000   659266   789970   940000 25100000

On the Map

One of the best ways to have a sense of data is visualization. Since the data could be fundamentally geographic, it would be a good attemp to visualize prices on the Washington D.C. map. On the map, the darker dots are the more expensive.

ggmap(dc_map) + 
  geom_point(data = house.price, 
             aes(LONGITUDE, LATITUDE,
                 color = cut(PRICE, price_group)), 
             size = 0.3, alpha = 0.3) +
  labs(title = "Washington D.C House Price Heat Map") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  scale_color_brewer(palette = "Oranges",
                     name = "Price Range",
                     labels = price_label)

The areas with no dots are mostly hills or other types where houses can’t be build on. Certainly, there are some expensive areas and some cheaper areas. Washington D.C. is consisted of 8 wards of which ward 2 and 3 are where dots are darker while the rest are brighter. But ward 6 in the middle is a bit ambiguous to tell because its left is dark but the right is not. Ward is actually one of the columns so it will be studied by each ward.

Handle Outliers

As shown on the histogram, there are a series of very expensive houses. These probably would interfere with modeling, thus, must be handled before moving forward. Outliers here are defined as observations over or less than IQR*1.5. The lower cut-off, however, is negative leaving only the expensive houses as outliers. \[Interquantile\ Range\ (IQR)=Quantile\ 1-\ Quantile\ 2\]

lowerq <- quantile(house.price$PRICE)[2]
upperq <- quantile(house.price$PRICE)[4]
iqr <- upperq - lowerq
lower.threshold <- lowerq - (iqr*1.5)
upper.threshold <- upperq + (iqr*1.5)

cat("Lower Outlier Threshold:", lower.threshold,
    "\nUpper Outlier Threshold:", upper.threshold)

## Lower Outlier Threshold: -365000 
## Upper Outlier Threshold: 1723000

house.price <- house.price[house.price$PRICE < upper.threshold, ]

Preprocessing: Drop Variables

As explained, some variables do not have meaningful or unique information. Although a variable has its own information, it could still be not useful if some other variable can replace it or have better information. The data set has so many geographical data which come in different forms but have a common ground. Before taking care of that, NA values should be controlled.

# no longer needed
house.price <- select(house.price, -c("SOURCE", "SALEDATE"))

# count NA for each variable
na_count <- sapply(house.price, function(y) {
    sum(length(which(is.na(y))))
  })

# all the variables with NA's
na_count <- na_count[na_count != 0]
# variable and the number of NA's
print(na_count)

##                AYB            YR_RMDL            STORIES 
##                 19               4861                 19 
##           KITCHENS          CMPLX_NUM         LIVING_GBA 
##                  1              12041              12041 
##        FULLADDRESS       NATIONALGRID ASSESSMENT_SUBNBHD 
##                119                119               1439 
##       CENSUS_BLOCK           QUADRANT 
##                119                 25

“LIVING GBA” and “CMPLX NUM” are completely empty, suggesting that they are typical data of condominiums which are excluded at the beggining. Therefore, these two should simply be dropped. “YR RMDL” have about one third empty and this must be because not all the houses have been remodeled. While dropping is not always the best, the solution for this would be type conversion so that it is either a yes or a no as to whether remodeled or not. The rest are mostly geographical columns. While neighborhood is accepted as a good indicator for house prices, there are 57 different of them to make too many levels. As the final model will be a linear model, having too many category is susceptible to overfitting and losing general trends. Most importantly ward seems to be very useful as confirmed on the map.

# drop empty variables
house.price <- select(house.price, -c("CMPLX_NUM", "LIVING_GBA"))

# transform YR_RMDL by encoding into Yes or No
house.price$RMDL <- ifelse(is.na(house.price$YR_RMDL), "N", "Y")
house.price$RMDL <- as.factor(house.price$RMDL)

# drop geographical variables except for WARD
house.price <- select(house.price, -c("ZIPCODE", "USECODE", "SQUARE", "NATIONALGRID",
                                      "ASSESSMENT_NBHD", "ASSESSMENT_SUBNBHD", 
                                      "CENSUS_BLOCK", "CENSUS_TRACT",
                                      "QUADRANT", "FULLADDRESS",
                                      "LONGITUDE", "LATITUDE"))

# fine tuning
print(levels(house.price$GIS_LAST_MOD_DTTM))

## [1] "2018-07-22 18:01:38" "2018-07-22 18:01:43"

house.price <- select(house.price, -c("GIS_LAST_MOD_DTTM", "YR_RMDL"))

# drop small number of NA's (AYB, STORIES...)
house.price <- house.price[complete.cases(house.price), ]

Data Analysis

This part concentrates on relationships among variables. Separate analysis will be conducted by data type, numerical and categorical, and then another for both of them together. For numerical types, correlation matrix will be drawn not only to find relationship with dependent variable but also to catch multicollinearity patterns. For categorical types, box plots will be used to see whether each category has an ability to divide house prices and what values are useful if so. Box plot also can help to decide whether to reduce number of values in a category.

Correlation Matrix

numeric_cols <- names(select_if(house.price, is.numeric))
print(numeric_cols)

##  [1] "BATHRM"     "HF_BATHRM"  "NUM_UNITS"  "ROOMS"      "BEDRM"     
##  [6] "AYB"        "EYB"        "STORIES"    "PRICE"      "SALE_NUM"  
## [11] "GBA"        "BLDG_NUM"   "KITCHENS"   "FIREPLACES" "LANDAREA"

corr <- cor(house.price[, numeric_cols])
corrplot.mixed(corr, number.cex = .75, tl.pos = "lt")

“SALE NUM” has no relationship with any others, thus, it can be easily gone. “NUM UNITS” and “KITCHENS” have relationships with some variables but amlost none with “PRICE”. Our target “PRICE” has a decent relationship with the rest, especially with “GBA” and room-related variables. However, there are multicollinearity issues such as “ROOMS” and “BEDRM”. These should be handled first, and then another matrix should be drawn.

# drop unrelated variables
house.price <- subset(house.price, select = -c(SALE_NUM, BLDG_NUM, NUM_UNITS, 
                                               KITCHENS, LANDAREA))

# test for multicollinearity of numerical variables
VIF.model <- lm(PRICE ~ BATHRM + HF_BATHRM + ROOMS + BEDRM + 
                  AYB + EYB + STORIES + GBA + FIREPLACES, 
                data = house.price)
print(vif(VIF.model))

##     BATHRM  HF_BATHRM      ROOMS      BEDRM        AYB        EYB 
##     2.2109     1.2203     2.3428     2.1811     2.0363     2.5310 
##    STORIES        GBA FIREPLACES 
##     1.1156     2.5999     1.2261

Variance inflation factor (VIF) measures if variables are causing multicollinearity problems. Usually, VIF equals to 5 up to 10 is the limit which variables should not go over. Since we don’t see such a variable, we can keep them and draw another correlation matrix.

\[VIF_i=\frac{1}{1-R_i^2}\]

# final correaltion table
numeric_cols <- names(select_if(house.price, is.numeric))
corr <- cor(house.price[, numeric_cols])
corrplot.mixed(corr, number.cex = .75, tl.pos = "lt")

Even though there are still moderate multicollinearity issues, these are under the serious level and will be handled again during modeling. Again, house prices seem to have high correlations with area-related variables, which comes no surprise. Still, quality-related factors such as year built or fire place would mildly affect house prices. Although there are higher numbers on the matrix, “PRICE” has some relationships uniformly with the variables when the same can’t be applied to other variables.

Scatter Plot

Scatter plot is basic to find a pattern between two numerical variables. Either going up or down, a slope means a relationship. When units are so different among variables, normalization could be considered or it might mislead interpretation if they are put together. Another way to deal with this is to show similar variables together instead of the complicated normalization process. Variables related with rooms are in the similar range, therefore, they are put together first. Next, variables related with years and the rest are drawn on the same pane.

# define a function for repeating uses
house.price_scatter <- function(var, x_start, x_lim) {
  ggplot(house.price, aes_string(var, "PRICE", fill = var)) +
    geom_point(color = "darkgreen", shape = 1, alpha = 0.1, 
               size = 0.2, position = "jitter") +
    geom_smooth(span = 5, method = lm) +
    ggtitle(toString(var)) + xlab("") +
    theme(legend.position="none") +
    scale_x_continuous(limits = c(x_start, x_lim)) +
    scale_y_continuous(limits = c(0, 1750000), 
                       breaks = seq(0, 1750000, 250000))
}

# room-related variables
BATHRM <- house.price_scatter("BATHRM", 0, 8)
HF_BATHRM <- house.price_scatter("HF_BATHRM", 0, 4)
BEDRM <- house.price_scatter("BEDRM", 0, 8)
ROOMS <- house.price_scatter("ROOMS", 0, 15)

grid.arrange(BATHRM, HF_BATHRM, BEDRM, ROOMS,
             ncol = 2, nrow = 2)

It seems like there is a strong pattern going up for them. It is very natural that a house price increases as it has more rooms. One surprising thing is that number of bathrooms has a sharper slope than number of bedrooms. Indeed, it is often seen that a 2-bedroom/2-bathroom house is more expensive than a 3-bedroom/1-bathroom house. Another thing to note is that it seems like prices become variable more widely as there are more rooms. For example, the price range of 3 to 4 rooms is much smaller than the price range of 7 to 8 rooms. Since room variables will play a significant role, heteroscedasticity is expected to happen, and this will be cared later.

# year-related and quality variables
AYB <- house.price_scatter("AYB", 1930, 2020)
EYB <- house.price_scatter("EYB", 1930, 2020)
STORIES <- house.price_scatter("STORIES", 0, 10)
FIREPLACES <- house.price_scatter("FIREPLACES", 0, 5)


grid.arrange(AYB, EYB, STORIES, FIREPLACES,
             ncol = 2, nrow = 2)

Variables related with years show a rough pattern. “EYB” has some slope which is affected by data after 2000. These are what must be heavily affected by contingent market trends, not by typical features. Although “STORIES” has a good sharp slope, they are mostly 2 to 3 stories with large variances. “FIREPLACE” has a decent movement within 0 to 1 range.

# GBA showing growing variances
ggplot(house.price, aes(x = GBA, y = PRICE)) +
  geom_point(color = "darkgreen", 
             shape = 1, alpha = 0.3, size = 0.5) + 
  geom_smooth() +
  ggtitle("GBA") + xlab("") +
  scale_x_continuous(limits = c(0, 6000),
                     breaks = seq(0, 6000, 1000)) +
  scale_y_continuous(limits = c(0, 1750000), 
                     breaks = seq(0, 1750000, 250000))

“GBA” appeals the heteroscedasticity issue very clearly. As a house gets bigger, its price could be in any range. This can be understood that houses with 1-bedroom, for instance, should be in a small price range because a new house in downtown is still just 1-bedroom. But houses with 4-bedroom could be over million dollars or an old out-of-maintenance house on auction.

Transformation: Box-Cox

Now that the variance issue is found, it is time to fix it. For the last check, Breusch-Pagan test can be conducted on a linear model. The test runs with null hypothesis that the variances are uniform over values. As seen below, the results are enough to reject the null hypothesis and claim the variance issues.

# linear model with numerical variables
Numeric.lm <- lm(PRICE ~ BATHRM + HF_BATHRM + BEDRM + ROOMS +
                   AYB + EYB + STORIES + FIREPLACES + GBA, 
                 data = house.price)

# constant variance tests
print(bptest(Numeric.lm))

## 
##  studentized Breusch-Pagan test
## 
## data:  Numeric.lm
## BP = 1007.1, df = 9, p-value < 2.2e-16

print(ncvTest(Numeric.lm))

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1220.9, Df = 1, p = < 2.22e-16

Box-Cox suggests how to transform the data. The lambda is the core concept of the method that would be iterated between a certain range. Lambda nearing 0.5 indicates square root transformation for prices.

# box-cox
bc <- boxCox(Numeric.lm)

lambda <- bc$x[which.max(bc$y)]
cat("Lambda for Transformation:", lambda)

## Lambda for Transformation: 0.5454545

# transformation to sqrt
house.price$PRICE_TRANS <- sqrt(house.price$PRICE)
house.price <- select(house.price, -PRICE)

# price-sqrt histogram
ggplot(house.price, aes(x = PRICE_TRANS)) + 
  geom_histogram(aes(y = ..density..), 
                 binwidth = 20, fill = "blue", alpha = 0.25) +
  geom_density(fill = "pink", alpha = 0.25) +
  labs(title = "Washington D.C House Price SQRT") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(aes(xintercept = mean(PRICE_TRANS), color = "mean"), size = 0.5) +
  geom_vline(aes(xintercept = median(PRICE_TRANS), color = "median"),
             linetype = "dashed", size = 0.5) +
  scale_x_continuous(limits = c(0, 1500)) +
  scale_color_manual(name = "", values = c(mean = "red", median = "blue")) +
  stat_function(fun = dnorm, args = list(
    mean = mean(house.price$PRICE_TRANS), sd = sd(house.price$PRICE_TRANS)),
    linetype = "dashed")

After transformation, the distribution becomes better than the initial one. The dotted line is the normal curve for comparison. Despite being not perfectly matched, at least the skewness is fixed as the mean median are aligned.

# final correlations with PRICE_TRANS
X_num <- names(select_if(house.price, is.numeric))
X_num <- X_num[X_num != "PRICE_TRANS"]
cat("[Price SQRT vs. X Variables Correlation]\n")

## [Price SQRT vs. X Variables Correlation]

kable(cor(house.price[, X_num], house.price$PRICE_TRANS)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

BATHRM	0.4432783
HF_BATHRM	0.2619754
ROOMS	0.2784958
BEDRM	0.3369135
AYB	-0.2574367
EYB	0.2098391
STORIES	0.2011221
GBA	0.4984254
FIREPLACES	0.4796368

Although the correlations decreased a bit, it is more important to abide by the assumption of variance and normality, thus, the transformation is necessary.

Box Plot

Box plot is helpful to see if a categorical variable can divide dependent variables into groups. The data set has several categorical variables from building materials, functions, quality, to area. Some has too many different values available, some has unbalanced values, and another has its own issue. The following is general remedy for these kind of problems, some of which will be used for the project.

No effect: Drop the category
Too many values: Group them by similarity
Unbalanced values: Keep the major value and merge all the minor values
Specific order: Label values with numbers
More of numerical type: Replace each value with a certain number like median

categorical_cols <- names(select_if(house.price, is.factor))
print(categorical_cols)

##  [1] "HEAT"      "AC"        "QUALIFIED" "STYLE"     "STRUCT"   
##  [6] "GRADE"     "CNDTN"     "EXTWALL"   "ROOF"      "INTWALL"  
## [11] "WARD"      "RMDL"

# number of unique values in each categorical variable
str(apply(house.price[, categorical_cols], 2, function(x) unique(x)))

## List of 12
##  $ HEAT     : chr [1:14] "Warm Cool" "Hot Water Rad" "Forced Air" "Ht Pump" ...
##  $ AC       : chr [1:3] "Y" "N" "0"
##  $ QUALIFIED: chr [1:2] "Q" "U"
##  $ STYLE    : chr [1:13] "2 Story" "3 Story" "2.5 Story Fin" "4 Story" ...
##  $ STRUCT   : chr [1:8] "Row Inside" "Semi-Detached" "Multi" "Row End" ...
##  $ GRADE    : chr [1:9] "Above Average" "Very Good" "Average" "Excellent" ...
##  $ CNDTN    : chr [1:7] "Very Good" "Excellent" "Good" "Average" ...
##  $ EXTWALL  : chr [1:21] "Common Brick" "Stone" "Stucco" "Brick/Stone" ...
##  $ ROOF     : chr [1:12] "Built Up" "Metal- Sms" "Slate" "Comp Shingle" ...
##  $ INTWALL  : chr [1:11] "Hardwood" "Hardwood/Carp" "Carpet" "Wood Floor" ...
##  $ WARD     : chr [1:8] "Ward 2" "Ward 1" "Ward 6" "Ward 5" ...
##  $ RMDL     : chr [1:2] "Y" "N"

# fix '0' values in AC
cat("Number of '0's in AC:", pull(count(house.price[house.price$AC == '0', ])))

## Number of '0's in AC: 6

house.price[house.price$AC == '0', 'AC'] <- "N"

# define a function for repeated uses
house.price_boxplot <- function(var, x.angle) {
  ggplot(house.price, aes_string(var, "PRICE_TRANS", fill = var)) +
    geom_boxplot(outlier.colour = "black", 
                 outlier.shape = 16, 
                 outlier.size = 1) +
    geom_hline(aes(yintercept = median(PRICE_TRANS)), 
               color = 'gray', linetype = "dashed", size = 0.5) +
    ggtitle(toString(var)) + xlab("") +
    theme(legend.position="none",
          axis.text.x = element_text(angle = x.angle, hjust = 1))
}

# air conditioning, remodeld, qualified, heating, ward
lay <- rbind(c(1, 2, 3),
             c(4, 4, 5))
grid.arrange(grobs = list(house.price_boxplot("AC", 0), 
                          house.price_boxplot("RMDL", 0), 
                          house.price_boxplot("QUALIFIED", 0), 
                          house.price_boxplot("HEAT", 90),
                          house.price_boxplot("WARD", 90)),
             heights = c(0.4, 0.6),
             layout_matrix = lay)

“AC”, “RMDL”, and “QUALIFIED” have two classes. Among them, “QUALIFIED” seems to do a good job to divide expensive houses and cheap houses, of course, qualified houses value more and the same goes for remodeled houses with air conditioner equipped. “HEAT” is a bit confusing in that most of values are around the median price (dotted line). Though there are some values “Air Exchng” or “Evp Cool”, their counts do not seem to be enough to make a contribution. For that reason, “HEAT” should rather be dropped. “WARD” looks like it will be very beneficial as each ward are ranged properly and located differently. Some issue here is that they have many outliers and ward 7 and 8 are almost identical. Nonetheless, this looks good enough useful when working with numerical variables.

# drop HEAT
print(summary(house.price$HEAT))

##        Air-Oil     Air Exchng  Elec Base Brd   Electric Rad       Evp Cool 
##             11              2              8              6              2 
##     Forced Air Gravity Furnac  Hot Water Rad        Ht Pump       Ind Unit 
##           5450              8           3957            189              1 
##        No Data   Wall Furnace      Warm Cool Water Base Brd 
##              7             22           2319             20

house.price <- select(house.price, -HEAT)

grid.arrange(house.price_boxplot("STYLE", 90),
             house.price_boxplot("STRUCT", 90),
             ncol = 2)

The values in “STYLE” are basically stories. Since “STORIES” variable already exists, this can be removed. Also, almost all of them (9557 count) are “2 Story” making it severely unbalanced. “STRUCT” looks like it could do some job to find cheap houses by the value “Multi” or “Semi-Detached”.

# style
print(summary(house.price$STYLE))

##         1 Story   1.5 Story Fin 1.5 Story Unfin         2 Story 
##             461             245              16            9557 
##   2.5 Story Fin 2.5 Story Unfin         3 Story   3.5 Story Fin 
##             633              63             918              14 
## 3.5 Story Unfin         4 Story   4.5 Story Fin 4.5 Story Unfin 
##               0              52               0               0 
##        Bi-Level         Default    Outbuildings     Split Foyer 
##               1               8               0              18 
##     Split Level          Vacant 
##              16               0

house.price <- select(house.price, -STYLE)

# struct
print(summary(house.price$STRUCT))

##       Default         Multi       Row End    Row Inside Semi-Detached 
##             2           625          1432          5046          1829 
##        Single      Town End   Town Inside   Vacant Land 
##          3019            12            37             0

house.price$STRUCT <- ifelse(house.price$STRUCT %in%
                          c("Multi", "Semi-Detached", "Town Inside"),
                          "Cheap", "Expensive")

# condition, grade
grid.arrange(house.price_boxplot("CNDTN", 90),
             house.price_boxplot("GRADE", 90), 
             ncol = 2)

“CDNTN” and “GRADE” show a similar pattern that average or fair houses are cheaper. These houses are under the median line and the rest are above very clearly. So, these several values can be grouped into two. This way, the values in “CDNTN” become balanced too. The values in “GRADE”, however, are not quite two group dividable as there are unignorable discrepancies among similair values. This will be further investigated.

# grade
print(summary(house.price$GRADE))

## Above Average       Average     Excellent Exceptional-A Exceptional-B 
##          3741          4675           229            14             1 
## Exceptional-C Exceptional-D  Fair Quality  Good Quality   Low Quality 
##             0             0            22          2313             0 
##       No Data      Superior     Very Good 
##             0           164           843

# condition
print(summary(house.price$CNDTN))

##   Average   Default Excellent      Fair      Good      Poor Very Good 
##      4179         1       236       117      5392        28      2049

house.price$CNDTN <- ifelse(house.price$CNDTN %in%
                        c("Average", "Fair", "Poor"), "Bad", "Good")

# exwall
house.price_boxplot("EXTWALL", 90)

There are just too many values in “EXTWALL”, thus, it is inevitable to group them. With some extra research about exterior materials, they can be grouped into “Premium” or “Normal”. Generally, Brick and Stucco are considered as relatively better materials.

# exterior wall
print(summary(house.price$EXTWALL))

##          Adobe       Aluminum   Brick Veneer   Brick/Siding    Brick/Stone 
##              0            100            129            779             70 
##   Brick/Stucco   Common Brick       Concrete Concrete Block        Default 
##             65           9025             12              7              2 
##     Face Brick      Hardboard   Metal Siding        Plywood     Rustic Log 
##            123             18              7              0              0 
##        Shingle       SPlaster          Stone   Stone Veneer   Stone/Siding 
##            121              0             57             35             43 
##   Stone/Stucco         Stucco   Stucco Block   Vinyl Siding    Wood Siding 
##             18            286              4            661            440

house.price$EXTWALL <- ifelse(house.price$EXTWALL %in%
                          c("Brick Veneer", "Brick/Stone", "Brick/Stucco", "Hardboard",
                            "Stone", "Stone/Stucco", "Stucco Block", "Wood Siding"),
                            "Premium", "Normal")

# interior wall
grid.arrange(house.price_boxplot("INTWALL", 90), 
             house.price_boxplot("ROOF", 90), 
             ncol = 2)

“INTWALL” might have a chance to be grouped, but it should be dropped because of unbalanced issue. There are 9466 houses with “Hardwood” interior type, which are almost entire cases. As for “ROOF”, there are four types that are obviously over the median line, and these can go together.

# drop Interior
print(summary(house.price$INTWALL))

##        Carpet  Ceramic Tile       Default      Hardwood Hardwood/Carp 
##           300             6             6          9466          1419 
##   Lt Concrete       Parquet     Resiliant       Terrazo    Vinyl Comp 
##             8             2             3             1             3 
##   Vinyl Sheet    Wood Floor 
##             0           788

house.price <- select(house.price, -INTWALL)

# Roof: cheap vs. expensive
print(summary(house.price$ROOF))

##       Built Up      Clay Tile   Comp Shingle Composition Ro       Concrete 
##           3869             47           3464             24              0 
##  Concrete Tile     Metal- Cpr     Metal- Pre     Metal- Sms        Neopren 
##              0              5             32           3460            121 
##          Shake        Shingle          Slate        Typical    Water Proof 
##             65             43            853             19              0 
##       Wood- FS 
##              0

house.price$ROOF <- ifelse(house.price$ROOF %in%
                          c("Clay Tile", "Metal- Cpr", "Neopren", "Slate"),
                          "Expensive", "Cheap")

Facet Plot

Facet plot usually needs two numerical variables and a categorical variable. While “CDNTN” are now divided into two groups, it wouldn’t be a good idea to do the same grouping to “GRADE” as they are similar attributes. This is where facet plot would be considered to see if “GRADE” could be converted into a numerical variable.

# facet points by grade
ggplot(house.price[!house.price$GRADE %in% 
      c("Exceptional-A", "Exceptional-B", "Fair Quality"), ], 
      aes(GBA, PRICE_TRANS, color = GRADE)) + 
  geom_point(alpha = 0.05, shape = 1) + 
  geom_smooth(method = lm) + 
  facet_wrap(~GRADE) +
  scale_x_continuous(limits = c(0, 4000), 
                     breaks = seq(0, 4000, 2000))

# median price by grade
GRADE_median <- house.price %>%
  group_by(GRADE) %>%
  summarise(MEDIAN_PRICE = median(PRICE_TRANS)) %>%
  arrange(MEDIAN_PRICE)
GRADE_median$MEDIAN_PRICE <- round(GRADE_median$MEDIAN_PRICE, digits = 2)

kable(GRADE_median) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

GRADE	MEDIAN_PRICE
Fair Quality	409.13
Average	663.25
Above Average	782.62
Exceptional-B	921.95
Good Quality	950.53
Very Good	1067.71
Superior	1113.55
Excellent	1140.18
Exceptional-A	1214.66

# new variable median price by grade
house.price$GRADE_MEDIAN_PRICE <- lapply(house.price$GRADE, function(x) 
  GRADE_median[match(x, GRADE_median$GRADE), "MEDIAN_PRICE"][[1]])
house.price$GRADE_MEDIAN_PRICE <- as.numeric(house.price$GRADE_MEDIAN_PRICE)
house.price <- select(house.price, -GRADE)

As seen in the graph and the table, prices are distinctively different by grade. Therefore, these median prices grade by grade would replace the categorical variable “GRADE” rather than grouping them as reduced categories.

Diagnostic Analysis

Recalling the goal of this project, evaluating linear models is crucial to building the best model. Diagnostic analysis is the step to do that in various ways such as Q-Q plot, Residual vs. Fitted plot, Cook’s distance, and so on. These techniques are designed to test the assumptions like constant variance, normality of residuals, and random error. The first step for this is building a linear model that will work as baseline and be modified later.

\[Cook's Distance (D_i)=\frac{(y_i-\hat{y}_i)^2}{p\times MSE}[\frac{h_ii}{(1-h_ii)^2}]\]

house.lm <- lm(PRICE_TRANS ~ ., data = house.price)
summary(house.lm)

## 
## Call:
## lm(formula = PRICE_TRANS ~ ., data = house.price)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1033.15   -48.89    -1.19    47.75   766.09 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.610e+02  1.503e+02   5.065 4.14e-07 ***
## BATHRM              2.367e+01  1.376e+00  17.195  < 2e-16 ***
## HF_BATHRM           1.989e+01  1.687e+00  11.791  < 2e-16 ***
## ACY                 8.028e+00  2.612e+00   3.074 0.002117 ** 
## ROOMS              -5.383e-02  6.188e-01  -0.087 0.930682    
## BEDRM               4.084e+00  1.214e+00   3.365 0.000768 ***
## AYB                -1.169e+00  5.455e-02 -21.429  < 2e-16 ***
## EYB                 9.887e-01  1.062e-01   9.312  < 2e-16 ***
## STORIES            -4.009e-01  1.662e+00  -0.241 0.809352    
## QUALIFIEDU         -9.573e+01  2.363e+00 -40.508  < 2e-16 ***
## GBA                 6.857e-02  2.357e-03  29.092  < 2e-16 ***
## STRUCTExpensive     3.413e+01  2.300e+00  14.839  < 2e-16 ***
## CNDTNGood           3.435e+01  2.583e+00  13.300  < 2e-16 ***
## EXTWALLPremium      5.489e+00  3.490e+00   1.573 0.115761    
## ROOFExpensive       1.115e+01  3.351e+00   3.328 0.000877 ***
## FIREPLACES          1.362e+01  1.350e+00  10.085  < 2e-16 ***
## WARDWard 2          8.984e+01  5.515e+00  16.289  < 2e-16 ***
## WARDWard 3          5.428e+01  4.768e+00  11.382  < 2e-16 ***
## WARDWard 4         -6.287e+01  3.984e+00 -15.783  < 2e-16 ***
## WARDWard 5         -8.875e+01  3.925e+00 -22.610  < 2e-16 ***
## WARDWard 6          3.723e+00  3.852e+00   0.966 0.333921    
## WARDWard 7         -2.157e+02  4.370e+00 -49.356  < 2e-16 ***
## WARDWard 8         -2.458e+02  4.677e+00 -52.548  < 2e-16 ***
## RMDLY               2.674e+00  2.331e+00   1.147 0.251409    
## GRADE_MEDIAN_PRICE  2.443e-01  9.596e-03  25.454  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92.86 on 11977 degrees of freedom
## Multiple R-squared:  0.8132, Adjusted R-squared:  0.8128 
## F-statistic:  2172 on 24 and 11977 DF,  p-value: < 2.2e-16

There are a few variables that are not significant. “ROOMS” must be due to collinearity, and the rest are seemingly because they are not useful. Interestingly enough, “Ward 6” has already been problematic on the map earlier.

Diagnostic Plots and Influential Points

With the baseline model, diagnostic plots can be drawn. Residual vs. Fitted plot should have the balancing line horizontally around zero. The ideal Q-Q plot follows the diagonal line. Cook’s distance is used as limits which dots should not go over. It gives the indice of observations with problems so they can be excluded.

plot(house.lm)

Generally speaking, Residual vs. Fitted plot look okay, but Q-Q plot doesn’t look like it is on the normal line. From the shape, it could be light-tailed distribution. Residual vs. Leverage plot seems okay as dots are in the boundary of Cook’s distance.

# influential observations from Cook's Distance-Leverage
X_y <- names(select_if(house.price, is.numeric))
print(house.price[row.names(house.price) %in% 
    c(21, 18569, 24841, 97202), c(X_y)])

##       BATHRM HF_BATHRM ROOMS BEDRM  AYB  EYB STORIES  GBA FIREPLACES
## 21         3         1    14     5 1880 1987     3.0 3465          3
## 18569      2         1     8     3 1890 1976     2.0 2272          0
## 24841      4         1     7     3 1996 2009     2.5 2480          2
## 97202      1         0     4     1 1915 1954     1.0  684          0
##       PRICE_TRANS GRADE_MEDIAN_PRICE
## 21     182.296462            1067.71
## 18569    3.162278            1113.55
## 24841  409.878031            1067.71
## 97202 1178.982612             663.25

# drop influential observations
house.price <- house.price[!row.names(house.price) %in%
    c(21, 18569, 24841, 97202), ]

There are four observations that the plots suggest looking closely. They are clearly strangely priced for their features. For example, index 21 has 14 rooms and it’s only 182 unit price when its group median is 1067 unit price. The rest are also deviant from normal pattern.

resid.plot <- function(model) {
  ggplot(model, aes(.resid)) + 
    geom_density(fill = "pink", color = "red", alpha = 0.5) +
    stat_function(fun = dnorm, args = list(
      mean = mean(resid(model)), sd = sd(resid(model))),
      linetype = "dashed", geom = "area", alpha = 0.2) +
    xlab("Price Square Root Residuals") + ylab("Density") +
    ggtitle("Residual Plot") +
    scale_x_continuous(limits = c(-500, 500)) +
    coord_fixed(ratio = 1e5)
}

resid.plot(house.lm)

From the warning of Q-Q plot, the distribution should be checked again. The gray in the back is normal distribution when the red area is the data distribution. It has a high peak with light tails both sides, suggesting that most of the observations are centered around the mean. As long as it is a symmetric bell shape, it won’t cause a serious problem. However, it still looks like there are some extreme values from the long tails.

Remove Influential Points: DFBETA

One of the methods to detect influential points is DFBETA which measures how much impact each observation has on a particular predictor. Recalling the correlation matrix, the top three will be investigated in order to remove influential points. The ultimate goal of this process is again finding the general trend, not biased by extreme values.

\[DFBETAS_i=\frac{\hat{\beta}_i-\hat{\beta}_{(i)j}}{s_{(i)}\sqrt{(X^TX)_{jj}}}\]

# dfbeta
dfbeta.GBA <- dfbetaPlots(house.lm, terms = "GBA")

dfbeta.BATHRM <- dfbetaPlots(house.lm, terms = "BATHRM")

dfbeta.FIREPLACES <- dfbetaPlots(house.lm, terms = "FIREPLACES")

“GBA”, “BATHRM”, and “FIREPLACES” are the top three correlated variables. Because they will affect house prices the most, influential points should be detected based on them. The following calculates DFBETA’s and threshold for the data set, and finds influential points with the three features.

\[Threshold=\frac{2}{\sqrt{n}}\ where\ n=observations\]

# calculate dfbetas for selected variables
dfbeta.thrhd <- as.numeric(sqrt(4/count(house.price)))
dfbetas <- dfbetas(house.lm)
cat("DFBETA limit:", dfbeta.thrhd)

## DFBETA limit: 0.01825894

# influential point indice
dfbetas.infl.obs <- names(dfbetas[abs(dfbetas[ , "GBA"]) 
                                  > dfbeta.thrhd, "GBA"])
dfbetas.infl.obs <- c(dfbetas.infl.obs, 
                      names(dfbetas[abs(dfbetas[ , "BATHRM"]) 
                                  > dfbeta.thrhd, "BATHRM"]))
dfbetas.infl.obs <- c(dfbetas.infl.obs, 
                      names(dfbetas[abs(dfbetas[ , "FIREPLACES"]) 
                                  > dfbeta.thrhd, "FIREPLACES"]))

# filter observations over dfbeta limit
house.price <- house.price[!row.names(house.price) %in% dfbetas.infl.obs, ]

house.infl.lm <- lm(PRICE_TRANS ~., data = house.price)
summary(house.infl.lm)

## 
## Call:
## lm(formula = PRICE_TRANS ~ ., data = house.price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -466.85  -43.51   -1.04   40.47  544.05 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.651e+02  1.153e+02   5.767 8.28e-09 ***
## BATHRM              2.481e+01  1.124e+00  22.072  < 2e-16 ***
## HF_BATHRM           1.991e+01  1.316e+00  15.125  < 2e-16 ***
## ACY                 8.312e+00  1.993e+00   4.171 3.05e-05 ***
## ROOMS              -4.748e-01  4.962e-01  -0.957 0.338653    
## BEDRM               5.246e+00  9.690e-01   5.414 6.31e-08 ***
## AYB                -1.203e+00  4.323e-02 -27.831  < 2e-16 ***
## EYB                 1.061e+00  8.227e-02  12.898  < 2e-16 ***
## STORIES             9.766e-01  1.250e+00   0.782 0.434494    
## QUALIFIEDU         -7.831e+01  1.875e+00 -41.760  < 2e-16 ***
## GBA                 7.230e-02  2.077e-03  34.813  < 2e-16 ***
## STRUCTExpensive     3.292e+01  1.742e+00  18.902  < 2e-16 ***
## CNDTNGood           3.277e+01  1.980e+00  16.549  < 2e-16 ***
## EXTWALLPremium      3.696e-01  2.719e+00   0.136 0.891884    
## ROOFExpensive       9.263e+00  2.626e+00   3.528 0.000421 ***
## FIREPLACES          1.490e+01  1.207e+00  12.341  < 2e-16 ***
## WARDWard 2          1.015e+02  4.449e+00  22.805  < 2e-16 ***
## WARDWard 3          6.428e+01  3.709e+00  17.330  < 2e-16 ***
## WARDWard 4         -5.428e+01  3.087e+00 -17.587  < 2e-16 ***
## WARDWard 5         -7.815e+01  3.061e+00 -25.532  < 2e-16 ***
## WARDWard 6          4.901e+00  3.006e+00   1.630 0.103099    
## WARDWard 7         -2.095e+02  3.388e+00 -61.847  < 2e-16 ***
## WARDWard 8         -2.351e+02  3.607e+00 -65.190  < 2e-16 ***
## RMDLY               3.035e+00  1.785e+00   1.700 0.089117 .  
## GRADE_MEDIAN_PRICE  2.430e-01  7.571e-03  32.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.8 on 10660 degrees of freedom
## Multiple R-squared:  0.891,  Adjusted R-squared:  0.8907 
## F-statistic:  3630 on 24 and 10660 DF,  p-value: < 2.2e-16

After influential points are removed, R-squared is remarkably improved to 0.89 from 0.81. However, there are still insignificant features, so the last step becomes dropping them. These are what the model sees no difference whether they are in the model or not, therefore, R-squared should stay the same after losing them. Other than them, “ROOF” is relatively weaker than the rest. “WARD 6” is one of the values of “WARD”, and this will be handled before building a predictive model.

# drop insignificant regressors
house.price <- select(house.price, -c("ROOMS", "STORIES", "EXTWALL", "RMDL"))
write.csv(house.price, file = "DC_Properties_final.csv", row.names = F)

# regression with significant variables
house.final.lm <- lm(PRICE_TRANS ~ ., data = house.price)
summary(house.final.lm)

## 
## Call:
## lm(formula = PRICE_TRANS ~ ., data = house.price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -466.55  -43.35   -0.93   40.45  544.40 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.982e+02  1.110e+02   6.292 3.25e-10 ***
## BATHRM              2.478e+01  1.083e+00  22.871  < 2e-16 ***
## HF_BATHRM           1.992e+01  1.314e+00  15.160  < 2e-16 ***
## ACY                 9.162e+00  1.939e+00   4.726 2.32e-06 ***
## BEDRM               4.992e+00  9.271e-01   5.385 7.41e-08 ***
## AYB                -1.230e+00  4.042e-02 -30.426  < 2e-16 ***
## EYB                 1.071e+00  8.181e-02  13.093  < 2e-16 ***
## QUALIFIEDU         -7.852e+01  1.871e+00 -41.969  < 2e-16 ***
## GBA                 7.161e-02  1.909e-03  37.507  < 2e-16 ***
## STRUCTExpensive     3.286e+01  1.728e+00  19.018  < 2e-16 ***
## CNDTNGood           3.386e+01  1.879e+00  18.019  < 2e-16 ***
## ROOFExpensive       9.258e+00  2.621e+00   3.532 0.000415 ***
## FIREPLACES          1.495e+01  1.203e+00  12.430  < 2e-16 ***
## WARDWard 2          1.013e+02  4.442e+00  22.803  < 2e-16 ***
## WARDWard 3          6.396e+01  3.665e+00  17.448  < 2e-16 ***
## WARDWard 4         -5.441e+01  3.059e+00 -17.785  < 2e-16 ***
## WARDWard 5         -7.819e+01  3.046e+00 -25.673  < 2e-16 ***
## WARDWard 6          4.768e+00  2.999e+00   1.590 0.111878    
## WARDWard 7         -2.094e+02  3.354e+00 -62.446  < 2e-16 ***
## WARDWard 8         -2.353e+02  3.571e+00 -65.876  < 2e-16 ***
## GRADE_MEDIAN_PRICE  2.431e-01  7.523e-03  32.319  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.8 on 10664 degrees of freedom
## Multiple R-squared:  0.8909, Adjusted R-squared:  0.8907 
## F-statistic:  4356 on 20 and 10664 DF,  p-value: < 2.2e-16

grid.arrange(resid.plot(house.lm) + ggtitle("Original Residuals"),
             resid.plot(house.infl.lm) + ggtitle("After Removing Influential Points"),
             ncol = 2)

On the left is before eliminating the influential points and on the right is after that. The assumption of normal residuals is better met with the influential points removed, almost being identical with the normal distribution in gray.

Best Model

Although there are many other measurements to select features, two representative are used for this project. One is Adjusted R-squared to measure the overall fit, and the other is AIC to see if adding another feature is helpful for modeling.

Adjusted R-Square

Unlike R-square, Adjusted R-square takes number of parameters added into consideration. That is, it could be worse to add a new variable if the variable is not helpful enough to explain errors.

# full linear model
rsq.lm <- lm(PRICE_TRANS ~ ., data = house.price)

# convert it to matrix
x <- model.matrix(rsq.lm)[, -1]
y <- house.price$PRICE_TRANS

# 5 best models by adjusted r-squared
adjr <- leaps(x, y, method = "adjr2")
maxadjr(adjr, 5)

## 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 
##                                              0.891 
##    1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20 
##                                              0.891 
##    1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20 
##                                              0.891 
##       1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,18,19,20 
##                                              0.891 
##   1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 
##                                              0.891

The result of 5 best models indicates predictor numbers and adjusted r-squared for each model. While all the models perform the same (0.89), they have different set of predictors. Since a simpler model is bettern for the same performance, the 4th should be chosen with the least number of features.

print(colnames(x[, c(11, 17)]))

## [1] "ROOFExpensive" "WARDWard 6"

x <- x[, -c(11, 17)]

The two features the 4th model excludes happen to be what already had issues on significance. Now that it is clear they do not need to be in the model, “ROOF” and “WARD 6” can be let go. Nevertheless, it is safe to see feature selection at a different angle.

AIC

Basically, AIC is designed to decide whether to include an additinal variable or not. The lower the score is, the better the model is. The process runs through features and calculate AIC and continue until AIC stops decreasing. The following starts with a null model and add a feature one by one to see how AIC changes.

# null model for variable addition
null.lm <- lm(PRICE_TRANS ~ 1, data = house.price)
full.lm <- lm(PRICE_TRANS ~ ., data = house.price)

# stepwise selection
step(null.lm, scope = list(lower = null.lm, upper = full.lm), 
     direction = "both")

## Start:  AIC=113448.5
## PRICE_TRANS ~ 1
## 
##                      Df Sum of Sq       RSS    AIC
## + WARD                7 280357353 155994622 102472
## + GRADE_MEDIAN_PRICE  1 209347790 227004185 106468
## + GBA                 1 127386145 308965830 109762
## + FIREPLACES          1 121626952 314725023 109959
## + BATHRM              1  98008564 338343411 110732
## + BEDRM               1  58669252 377682723 111908
## + QUALIFIED           1  47450012 388901963 112220
## + STRUCT              1  43602660 392749315 112326
## + CNDTN               1  40578184 395773791 112408
## + HF_BATHRM           1  33346273 403005702 112601
## + AYB                 1  33215069 403136906 112605
## + ROOF                1  27149644 409202331 112764
## + AC                  1  25692455 410659520 112802
## + EYB                 1  18547356 417804619 112986
## <none>                            436351975 113449
## 
## Step:  AIC=102471.6
## PRICE_TRANS ~ WARD
## 
##                      Df Sum of Sq       RSS    AIC
## + GBA                 1  53904968 102089653  97944
## + BATHRM              1  49887222 106107400  98356
## + GRADE_MEDIAN_PRICE  1  44064741 111929881  98927
## + BEDRM               1  33110405 122884217  99924
## + CNDTN               1  26333572 129661050 100498
## + EYB                 1  23263095 132731526 100748
## + FIREPLACES          1  20577094 135417527 100962
## + QUALIFIED           1  20035415 135959207 101005
## + AC                  1  17124704 138869918 101231
## + HF_BATHRM           1  12797291 143197331 101559
## + STRUCT              1   6563659 149430963 102014
## + ROOF                1   3748826 152245796 102214
## + AYB                 1    401705 155592917 102446
## <none>                            155994622 102472
## - WARD                7 280357353 436351975 113449
## 
## Step:  AIC=97943.52
## PRICE_TRANS ~ WARD + GBA
## 
##                      Df Sum of Sq       RSS    AIC
## + CNDTN               1  21986830  80102823  95354
## + QUALIFIED           1  19369301  82720353  95698
## + GRADE_MEDIAN_PRICE  1  15902380  86187274  96136
## + AC                  1  15530322  86559332  96182
## + BATHRM              1  10894819  91194834  96740
## + EYB                 1   9993302  92096351  96845
## + HF_BATHRM           1   6097378  95992275  97288
## + STRUCT              1   6087446  96002208  97289
## + FIREPLACES          1   5987187  96102466  97300
## + BEDRM               1   2149040  99940613  97718
## + ROOF                1    999819 101089835  97840
## + AYB                 1     98023 101991630  97935
## <none>                            102089653  97944
## - GBA                 1  53904968 155994622 102472
## - WARD                7 206876176 308965830 109762
## 
## Step:  AIC=95353.98
## PRICE_TRANS ~ WARD + GBA + CNDTN
## 
##                      Df Sum of Sq       RSS    AIC
## + GRADE_MEDIAN_PRICE  1  10702257  69400566  93824
## + QUALIFIED           1   9857858  70244966  93953
## + STRUCT              1   4421828  75680995  94749
## + FIREPLACES          1   4035908  76066915  94804
## + BATHRM              1   2843797  77259026  94970
## + AC                  1   2461461  77641362  95022
## + HF_BATHRM           1   1904389  78198434  95099
## + AYB                 1   1355193  78747630  95174
## + BEDRM               1   1087531  79015292  95210
## + EYB                 1    903071  79199752  95235
## + ROOF                1    703228  79399595  95262
## <none>                             80102823  95354
## - CNDTN               1  21986830 102089653  97944
## - GBA                 1  49558227 129661050 100498
## - WARD                7 198756939 278859762 108668
## 
## Step:  AIC=93823.58
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE
## 
##                      Df Sum of Sq       RSS    AIC
## + QUALIFIED           1   9153173  60247393  92314
## + AYB                 1   3359196  66041370  93295
## + BATHRM              1   2859494  66541072  93376
## + STRUCT              1   2184079  67216486  93484
## + AC                  1   1423247  67977319  93604
## + FIREPLACES          1   1347815  68052751  93616
## + BEDRM               1   1272061  68128505  93628
## + HF_BATHRM           1    855942  68544624  93693
## + ROOF                1     96729  69303837  93811
## + EYB                 1     76573  69323993  93814
## <none>                             69400566  93824
## - GRADE_MEDIAN_PRICE  1  10702257  80102823  95354
## - CNDTN               1  16786708  86187274  96136
## - GBA                 1  26158907  95559473  97239
## - WARD                7 118871084 188271650 104473
## 
## Step:  AIC=92314.34
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED
## 
##                      Df Sum of Sq       RSS    AIC
## + AYB                 1   3896218  56351175  91602
## + STRUCT              1   2307855  57939537  91899
## + BATHRM              1   2284329  57963063  91903
## + FIREPLACES          1   1326205  58921188  92079
## + BEDRM               1   1178410  59068983  92105
## + AC                  1    694024  59553368  92193
## + HF_BATHRM           1    562033  59685359  92216
## + EYB                 1    331220  59916172  92257
## + ROOF                1    139550  60107842  92292
## <none>                             60247393  92314
## - QUALIFIED           1   9153173  69400566  93824
## - CNDTN               1   9229694  69477087  93835
## - GRADE_MEDIAN_PRICE  1   9997573  70244966  93953
## - GBA                 1  26838667  87086060  96249
## - WARD                7 111273121 171520513 103479
## 
## Step:  AIC=91601.98
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB
## 
##                      Df Sum of Sq       RSS   AIC
## + BATHRM              1   2576776  53774399 91104
## + STRUCT              1   2237156  54114019 91171
## + EYB                 1   2139600  54211574 91190
## + HF_BATHRM           1   1203142  55148032 91373
## + AC                  1   1083758  55267417 91396
## + FIREPLACES          1    923559  55427616 91427
## + BEDRM               1    742529  55608646 91462
## + ROOF                1     81048  56270127 91589
## <none>                             56351175 91602
## - AYB                 1   3896218  60247393 92314
## - QUALIFIED           1   9690195  66041370 93295
## - CNDTN               1  10671769  67022944 93453
## - GRADE_MEDIAN_PRICE  1  12127684  68478859 93683
## - GBA                 1  27495744  83846919 95846
## - WARD                7  65460729 121811904 99825
## 
## Step:  AIC=91103.86
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM
## 
##                      Df Sum of Sq       RSS   AIC
## + STRUCT              1   2554350  51220049 90586
## + HF_BATHRM           1   2148606  51625793 90670
## + EYB                 1   1771359  52003039 90748
## + FIREPLACES          1   1036545  52737854 90898
## + AC                  1    720785  53053614 90962
## + BEDRM               1    118109  53656289 91082
## + ROOF                1     94257  53680141 91087
## <none>                             53774399 91104
## - BATHRM              1   2576776  56351175 91602
## - AYB                 1   4188664  57963063 91903
## - CNDTN               1   6513401  60287800 92324
## - QUALIFIED           1   9086078  62860477 92770
## - GBA                 1  10944722  64719121 93081
## - GRADE_MEDIAN_PRICE  1  12265770  66040169 93297
## - WARD                7  64864224 118638623 99545
## 
## Step:  AIC=90585.86
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT
## 
##                      Df Sum of Sq       RSS   AIC
## + HF_BATHRM           1   1841526  49378523 90197
## + EYB                 1   1190449  50029600 90337
## + FIREPLACES          1    820460  50399588 90415
## + AC                  1    637281  50582768 90454
## + BEDRM               1    104029  51116020 90566
## + ROOF                1     65605  51154444 90574
## <none>                             51220049 90586
## - STRUCT              1   2554350  53774399 91104
## - BATHRM              1   2893970  54114019 91171
## - AYB                 1   4129337  55349385 91412
## - CNDTN               1   5945282  57165331 91757
## - QUALIFIED           1   9176424  60396473 92345
## - GRADE_MEDIAN_PRICE  1   9701864  60921913 92437
## - GBA                 1  11198634  62418683 92697
## - WARD                7  60370243 111590292 98892
## 
## Step:  AIC=90196.62
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT + HF_BATHRM
## 
##                      Df Sum of Sq       RSS   AIC
## + EYB                 1    717029  48661493 90042
## + FIREPLACES          1    683499  48695024 90050
## + AC                  1    299398  49079124 90134
## + BEDRM               1     72069  49306454 90183
## + ROOF                1     53412  49325111 90187
## <none>                             49378523 90197
## - HF_BATHRM           1   1841526  51220049 90586
## - STRUCT              1   2247270  51625793 90670
## - BATHRM              1   3778421  53156944 90982
## - CNDTN               1   4241031  53619554 91075
## - AYB                 1   5040915  54419438 91233
## - QUALIFIED           1   8605793  57984316 91911
## - GRADE_MEDIAN_PRICE  1   8865587  58244110 91959
## - GBA                 1   9220270  58598793 92024
## - WARD                7  58144830 107523353 98498
## 
## Step:  AIC=90042.33
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT + HF_BATHRM + EYB
## 
##                      Df Sum of Sq       RSS   AIC
## + FIREPLACES          1    786479  47875014 89870
## + AC                  1    170258  48491235 90007
## + ROOF                1    127923  48533571 90016
## + BEDRM               1     90893  48570600 90024
## <none>                             48661493 90042
## - EYB                 1    717029  49378523 90197
## - HF_BATHRM           1   1368107  50029600 90337
## - STRUCT              1   1842007  50503501 90437
## - CNDTN               1   2149258  50810751 90502
## - BATHRM              1   3282557  51944050 90738
## - AYB                 1   4403962  53065456 90966
## - GRADE_MEDIAN_PRICE  1   6568573  55230066 91393
## - QUALIFIED           1   8130725  56792219 91691
## - GBA                 1   9249969  57911463 91900
## - WARD                7  57338791 106000284 98347
## 
## Step:  AIC=89870.22
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES
## 
##                      Df Sum of Sq       RSS   AIC
## + BEDRM               1    123606  47751408 89845
## + AC                  1    100503  47774511 89850
## + ROOF                1     62403  47812611 89858
## <none>                             47875014 89870
## - FIREPLACES          1    786479  48661493 90042
## - EYB                 1    820009  48695024 90050
## - HF_BATHRM           1   1217304  49092318 90137
## - STRUCT              1   1650818  49525832 90230
## - CNDTN               1   1946932  49821946 90294
## - BATHRM              1   3309849  51184863 90583
## - AYB                 1   4340168  52215182 90795
## - GRADE_MEDIAN_PRICE  1   4962468  52837482 90922
## - QUALIFIED           1   8067402  55942416 91532
## - GBA                 1   8213569  56088583 91560
## - WARD                7  55292893 103167907 98060
## 
## Step:  AIC=89844.6
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + BEDRM
## 
##                      Df Sum of Sq       RSS   AIC
## + AC                  1    106444  47644964 89823
## + ROOF                1     62448  47688960 89833
## <none>                             47751408 89845
## - BEDRM               1    123606  47875014 89870
## - FIREPLACES          1    819192  48570600 90024
## - EYB                 1    845807  48597215 90030
## - HF_BATHRM           1   1175261  48926669 90102
## - STRUCT              1   1631824  49383232 90202
## - CNDTN               1   1948501  49699909 90270
## - BATHRM              1   2539171  50290579 90396
## - AYB                 1   4225864  51977272 90749
## - GRADE_MEDIAN_PRICE  1   4926213  52677621 90892
## - GBA                 1   6216626  53968034 91150
## - QUALIFIED           1   8064515  55815923 91510
## - WARD                7  55288477 103039885 98049
## 
## Step:  AIC=89822.76
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + BEDRM + 
##     AC
## 
##                      Df Sum of Sq       RSS   AIC
## + ROOF                1     55664  47589300 89812
## <none>                             47644964 89823
## - AC                  1    106444  47751408 89845
## - BEDRM               1    129547  47774511 89850
## - EYB                 1    721979  48366943 89981
## - FIREPLACES          1    747167  48392131 89987
## - HF_BATHRM           1   1041156  48686120 90052
## - CNDTN               1   1483245  49128209 90148
## - STRUCT              1   1643256  49288220 90183
## - BATHRM              1   2338711  49983675 90333
## - AYB                 1   4079026  51723990 90698
## - GRADE_MEDIAN_PRICE  1   4945920  52590884 90876
## - GBA                 1   6321029  53965993 91152
## - QUALIFIED           1   7846204  55491168 91450
## - WARD                7  55385669 103030633 98050
## 
## Step:  AIC=89812.27
## PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + QUALIFIED + 
##     AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + BEDRM + 
##     AC + ROOF
## 
##                      Df Sum of Sq       RSS   AIC
## <none>                             47589300 89812
## - ROOF                1     55664  47644964 89823
## - AC                  1     99660  47688960 89833
## - BEDRM               1    129399  47718698 89839
## - FIREPLACES          1    689541  48278841 89964
## - EYB                 1    765049  48354348 89981
## - HF_BATHRM           1   1025557  48614857 90038
## - CNDTN               1   1448874  49038174 90131
## - STRUCT              1   1613975  49203275 90167
## - BATHRM              1   2334329  49923629 90322
## - AYB                 1   4131116  51720416 90700
## - GRADE_MEDIAN_PRICE  1   4661249  52250548 90809
## - GBA                 1   6277879  53867179 91134
## - QUALIFIED           1   7860573  55449872 91444
## - WARD                7  55284755 102874055 98035

## 
## Call:
## lm(formula = PRICE_TRANS ~ WARD + GBA + CNDTN + GRADE_MEDIAN_PRICE + 
##     QUALIFIED + AYB + BATHRM + STRUCT + HF_BATHRM + EYB + FIREPLACES + 
##     BEDRM + AC + ROOF, data = house.price)
## 
## Coefficients:
##        (Intercept)          WARDWard 2          WARDWard 3  
##          698.16080           101.28301            63.95630  
##         WARDWard 4          WARDWard 5          WARDWard 6  
##          -54.41096           -78.18838             4.76844  
##         WARDWard 7          WARDWard 8                 GBA  
##         -209.41612          -235.26479             0.07161  
##          CNDTNGood  GRADE_MEDIAN_PRICE          QUALIFIEDU  
##           33.86244             0.24312           -78.51845  
##                AYB              BATHRM     STRUCTExpensive  
##           -1.22984            24.77822            32.85632  
##          HF_BATHRM                 EYB          FIREPLACES  
##           19.91879             1.07114            14.95291  
##              BEDRM                 ACY       ROOFExpensive  
##            4.99228             9.16163             9.25816

As the result shows, the full model has the lowest AIC, meaning that adding all the features is the best. However, we know “ROOF” and “WARD 6” would not make any difference, thus, they can be removed. Still, the feature selection is in accordance with the Adjusted R-squared and AIC results.

Prediction

A predictive model can be used to predict a house price for one purpose, and to see if a house value is evaluated properly for another. The principle steps start with splitting data set, train one part, and test the model against the rest. Since training set and testing set could be biased, partitioning should loop through the data set (Cross Validation).

# data frame for prediction
price.df <- data.frame(x, y)

# hold 30% for validation
train.indice <- createDataPartition(price.df$y, p = 0.7, list = F)
training.set <- price.df[train.indice, ]
testing.set <- price.df[-train.indice, ]

# cross validation
train.control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
lm.model <- train(y ~ ., data = price.df, trControl = train.control, method = "lm")
summary(lm.model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -466.25  -43.10   -1.17   40.51  543.80 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         763.86590  109.74435   6.960 3.59e-12 ***
## BATHRM               24.81960    1.08399  22.897  < 2e-16 ***
## HF_BATHRM            20.09576    1.31399  15.294  < 2e-16 ***
## ACY                   9.52283    1.93769   4.915 9.03e-07 ***
## BEDRM                 4.84664    0.92349   5.248 1.57e-07 ***
## AYB                  -1.21242    0.04018 -30.172  < 2e-16 ***
## EYB                   1.02048    0.08083  12.625  < 2e-16 ***
## QUALIFIEDU          -78.50728    1.87148 -41.949  < 2e-16 ***
## GBA                   0.07181    0.00191  37.604  < 2e-16 ***
## STRUCTExpensive      33.16673    1.72692  19.206  < 2e-16 ***
## CNDTNGood            34.35376    1.87601  18.312  < 2e-16 ***
## FIREPLACES           15.56695    1.19340  13.044  < 2e-16 ***
## WARDWard.2           97.12006    3.95214  24.574  < 2e-16 ***
## WARDWard.3           61.87410    3.00451  20.594  < 2e-16 ***
## WARDWard.4          -56.78342    2.21129 -25.679  < 2e-16 ***
## WARDWard.5          -81.42730    2.20527 -36.924  < 2e-16 ***
## WARDWard.7         -212.63356    2.56912 -82.765  < 2e-16 ***
## WARDWard.8         -238.45803    2.87632 -82.904  < 2e-16 ***
## GRADE_MEDIAN_PRICE    0.24693    0.00743  33.234  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.84 on 10666 degrees of freedom
## Multiple R-squared:  0.8908, Adjusted R-squared:  0.8906 
## F-statistic:  4833 on 18 and 10666 DF,  p-value: < 2.2e-16

The result is not very different from the initial modeling, claiming that the prediction model performs well. All the variables are now significant, and Adjusted R-squared is very slightly lower than R-squared. Putting them altogether, there is no more feature to be eliminated. One last step is to see how well this model works.

Evaluation: RMSE

Root Mean Square Error is widely used to evaluate a regression model. This is the average of differences between true values and predicted values of each observation with sign ignored. Note that the RMSE 68.14 is from unit price transformed as square root.

price_pred <- predict(lm.model, newdata = testing.set)

rmse <- RMSE(testing.set$y, price_pred)
cat("Root Mean Square Error (RMSE):", rmse)

## Root Mean Square Error (RMSE): 65.15407

True.Price <- (testing.set$y)^2
Predicted.Price <- price_pred^2
pred.df <- data.frame(True.Price, Predicted.Price)
pred.df$Diff <- pred.df$True.Price - pred.df$Predicted.Price

kable(head(pred.df, n = 20)) %>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "500px")

	True.Price	Predicted.Price	Diff
44	1700000	1658674.3	41325.71
161	830000	886218.6	-56218.57
162	635000	572257.5	62742.50
276	899000	750211.1	148788.89
341	928000	689197.8	238802.15
370	1250000	1132902.5	117097.55
533	1120000	1138339.3	-18339.27
560	805000	942926.6	-137926.60
623	1440000	1348719.1	91280.93
648	1412500	1434301.9	-21801.92
842	932500	908637.5	23862.54
891	1041000	906915.5	134084.52
916	1450000	1365898.3	84101.65
921	635000	846595.9	-211595.91
964	801600	765697.9	35902.05
978	1120160	1296661.6	-176501.58
1001	1200000	1057110.6	142889.37
1057	915000	786833.2	128166.81
1071	822500	805455.1	17044.86
1075	875000	786492.7	88507.29

cat("True Price vs. Predicted Price\n",
    "Mean Error: $", mean(abs(pred.df$Diff)), '\n',
    "Median Error: $", median(abs(pred.df$Diff)))

## True Price vs. Predicted Price
##  Mean Error: $ 80025.89 
##  Median Error: $ 61852.59

Median RMSE in dollar is $63,687, which implies that the model usually predicts a price for a given house with that much difference. For example, a $1,000,000 house could be predicted as much as $1,063,000 or $934,000. The following plot is the residual plot expressed in dollar amount. There seems no specific pattern among residuals. Note that the reason the left side is darker is simply due to more observations in that range.

ggplot(pred.df, aes(Predicted.Price, Diff)) + 
  geom_point(alpha = 0.25) +
  geom_hline(yintercept = 0, color = 'red') +
  geom_hline(yintercept = c(-250000, 250000), color = 'blue', linetype = "dashed") +
  ggtitle("Residual Plot")

mean_price <- pred.df[pred.df$True.Price > 650000 & 
                        pred.df$True.Price < 660000, "Predicted.Price"]
median_price <- pred.df[pred.df$True.Price > 615000 & 
                          pred.df$True.Price < 625000, "Predicted.Price"]

cat("Mean and Median Price Prediction\n",
    "Mean (650K~660K): $", mean(abs(mean_price)), '\n',
    "Median (615K~625K): $", mean(abs(median_price)))

## Mean and Median Price Prediction
##  Mean (650K~660K): $ 661071.5 
##  Median (615K~625K): $ 610062.6

Prediction for a given value can be helpful to feel how this model works. The mean and median house price is $660,583 and $619,900, respectively, and the predictions for these are $684,259 and $593,353. There is about $20K gap between true statistics and the predictions. This is much lower than overall the RMSE, but more meaningful because most of the observations are around mean and median, so will be new data.

ggplot(pred.df) + 
  geom_histogram(aes(True.Price, color="True"), 
                 binwidth = 20000, fill = "white", alpha = 0.25) +
  geom_histogram(aes(Predicted.Price, color="Predicted"), 
                 binwidth = 20000, fill = "white", alpha = 0.25) +
  ggtitle("True Price vs. Predicted Price")

Finally, histograms of true values and predicted values can be drawn together to see how similar they are over the price ranges. During the analysis, we found out that the distributions had issues on tails. The same occurs here where most of errors occur on tails. However, this shouldn’t be a serious problem because the model is focused on majority of houses, not overly expensive or cheap houses.

Conclusion

Beginning with about 50 variables available, we successfully selected 16 features to predict house prices. Adjusted R-square and AIC were used as the criteria for feature selection. Some of them are as follows.

Numerical: Number of bathrooms, House size in sqft, Year built
Categorical: Ward, Condition, Struct

House prices become more expensive as number of rooms and size increase. Area such as ward and house condition critically affect prices but building materials do not as much. The following diagnostic plots are what we used for checking the assumptions.

Resisual vs. Fitted Plot: Constant variance
Q-Q Plot: Normality of residuals

After the diagnosis, DFBETAS was used for filtering influential points without which the linear model was improved to meet the assumptions. Through all the processes, we could finally build a predictive model. A training set and a validation set were separated to train the model only with the training set and to test it against the testing set. By doing so, the model could be tested with data it didn’t see for training.

The performance of the model was measured as 0.89 R-squared. More realistically, the median error was $63,025 and less than $20,000 for average houses priced between $650,000 and $660,000. Considering how expensive houses in Washington D.C. are, error $20,000 or less is quite impressive even if there could be further improvement if with more variables added.

References

Theory

R Syntax