Any topic (writer’s choice)
Use logistic regression to build a classification model, based on the Boston Housing SPSS data.
There are 9 attributes in each case of the dataset. They are:
CRIM per capita crime rate by town
INDUS proportion of non-retail business acres per town.
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
PTRATIO pupil-teacher ratio by town
B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
TAX Tax band levied by the government
The goal is to predict the Tax band levied by the government (TAX) based on information gathered from the predictors.
Why should the data be partitioned into training, validation and test sets? What will the training and test sets be used for in this task?
1. Partition the data into the training and test sets in the proportion of 70/30. Perform the subsequent tasks on the training set only.
2. Explore the data set by running descriptive analysis, boxplots and histograms. Based on these methods describe the data and make relevant conclusions. Do the data need cleaning? Why? How would you clean them?
3. Compute the correlation table for the predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Discuss the results (e.g. which variables could/should potentially be removed from the data set, why?).
4. Fit a binary logistic regression to predict TAX using the Enter attribute selection method. Interpret the SPSS outputs.
5. Write the logistic regression model and interpret it.
6. Re-run the binary logistic regression model using the Stepwise attribute selection method. Explain the outputs and how the attribute selection was performed using this method.
7. What is the classification of housing with a Crime rate per capita by town CRIM = 1?
8. What is the minimum Crime rate per capita by town (CRIM) before housing would be classified as a band 2 taxpayer?
9. Discuss the logistic regression model performance by calculating accuracy metrics such as Precision, Recall and F-measure.
10. Now using the final trained model predict the Tax band on the test data set and estimate the accuracy rate. Discuss the results.