Problem 1: Predicting price of houses using linear regression (15 marks)

The data in the table below provides some historical data on the selling price, area, number of

bedrooms, and age of houses that that were sold in a neighbourhood. The purpose of this exercise is

to develop models for predicting selling price on the basis of one or more of the other three

variables

Attachments:CSE5DSS – Decision Support Systems

Individual Assignment 2, 2020

Due Date: Wednesday, May 20, 5:00pm

Assessment Weight: 15% of the final mark for the subject

Instructions

• This is an INDIVIDUAL assignment. You are not permitted to collaborate with any other student,

and you are not permitted to outsource the work to any other party.

• This assignment consists of six separate problems. You are required to solve all six problems.

• Total marks available for the assignment is 50.

Plagiarism

Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the

work is your own. When submitting your assignment via the LMS, the following announcement will

appear:

Software will be used to assist in the detection of plagiarism. Students are referred to the section on

‘Academic Misconduct’ in the subject’s guideline available on LMS.

Lateness Policy

Penalties are applied to late assignments (5% of total possible marks for the task is deducted per day,

accepted up to 5 days after the due date only). An assignment submitted more than five working days

after the due date will not be accepted.

Submission Procedure

Assignments are to be submitted electronically via the Learning Management System. You should

submit a single pdf file which contains your reports for all three problems.

Problem 1: Predicting price of houses using linear regression (15 marks)

The data in the table below provides some historical data on the selling price, area, number of

bedrooms, and age of houses that that were sold in a neighbourhood. The purpose of this exercise is

to develop models for predicting selling price on the basis of one or more of the other three

variables.

Selling Price ($1000) Area (m2

) Bedrooms Age

920 254 4 10

1244 377 4 10

1160 223 3 1

1140 232 4 3

632 171 3 40

904 221 3 7

472 124 2 25

512 155 2 30

700 214 3 18

1152 232 3 2

740 208 3 30

1180 290 4 0

1152 230 3 3

1320 265 3 3

492 159 3 30

760 215 3 19

1104 232 3 1

Part 1

Use WEKA to create three linear regression models for predicting the selling price.

1

• Model 1 should use only the area as input

• Model 2 should use only the number of bedrooms as input

• Model 3 should use only the age as input

For each model, record the regression equation, the training error, and the leave-one-out crossvalidation error.

Use the regression equation from each model to predict the selling price of a 20 year old house with

three bedrooms and an area of 225 m2.

Which model do you believe provides the most reliable prediction? You MUST justify your answer

based on relevant data from the results that you have provided.

Part 2

Now investigate the four possible models that use more than one input variable (i.e., area and

bedrooms; area and age; bedrooms and age; area, bedrooms and age). As in Part 1, record the

regression equation, the training error, and the leave-one-out cross-validation error. Which of these

models provides the most reliable prediction? How does this compare with best model from Part 1?

What to submit

Submit a brief report presenting your results and justifications. Your submission will be marked

according to the completeness and correctness of your responses.

1 Use the Linear Regression function, which you will find in the Classify tab, in the Functionssection. You will need to

remove some predictor variables after you open the file. By default, WEKA Linear Regression performs automatic

attribute selection. We are interested in comparing the difference in performance between combinations of input

variables, so it is important to switch off automatic attribute selection. To do this, go to the Linear Regression

properties dialog box, and set the value of attributeSelectionMethod to ‘No attribute selection’.

Problem 2: Classifying credit risk (15 marks)

The German Credit dataset is a well-known dataset that is widely used for comparing the

performance of data-mining and machine learning techniques on classification tasks. The dataset

can be obtained from the UCI machine learning repository. The following URL will take you to UCI

web page for this file:

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Read the documentation for this dataset, and then download the file german.csv from the LMS page.

Answer the following preliminary questions:

i. How many features, including the target variable, does the data contain?

ii. How many of the attributes are numeric?

iii. How many of the attributes are categorical (including binary)?

iv. How many examples does the data contain?

v. Which attribute represents the class variable?

vi. How many possible values can the class variable take?

vii. What does each of the values of the class variable represent ?

Now load the file into WEKA and compare the performance of each of the following classifiers using

10-fold cross-validation:

• Logistic regression (classifier->functions->logistic)

• Decision trees (classifier->trees->J48)

• Naïve Bayes (classifier->bayes-> NaïveBayes)

• Multilayer Perceptron (classifier->functions->MultilayerPerceptron)

Present the confusion matrix showing the results for each of the four classifiers, and for each case,

calculate the accuracy, precision, and recall.

As described in the documentation for the dataset, the cost of misclassifications are not equal, and it

is worse (in fact 5 times worse) to classify a customer as good when they are bad than it is to classify

a customer as bad when they are good. Using the results that you have provided above, calculate

the weighted misclassification error for each of the classifiers, and, on the basis of these calculations,

recommend which of the classifiers is the best to use on this dataset. Make sure that you show all

calculations, and provide a clear justification for your answer.

What to submit

Submit a brief report presenting your answers to preliminary questions, your results and

justifications. Your submission will be marked according to the completeness and correctness of

your responses.

Problem 3: Mining a Bank Marketing Dataset (20 marks)

The paper “Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM

Methodology”, by Moro, Laureano and Cortez (2011) describes the application of data mining to the

direct marketing campaigns of a Portuguese banking institution. The paper can be accessed from here:

https://www.researchgate.net/publication/236231158_Using_Data_Mining_for_Bank_Direct_Marketi

ng_An_Application_of_the_CRISP-DM_Methodology

The original datasets used in the paper can be found in the file bank.zip, which you will find at the

UCI machine learning repository located at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

The file bank.zip contains a number of files:

• bank-names.txt, which contains, amongst other information, a description of the fields

contained in the dataset;

• bank-full.csv, which is the full dataset, containing 45,212 examples, and

• bank.csv, which is the reduced dataset, containing 4,521 examples (10% of the samples in

the full dataset).

Note that the datasets from the repository are actually semicolon-separated, not comma-separated

(despite the .csv extension). For your convenience, a comma-separated version of bank.csv has been

placed on the LMS. Please use the version from the LMS. A copy of the file bank-names.txt has also

been placed on the LMS.

Preliminary questions

Answer the following preliminary questions:

i. How many features, including the class variable, does the data contain?

ii. How many of the attributes are numeric?

iii. How many of the attributes are categorical (including binary)?

iv. How many examples does the data contain?

v. Which attribute represents the class variable?

vi. How many possible values can the class variable take?

vii. What does each of the values of the class variable represent?

Investigation task

Your task is to evaluate and compare the performance of the following three classifiers on this

dataset:

• Logistic regression (classifier->functions->logistic)

• Decision trees (classifier->trees->J48)

• Naïve Bayes (classifier->bayes-> NaïveBayes)

Apply each of the above classifiers to the bank.csv dataset using 10-fold-cross-validation, and using

the default parameters settings for each classifier.

For each classifier, you should provide the following:

• Confusion Matrix

• Accuracy

• Plot of ROC curve

• Area under ROC curve

• A chart showing Lift for the first five deciles2

2 You will probably find it easiest to produce the ROC and Lift curves using Excel. To do this, you will need to

export the relevant output from WEKA. You can do this by right-clicking on one of the results in the WEKA

Result List, pointing the cursor to ‘Visualize threshold curve’ and selecting ‘yes’. A new window should appear

showing an ROC curve for the classifier. Click on the ‘Save’.

Report

Write a report which presents the information you have collected in an organised and coherent

manner. The results should be presented in a way that facilitates comparison between classifiers.

The report must address the following question: Based on these results, which classifier gives best

performance on this task? In answering this question, you should not simply compare the

performance of the classifiers against each of the above measures. You should pay careful

consideration to the nature of the task, and determine (providing justification for) which measure or

measures are most appropriate for this task.

What to submit

Submit a report (maximum 1,000 words, plus charts, tables, etc.) presenting your answers to the

preliminary questions, your results and justifications. Your submission will be marked according to

the completeness and correctness of your responses, and the clarity, cohesiveness and presentation

quality of your report.