Predicting price of houses using linear regression (15 marks) The data in the table below provides some historical data on the selling price, area, number of bedrooms, and age of houses that that were sold in a neighbourhood. The purpose of this exercise is to develop models for predicting selling price on the basis of one or more of the other three variables

Problem 1: Predicting price of houses using linear regression (15 marks)
The data in the table below provides some historical data on the selling price, area, number of
bedrooms, and age of houses that that were sold in a neighbourhood. The purpose of this exercise is
to develop models for predicting selling price on the basis of one or more of the other three
variables

Attachments:CSE5DSS – Decision Support Systems
Individual Assignment 2, 2020
Due Date: Wednesday, May 20, 5:00pm
Assessment Weight: 15% of the final mark for the subject
Instructions
• This is an INDIVIDUAL assignment. You are not permitted to collaborate with any other student,
and you are not permitted to outsource the work to any other party.
• This assignment consists of six separate problems. You are required to solve all six problems.
• Total marks available for the assignment is 50.
Plagiarism
Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the
work is your own. When submitting your assignment via the LMS, the following announcement will
appear:
Software will be used to assist in the detection of plagiarism. Students are referred to the section on
‘Academic Misconduct’ in the subject’s guideline available on LMS.
Lateness Policy
Penalties are applied to late assignments (5% of total possible marks for the task is deducted per day,
accepted up to 5 days after the due date only). An assignment submitted more than five working days
after the due date will not be accepted.
Submission Procedure
Assignments are to be submitted electronically via the Learning Management System. You should
submit a single pdf file which contains your reports for all three problems.
Problem 1: Predicting price of houses using linear regression (15 marks)
The data in the table below provides some historical data on the selling price, area, number of
bedrooms, and age of houses that that were sold in a neighbourhood. The purpose of this exercise is
to develop models for predicting selling price on the basis of one or more of the other three
variables.
Selling Price ($1000) Area (m2
) Bedrooms Age
920 254 4 10
1244 377 4 10
1160 223 3 1
1140 232 4 3
632 171 3 40
904 221 3 7
472 124 2 25
512 155 2 30
700 214 3 18
1152 232 3 2
740 208 3 30
1180 290 4 0
1152 230 3 3
1320 265 3 3
492 159 3 30
760 215 3 19
1104 232 3 1
Part 1
Use WEKA to create three linear regression models for predicting the selling price.
1
• Model 1 should use only the area as input
• Model 2 should use only the number of bedrooms as input
• Model 3 should use only the age as input
For each model, record the regression equation, the training error, and the leave-one-out crossvalidation error.
Use the regression equation from each model to predict the selling price of a 20 year old house with
three bedrooms and an area of 225 m2.
Which model do you believe provides the most reliable prediction? You MUST justify your answer
based on relevant data from the results that you have provided.
Part 2
Now investigate the four possible models that use more than one input variable (i.e., area and
bedrooms; area and age; bedrooms and age; area, bedrooms and age). As in Part 1, record the
regression equation, the training error, and the leave-one-out cross-validation error. Which of these
models provides the most reliable prediction? How does this compare with best model from Part 1?
What to submit
Submit a brief report presenting your results and justifications. Your submission will be marked
according to the completeness and correctness of your responses.
1 Use the Linear Regression function, which you will find in the Classify tab, in the Functionssection. You will need to
remove some predictor variables after you open the file. By default, WEKA Linear Regression performs automatic
attribute selection. We are interested in comparing the difference in performance between combinations of input
variables, so it is important to switch off automatic attribute selection. To do this, go to the Linear Regression
properties dialog box, and set the value of attributeSelectionMethod to ‘No attribute selection’.
Problem 2: Classifying credit risk (15 marks)
The German Credit dataset is a well-known dataset that is widely used for comparing the
performance of data-mining and machine learning techniques on classification tasks. The dataset
can be obtained from the UCI machine learning repository. The following URL will take you to UCI
web page for this file:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
Read the documentation for this dataset, and then download the file german.csv from the LMS page.
Answer the following preliminary questions:
i. How many features, including the target variable, does the data contain?
ii. How many of the attributes are numeric?
iii. How many of the attributes are categorical (including binary)?
iv. How many examples does the data contain?
v. Which attribute represents the class variable?
vi. How many possible values can the class variable take?
vii. What does each of the values of the class variable represent ?
Now load the file into WEKA and compare the performance of each of the following classifiers using
10-fold cross-validation:
• Logistic regression (classifier->functions->logistic)
• Decision trees (classifier->trees->J48)
• Naïve Bayes (classifier->bayes-> NaïveBayes)
• Multilayer Perceptron (classifier->functions->MultilayerPerceptron)
Present the confusion matrix showing the results for each of the four classifiers, and for each case,
calculate the accuracy, precision, and recall.
As described in the documentation for the dataset, the cost of misclassifications are not equal, and it
is worse (in fact 5 times worse) to classify a customer as good when they are bad than it is to classify
a customer as bad when they are good. Using the results that you have provided above, calculate
the weighted misclassification error for each of the classifiers, and, on the basis of these calculations,
recommend which of the classifiers is the best to use on this dataset. Make sure that you show all
calculations, and provide a clear justification for your answer.
What to submit
Submit a brief report presenting your answers to preliminary questions, your results and
justifications. Your submission will be marked according to the completeness and correctness of
your responses.
Problem 3: Mining a Bank Marketing Dataset (20 marks)
The paper “Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM
Methodology”, by Moro, Laureano and Cortez (2011) describes the application of data mining to the
direct marketing campaigns of a Portuguese banking institution. The paper can be accessed from here:
https://www.researchgate.net/publication/236231158_Using_Data_Mining_for_Bank_Direct_Marketi
ng_An_Application_of_the_CRISP-DM_Methodology
The original datasets used in the paper can be found in the file bank.zip, which you will find at the
UCI machine learning repository located at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
The file bank.zip contains a number of files:
• bank-names.txt, which contains, amongst other information, a description of the fields
contained in the dataset;
• bank-full.csv, which is the full dataset, containing 45,212 examples, and
• bank.csv, which is the reduced dataset, containing 4,521 examples (10% of the samples in
the full dataset).
Note that the datasets from the repository are actually semicolon-separated, not comma-separated
(despite the .csv extension). For your convenience, a comma-separated version of bank.csv has been
placed on the LMS. Please use the version from the LMS. A copy of the file bank-names.txt has also
been placed on the LMS.
Preliminary questions
Answer the following preliminary questions:
i. How many features, including the class variable, does the data contain?
ii. How many of the attributes are numeric?
iii. How many of the attributes are categorical (including binary)?
iv. How many examples does the data contain?
v. Which attribute represents the class variable?
vi. How many possible values can the class variable take?
vii. What does each of the values of the class variable represent?
Investigation task
Your task is to evaluate and compare the performance of the following three classifiers on this
dataset:
• Logistic regression (classifier->functions->logistic)
• Decision trees (classifier->trees->J48)
• Naïve Bayes (classifier->bayes-> NaïveBayes)
Apply each of the above classifiers to the bank.csv dataset using 10-fold-cross-validation, and using
the default parameters settings for each classifier.
For each classifier, you should provide the following:
• Confusion Matrix
• Accuracy
• Plot of ROC curve
• Area under ROC curve
• A chart showing Lift for the first five deciles2
2 You will probably find it easiest to produce the ROC and Lift curves using Excel. To do this, you will need to
export the relevant output from WEKA. You can do this by right-clicking on one of the results in the WEKA
Result List, pointing the cursor to ‘Visualize threshold curve’ and selecting ‘yes’. A new window should appear
showing an ROC curve for the classifier. Click on the ‘Save’.
Report
Write a report which presents the information you have collected in an organised and coherent
manner. The results should be presented in a way that facilitates comparison between classifiers.
The report must address the following question: Based on these results, which classifier gives best
performance on this task? In answering this question, you should not simply compare the
performance of the classifiers against each of the above measures. You should pay careful
consideration to the nature of the task, and determine (providing justification for) which measure or
measures are most appropriate for this task.
What to submit
Submit a report (maximum 1,000 words, plus charts, tables, etc.) presenting your answers to the
preliminary questions, your results and justifications. Your submission will be marked according to
the completeness and correctness of your responses, and the clarity, cohesiveness and presentation
quality of your report.