Data Analytics Question

Overview

This assessment task requires you to submit your answers to the questions provided in a word document. You need to transfer the results from the excel file into the word document.

In addition, please submit your detail Excel file with a tab explaining everything you did (state the formula, what this for,…). This is to cross-check the process.

If you think there is any issue or unclarity in any question, please make your assumptions (if there is any) and clearly explain them in your report.

Report Length: less than 2500 words

Please review the attachment detail assignment guide before start doing the assignment!

Assessment Details

Please download the questions 1, 2, 3 data for questions 1, 2 and 3.

The dataset is adopted from:

Question 1 (20 marks)

The provided dataset contains real-world data quality issues that must be addressed before any analysis can be performed.

Create a table with four columns: Variable, Missing Values, Outliers, and Errors. In the Variable column, list the names of all columns in the dataset. For each variable, record the corresponding count of missing values, outliers, and errors found.

Note that errors may include invalid entries such as impossible numeric values (e.g. a negative price or zero number of doors), nonsensical codes, or data type mismatches.

Once the audit table is complete, clean the dataset by addressing all identified issues. Report the total number of rows remaining after cleaning and briefly explain the strategy you applied for each type of issue (removal, imputation, correction, etc.).

Variable

Missing

Values

Outliers

Errors

Variable 1

Variable 2

Question 2 (20 marks)

Develop six visualisations that best illustrate which variables have the strongest relationship with Price.

Only figures are accepted tables will receive no marks

To identify the strongest relationships, you are advised to generate a larger pool of visuals (around 2030) and select the six most effective. Only include the final six visualisations in your report.

Include a variety of visualisation types

Ensure your six selected visuals cover both categorical and numerical variables.

Question 3 (30 marks)

Develop a regression model predicting Price as the outcome variable.

Refine the model using different strategies (e.g. including/excluding variables) to improve accuracy

Present the final regression equation

Interpret the equation explain how specific variables influence Price

Assess the model’s accuracy, including limitations and potential areas for improvement

Note: the accuracy of real world data may be low.

Note: Excel’s regression tool has a limit on the number of variables (16 variables) it can handle simultaneously. You are therefore expected to select only the most relevant predictors before building your model.

Question 4 (10 marks)

The following screenshot (please find in the second tab of data file or attachment name question 4 screenshot) is taken from the logistics regression output from the data set attached question 4 data. The response variable that is called card is a binary variable which is considered as success (yes or 1) if the application of the customer for a credit card is accepted.

Write the logistics regression equation based on the output? Interpret the results.

Calculate the probability of class 1 for the output variable considering cut-off values of 30% and 70%.

Calculate overall error, sensitivity, and specificity for both cut-off values. Explain the steps of calculations.

Explain which cut-off value is a better measure in this business context.

Question 5 (20 marks)

A logistics company, SwiftMove, is planning a promotional campaign by placing advertisements across three online freight and logistics directories. The marketing manager has labelled these platforms as X, Y, and Z.

Estimated reach, cost per advertisement, and maximum allowed placements for each platform are provided in the table below.

Limitations

Platform X

Platform Y

Platform Z

Estimated audience reach per advertisement

110,000

42,000

75,000

Cost per advertisement for the first 5 advertisements

2,800

600

950

Cost per advertisement for more than 5 advertisements

2,400

600

950

Maximum acceptable number of advertisements

18

25

30

To maintain a balanced campaign:

Platform Y advertisements must not exceed 65% of the total number of advertisements placed

Platform X must account for at least 25% of the total advertisements placed

The total advertising budget is capped at $54,000

Platform X offers a tiered discount based on order volume; Platforms Y and Z have fixed rates. (For example, if 8 advertisements are ordered on Platform X and the first 5 are priced at $2,800 and the remaining at $2,400, the total cost would be 5$2,800 + 3$2,400 = $21,200.)

Part a) How many advertisements should be placed on each platform to maximise total audience reach? Clearly define the decision variables, objective function, and constraints. Build the model in Excel and save the worksheet (named Question 5 Part a) in your submission file.

Part b) Rewrite the model to determine: at what price per advertisement must

Platform X charge so that exactly 9 advertisements are allocated to Platform X, assuming the company still aims to maximise total audience reach? Assume the rates for Platforms Y and Z remain unchanged.

Rubric:

Criteria

Ratings

Pts

Q1

Excellent detection and management of errors, missing values and outliers.

20 pts

Q2

Choice of visualisation technique is appropriate. Details of visual (axis title, data to ink ratio) are presented properly. Interpretations are appropriate. Reasonable arguments are presented.

20 pts

Q 3

Correct presentation of – regression table – regression equation – interpretations

30 pts

Q 4

Correct presentation of – confusion matrix and accuracy measures – interpretations

10 pts

Q 5

The optimisation model is correct. The model is solved correctly. Excellent interpretations are presented.

20 pts

WRITE MY PAPER