✍Researcher Perfect🎓, [22.04.21 11:32]
Paper details

Part I

Clean and prepare the attached data file for use in Weka as an .arff file. Show your steps. (Make sure that the label is in the last column.)Compare class determination using Random Forest and Random Tree. Compare the value of a hold out set to k-fold cross validation for validating the model. Suggest a filter to be used to improve validation results. Explain your answer..For one model, find the weakest attribute. Explain how you found it.Submit all answers and screen shots in one Word Document

Write the questions down followed by the answers.

Part II

Draw a 2×2 confusion matrix. Label the boxes. Show observations and predictions. Show how to calculate sensitivity, specificity, positive predictive value and negative predictive value.Draw and explain the model induction algorithm. Explain each step in detail.Why is clustering considered an iterative process?80 people are tested for HIV. 28 test positive. 25 of those have the disease. 4 of those who tested negative have the disease. Fill out the confusion matrix.How does Weka use a test set to validate a classification model? Include in your discussion how Weka uses masking the target.The following data reveal the results of pre-reading testing and the Reading Scale (RS) of the results. Using simple regression, predict the test score if the Reading scale score is 6.3. Show your work.

Test Score

RS 103 7.0 101 6.7..[ more fig. will be provided]

7. What is the difference between a supervised and an unsupervised data set?

8. What is overfitting in a classification model?

9. How is market basket analysis used?

10. The following dataset shows the results of a school reading test to determine learning disabilities. The school has to provide special education resources to those who need it. The younger the child the more effective the intervention. However, many children overcome the disability on their own. The cost of a false positive is $3500,000. The cost of a false negative is $10,000. What is the optimal threshold for the school to provide the services. Use confusion matrixes and show your work.

Test Score

LD 103 x 101 97 [ more figures will be provided]

Extra Credit 1: Construct the ROC curve for this data.

Extra Credit 2: What is the algorithm used in simple linear regression? How is it used to find the line of best fit? How does the model work? What would be the next step, if you are following the model induction algorithm, after you find the line.

Submit all answers and screen shots in one Word Document