Quantitative Data Analysis
Module Code: DS7006
Dead line: 11th January 2016
Individual Data Analysis Project Submit: 11th January 2016 You will be provided with a data set as an SQLite database for you to analyse using R. The data set should be assessed for reliability, explored, hypotheses raised and tested. The report should provide the reader with a clear understanding of:
• which variables you chose to use,
• which techniques you used and in which order,
• why you chose to apply each of these techniques,
• what outcome resulted at each stage of the analysis and what it means.
State clearly your hypotheses that you develop through the data exploration. Use tables and visualisations as appropriate to present your analysis. Show the key elements of your SQL and R scripts, organised in an appendix.
Where texts and other background reading are cited, a list of references should be provided using Harvard style. Your report should tell the ‘story’ of your analysis project rather than just being a ‘catalogue’ of things done to data. Both assignments must be submitted through Moodle before midnight on the due date.
PLEASE NOTE: you must select your choice of data for Region: London, South East and East of England.
Set of SQLlite data attached in separate *sql format.
DS7006 – Quantitative Data Analysis
What makes a good individual data analysis project?
This should not be more than 4,000 words and although it should contain graphs and tables (even maps), these should be carefully constructed and there for the express purpose of illustrating what is discussed in the text. So, for example, giving Q-Q plots for 20 variables is unnecessary – just a few to illustrate typical characteristics found would do. Marks will be knocked off for gratuitous use of graphics and tables to bulk up the assignment to make it look BIG!
Set the scene: Begin by briefly introducing the topic you are analysing. Are there any relevant theories about the phenomenon you are looking at….is there a key literature? What are the objectives of your analysis, what is the story you want to tell through data analysis? “In this project, I am going to show through data analysis that…..”.
Data acquisition: You have been supplied with a SQLite database containing a range of variables for districts in England. Some data are from the 2001 census, others are about 2003-2005. There is an Excel spreadsheet indicating what each field is. Map SHP files are also provided. All the tables can be joined using the same primary key. You should choose a range of variables that might be analytically interesting and through one or more SQL queries join them together and export as a CSV file ready to use in R. Choose which are your dependant and independent variables. For count data you will need to normalise by using an appropriate base population (e.g. per thousand population; per thousand pensioners).
Data exploration: this includes univariate, bivariate and multi-variate. This is to understand and check the veracity of individual variables and searching for possible relationships between the dependent and independent variables. As a result of this were any corrections made to the data? Using neat tables, boxplots, scatter graphs etc., what did you find out about central tendency, spread, outliers, missing values, correlations etc.? Does this raise any new hypotheses you could test?
(Factor analysis): are you dealing with many variables…..should you carry out a factor analysis on the independent variables to find out the key dimensions within your data. What did you find?
(Classification): are you dealing with many cases that if hierarchically classified might show something interesting and new? How many groups did you choose and what distinguishes each group from the others?
Hypothesis testing: Clearly state your null hypotheses. Make sure you choose the right test. Are the data you are using in the test paired or independent? If using a parametric test, are your data normally distributed?…show the test of normality. What confidence interval are you using to accept or reject your null hypothesis (if not 95% (0.05) than give a reason)? What is the outcome of the test and what is your interpretation of what this means?
(Regression): is it appropriate to build a multiple regression model to show how all the independent variables work together to predict a target dependent variable? What is the R2 and the significance of the regression model?
Conclusions: what conclusions can you draw from the data analysis, what are main findings…what are the strengths and weaknesses of what you have done?....are there any implications for future analysis?
References: list of key references using Harvard.
Appendices: your SQL commands and R scripts.