Q1 – Infant Mortality

The file infant.xls contains World Bank data on country-level infant mortality rates across the developed and developing world. The data set contains 394 observations over a 25-year period (from 1990-2014) and includes a number of potential covariates that might plausibly affect health outcomes. The variables are defined as follows:

Infant Mortality – The number of deaths per 1000 live births.

GDP – The real GDP per capita (in US dollars) for each country.

Year – The year the observation was recorded.

Contraceptive Prevalence – The proportion of individuals with access to contraceptives.

Physicians – The number of doctors per 1000 people.

Sanitation – The proportion of individuals with access to modern sanitation facilities.

Education – A measure of educational attainment in terms of years of post-primary attendance.

Your task is to produce an econometric model that will yield useful information for understanding the factors that drive infant mortality. The idea is ultimately to inform policies that could alleviate problems associated with infant mortality in developing countries. You should analyse the data and produce a small report providing expert advice based upon your results.

When interpreting your model you should also consider at least some of the following:

(1) The quality of the fit

(2) The sign and significance of the coefficients

(3) Whether the specification of your model is appropriate for the data

(4) What the model ultimately implies about infant mortality rates, and what can be done about them.

Q2 – Psychological Traits and Violent Crime

Regression models can be used in an incredibly wide array of contexts, including areas such as psychology and criminology. In this (fictitious) example you are to take some data on the incidence of violent crime, and use it to produce a model that can identify suspects in a murder case. The file crime.xls has data on a set of individual-level personality traits obtained from the Big Five and Dark Triad constructs. You also have data on whether the individual has been convicted of a violent crime. The variables are given in the file crime.xls and appear as follows:

Extroversion {1-10} – 10 indicates more extroverted.

Openness to Experience {1-10} – 10 indicates more openness.

Sensitivity {1-10} – 10 indicates more sensitivity to negative emotion.

Agreeableness {1-10} – 10 indicates more agreeable.

Conscientiousness{1-10} – 10 indicates more conscientious.

Narcissism {1-7} – 7 indicates higher levels.

Psychopathy{1-7} – 7 indicates higher levels.

Machiavellianism {1-7} – 7 indicates higher levels.

Age – measured in years.

Female {0-1} where 1 indicates female.

Offence {0-1} where 1 indicates has been convicted of a violent offence.

Your task is to analyse the data set in order to extract some practical insights into the psychology of violent crime.

Calculate the average values of the eight psychometric variables for individuals who have, and have not, been convicted of violent offences. Report your results. Which variables have the biggest differences between the two groups? Discuss.

Estimate a linear probability model explaining violent crime convictions as a function of age, gender, and the suite of personality characteristics given above {hint: this is a straightforward regression model in excel using convictions as the dependent variable}. You should interpret the psychometric variables as continuous while gender is a dummy variable.

Interpret your model with the aim of providing useful information for non-statisticians. Which variables are the most important in this multiple regression model?

In the first dot-point, you identified key variables by looking for large discrepancies in average values between our violent and non-violent groups. In the second, you identified key variables using a regression model. Briefly discuss the difference between these two approaches. Which one is more likely to identify variables that cause violent crime? Why?

Suppose you are part of an investigation into a murder case. The police have identified four suspects and a personality evaluation is given to each. Using your model, obtain a prediction for (i.e. ) for each suspect using the data below. Which one(s) would you recommend the police focus their attention on? Why?

John Smith Jane Doe Adam Jones Brian Greene

Extraversion 7 4 5 2

Openness 3 3 8 4

Sensitivity 3 7 2 6

Agreeableness 7 7 3 3

Conscientiousness 8 8 7 8

Narcissism 3 5 7 2

Psychopathy 2 3 6 3

Machiavellianism 2 2 7 6

Age 36 41 34 32

Female 0 1 0 0

Q3 – Non-linear Functional Forms

Most companies face a dilemma when setting prices for their products. If they set the price of a good too low, then a high volume of sales will not be enough to make up for a small margin on each item. On the other hand, if they set the price too high, the large margin per sale will not make up for a reduced quantity. Maximising profit therefore requires a balancing act that trades off prices and quantities.

The file textiles.xls has data on textile profits, prices per unit of output, and production costs {a quality index, labour costs, price of materials}. You are to analyse the data with a statistical model and produce a brief statement advising a textile company on pricing policy.

Since firms are trying to maximise profits, estimate a model for profit of the form below using Excel, and report your results.

Briefly interpret your model. Do your control variables (quality, labour, materials) have the expected signs? Why or why not?

What percentage of the variation in profit is accounted for by the model? Explain.

The parameter (and the quadratic transform ) produce the non-linearity in the model. What sign (positive or negative) do we expect for ? Why?

The optimal price in this model is given by . What price would you recommend a textile manufacturer to charge per unit of output?

Statistical models can often be improved by including more variables in the estimation. Suggest two variables that you could include in your model that would improve its predictive capacity.

Q4. Clustering

Political scientists are aware that individuals respond in differing ways to various policy platforms and campaign messages. For example, some political messages will resonate strongly in certain segments of the community, and yet be very unpopular in other segments. For this reason, campaigns often like to split voters into groups and target each set of voters in different ways.

Suppose you work for an election campaign that wishes to target voters in this way. The file voting.xls has data on the ages, income levels, and population densities of the home addresses of 20 voters (note that the z-transformed versions are also available). You are to use a k-means clustering algorithm to show your campaign colleagues how such a breakdown may be performed.

Clustering works by taking some randomly chosen “seed” points and allocating each observation to the nearest available seed. The process then “iterates” until the allocations are stable. Explain what is meant by the term “iterates” in this context.

Take the z-transformed variables z_age, z_income and z_popdensity and perform a k-means clustering procedure to sort your data into three groups (hint: use the excel file k-means for this). Approximately low many iterations did you need to use before the process was complete?

Present three scatter plots showing your allocation of observations into clusters. Which cluster has the most observations? Which has the highest income? Which one is situated in the most densely packed urban area? Answer these questions by reporting and interpreting the centroid means for each cluster.

Briefly explain how you could use your results to tailor your advertising campaign to target specific clusters within the broader population.