1. Random selection from the population of interest

This implies truly random selection – you cannot pick the sampling units yourself or just assume the selection of units is following a random process. Random number generators are commonly used to identify units to be sampled. It is vital that you are selecting units from a good ‘sampling frame’, i.e. a list of all possible units/people in the population of interest. If this list is not complete in any way, your sample could be biased. Generally, market research surveys are regarded as poor quality, because they tend to rely on ‘convenience’ samples (e.g. recruiting anyone who is readily available). This is obviously a highly non-random way to sample people. The randomness of the sample is what makes statistical inference (creating estimates of population parameters with known precision) possible.

2. Independence of sampling units

Knowing whether your sampling units are ‘independent’ is vital when analysing sample data. If we took a random sample of 1,000 pupils at 50 schools and related their improvement in academic achievement to school characteristics, we would have to take into consideration that there are only 50 schools, not 1,000. Analysing the data at the level of the pupil would give incorrect estimates of precision for variables at the school level. This is an example of the kind of analysis for which multilevel modelling was developed. Single level analyses do not deal well with this kind of situation because the resulting estimates rely on the assumption that the sampling units are independent.

3. Sample size considerations

Generally, the larger the sample size, the more precise the estimates will be. However, working out the sample size required for sufficient precision (whilst avoiding the time and effort involved in sampling more units than required) depends on several different considerations. For example:

- the acceptable level of error (e.g. proportion of false positives and false negatives)
- the expected strength of relationships between variables
- the number of variables to be included simultaneously in models
- the precision of measurement instruments (i.e. the quality and accuracy of questions in a survey)

The table below shows some indicative^{1} population sizes along with the minimum sample size, the percentage of the population this represents, and the maximum number of parameters (explanatory variables) these samples would safely allow in a statistical model, where the main outcome of interest is binary or categorical.

Population
size |
Minimum
sample size |
% of
population |
Maximum model
parameters |

100 |
80 |
80% |
8 |

500 |
218 |
44% |
21 |

1,000 |
278 |
28% |
27 |

10,000 |
370 |
4% |
37 |

When the population of interest is small, you need a larger proportion of the population in your sample to obtain sufficiently precise estimates. This is because the error in estimates (standard error, SE) is related to the inverse of the square root of the sample size (n):

SE=\frac{SD}{\sqrt{n}}

…where SD is the standard deviation.

Also, the smaller the sample size, the less scope there is for breaking an analysis down using explanatory variables. As a rule of thumb, the number of parameters allowed in a statistical model should not exceed 10% of the sample size. This places a limitation on models estimated using small samples.

The table above is actually pretty optimistic, due to the assumptions used to calculate it (see Bartlett et al.^{1}) In practice, a larger sample is preferable for statistical models, as it provides statistical ‘power’ – i.e. the ability to detect small associations in the data. Generally, sample sizes of at least 500 are desirable when incorporating 20+ parameters in a model.

National survey datasets generally have sample sizes of 5,000 to 20,000 for any one wave of data collection. This provides a great deal of flexibility for social scientists when specifying statistical models. Nevertheless, it is important to base models on a thorough theoretical framework, usually informing the selection of explanatory variables through a review of the available literature on the topic being investigated.

4. Missing data

There are two main sources of missing data:

**missing units** – this occurs when all data for a particular sampling unit is missing (e.g. when a person does not respond to a survey)
**missing items** – this occurs when a particular item of data is missing (e.g. an answer to a question on a survey)

The problem with missing data is that it is often not random (people have reasons for not responding) and can therefore bias your analyses. Statisticians tend to categorise missing data into one of three types:

- MCAR (missing completely at random) – this is where the missingness is completely (statistically) random. Although this will reduce your available sample size, it does not bias your results.
- MAR (missing at random) – this is where the missingness is related to variables that have been collected. By conditioning on these variables in analyses, any bias is controlled for.
- NMAR (not missing at random) – this is where the missingness is related to the variable on which data is missing – for example, estimates of income are frequently problematic as people often do not want to tell a researcher their income, and this can be due to the level of their income. Generally, it is more challenging preventing this kind of missing data from biasing analyses.

Usually, proactive prevention is better than reactive remedy. Some ways to prevent missing data include:

- keeping a survey very short and piloting questions
- using incentives to encourage participation
- building a relationship with respondents to encourage participation

If there are data missing, there are two general approaches that can be used to counteract possible bias in analyses: reweighting and imputation. Reweighting involves estimating the probability of a unit being missing based on a range of explanatory variables. The inverse of this probability can then be used as a weight to increase the influence of those units in the analysis which are more likely to be missing but are, in fact, present.

Imputation involves filling-in missing data. This is done by estimating what the missing values would have been if they had been recorded. Often, this is done multiple times, to give a range of guesses for the missing values (this is called multiple imputation), with all this information then incorporated into subsequent analyses. Currently, multiple imputation is the gold standard for dealing with missing data. Although powerful, these methods are relatively complex to implement, and so it is generally preferable to do everything possible to prevent missingness occurring in the first place.