Chapter 25: Assessing risk of bias in a non-randomized study

Jonathan AC Sterne, Miguel A Hernán, Alexandra McAleenan, Barnaby C Reeves, Julian PT Higgins

Key Points:

The Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool is recommended for assessing the risk of bias in non-randomized studies of interventions included in Cochrane Reviews.
Review authors should specify important confounding domains and co-interventions of concern in their protocol.
At the start of a ROBINS-I assessment of a study, review authors should describe a ‘target trial’, which is a hypothetical pragmatic randomized trial of the interventions compared in the study, conducted on the same participant group and without features putting it at risk of bias.
Assessment of risk of bias in a non-randomized study should address pre-intervention, at-intervention, and post-intervention features of the study. The issues related to post-intervention features are similar to those in randomized trials.
Many features of ROBINS-I are shared with the RoB 2 tool for assessing risk of bias in randomized trials. It focuses on a specific result, is structured into a fixed set of domains of bias, includes signalling questions that inform risk of bias judgements and leads to an overall risk-of-bias judgement.
Based on answers to the signalling questions, judgements for each bias domain, and for overall risk of bias, can be ‘Low’, ‘Moderate’, ‘Serious’ or ‘Critical’ risk of bias.
The full guidance documentation for the ROBINS-I tool, including the latest variants for different study designs, is available at www.riskofbias.info.

Cite this chapter as: Sterne JAC, Hernán MA, McAleenan A, Reeves BC, Higgins JPT. Chapter 25: Assessing risk of bias in a non-randomized study [last updated October 2019]. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5. Cochrane, 2024. Available from cochrane.org/handbook.

25.1 Introduction

Cochrane Reviews often include non-randomized studies of interventions (NRSI), as discussed in detail in Chapter 24. Risk of bias should be assessed for each included study (see Chapter 7). The Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool (Sterne et al 2016) is recommended for assessing risk of bias in a NRSI: it provides a framework for assessing the risk of bias in a single result (an estimate of the effect of an experimental intervention compared with a comparator intervention on a particular outcome). Many features of ROBINS-I are shared with the RoB 2 tool for assessing risk of bias in randomized trials (see Chapter 8).

Evaluating risk of bias in results of NRSI requires both methodological and content expertise. The process is more involved than for randomized trials, and the participation of both methodologists with experience in the relevant study designs or design features, and health professionals with knowledge of prognostic factors that influence intervention decisions for the target patient or population group, is recommended (see Chapter 24). At the planning stage, the review question must be clearly articulated, and important potential problems in NRSI relevant to the review should be identified. This includes a preliminary specification of important confounders and co-interventions (see Section 25.3.1). Each study should then be carefully examined, considering all the ways in which its results might be put at risk of bias.

In this chapter we summarize the biases that can affect NRSI and describe the main features of the ROBINS-I tool. Since the initial version of the tool was published in 2016 (Sterne et al 2016), developments to it have continued. At the time of writing, a new version is under preparation, with variants for several types of NRSI design. The full guidance documentation for the ROBINS-I tool, including the latest variants for different study designs, is available at www.riskofbias.info.

25.1.1 Defining bias in a non-randomized study

We define bias as the systematic difference between the study results obtained from an NRSI and a pragmatic randomized trial (both with a very large sample size), addressing the same question and conducted on the same participant group, that had no flaws in its conduct. Defined in this way, bias is distinct from issues of indirectness (applicability, generalizability or transportability to types of individuals who were not included in the study; see Chapter 14) and distinct from chance. For example, restricting the study sample to individuals free of comorbidities may limit the utility of its findings because they cannot be generalized to clinical practice, where comorbidities are common. However, such restriction does not bias the results of the study in relation to individuals free of comorbidities.

Evaluations of risk of bias in the results of NRSI are thus facilitated by considering each NRSI as an attempt to emulate (mimic) a hypothetical ‘target’ randomized trial (see also Section 25.3.2). This is the hypothetical pragmatic randomized trial that compares the health effects of the same interventions, conducted on the same participant group and without features putting it at risk of bias (Institute of Medicine 2012, Hernán and Robins 2016). Importantly, a target randomized trial need not be feasible or ethical. For example, there would be no problem specifying a target trial that randomized individuals to receive tobacco cigarettes or no cigarettes to examine the effects of smoking, even though such a trial would not be ethical in practice. Similarly, there would be no problem specifying a target trial that randomized multiple countries to implement a ban on smoking in public places, even though this would not be feasible in practice.

25.2 Biases in non-randomized studies

When a systematic review includes randomized trials, its results correspond to the causal effects of the interventions studied provided that the trials have no bias. Randomization is used to avoid an influence of either known or unknown prognostic factors (factors that predict the outcome, such as severity of illness or presence of comorbidities) on intervention group assignment. There is greater potential for bias in NRSI than in randomized trials. A key concern is the possibility of confounding (see Section 25.2.1). NRSI may also be affected by biases that are referred to in the epidemiological literature as selection bias (see Section 25.2.2) and information bias (see Section 25.2.3). Furthermore, we are at least as concerned about reporting biases as we are when including randomized trials (see Section 25.2.4).

25.2.1 Confounding

Confounding occurs when there are common causes of the choice of intervention and the outcome of interest. In the presence of confounding, the association between intervention and outcome differs from its causal effect. This difference is known as confounding bias. A confounding domain (or, more loosely, a ‘confounder’) is a pre-intervention prognostic factor (i.e. a variable that predicts the outcome of interest) that also predicts whether an individual receives one or the other interventions of interest. Some common examples are severity of pre-existing disease, presence of comorbidities, healthcare use, physician prescribing practices, adiposity, and socio-economic status.

Investigators measure specific variables (often also referred to as confounders) in an attempt to control fully or partly for these confounding domains. For example, baseline immune function and recent weight loss may be used to adjust for disease severity; hospitalizations and number of medical encounters in the six months preceding baseline may be used to adjust for healthcare use; geographic measures to adjust for physician prescribing practices; body mass index and waist-to-hip ratio to adjust for adiposity; and income and education to adjust for socio-economic status.

The confounding domains that are important in the context of particular interventions may vary across study settings. For example, socio-economic status might be an important confounder in settings where cost or having insurance cover affects access to health care, but might not introduce confounding in studies conducted in countries in which access to the interventions of interest is universal and therefore socio-economic status does not influence intervention received.

Confounding may be overcome, in principle, either by design (e.g. by restricting eligibility to individuals who all have the same value of the baseline confounders) or – more commonly – through statistical analyses that adjust (‘control’) for the confounder(s). Adjusting for factors that are not confounders, and in particular adjusting for variables that could be affected by intervention (‘post-intervention’ variables), may introduce bias.

In practice, confounding is not fully overcome. First, residual confounding occurs when a confounding domain is not measured, is measured with error, or when the relationship between the confounding domain and the outcome or exposure (depending on the analytic approach being used) is imperfectly modelled. For example, in a NRSI comparing two antihypertensive drugs, we would expect residual confounding if pre-intervention blood pressure was measured three months before the start of intervention, but the blood pressures used by clinicians to decide between the drugs at the point of intervention were not available in our dataset. Second, unmeasured confounding occurs when a confounding domain has not been measured at all, or is not controlled for in the analysis. This would be the case if no pre-intervention blood pressure measurements were available, or if the analysis failed to control for pre-intervention blood pressure despite it being measured. Unmeasured confounding can usually not be excluded, because we are seldom certain that we know all the confounding domains.

When NRSI are to be included in a review, review authors should attempt to pre-specify important confounding domains in their protocol. The identification of potential confounding domains requires subject-matter knowledge. For example, experts on surgery are best-placed to identify prognostic factors that are likely to be related to the choice of a surgical strategy. We recommend that subject-matter experts be included in the team writing the review protocol, and we encourage the listing of confounding domains in the review protocol, based on initial discussions among the review authors and existing knowledge of the literature.

25.2.2 Selection bias

Selection bias occurs when some eligible participants, or some follow-up time of some participants, or some outcome events, are excluded in a way that leads to the association between intervention and outcome in the NRSI differing from the association that would have been observed in the target trial. This phenomenon is distinct from that of confounding, although the term selection bias is sometimes used to mean confounding. Selection biases occur in NRSI either due to selection of participants or follow-up time into the study (addressed in the ‘Bias in selection of participants into the study’ domain), or selection of participants or follow-up time out of the study (addressed in the ‘Bias due to missing data’ domain).

Our use of the term ‘selection bias’ is intended to refer only to bias that would arise even if the effect of interest were null, that is, biases that are internal to the study, and not to issues of indirectness (generalizability, applicability or transferability to people who were excluded from the study) (Schünemann et al 2013).

Selection bias occurs when selection of participants or follow-up time is related to both intervention and outcome. For example, studies of folate supplementation during pregnancy to prevent neural tube defects in children were biased because they only included mothers and children if children were born alive (Hernán et al 2002). The bias arose because having a live birth (rather than a stillbirth or therapeutic abortion, for which outcome data were not available) is related to both the intervention (because folate supplementation increases the chance of a live birth) and the outcome (because the presence of neural tube defects makes a live birth less likely) (Velie and Shaw 1996, Hernán et al 2002).

Selection bias can also occur when some follow-up time is excluded from the analysis. For example, there is potential for bias when prevalent users of an intervention (those already receiving the intervention), rather than incident (new) users are included in analyses comparing them with non-users. This is a type of selection bias that has also been termed inception bias or lead time bias. If participants are not followed from assignment of the intervention (inception), as they would be in a randomized trial, then a period of follow-up has been excluded, and individuals who experienced the outcome soon after starting the intervention will be missing from analyses.

Selection bias may also arise because of missing data due to, among other reasons, attrition (loss to follow-up), missed appointments, incomplete data collection and by participants being excluded from analysis by primary investigators. In NRSI, data may be missing for baseline characteristics (including interventions received or baseline confounders), for pre-specified co-interventions, for outcome measurements, for other variables involved in the analysis or a combination of these. Specific considerations for missing data broadly follow those established for randomized trials and described in the RoB 2 tool for randomized trials (see Chapter 8).

25.2.3 Information bias

Bias may be introduced if intervention status is misclassified, or if outcomes are misclassified or measured with error. Such bias is often referred to as information bias or measurement bias. Errors in classification (or measurement) may be non-differential or differential, and in general we are more concerned about such errors when they are differential. Differential misclassification of intervention status occurs when misclassifications are related to subsequent outcome or to risk of the outcome. Differential misclassification (or measurement error) in outcomes occurs when it is related to intervention status.

Misclassification of intervention status is seldom a problem in randomized trials and other experimental studies, because interventions are actively assigned by the researcher and their accurate recording is a key feature of the study. However, in observational studies information about interventions allocated or received must be ascertained. To prevent differential misclassification of intervention status it is important that, wherever possible, interventions are defined and categorized without knowledge of subsequent outcomes. A well-known example of differential misclassification, when knowledge of subsequent outcomes might affect classification of interventions, is recall bias in a case-control study: cases may be more likely than controls to recall potentially important events or report exposure to risk factors they believe to be responsible for their disease. Differential misclassification of intervention status can occur in cohort studies if it is obtained retrospectively. This can happen if information (or availability of information) on intervention status is influenced by outcomes: for example a cohort study in elderly people in which the outcome is dementia, and participants’ recall of past intervention status at study inception was affected by pre-existing mild cognitive impairment. Such problems can be avoided if information about intervention status is collected at the time of the intervention and the information is complete and accessible to those undertaking the NRSI.

Bias in measurement of the outcome is often referred to as detection bias. Examples of situations in which such bias can arise are if (i) outcome assessors are aware of intervention status (particularly when assessment of the outcome is subjective); (ii) different methods (or intensities of observation) are used to assess outcomes in the different intervention groups; and (iii) measurement errors are related to intervention status (or to a confounder of the intervention-outcome relationship). Blinding of outcome assessors aims to prevent systematic differences in measurements between intervention groups but is frequently not possible or not performed in NRSI.

25.2.4 Reporting bias

Concerns over selection of the reported results from NRSI reflect the same concerns as for randomized trials (see Chapter 7 and Chapter 8, Section 8.7). Selective reporting typically arises from a desire for findings to be newsworthy, or sufficiently noteworthy to merit publication: this could be the case if previous evidence (or a prior hypothesis) is either supported or contradicted. Although there is a lack of empirical evidence of selective reporting in NRSI compared with randomized trials, it is difficult to imagine that the problem is any less serious for NRSI. Many NRSI do not have written protocols, and many are exploratory so – by design – involve inspecting many associations between intervention and outcome.

Selection of the reported result will lead to bias if it is based on the P value, magnitude or direction of the intervention effect estimate. Bias due to selection of the outcome measure occurs when an effect estimate for a particular outcome is selected from among multiple measurements, for example when a measurement is made at a number of time points or using multiple scales. Bias due to selection of the analysis occurs when the reported results are selected from intervention effects estimated in multiple ways, such as analyses of both change scores and post-intervention scores adjusted for baseline, or multiple analyses with adjustment for different sets of potential confounders. Finally, there may be selective reporting of a subgroup of participants, selected from a larger NRSI, for which results are reported on the basis of a more interesting finding.

The separate issue of bias due to missing results, where non-reporting of study outcomes or whole studies is related to the P value, magnitude or direction of the intervention effect estimate, is addressed outside the framework of the ROBINS-I tool, and is described in detail in Chapter 13.

25.3 The ROBINS-I tool

25.3.1 At protocol stage: listing the confounding domains and the possible co-interventions

Review authors planning a ROBINS-I assessment should list important confounding domains in their protocol. Relevant confounding domains are the prognostic factors (predictors of the outcome) that also predict whether an individual receives one or the other intervention of interest.

Review authors are also encouraged to list important co-interventions in their protocol. Relevant co-interventions are the interventions or exposures that individuals might receive after or with initiation of the intervention of interest, which are related to the intervention received and which are prognostic for the outcome of interest. Therefore, co-interventions are a type of confounder, which we consider separately to highlight its importance.

Important confounders and co-interventions are likely to be identified both through the knowledge of subject-matter experts who are members of the review team, and through initial (scoping) reviews of the literature. Discussions with health professionals who make intervention decisions for the target patient or population groups may also be helpful. Assessment of risk of bias may, for some domains, rely heavily on expert opinion rather than empirical data: this means that consensus may not be reached among experts with different opinions. Nonetheless use of ROBINS-I should help structure discussions about risk of bias and make disagreements explicit.

25.3.2 Specifying a target trial specific to the study

ROBINS-I requires that review authors explicitly identify the interventions that would be compared in the hypothetical target trial that the NRSI is trying to emulate (see Section 25.1.1). Often the description of these interventions will require subject-matter knowledge, because information provided by the investigators of the observational study is insufficient to define the target trial. For example, NRSI authors may refer to ‘use of therapy [A],’ which does not directly correspond to the intervention ‘prescribe therapy [A]’ that would be tested in an intention-to-treat analysis of the target trial. Meaningful assessment of risk of bias is problematic in the absence of well-defined interventions.

25.3.3 Specifying the nature of the effect of interest

In the target trial, the effect of interest will be either the effect of assignment to the interventions at baseline, regardless of the extent to which the interventions were received as intended, or the effect of adhering to the interventions as specified in the study protocol (see Chapter 8, Section 8.2.2). Risk of bias will be assessed in relation to one of these effects. The choice of effect of interest is a decision of the review authors. However, it may be influenced by the analyses that produced the NRSI result being assessed, because the result may correspond more closely to one of the effects of interest and would, therefore, be at greater risk of bias with respect to the alternative effect of interest.

In a randomized trial, these two effects may be interpreted as the intention-to-treat (ITT) effect and the per protocol effect (see also Chapter 8, Section 8.2.2). Analogues of these effects can be defined for NRSI. For example, the ITT effect can be approximated by the effect of prescribing experimental intervention versus prescribing comparator intervention. When prescription information is not available, the ITT effect can be approximated by the effect of starting the experimental intervention versus starting comparator intervention, which corresponds to the ITT effect in a trial in which participants assigned to an intervention always start the intervention. An analogue of the effect of adhering to the intervention as described in the trial protocol is (starting and) adhering to experimental intervention versus (starting and) adhering to comparator intervention unless medical reasons (e.g. toxicity) indicate discontinuation.

For both NRSI and randomized trials, unbiased estimation of the effect of adhering to sustained interventions (interventions that continue over time, such as daily ingestion of a drug intervention) requires appropriate adjustment for prognostic factors (‘time-varying confounders’) that predict deviations from the intervention after the start of follow-up (baseline). Review authors should seek specialist advice when assessing intervention effects estimated using methods that adjust for time-varying confounding.

When the effect of interest is that of assignment to the intervention (or starting intervention at baseline), risk-of-bias assessments need not be concerned with post-baseline deviations from intended interventions that reflect the natural course of events. For example, a departure from an allocated intervention that was clinically necessary because of a sudden worsening of the patient’s condition does not lead to bias. The only post-baseline deviation that may lead to bias are the potentially biased actions of researchers arising from the experimental context. Observational studies estimating the effect of assignment to intervention from routine data should therefore have no concerns about post-baseline deviations from intended interventions.

By contrast, when the effect of interest is adhering to the intended intervention, risk-of-bias assessments of both NRSI and randomized trials should consider post-baseline deviations from the intended interventions, including lack of adherence and differences in additional interventions (co-interventions) between intervention groups.

25.3.4 Domains of bias

The domains included in ROBINS-I cover all types of bias that are currently understood to affect the results of NRSI. Each domain is mandatory, and no additional domains should be added. Table 25.3.a lists the bias domains covered by the tool for most types of NRSI. Versions of the tool are available, or in development, for several types of NRSI, and the variant selected should be appropriate to the key features of the study being assessed (see latest details at www.riskofbias.info).

In common with RoB 2 (Chapter 8, Section 8.2.3), the tool comprises, for each domain:

a series of ‘signalling questions’;
a judgement about risk of bias for the domain, which is facilitated by an algorithm that maps responses to the signalling questions to a proposed judgement;
free text boxes to justify responses to the signalling questions and risk-of-bias judgements; and
an option to predict (and explain) the likely direction of bias.

The signalling questions aim to elicit information relevant to the risk-of-bias judgement for the domain, and work in the same way as for RoB 2 (see Chapter 8, Section 8.2.3). The response options are:

yes;
probably yes;
probably no;
no;
no information.

Based on these responses to the signalling questions, the options for a domain-level risk-of-bias judgement are ‘Low’, ‘Moderate’, ‘Serious’ or ‘Critical’ risk of bias, with an additional option of ‘No information’ (see Table 25.3.b). These differ from the risk-of-bias judgements for the RoB 2 tool (Chapter 8, Section 8.2.3).

Note that a judgement of ‘Low risk of bias’ corresponds to the absence of bias in a well-performed randomized trial, with regard to the domain being considered. This category thus provides a reference for risk-of-bias assessment in NRSI in particular for the ‘pre-intervention’ and ‘at-intervention’ domains. Because of confounding, we anticipate that only rarely will design or analysis features of a non-randomized study lead to a classification of low risk of bias when studying the intended effects of interventions (on the other hand, confounding may be a less serious concern when studying unintended effects of intervention (Institute of Medicine 2012)). By contrast, since randomization does not protect against post-intervention biases, we expect more overlap between assessments of randomized trials and assessments of NRSI for the post-intervention domains. Nonetheless other features of randomized trials that are usually not feasible in NRSI, such as blinding of participants, health professionals or outcome assessors, may make NRSI more at risk of post-intervention biases.

As for RoB 2, a free text box alongside the signalling questions and judgements provides space for review authors to present supporting information for each response. Brief, direct quotations from the text of the study report should be used whenever possible.

The tool includes an optional component to judge the direction of the bias for each domain and overall. For some domains, the bias is most easily thought of as being towards or away from the null. For example, suspicion of selective non-reporting of statistically non-significant results would suggest bias away from the null. However, for other domains (in particular confounding, selection bias and forms of measurement bias such as differential misclassification), the bias needs to be thought of as an increase or decrease in the effect estimate to favour either the experimental intervention or comparator compared with the target trial, rather than towards or away from the null. For example, confounding bias that decreases the effect estimate would be towards the null if the true risk ratio were greater than 1, and away from the null if the risk ratio were less than 1. If review authors do not have a clear rationale for judging the likely direction of the bias, they should not attempt to guess it and should leave this response blank.

Table 25.3.a Bias domains included in the ROBINS-I tool

Bias domain	Category of bias	Explanation
Pre-intervention domains
Bias due to confounding	Confounding	Baseline confounding occurs when one or more prognostic variables (factors that predict the outcome of interest) also predicts the intervention received at baseline. ROBINS-I can also address time-varying confounding, which occurs when post-baseline prognostic factors affect the intervention received after baseline.
Bias in selection of participants into the study	Selection bias	When exclusion of some eligible participants, or the initial follow-up time of some participants, or some outcome events, is related to both intervention and outcome, there will be an association between interventions and outcome even if the effect of interest is truly null. This type of bias is distinct from confounding. A specific example is bias due to the inclusion of prevalent users, rather than new users, of an intervention.
At-intervention domain
Bias in classification of interventions	Information bias	Bias introduced by either differential or non-differential misclassification of intervention status. Non-differential misclassification is unrelated to the outcome and will usually bias the estimated effect of intervention towards the null. Differential misclassification occurs when misclassification of intervention status is related to the outcome or the risk of the outcome.
Post-intervention domains
Bias due to deviations from intended interventions	Confounding	Bias that arises when there are systematic differences between experimental intervention and comparator groups in the care provided, which represent a deviation from the intended intervention(s). Assessment of bias in this domain will depend on the effect of interest (either the effect of assignment to intervention or the effect of adhering to intervention).
Bias due to missing data	Selection bias	Bias that arises when later follow-up is missing for individuals initially included and followed (e.g. differential loss to follow-up that is affected by prognostic factors); bias due to exclusion of individuals with missing information about intervention status or other variables such as confounders.
Bias in measurement of the outcome	Information bias	Bias introduced by either differential or non-differential errors in measurement of outcome data. Such bias can arise when outcome assessors are aware of intervention status, if different methods are used to assess outcomes in different intervention groups, or if measurement errors are related to intervention status or effects.
Bias in selection of the reported result	Reporting bias	Selective reporting of results from among multiple measurements of the outcome, analyses or subgroups in a way that depends on the findings.

Table 25.3.b Reaching a risk-of-bias judgement for an individual bias domain

Risk-of-bias judgement	Interpretation
Low risk of bias	The study is comparable to a well-performed randomized trial with regard to this domain.
Moderate risk of bias	The study is sound for a non-randomized study with regard to this domain but cannot be considered comparable to a well-performed randomized trial.
Serious risk of bias	The study has some important problems in this domain.
Critical risk of bias	The study is too problematic in this domain to provide any useful evidence on the effects of intervention.
No information	No information on which to base a judgement about risk of bias for this domain.

25.3.5 Reaching an overall risk-of-bias judgement for a result

The response options for an overall risk-of-bias judgement for a result, across all domains, are the same as for individual domains. Table 25.3.c shows the approach to mapping risk-of-bias judgements within domains to an overall judgement for the outcome.

Judging a result to be at a particular level of risk of bias for an individual domain implies that the result has an overall risk of bias at least this severe. For example, a judgement of ‘Serious’ risk of bias within any domain implies that the concerns identified have serious implications for the result overall, irrespective of which domain is being assessed. In practice this means that if the answers to the signalling questions yield a proposed judgement of ‘Serious’ or ‘Critical’ risk of bias, review authors should consider whether any identified problems are of sufficient concern to warrant this judgement for that result overall. If this is not the case, the appropriate action would be to retain the answers to the signalling questions but override the proposed default judgement and provide justification.

‘Moderate’ risk of bias in multiple domains may lead review authors to decide on an overall judgement of ‘Serious’ risk of bias for that outcome or group of outcomes, and ‘Serious’ risk of bias in multiple domains may lead review authors to decide on an overall judgement of ‘Critical’ risk of bias.

Once an overall judgement has been reached for an individual study result, this information should be presented in the review and reflected in the analysis and conclusions. For discussion of the presentation of risk-of-bias assessments and how they can be incorporated into analyses, see Chapter 7. Risk-of-bias assessments also feed into one domain of the GRADE approach for assessing certainty of a body of evidence, as discussed in Chapter 14.

Table 25.3.c Reaching an overall risk-of-bias judgement for a specific outcome

Overall risk-of-bias judgement	Interpretation	Criterion
Low risk of bias	The study is comparable to a well-performed randomized trial.	The study is judged to be at low risk of bias for all domains for this result.
Moderate risk of bias	The study appears to provide sound evidence for a non-randomized study but cannot be considered comparable to a well-performed randomized trial.	The study is judged to be at low or moderate risk of bias for all domains.
Serious risk of bias	The study has one or more important problems.	The study is judged to be at serious risk of bias in at least one domain, but not at critical risk of bias in any domain.
Critical risk of bias	The study is too problematic to provide any useful evidence and should not be included in any synthesis.	The study is judged to be at critical risk of bias in at least one domain.

25.4 Risk of bias in follow-up (cohort) studies

As discussed in Chapter 24 (Section 24.2), labels such as ‘cohort study’ can be inconsistently applied and encompass many specific study designs. For this reason, these terms are generally discouraged in Cochrane Reviews in favour of using specific features to describe how the study was designed and analysed. For the purposes of ROBINS-I, we define a category of studies, which we refer to as follow-up studies, that refers to studies in which participants are followed up from the start of intervention up to a later time for ascertainment of outcomes of interest. This includes inception cohort studies (in which participants are identified at the start of intervention), non-randomized controlled trials, many analyses of routine healthcare databases, and retrospective cohort studies.

The issues covered by ROBINS-I for follow-up studies are summarized in Table 25.4.a. A distinctive feature of a ROBINS-I assessment of follow-up studies is that it addresses both baseline confounding (the most familiar type) and time-varying confounding. Baseline confounding occurs when one or more pre-intervention prognostic factors predict the intervention received at start of follow-up. A pre-intervention variable is one that is measured before the start of interventions of interest. For example, a cohort study comparing two antiretroviral drug regimens for HIV should control for CD4 cell count measured before the start of antiretroviral therapy, because this is strongly prognostic for the outcomes AIDS and death, and is also likely to influence choice of regimen. Baseline confounding is likely to be an issue in most NRSI.

In some NRSI, particularly those based on routinely collected data, participants switch between the interventions being compared over time, and the follow-up time from these individuals is divided between the intervention groups according to the intervention received at any point in time. If post-baseline prognostic factors affect the interventions to which the participants switch, then this can lead to time-varying confounding. For example, suppose a study of patients treated for HIV partitions follow-up time into periods during which patients were receiving different antiretroviral regimens and compares outcomes during these periods in the analysis. Post-baseline CD4 cell counts might influence switches between the regimens of interest. When such post-baseline prognostic variables are affected by the interventions themselves (e.g. antiretroviral regimen may influence post-baseline CD4 count), we say that there is treatment-confounder feedback. This implies that conventional adjustment (e.g. Poisson or Cox regression models) is not appropriate as a means of controlling for time-varying confounding. Other post-baseline prognostic factors, such as adverse effects of an intervention, may also predict switches between interventions.

Note that a change from the baseline intervention may result in switching to an intervention other than the alternative of interest in the study (i.e. from experimental intervention to something other than the comparator intervention, or from comparator intervention to something other than the experimental intervention). If follow-up time is re-allocated to the alternative intervention in the analysis that produced the result being assessed for risk of bias, then there is a potential for bias arising from time-varying confounding. If follow-up time was not allocated to the alternative intervention, then the potential for bias is considered either (i) under the domain ‘Bias due to deviations from intended interventions’ if interest is in the effect of adhering to intervention and the follow-up time on the subsequent intervention is included in the analysis, or (ii) under ‘Bias due to missing data’ if the follow-up time on the subsequent intervention is excluded from the analysis.

Table 25.4.a Bias domains included in the ROBINS-I tool for follow-up studies, with a summary of the issues addressed

Bias domain	Issues addressed*
Bias due to confounding	Whether: the review author should consider baseline confounding only, or both baseline confounding and time-varying confounding (arising in studies in which follow-up time is split according to the intervention being received); all important confounding domains were controlled for; the confounding domains were measured validly and reliably by the variables available; and appropriate analysis methods were used to control for the confounding.
Bias in selection of participants into the study	Whether: selection of participants into the study (or into the analysis) was based on participant characteristics observed after the start of intervention; (if applicable) these characteristics were associated with intervention and influenced by outcome (or a cause of the outcome); start of follow-up and start of intervention were the same; and (if applicable) adjustment techniques were used to correct for the presence of selection biases.
Bias in classification of interventions	Whether: intervention status was classified correctly for all (or nearly all) participants; information used to classify intervention groups was recorded at the start of the intervention; and classification of intervention status could have been influenced by knowledge of the outcome or risk of the outcome.
Bias due to deviations from intended interventions	When the review authors’ interest is in the effect of assignment to intervention (see Section 25.3.3): Whether: there were deviations from the intended intervention because of the experimental context (i.e. deviations that do not reflect usual practice); and, if so, whether they were balanced between groups and likely to have affected the outcome. When the review authors’ interest is in the effect of adhering to intervention (see Section 25.3.3): Whether: important co-interventions were balanced across intervention groups; failures in implementing the intervention could have affected the outcome and were unbalanced across intervention groups; study participants adhered to the assigned intervention regimen and if not whether non-adherence was unbalanced across intervention groups; and (if applicable) an appropriate analysis was used to estimate the effect of adhering to the intervention.
Bias due to missing data	Whether: the number of participants omitted from the analysis due to missing outcome data was small; the number of participants omitted from the analysis due to missing data on intervention status was small; the number of participants omitted from the analysis due to missing data on other variables needed for the analysis was small; (if applicable) there was evidence that the result was not biased by missing outcome data; and (if applicable) missingness in the outcome was likely to depend on the true value of the outcome (e.g. because of different proportions of missing outcome data, or different reasons for missing outcome data, between intervention groups).
Bias in measurement of the outcome	Whether: the method of measuring the outcome was inappropriate; measurement or ascertainment of the outcome could have differed between intervention groups; outcome assessors were aware of the intervention received by study participants; and (if applicable) assessment of the outcome could have been influenced by knowledge of intervention received; and whether this was likely.
Bias in selection of the reported result	Whether: the numerical result being assessed is likely to have been selected, on the basis of the results, from multiple outcome measurements within the outcome domain; the numerical result being assessed is likely to have been selected, on the basis of the results, from multiple analyses of the data; and the numerical result being assessed is likely to have been selected, on the basis of the results, from multiple subgroups of a larger cohort.
* For the precise wording of signalling questions and guidance for answering each one, see the full ROBINS-I tool at www.riskofbias.info.

25.5 Risk of bias in uncontrolled before-after studies (including interrupted time series)

In some studies measurements of the outcome variable are made both before and after an intervention takes place. The measurements may be made on individuals, clusters of individuals, or administrative entities according to the unit of analysis of the study. There may be only one unit, several units or many units. Here, we consider only uncontrolled studies in which all units contributing to the analysis received the (same) intervention. Controlled versions of these studies are covered in Section 25.6.

This category of studies includes interrupted time series (ITS) studies (Kontopantelis et al 2015, Polus et al 2017). ITS studies collect longitudinal data measured at an aggregate level (across participants within one or more units), with several measurement times before implementation of the intervention, and several measurement times after implementation of the intervention. These studies might be characterized as uncontrolled, repeated cross-sectional designs, where the population of interest may be defined geographically or through interaction with a health service, and measures of activity or outcomes may include different individuals at each time point. A specific time point known as the ‘interruption’ defines the distinction between ‘before’ (or ‘pre-intervention’) and ‘after’ (or ‘post-intervention’) time points. Specifying the exact time of this interruption can be challenging, especially when an intervention has many phases or when periods of preparation of the intervention may result in progressive changes in outcomes (e.g. when there are debates and processes leading to a new law or policy). The data from an ITS are typically a single time series, and may be analysed using time series methods (e.g. ARIMA models). In an ITS analysis, the ‘comparator group’ is constructed by making assumptions about the trajectory of outcomes had there been no intervention (or interruption), based on patterns observed before the intervention. The intervention effect is estimated by comparing the observed outcome trajectory after intervention with the assumed trajectory had there been no intervention.

The category also includes studies in which multiple individuals are each measured before and after receiving an intervention: there may be several pre- and post-intervention measurements. These studies might be characterized as uncontrolled, longitudinal designs (alternatively they may be referred to as repeated measures studies, before-after studies, pre-post studies or reflexive control studies). One special case is a study with a single pre-intervention outcome measurement and a single post-intervention outcome measurement for each of multiple participants. Such a study will usually be judged to be at serious or critical risk of bias because it is impossible to determine whether pre-post changes are due to the intervention rather than other factors.

The main issues addressed in a ROBINS-I evaluation of an uncontrolled before-after study are summarized below and in Table 25.5.a. We address issues only for the effect of assignment to intervention, since we do not expect uncontrolled before-after studies to examine the effect of starting and adhering to the intended intervention.

There is a possibility that extraneous events or changes in context occur around the time at which the intervention is introduced. Bias will be introduced if these external forces influence the outcome. This issue is addressed under the first domain of ROBINS-I (‘Bias due to confounding’).
There should be sufficient data to extrapolate from outcomes before the intervention into the future. ‘Sufficient’ means enough time points, over a sufficient period of time, to characterize trends and patterns. This issue is also addressed under ‘Bias due to confounding’.
ITS analyses require specification of a specific time point (the ‘interruption’) before which there was no intervention (pre-intervention period) and after which there has been an intervention (the post-intervention period). However, interventions do not happen instantaneously, so this time point may be before, or after, some important features of the intervention were implemented. The time point could be selected to maximize the apparent effect: this issue is covered primarily in the domain ‘Bias in classification of the intervention’ but is also relevant to ‘Bias in selection of the reported result’ since researchers could conduct analyses with different interruption points and report that which maximizes the support for their hypothesis).
The interruption time point might be before important features of the intervention have been implemented, so that there is a delay before the intervention is fully effective. Such lagging of effects should not be regarded as bias, but is rather an issue of applicability of some of the measurement times. Lagging effects can be accommodated in analyses if sufficient post-intervention measurements are available, for example by excluding data from a phase-in period of the intervention.
The interruption time point might be after important features of the intervention have been implemented: for example, if anticipation of a policy change alters people’s behaviour so that there is early impact of the intervention before its main implementation. Such effects will attenuate differences between pre- and post-intervention outcomes. We address this issue as a type of contamination of the pre-intervention period by aspects of the intervention and consider it under ‘Bias due to deviations from the intended intervention’.
Changes in administrative procedures related to collection of outcome data (e.g. bookkeeping, changes to success criteria) may coincide with the intervention. This is addressed under ‘Bias in measurement of the outcome’. Further outcome measurement issues include ‘evaluation apprehension’, for example, when awareness of past responses to questionnaires influences subsequent responses.
The intervention might cause attrition from the framework or system used to measure outcomes. This is a bias due to selection out of the study, and is addressed in the domain ‘Bias due to missing data’.

Table 25.5.a Bias domains included in the ROBINS-I tool for (uncontrolled) before-after studies, with a summary of the issues addressed

Bias domain	Additional or different issues addressed compared with follow-up studies*
Bias due to confounding	Whether: measurements of outcomes were made at sufficient pre-intervention time points to permit characterization of pre-intervention trends and patterns; there are extraneous events or changes in context around the time of the intervention that could have influenced the outcome; and the study authors used an appropriate analysis method that accounts for time trends and patterns, and controls for all the important confounding domains.
Bias in selection of participants into the study	The issues are similar to those for follow-up studies. For studies that prospectively follow a specific group of units from pre-intervention to post-intervention, selection bias is unlikely. For repeated cross-sectional surveys of a population, there is the potential for selection bias even if the study is prospective.
Bias in classification of interventions	Whether specification of the distinction between pre-intervention time points and post-intervention time points could have been influenced by the outcome data.
Bias due to deviations from intended interventions	Assuming the review authors’ interest is in the effect of assignment to intervention (see Section 25.3.3): Whether the effects of any preparatory (pre-interruption) phases of the intervention were appropriately accounted for.
Bias due to missing data	Whether outcome data were missing for whole clusters (units of multiple individuals) as well as for individual participants.
Bias in measurement of the outcome	Whether: methods of outcome assessment were comparable before and after the intervention; and there were changes in systematic errors in measurement of the outcome coincident with implementation of the intervention.
Bias in selection of the reported result	The issues are the same as for follow-up studies.
* For the precise wording of signalling questions and guidance for answering each one, see the full ROBINS-I tool at www.riskofbias.info.

25.6 Risk of bias in controlled before-after studies

Studies in which: (i) units are non-randomly allocated to a group that receives an intervention or to an alternative group that receives nothing or a comparator intervention; and (ii) at least one measurement of the outcome variable is made in both groups before and after implementation of the intervention are often known as controlled before-after studies (CBAs) (Eccles et al 2003, Polus et al 2017). The comparator group(s) may be contemporaneous or not. This category also includes controlled interrupted time series (CITSs) (Lopez Bernal et al 2018). The units included in the study may be individuals, clusters of individuals, or administrative units. The intervention may be at the level of the individual unit or at some aggregate (cluster) level. Studies may follow the same units over time (sometimes referred to as within-person or within-unit longitudinal designs) or look at (possibly) different units at the different time points (sometimes referred to as repeated cross-sectional designs, where the population of interest may be defined geographically or through interaction with a health service, and may include different individuals over time).

A common analysis of CBA studies is a ‘difference in differences’ analysis, in which before-after differences in the outcome (possibly averaged over multiple units) are contrasted between the intervention and comparator groups. The outcome measurements before and after intervention may be single observations, means, or measures of trend or pattern. The assumption underlying such an analysis is that the before-after change in the intervention group is equivalent to the before-after change in the comparator group, except for any causal effects of the intervention; that is, that the pre-post intervention difference in the comparator group reflects what would have happened in the intervention group had the intervention not taken place.

The main issues addressed in a ROBINS-I evaluation of a controlled before-after study are summarized below and in Table 25.6.a.

The occurrence of extraneous events around the time of intervention may differ between the intervention and comparator groups. This is addressed under ‘Bias due to confounding’.
Trends and patterns of the outcome over time may differ between the intervention and comparator groups. The plausibility of this threat to validity can be assessed if more than one pre-intervention measurement of the outcome is available: the more measurements, the better the pre-intervention trends can be modelled and compared between groups. This issue is also addressed under ‘Bias due to confounding’.
If the definition of the intervention and comparator groups depends on pre-intervention outcome measurements (e.g. if individuals with high values are selected for intervention and those with low values for the comparator), regression to the mean may be confused with a treatment effect. The plausibility of this threat can be assessed by having more than one pre-intervention measurement. This is addressed under ‘Bias due to confounding’.
There is a risk of selection bias in repeated cross-sectional surveys if the types of participants/units included in repeated surveys changes over time, and such changes differ between intervention and comparator groups. Changes might occur contemporaneously with the intervention if it causes (or requires) attrition from the measurement framework. These issues are addressed under ‘Bias due to selection of participants into the study’ and ‘Bias due to missing data’.
Outcome measurement methods might change between pre- and post-intervention periods. This issue may complicate analyses if it occurs in the intervention and comparator groups at the same time but is a threat to validity if it differs between them. This is addressed under ‘Bias due to measurement of the outcome’.
Poor specification of the time point before which there was no intervention and after which there has been an intervention may introduce bias. This is addressed under ‘Bias in classification of interventions’.

Table 25.6.a Bias domains included in the ROBINS-I tool for controlled before-after studies, with a summary of the issues addressed

Bias domain	Additional or different issues addressed compared with follow-up studies*
Bias due to confounding	Whether: measurements of outcomes were made at sufficiently many time points, in both the intervention and comparator groups, to permit characterization of pre-intervention trends and patterns; any extraneous events or changes in context around the time of the intervention that could have influenced the outcome were experienced equally by both intervention groups; and pre-intervention trends and patterns in outcomes were analysed appropriately and found to be similar across the intervention and comparator groups.
Bias in selection of participants into the study	The issues are similar to those for follow-up studies. For repeated cross-sectional surveys of a population, there is the potential for selection bias if changes in the types of participants/units included in repeated surveys differ between intervention and comparator groups.
Bias in classification of interventions	Whether classification of time points as before versus after intervention could have been influenced by post-intervention outcome data.
Bias due to deviations from intended interventions	Assuming the review authors’ interest is in the effect of assignment to intervention (see Section 25.3.3): The issues are the same as for follow-up studies.
Bias due to missing data	Whether outcome data were missing for whole clusters as well as for individual participants.
Bias in measurement of the outcome	Whether: methods of outcome assessment were comparable across intervention groups and before and after the intervention; and there were changes in systematic errors in measurement of the outcome coincident with implementation of the intervention.
Bias in selection of the reported result	The issues are the same as for follow-up studies.
* For the precise wording of signalling questions and guidance for answering each one, see the full ROBINS-I tool at www.riskofbias.info.

25.7 Chapter information

Authors: Jonathan AC Sterne, Miguel A Hernán, Alexandra McAleenan, Barnaby C Reeves, Julian PT Higgins

Acknowledgements: ROBINS-I was developed by a large collaborative group, and we acknowledge the contributions of Jelena Savović, Nancy Berkman, Meera Viswanathan, David Henry, Douglas Altman, Mohammed Ansari, Rebecca Armstrong, Isabelle Boutron, Iain Buchan, James Carpenter, An-Wen Chan, Rachel Churchill, Jonathan Deeks, Roy Elbers, Atle Fretheim, Jeremy Grimshaw, Asbjørn Hróbjartsson, Jemma Hudson, Jamie Kirkham, Evan Kontopantelis, Peter Jüni, Yoon Loke, Luke McGuinness, Jo McKenzie, Laurence Moore, Matt Page, Theresa Pigott, Stephanie Polus, Craig Ramsay, Deborah Regidor, Eva Rehfuess, Hannah Rothstein, Lakhbir Sandhu, Pasqualina Santaguida, Holger Schünemann, Beverley Shea, Sasha Shepperd, Ian Shrier, Hilary Thomson, Peter Tugwell, Lucy Turner, Jeffrey Valentine, Hugh Waddington, Elizabeth Waters, George Wells, Penny Whiting and David Wilson.

Funding: Development of ROBINS-I was funded by a Methods Innovation Fund grant from Cochrane and by Medical Research Council (MRC) grant MR/M025209/1. JACS, BCR and JPTH are members of the National Institute for Health Research (NIHR) Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bristol, the NIHR Collaboration for Leadership in Applied Health Research and Care West (CLAHRC West) at University Hospitals Bristol NHS Foundation Trust, and the MRC Integrative Epidemiology Unit at the University of Bristol. JACS and JPTH received funding from NIHR Senior Investigator awards NF-SI-0611-10168 and NF-SI-0617-10145, respectively. JPTH and AM are funded in part by Cancer Research UK (grant C18281/A19169). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health, the MRC or Cancer Research UK.

25.8 References

Eccles M, Grimshaw J, Campbell M, Ramsay C. Research designs for studies evaluating the effectiveness of change and improvement strategies. Quality and Safety in Health Care 2003; 12: 47–52.

Hernán MA, Hernandez-Diaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. American Journal of Epidemiology 2002; 155: 176–184.

Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology 2016; 183: 758–764.

Institute of Medicine. Ethical and Scientific Issues in Studying the Safety of Approved Drugs. Washington (DC): The National Academies Press; 2012.

Kontopantelis E, Doran T, Springate DA, Buchan I, Reeves D. Regression based quasi-experimental approach when randomisation is not an option: interrupted time series analysis. BMJ 2015; 350: h2750.

Lopez Bernal J, Cummins S, Gasparrini A. The use of controls in interrupted time series studies of public health interventions. International Journal of Epidemiology 2018; 47: 2082–2093.

Polus S, Pieper D, Burns J, Fretheim A, Ramsay C, Higgins JPT, Mathes T, Pfadenhauer LM, Rehfuess EA. Heterogeneity in application, design, and analysis characteristics was found for controlled before-after and interrupted time series studies included in Cochrane reviews. Journal of Clinical Epidemiology 2017; 91: 56–69.

Schünemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G, Helfand M. Non-randomized studies as a source of complementary, sequential or replacement evidence for randomized controlled trials in systematic reviews on the effects of interventions. Research Synthesis Methods 2013; 4: 49–62.

Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A, Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L, Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC, Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing risk of bias in non-randomized studies of interventions. BMJ 2016; 355: i4919.

Velie EM, Shaw GM. Impact of prenatal diagnosis and elective termination on prevalence and risk estimates of neural tube defects in California, 1989–1991. American Journal of Epidemiology 1996; 144: 473–479.