Chapter 18: Patient-reported outcomes

Bradley C Johnston, Donald L Patrick, Tahira Devji, Lara J Maxwell, Clifton O Bingham III, Dorcas E Beaton, Maarten Boers, Matthias Briel, Jason W Busse, Alonso Carrasco-Labra, Robin Christensen, Bruno R da Costa, Regina El Dib, Anne Lyddiatt, Raymond W Ostelo, Beverley Shea, Jasvinder Singh, Caroline B Terwee, Paula R Williamson, Joel J Gagnier, Peter Tugwell, Gordon H Guyatt

Key Points:

Summary data on patient-reported outcomes (PROs) are important to ensure healthcare decision makers are informed about the outcomes most meaningful to patients.
Authors of systematic reviews that include PROs should have a good understanding of how patient-reported outcome measures (PROMs) are developed, including the constructs they are intended to measure, their reliability, validity and responsiveness.
Authors should pre-specify at the protocol stage a hierarchy of preferred PROMs to measure the outcomes of interest.

Cite this chapter as: Johnston BC, Patrick DL, Devji T, Maxwell LJ, Bingham III CO, Beaton D, Boers M, Briel M, Busse JW, Carrasco-Labra A, Christensen R, da Costa BR, El Dib R, Lyddiatt A, Ostelo RW, Shea B, Singh J, Terwee CB, Williamson PR, Gagnier JJ, Tugwell P, Guyatt GH. Chapter 18: Patient-reported outcomes [last updated October 2019]. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5. Cochrane, 2024. Available from cochrane.org/handbook.

18.1 Introduction to patient-reported outcomes

18.1.1 What are patient-reported outcomes?

A patient-reported outcome (PRO) is “any report of the status of a patient’s health condition that comes directly from the patient without interpretation of the patient’s response by a clinician or anyone else” (FDA 2009). PROs are one of several clinical outcome assessment methods that complement biomarkers, measures of morbidity (e.g. stroke, myocardial infarction), burden (e.g. hospitalization), and survival used and reported in clinical trials and non-randomized studies (FDA 2018).

Patient-reported outcome measures (PROMs) are instruments that are used to measure the PROs, most often self-report questionnaires. Although investigators may address patient-relevant outcomes via proxy reports or observations from caregivers, health professionals, or parents and guardians, these are not PROMs but rather clinician-reported or observer-reported outcomes (Powers et al 2017).

PROs provide crucial information for patients and clinicians facing choices in health care. Conducting systematic reviews and meta-analyses including PROMs and interpreting their results is not straightforward, and guidance can help review authors address the challenges.

The objectives of this chapter are to: (i) describe the category of outcomes known as PROs and their importance for healthcare decision making; (ii) illustrate the key issues related to reliability, validity and responsiveness that systematic review authors should consider when including PROs; and (iii) address the structure and content (domains, items) of PROs and provide guidance for combining information from different PROs. This chapter outlines a step-by-step approach to addressing each of these elements in the systematic review process. The focus is on the use of PROs in randomized trials, and what is crucial in this context when selecting PROs to include in a meta-analysis. The principles also apply to systematic reviews of non-randomized studies addressing PROs (e.g. dealing with adverse drug reactions).

18.1.2 Why patient-reported outcomes?

PROs provide patients’ perspectives regarding treatment benefit and harm, directly measure treatment benefit and harm beyond survival, major morbid events and biomarkers, and are often the outcomes of most importance to patients and families.

Self-reported outcomes often correlate poorly with physiological and other outcomes such as performance-related outcomes, clinician-reported outcomes, or biomarkers. In asthma, Yohannes and colleagues (Yohannes et al (1998) found that variability in exercise capacity contributed to only 3% of the variability in breathing problems on a patient self-report questionnaire. In chronic obstructive pulmonary disease (COPD), the reported correlations between forced expiratory volume (FEV1) and quality of life (QoL) are weak (r=0.14 to 0.41) (Jones 2001). In peripheral arterial occlusive disease, correlations between haemodynamic variables and QoL are low (e.g. r=–0.17 for QoL pain subscale and Doppler sonographic ankle/brachial pressure index) (Müller-Bühl et al 2003). In osteoarthritis, there is discordance between radiographic arthritis and patient-reported pain (Hannan et al 2000). These findings emphasize the often important limitations of biomarkers for informing the impact of interventions on the patient experience or the patient’s perspective of disease (Bucher et al 2014).

PROs are essential when externally observable patient-important outcomes are rare or unavailable. They provide the only reasonable strategy for evaluating treatment impact of many conditions including pain syndromes, fatigue, disorders such as irritable bowel syndrome, sexual dysfunction, and emotional function and adverse effects such as nausea and anxiety for which physiological measurements are limited or unavailable.

18.2 Formulation of the review

In this section we describe PROMs in more detail and discuss some issues to consider when deciding which PROMs to address in a review.

A common term used in the health status measurement literature is construct. Construct refers to what PROMs are trying to measure, the concept that defines the PROM such as pain, physical function or depressive mood. Constructs are the postulated attributes of the person that investigators hope to capture with the PROM (Cronbach and Meehl 1955).

Many different ways exist to label and classify PROMs and the constructs they measure. For instance, reports from patients include signs (observable manifestations of a condition), sensations (most commonly classified as symptoms that may be attributable to disease and/or treatment), behaviours and abilities (commonly classified as functional status), general perceptions or feelings of well-being, general health, satisfaction with treatment, reports of adverse effects, adherence to treatment, and participation in social or community events and health-related quality of life (HRQoL).

Investigators can use different approaches to capture patient perspectives, including interviews, self-completed questionnaires, diaries, and via different interfaces such as hand-held devices or computers. Review authors must identify the postulated constructs that are important to patients, and then determine the extent to which the PROMs used and reported in the trials address those constructs, the characteristics (measurement properties) of the PROMs used, and communicate this information to the reader (Calvert et al 2013).

Focusing now on HRQoL, an important PRO, some approaches attempt to cover the full range of health-related patient experience – including, for instance, self-care, and physical, emotional and social function – and thus enable comparisons between the impact of treatments on HRQoL across diseases or conditions. Authors often call these approaches generic instruments (Guyatt et al 1989, Patrick and Deyo 1989). These include utility measures such as the EuroQol five dimensions questionnaire (EQ-5D) or the Health Utilities Index (HUI). They also include health profiles such as the Short Form 36-item (SF-36) or the SF-12; these have come to dominate the field of health profiles (Tarlov et al 1989, Ware et al 1995, Ware et al 1996). An alternative approach to measuring PROs is to focus on much more specific constructs: PROMs may be specific to function (e.g. sleep, sexual function), to a disease (e.g. asthma, heart failure), to a population (e.g. the frail elderly) or to a symptom (pain, fatigue) (Guyatt et al 1989, Patrick and Deyo 1989). Another domain-specific measurement system now receiving attention is Patient-Reported Outcomes Measurement Instruments System (PROMIS). PROMIS is a National Institutes of Health funded PROM programme using computerized adaptive testing from large item banks for over 70 domains (e.g. anxiety, depression, pain, social function) relevant to wide variety of chronic diseases (Cella et al 2007, Witter 2016, PROMIS 2018).

Authors often use the terms ‘quality of life’, ‘health status’, ‘functional status’, ‘HRQoL’ and ‘well-being’ loosely and interchangeably. Systematic review authors must therefore consider carefully the constructs that the PROMs have actually measured. To do so, they may need to examine the items or questions included in a PROM.

Another issue to consider is whether and how the individual items of instruments are weighted. A number of approaches can be used to arrive at weights (Wainer 1976). Utility instruments designed for economic analysis put greater emphasis on item weighting, attempting ultimately to present HRQoL as a continuum anchored between death and full health. Many PROMs weight items equally in the calculation of the overall score, a reasonable approach. Readers can refer to a helpful overview of classical test theory and item response theory to understand better the merits and limitations of weighting (Cappelleri et al 2014).

Table 18.2.a presents a framework for considering and reporting PROMs in clinical trials, including their constructs and how they were measured. A good understanding of the PROMs identified in the included studies for a review is essential to appropriate analysis of outcomes across studies, and appraisal of the certainty of the evidence.

Table 18.2.a Checklist for describing and assessing PROMs in clinical trials. Adapted from Guyatt et al (1997)

1. What were the PROMs assessing?

1.1. What concepts or constructs were the PROMs used in the study assessing?

1.2. What rationale (if any) for selection of concepts or constructs did the authors provide?

1.3. Were patients involved in the development (e.g. focus groups, surveys) of PROMs?

2. Omissions

2.1 Were there any important aspects of patient’s health (e.g. symptoms, function, perceptions) or quality of life (e.g. overall evaluation, satisfaction with life) that were not reported in this study? A search for ‘Core Outcome Sets’ for condition would be helpful (see Section 18.4.1).

3. What were the measurement strategies?

3.1. Did investigators use instruments that yield a single indicator or index number, or a profile, or a battery of instruments?

3.2. Did investigators use specific or generic measures, or both?

4. Did the instruments work in the way they were supposed to work – validity?

4.1. Was evidence of prior validation for use in the current population presented?

5. Did the instruments work in the way they were supposed to work – responsiveness?

5.1 Are the PROMs able to detect important change in patient status, even if those changes are small?

6. Can you make the magnitude of effect (if any) understandable to readers – interpretability?

6.1 If the intervention has had an apparent impact on a PROM, can you provide users with a sense of whether that effect is trivial, small but important, moderate, or large?

18.3 Appraisal of evidence

18.3.1 Measurement of PROs: single versus multiple time-points

To be useful, instruments must be able to distinguish between situations of interest (Boers et al 1998). When results are available for only one time-point (e.g. for classification), the key issue for PROMs is to be able to distinguish individuals with more desirable scores from those whose scores are less desirable. The key measurement issues in such contexts are reliability and cross-sectional construct validity (Kirshner and Guyatt 1985, Beaton et al 2016).

In longitudinal studies such as randomized trials, investigators usually obtain measurements at multiple time-points, for example at the beginning of the trial and again following administration of the interventions. In this context, PROMs must be able to distinguish those who have experienced positive changes over time from those who have experienced negative changes, those who experienced less positive change, or those who experienced no change at all, and to estimate accurately the magnitude of those changes. The key measurement issues in these contexts – sometimes referred to as evaluative – are responsiveness and longitudinal construct validity (Kirshner and Guyatt 1985, Beaton et al 2016).

18.3.2 Reliability

Intuitively, many think of reliability as obtaining the same scores on repeated administration of an instrument in stable respondents. That stability (or lack of measurement error) is important, but not sufficient. Satisfactory instruments must be able to distinguish between individuals despite measurement error.

Reliability statistics therefore look at the ratio of the variability between respondents (typically the numerator of a reliability statistic) and the total variability (the variability between respondents and the variability within respondents). The most commonly used statistics to measure reliability is a kappa coefficient for categorical data, a weighted kappa coefficient for ordered categorical data, and an intraclass correlation coefficient for continuous data (de Vet et al 2011).

Limitations in reliability will be of most concern for the review author when randomized trials have failed to establish the superiority of an experimental intervention over a comparator intervention. The reason is that lack of reliability cannot create intervention effects that are not present, but can obscure true intervention effects as a result of random error. When a systematic review does not find evidence that an intervention affects a PROM, review authors should consider whether this may be due to poor reliability (e.g. if reliability coefficients are less than 0.7) rather than lack of an effect.

18.3.3 Validity

Validity has to do with whether the instrument is measuring what it is intended to measure. Content validity assessment involves patient and clinician evaluation of the relevance and comprehensiveness of the content contained in the measures, usually obtained through qualitative research with patients and families (Johnston et al 2012). Guidance is available on the assessment of content validity for PROMs used in clinical trials (Patrick et al 2011a, Patrick et al 2011b).

Construct validity involves examining the logical relationships that should exist between assessment measures. For example, in patients with COPD, we would expect that patients with lower treadmill exercise capacity generally will have more dyspnoea (shortness of breath) in daily life than those with higher exercise capacity, and we would expect to see substantial correlations between a new measure of emotional function and existing emotional function questionnaires.

When we are interested in evaluating change over time – that is, in the context of evaluation when measures are available both before and after an intervention – we examine correlations of change scores. For example, patients with COPD who deteriorate in their treadmill exercise capacity should, in general, show increases in dyspnea, while those whose exercise capacity improves should experience less dyspnea. Similarly, a new emotional function instrument should show concurrent improvement in patients who improve on existing measures of emotional function. The technical term for this process is testing an instrument’s longitudinal construct validity. Review authors should look for evidence of the validity of PROMs used in clinical studies. Unfortunately, reports of randomized trials using PROMs seldom review or report evidence of the validity of the instruments they use, but when these are available review authors can gain some reassurance from statements (backed by citations) that the questionnaires have been previously validated, or could seek additional published information on named PROMs. Ideally, review authors should look for systematic reviews of the measurement properties of the instruments in question. The Consensus-based standards for the selection of health measurement instruments (COSMIN) website offers a database of such reviews (COSMIN Database of Systematic Reviews). In addition, the Patient-Reported Outcomes and Quality of Life Instruments Database (PROQOLID) provides documentation of the measurement properties for over 1000 PROs.

If the validity of the PROMs used in a systematic review remains unclear, review authors should consider whether the PROM is an appropriate measure of the review’s planned outcomes, or whether it should be excluded (ideally, this would be considered at the protocol stage), and any included results should be interpreted with appropriate caution. For instance, in a review of flavonoids for haemorrhoids, authors of primary trials used PROMs to ascertain patients’ experience with pain and bleeding (Alonso-Coello et al 2006). Although the wording of these PROMs was simple and made intuitive sense, the absence of formal validation raises concerns over whether these measures can give meaningful data to distinguish between the intervention and its comparators.

A final concern about validity arises if the measurement instrument is used with a different population, or in a culturally and linguistically different environment from the one in which it was developed. Ideally, PROMs should be re-validated in each study, but systematic review authors should be careful not to be too critical on this basis alone.

18.3.4 Responsiveness

In the evaluative context, randomized trial participant measurements are typically available before and after the intervention. PROMs must therefore be able to distinguish among patients who remain the same, improve or deteriorate over the course of the trial (Guyatt et al 1987, Revicki et al 2008). Authors often refer to this measurement property as responsiveness; alternatives are sensitivity to change or ability to detect change.

As with reliability, responsiveness becomes an issue when a meta-analysis suggests no evidence of a difference between an intervention and control. An instrument with a poor ability to measure change can result in false-negative results, in which the intervention improves how patients feel, yet the instrument fails to detect the improvement. This problem may be particularly salient for generic questionnaires that have the advantage of covering all relevant areas of HRQoL, but the disadvantage of covering each area superficially or without the detail required for the particular context of use (Wiebe et al 2003, Johnston et al 2016a). Thus, in studies that show no difference in PROMs between intervention and control, lack of instrument responsiveness is one possible reason. Review authors should look for published evidence of responsiveness. If there is an absence of prior evidence of responsiveness, this represents a potential reason for being less certain about evidence from a series of randomized trials. For instance, a systematic review of respiratory muscle training in COPD found no effect on patients’ function. However, two of the four studies that assessed a PROM used instruments without established responsiveness (Smith et al 1992).

18.3.5 Reporting bias

Studies focusing on PROs often use a number of PROMs to measure the same or similar constructs. This situation creates a risk of selective outcome reporting bias, in which trial authors select for publication a subset of the PROMs on the basis of the results; that is, those that indicate larger intervention effects or statistically significant P values (Kirkham et al 2010). Further detailed discussion of selective outcome reporting is presented in Chapter 7 (Section 7.2.3.3); see also Chapter 8 (Section 8.7).

Systematic reviews focusing on PROs should be alert to this problem. When only a small number of eligible studies have reported results for a particular PROM, particularly if the PROM is mentioned in a study protocol or methods section, or if it is a salient outcome that one would expect conscientious investigators to measure, review authors should note the possibility of reporting bias and consider rating down certainty in evidence as part of their GRADE assessment (see Chapter 14) (Guyatt et al 2011). For instance, authors of a systematic review evaluating the responsiveness of PROs among patients with rare lysosomal storage diseases encountered eligible studies in which the use of a PRO was described in the methods, but there were either no data or limited PRO data in the results. When authors did present some information about results, the reports sometimes included only interim or end-of-study results. Such instances are likely to be an indication of selective outcome reporting bias: it seems implausible that, if results showed apparent benefit on PROs, investigators would mention a PRO in the methods and subsequently fail to report results (Johnston et al 2016b).

18.4 Synthesis and interpretation of evidence

18.4.1 Selecting from multiple PROMs

The definition of a particular PRO may vary between studies, and this may justify use of different instruments (i.e. different PROMs). Even if the definitions are similar (or if, as happens more commonly, the investigators do not define the PRO), the investigators may choose different instruments to measure the PROs, especially if there is a lack of consensus on which instrument to use (Prinsen et al 2016).

When trials report results for more than one instrument, authors should – independent of knowledge of the results and ideally at the protocol stage – create a hierarchy based on reported measurement properties of PROMs (Tendal et al 2011, Christensen et al 2015), considering a detailed understanding of what each PROM measures (see Table 18.2.a), and its demonstrated reliability, validity, responsiveness and interpretability (see Section 18.3). This will allow authors to decide which instruments will be used for data extraction and synthesis. For example, the following instruments are all validated, patient-reported pain instruments that an investigator may use in a primary study to assess an intervention’s usefulness for treating pain:

7-item Integrated Pain Score;
10-point Visual Analogue Scale for Pain;
20-item McGill Pain Questionnaire; and
56-item Brief Pain Inventory (PROQOLID 2018).

In some clinical fields core outcome sets are available to guide the use of appropriate PROs (COMET 2018). Only rarely do these include specific guidance on which PROMs are preferable, although methods have been proposed for this (Prinsen et al 2016). Within the field of rheumatology, the Outcome Measures in Rheumatology (OMERACT) initiative has developed a conceptual framework known as OMERACT Filter 2.0 to identify both core domain sets (what outcome should be measured) and core outcome measurement sets (how the outcome should be measured, i.e. which PROM to use) (Boers et al 2014). This is a generic framework and applicable to those developing core outcome sets outside the field of rheumatology.

As an example of a pre-defined hierarchy, for knee osteoarthritis, OMERACT has used a published hierarchy based on responsiveness for extraction of PROMs evaluating pain and physical function for performing systematic reviews (Juhl et al 2012).

Authors should decide in advance whether to exclude PROMs not included in the hierarchy, or to include additional measures where none of the preferred measures are available.

18.4.2 Synthesizing data from multiple PROMs

While a hierarchy can be helpful in identifying the review authors’ preferred measures, and excluding some measures considered inappropriate, it remains likely that authors will encounter studies using several different PROMs to measure a given construct, either within one study or across multiple studies. Authors must then decide how to approach synthesis of multiple measures, and among them, consider which measures to include in a single meta-analysis on a particular construct (Tendal et al 2011, Christensen et al 2015).

When deciding if statistical synthesis is appropriate, review authors will often find themselves reading between the lines to try and get a precise notion of the underlying construct for the PROMs used. They may have to consult the articles that describe the development and prior use of PROMs included in the primary studies, or look at the instruments to understand the concepts being measured.

For example, authors of a Cochrane Review of cognitive behavioural therapy (CBT) for tinnitus included HRQoL as a PRO (Martinez-Devesa et al 2007), assessed with different PROMs: four trials using the Tinnitus Handicap Questionnaire; one trial the Tinnitus Questionnaire; and one trial the Tinnitus Reaction Questionnaire. Review authors compared the content of the PROMs and concluded that statistical pooling was appropriate.

The most compelling evidence regarding the appropriateness of including different PROMs in the same meta-analysis would come from a finding of substantial correlations between the instruments. For example, the two major instruments used to measure HRQoL in patients with COPD are the Chronic Respiratory Questionnaire (CRQ) and the St. George’s Respiratory Questionnaire (SGRQ). Correlations between the two questionnaires in individual studies have varied from 0.3 to 0.6 in both cross-sectional (correlations at a point in time) and longitudinal (correlations of change) comparisons (Rutten-van Mölken et al 1999, Singh et al 2001, Schünemann et al 2003, Schünemann et al 2005). In one study, investigators examined the correlations between group mean changes in the CRQ and SGRQ in 15 studies including 23 patient groups and found a correlation of 0.88 (Puhan et al 2006).

Ideally, the decision to combine scores from different PROMs would be based not only on their measuring similar constructs but also on their satisfactory validity, and, depending on whether before and after intervention or only after intervention measurements were available, and on their responsiveness or reliability. For example, extensive evidence of validity is available for both CRQ and the SGRQ. The CRQ has, however, proved more responsive than the SGRQ: in an investigation that included 15 studies using both instruments, standardized response means of the CRQ (median 0.51, interquartile range (IQR) 0.19 to 0.98) were significantly higher (P <0.001) than those associated with the SGRQ (median 0.26, IQR −0.03 to 0.40) (Puhan et al 2006). As a result, pooling results from trials using these two instruments could lead to underestimates of intervention effect in studies using the SGRQ (Puhan et al 2006, Johnston et al 2010). This can be tested using a sensitivity analysis of studies using the more responsive versus less responsive instrument.

Usually, detailed data such as those described above will be unavailable. Investigators must then fall back on intuitive decisions about the extent to which different instruments are measuring the same underlying concept. For example, the authors of a meta-analysis of psychosocial interventions in the treatment of pre-menstrual syndrome faced a profusion of outcome measures, with 25 PROMs used in their nine eligible studies (Busse et al 2009). They dealt with this problem by having two experienced clinical researchers, knowledgeable to the study area and not otherwise involved in the review, independently examine each instrument – including all domains – and group 16 PROMs into six discrete conceptual categories. Any discrepancies were resolved by discussion to achieve consensus. Table 18.4.a details the categories and the included instruments within each category.

Authors should follow the guidance elsewhere in this Handbook on appropriate methods of synthesizing different outcome measures in a single analysis (Chapter 10) and interpreting these results in a way that is most meaningful for decision makers (Chapter 15).

Table 18.4.a Examples of potentially combinable PROMs measuring similar constructs from a review of psychosocial interventions in the treatment of pre-menstrual syndrome (Busse et al 2009). Reproduced with permission of Karger

Anxiety
Beck Anxiety Inventory
Menstrual Symptom Diary-Anxiety domain
State and Trait Anxiety Scale-State Anxiety domain
Behavioural Changes
Menstrual Distress Questionnaire-Behavioural Changes domain
Pre-Menstrual Assessment Form-Social Withdrawal domain
Depression
Beck Depression Inventory
Depression Adjective Checklist State-Depression domain
General Contentment Scale-Depression and Well-being domain
Menstrual Symptom Diary-Depression domain
Menstrual Distress Questionnaire-Negative Affect domain
Interference
Global Rating of Interference Daily Record of Menstrual Complaints-Interference domain
Sexual Relations
Martial Satisfaction Inventory-Sexual Dissatisfaction domain
Social Adjustment Scale-Sexual Relationship domain
Water Retention and Oedema
Menstrual Distress Questionnaire-Water Retention domain
Menstrual Symptom Diary-Oedema domain

Having decided which PROs and subsequently PROMs to include in a meta-analysis, review authors face the challenge of ensuring the results they present are interpretable to their target audiences. For instance, if told that the mean difference between rehabilitation and standard care in a series of randomized trials using the CRQ was 1.0 (95% CI 0.6 to 1.5), many readers would be uncertain whether this represents a trivial, small but important, moderate, or large effect (Guyatt et al 1998, Brozek et al 2006, Schünemann et al 2006). Similarly, the interpretation of a standardized mean difference is challenging for most (Johnston et al 2016b). Chapter 15 summarizes the various statistical presentation approaches that can be used to improve the interpretability of summary estimates. Further, for those interested in additional guidance, the GRADE working group summarizes five presentation approaches to enhancing the interpretability of pooled estimates of PROs when preparing ‘Summary of findings’ tables (Thorlund et al 2011, Guyatt et al 2013, Johnston et al 2013).

18.5 Chapter information

Authors: Bradley C Johnston, Donald L Patrick, Tahira Devji, Lara J Maxwell, Clifton O Bingham III, Dorcas Beaton, Maarten Boers, Matthias Briel, Jason W Busse, Alonso Carrasco-Labra, Robin Christensen, Bruno R da Costa, Regina El Dib, Anne Lyddiatt, Raymond W Ostelo, Beverley Shea, Jasvinder Singh, Caroline B Terwee, Paula R Williamson, Joel J Gagnier, Peter Tugwell, Gordon H Guyatt

Funding: DB is on the executive of OMERACT (Outcome Measurement in Rheumatology) (unpaid position). OMERACT is supported through partnership with multiple industries and OMERACT funds support staff to assist in the development of methods and materials around core outcome set development that influenced this chapter. The Parker Institute, Bispebjerg and Frederiksberg Hospital (RC) is supported by a core grant from the Oak Foundation (OCAY-13-309). TD has received funding from the Canadian Institutes of Health Research for research related to patient-reported outcomes and minimal important differences. RWO received research grants (paid to the Institute) from Netherlands Organisation Scientific Research (NWO); Netherlands Organisation for Health Research and Development (ZonMw); Wetenschappelijk College Fysiotherapie/KNGF Ned Ver Manuele Therapie; European Chiropractors’ Union; Amsterdam Movement Sciences; National Health Care Institute (ZiN); De Friesland Zorgverzekeraar. PRW’s work within the COMET Initiative is funded through grant NIHR Senior Investigator Award (NF-SI_0513-10025).

18.6 References

Alonso-Coello P, Zhou Q, Martinez-Zapata MJ, Mills E, Heels-Ansdell D, Johanson JF, Guyatt G. Meta-analysis of flavonoids for the treatment of haemorrhoids. British Journal of Surgery 2006; 93: 909-920.

Beaton D, Boers M, Tugwell P. Assessment of Health Outcomes. In: Firestein G, Budd R, Gabriel SE, McInnes IB, O'Dell J, editors. Kelley and Firestein's Textbook of Rheumatology. 10th ed. Philadelphia (PA): Elsevier; 2016. p. 496-508.

Boers M, Brooks P, Strand CV, Tugwell P. The OMERACT filter for Outcome Measures in Rheumatology. Journal of Rheumatology 1998; 25: 198-199.

Boers M, Kirwan JR, Wells G, Beaton D, Gossec L, d'Agostino MA, Conaghan PG, Bingham CO, 3rd, Brooks P, Landewe R, March L, Simon LS, Singh JA, Strand V, Tugwell P. Developing core outcome measurement sets for clinical trials: OMERACT filter 2.0. Journal of Clinical Epidemiology 2014; 67: 745-753.

Brozek JL, Guyatt GH, Schünemann HJ. How a well-grounded minimal important difference can enhance transparency of labelling claims and improve interpretation of a patient reported outcome measure. Health and Quality of Life Outcomes 2006; 4: 69.

Bucher HC, Cook DJ, Holbrook AM, Guyatt G. Chapter 13.4: Surrogate Outcomes. In: Guyatt G, Rennie D, Meade MO, Cook DJ, editors. Users' Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. 3rd ed. New York: McGraw-Hill Education; 2014.

Busse JW, Montori VM, Krasnik C, Patelis-Siotis I, Guyatt GH. Psychological intervention for premenstrual syndrome: a meta-analysis of randomized controlled trials. Psychotherapy and Psychosomatics 2009; 78: 6-15.

Calvert M, Blazeby J, Altman DG, Revicki DA, Moher D, Brundage MD. Reporting of patient-reported outcomes in randomized trials: the CONSORT PRO extension. JAMA 2013; 309: 814-822.

Cappelleri JC, Jason Lundy J, Hays RD. Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures. Clinical Therapeutics 2014; 36: 648-662.

Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, Ader D, Fries JF, Bruce B, Rose M. The Patient-Reported Outcomes Measurement Information System (PROMIS): progress of an NIH Roadmap cooperative group during its first two years. Medical Care 2007; 45: S3-S11.

Christensen R, Maxwell LJ, Jüni P, Tovey D, Williamson PR, Boers M, Goel N, Buchbinder R, March L, Terwee CB, Singh JA, Tugwell P. Consensus on the Need for a Hierarchical List of Patient-reported Pain Outcomes for Metaanalyses of Knee Osteoarthritis Trials: An OMERACT Objective. Journal of Rheumatology 2015; 42: 1971-1975.

COMET. Core Outcome Measures in Effectiveness Trials 2018. http://www.comet-initiative.org.

Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychological Bulletin 1955; 52: 281-302.

de Vet HCW, Terwee CB, Mokkink LB, Knol DL. Measurement in Medicine: A Practical Guide. Cambridge: Cambridge University Press; 2011.

FDA. Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. Rockville, MD; 2009. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf.

FDA. Clinical Outcome Assessment Program Silver Spring, MD: US Food and Drug Administration; 2018. https://www.fda.gov/Drugs/DevelopmentApprovalProcess/DrugDevelopmentToolsQualificationProgram/ucm284077.htm.

Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. Journal of Chronic Diseases 1987; 40: 171-178.

Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, Norris S, Falck-Ytter Y, Glasziou P, DeBeer H, Jaeschke R, Rind D, Meerpohl J, Dahm P, Schünemann HJ. GRADE guidelines: 1. Introduction-GRADE evidence profiles and summary of findings tables. Journal of Clinical Epidemiology 2011; 64: 383-394.

Guyatt GH, Veldhuyzen Van Zanten SJ, Feeny DH, Patrick DL. Measuring quality of life in clinical trials: a taxonomy and review. CMAJ: Canadian Medical Association Journal 1989; 140: 1441-1448.

Guyatt GH, Naylor CD, Juniper E, Heyland DK, Jaeschke R, Cook DJ. Users' guides to the medical literature. XII. How to use articles about health-related quality of life. Evidence-Based Medicine Working Group. JAMA 1997; 277: 1232-1237.

Guyatt GH, Juniper EF, Walter SD, Griffith LE, Goldstein RS. Interpreting treatment effects in randomised trials. BMJ 1998; 316: 690-693.

Guyatt GH, Thorlund K, Oxman AD, Walter SD, Patrick D, Furukawa TA, Johnston BC, Karanicolas P, Akl EA, Vist G, Kunz R, Brozek J, Kupper LL, Martin SL, Meerpohl JJ, Alonso-Coello P, Christensen R, Schünemann HJ. GRADE guidelines: 13. Preparing summary of findings tables and evidence profiles-continuous outcomes. Journal of Clinical Epidemiology 2013; 66: 173-183.

Hannan MT, Felson DT, Pincus T. Analysis of the discordance between radiographic changes and knee pain in osteoarthritis of the knee. Journal of Rheumatology 2000; 27: 1513-1517.

Johnston BC, Thorlund K, Schünemann HJ, Xie F, Murad MH, Montori VM, Guyatt GH. Improving the interpretation of quality of life evidence in meta-analyses: the application of minimal important difference units. Health and Quality of Life Outcomes 2010; 8: 116-116.

Johnston BC, Thorlund K, da Costa BR, Furukawa TA, Guyatt GH. New methods can extend the use of minimal important difference units in meta-analyses of continuous outcome measures. Journal of Clinical Epidemiology 2012; 65: 817-826.

Johnston BC, Patrick DL, Thorlund K, Busse JW, da Costa BR, Schünemann HJ, Guyatt GH. Patient-reported outcomes in meta-analyses-part 2: methods for improving interpretability for decision-makers. Health and Quality of Life Outcomes 2013; 11: 211-211.

Johnston BC, Miller PA, Agarwal A, Mulla S, Khokhar R, De Oliveira K, Hitchcock CL, Sadeghirad B, Mohiuddin M, Sekercioglu N, Seweryn M, Koperny M, Bala MM, Adams-Webber T, Granados A, Hamed A, Crawford MW, van der Ploeg AT, Guyatt GH. Limited responsiveness related to the minimal important difference of patient-reported outcomes in rare diseases. Journal of Clinical Epidemiology 2016a; 79: 10-21.

Johnston BC, Alonso-Coello P, Friedrich JO, Mustafa RA, Tikkinen KA, Neumann I, Vandvik PO, Akl EA, da Costa BR, Adhikari NK, Dalmau GM, Kosunen E, Mustonen J, Crawford MW, Thabane L, Guyatt GH. Do clinicians understand the size of treatment effects? A randomized survey across 8 countries. CMAJ: Canadian Medical Association Journal 2016b; 188: 25-32.

Jones PW. Health status measurement in chronic obstructive pulmonary disease. Thorax 2001; 56: 880-887.

Juhl C, Lund H, Roos EM, Zhang W, Christensen R. A hierarchy of patient-reported outcomes for meta-analysis of knee osteoarthritis trials: empirical evidence from a survey of high impact journals. Arthritis 2012; 2012: 136245.

Kirkham JJ, Dwan KM, Altman DG, Gamble C, Dodd S, Smyth R, Williamson PR. The impact of outcome reporting bias in randomised controlled trials on a cohort of systematic reviews. BMJ 2010; 340: c365.

Kirshner B, Guyatt G. A methodological framework for assessing health indices. Journal of Chronic Diseases 1985; 38: 27-36.

Martinez-Devesa P, Waddell A, Perera R, Theodoulou M. Cognitive behavioural therapy for tinnitus. Cochrane Database of Systematic Reviews 2007; 9: CD005233.

Müller-Bühl U, Engeser P, Klimm H-D, Wiesemann A. Quality of life and objective disease criteria in patients with intermittent claudication in general practice. Family Practice 2003; 20: 36-40.

Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Medical Care 1989; 27: S217-232.

Patrick DL, Burke LB, Gwaltney CJ, Leidy NK, Martin ML, Molsen E, Ring L. Content validity--establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO good research practices task force report: part 1--eliciting concepts for a new PRO instrument. Value in Health 2011a; 14: 967-977.

Patrick DL, Burke LB, Gwaltney CJ, Leidy NK, Martin ML, Molsen E, Ring L. Content validity--establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force report: part 2--assessing respondent understanding. Value in Health 2011b; 14: 978-988.

Powers JH, 3rd, Patrick DL, Walton MK, Marquis P, Cano S, Hobart J, Isaac M, Vamvakas S, Slagle A, Molsen E, Burke LB. Clinician-Reported Outcome Assessments of Treatment Benefit: Report of the ISPOR Clinical Outcome Assessment Emerging Good Practices Task Force. Value in Health 2017; 20: 2-14.

Prinsen CA, Vohra S, Rose MR, Boers M, Tugwell P, Clarke M, Williamson PR, Terwee CB. How to select outcome measurement instruments for outcomes included in a "Core Outcome Set" - a practical guideline. Trials 2016; 17: 449.

PROMIS. Patient Reported Outcomes Measurement Information System 2018. http://www.healthmeasures.net/explore-measurement-systems/promis.

PROQOLID. Patient Reported Outcomes and Quality of Life Instruments Database 2018. https://eprovide.mapi-trust.org/about/about-proqolid.

Puhan MA, Soesilo I, Guyatt GH, Schünemann HJ. Combining scores from different patient reported outcome measures in meta-analyses: when is it justified? Health and Quality of Life Outcomes 2006; 4: 94-94.

Revicki D, Hays RD, Cella D, Sloan J. Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. Journal of Clinical Epidemiology 2008; 61: 102-109.

Rutten-van Mölken M, Roos B, Van Noord JA. An empirical comparison of the St George's Respiratory Questionnaire (SGRQ) and the Chronic Respiratory Disease Questionnaire (CRQ) in a clinical trial setting. Thorax 1999; 54: 995-1003.

Schünemann HJ, Best D, Vist G, Oxman AD, Group GW. Letters, numbers, symbols and words: how to communicate grades of evidence and recommendations. Canadian Medical Association Journal 2003; 169: 677-680.

Schünemann HJ, Goldstein R, Mador MJ, McKim D, Stahl E, Puhan M, Griffith LE, Grant B, Austin P, Collins R, Guyatt GH. A randomised trial to evaluate the self-administered standardised chronic respiratory questionnaire. European Respiratory Journal 2005; 25: 31-40.

Schünemann HJ, Akl EA, Guyatt GH. Interpreting the results of patient reported outcome measures in clinical trials: the clinician's perspective. Health Qual Life Outcomes 2006; 4: 62.

Singh SJ, Sodergren SC, Hyland ME, Williams J, Morgan MD. A comparison of three disease-specific and two generic health-status measures to evaluate the outcome of pulmonary rehabilitation in COPD. Respiratory Medicine 2001; 95: 71-77.

Smith K, Cook D, Guyatt GH, Madhavan J, Oxman AD. Respiratory muscle training in chronic airflow limitation: a meta-analysis. American Review of Respiratory Disease 1992; 145: 533-539.

Tarlov AR, Ware JE, Jr., Greenfield S, Nelson EC, Perrin E, Zubkoff M. The Medical Outcomes Study. An application of methods for monitoring the results of medical care. JAMA 1989; 262: 925-930.

Tendal B, Nuesch E, Higgins JP, Jüni P, Gøtzsche PC. Multiplicity of data in trial reports and the reliability of meta-analyses: empirical study. BMJ 2011; 343: d4829.

Thorlund K, Walter SD, Johnston BC, Furukawa TA, Guyatt GH. Pooling health-related quality of life outcomes in meta-analysis-a tutorial and review of methods for enhancing interpretability. Research Synthesis Methods 2011; 2: 188-203.

Wainer H. Estimating coefficients in linear models: It don't make no nevermind. Psychological Bulletin 1976; 83: 213-217.

Ware J, Jr., Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Medical Care 1996; 34: 220-233.

Ware JE, Jr., Kosinski M, Bayliss MS, McHorney CA, Rogers WH, Raczek A. Comparison of methods for the scoring and statistical analysis of SF-36 health profile and summary measures: summary of results from the Medical Outcomes Study. Medical Care 1995; 33: As264-279.

Wiebe S, Guyatt G, Weaver B, Matijevic S, Sidwell C. Comparative responsiveness of generic and specific quality-of-life instruments. Journal of Clinical Epidemiology 2003; 56: 52-60.

Witter JP. Introduction: PROMIS a first look across diseases. Journal of Clinical Epidemiology 2016; 73: 87-88.

Yohannes AM, Roomi J, Waters K, Connolly MJ. Quality of life in elderly patients with COPD: measurement and predictive factors. Respiratory Medicine 1998; 92: 1231-1236.