A Practical Guide to Quasi-Experimental Methods (PSM and DID)

quasi experimental methods propensity score matching

A quasi-experiment is an empirical interventional study used to estimate the causal impact of an intervention on target population without random assignment. - Wikipedia

In this tutorial, we use simple datasets to illustrate two quasi-experimental methods: Propensity Score Matching (PSM) and Difference-in-differences (DID). We focus on the practical side of applying the methods and provide code in both Python and R via Kaggle (see links below).

This is joint work with my friend Dr. Gang Wang .

Propensity Score Matching

Propensity Score Matching (PSM) is a quasi-experimental method in which the researcher uses statistical techniques to construct an artificial control group by matching each treated unit with a non-treated unit of similar characteristics. Using these matches, the researcher can estimate the impact of an intervention. - World Bank

We use two datasets to show:

  • why matching matters via a simple simulated example
  • why propensity score matching is needed and how to do that in Python and R using a Groupon example

The Groupon dataset is from Gang’s paper .

We have shared the datasets and code (Python by me and R by Gang) on Kaggle:

  • Simple Matching (Python)
  • Simple Matching (R)
  • Propensity Score Matching (Python)
  • Propensity Score Matching (R)

We only focus on the key concepts and issues in this tutorial and advise the readers to go over the annotated code for details.

Simple Matching

smoker.csv is a simple simulated dataset about patients with three columns:

  • smoker: 1 yes, 0 no
  • treatment: 1 treated (treatment group), 0 not treated (control group)
  • outcome: 1 dead, 0 not dead

quasi experimental methods propensity score matching

If we ignore the smoker status and compare the death rate in the control group (23.5%) and the treatment group (34%), we would have the following conclusion:

Given the ATE (average treatment effect) is 0.105 (0.34 - 0.235), the percentage of death will increase by 10.5% if the treatment is conducted.

So, the treatment should NOT be done.

Now let’s look at the smoker patient ratio in control and treatment groups:

quasi experimental methods propensity score matching

There are way more smokers among the patients in the treatment group (56%) compared with the control group (19%) - is that possible that smoking leads to a higher death rate?

Next, we do a simple 1:1 matching so that the smoker ratio in the control and treatment groups will be the same. For each patient in the treatment group, we randomly choose a patient from the control group to match:

  • if the treated patient is a smoker, we select a smoker from the control group
  • if the treated patient is not a smoker, we select a non-smoker from the control group

Here, we use random sampling without replacement, i.e., once a patient in the control group is matched with a patient in the treatment group, he cannot be used to match other patients.

After matching, the smoker patient ratio in control and treatment groups is the same:

quasi experimental methods propensity score matching

Note: given there are patents that are not matched, the total number of patients in the matched dataset is often smaller than the original dataset.

Now, if we compare the death rate in the matched control group (56%) and treatment group (34%), we would have the following conclusion:

Given the ATE (average treatment effect) is -0.22 (0.34 - 0.56), the percentage of death will decrease by 22% if the treatment is conducted.

So, the treatment should be done.

After matching, our conclusion about the treatment is the opposite!

In the simple matching example above, we match patients only based on one binary feature: smoker or not. In real datasets used in observational studies, each observation often has many features. Matching observations with the same values across multiple features is often not feasible.

Let’s look at a Groupon example:

quasi experimental methods propensity score matching

Some Groupon deals have a minimal requirement, e.g., as shown in the picture above, the deal only works when there are at least 100 committed buyers.

So, we can define:

  • Control group: deals without the minimal requirement
  • Treatment group: deals with minimal requirement

We are interested in:

Does having the minimal requirement affect the deal outcomes, such as revenue, quantity sold, and Facebook likes received?

groupon.csv has the following structure:

quasi experimental methods propensity score matching

Finding two deals from different groups with exact matches across different features, such as price, coupon_duration, featured or not, etc. would be very difficult if not impossible.

Therefore, we need to use propensity score matching.

The propensity score definition is simple:

propensity score is the probability for an observation (a deal in this case) to be in the treatment group

The calculation of the propensity score can then be translated into a binary classification problem: given the features of a deal, classify each deal as 0 (without minimal requirement - control group) or 1 (with minimal requirement - treatment group).

Two issues here:

  • what model to use for classification: logistic regression is often used.
  • features/variables that predict treatment status perfectly, such as min_req feature, which the treatment feature is directly derived from (see the code notebook for the result of adding min_req ).
  • features/variables that may be affected by the treatment

So, we use prom_length , price , discount_pct , coupon_duration , featured , limited_supply features to train a logistic regression model to predict target variable treatment :

quasi experimental methods propensity score matching

The figure above shows major overlaps of the propensity scores in the control and treatment group, which is a good foundation for the matching, i.e., if there are no or few overlaps, it’s impossible/difficult to find enough matches.

For example, if min_req is used to calculate the propensity score, there won’t be any overlaps as shown below:

quasi experimental methods propensity score matching

If we plot revenue with start_date , we can see that the date is highly correlated with treatment:

quasi experimental methods propensity score matching

if start_date is used to calculate the propensity score, there is only little overlap resulting in not enough matched observations:

quasi experimental methods propensity score matching

The matching based on propensity score is simple:

  • for each observation in the treatment, use KNN to find the top N closest neighbors in the control group based on the propensity score (here, use a fixed radius/caliper, such as 0.05 or use 25% of the standard deviation of the propensity score as radius/caliper)
  • if 1:1 matching, pick the closest one as the match
  • if 1:M matching, pick the M closest ones from the top N as the match
  • a match cannot be reused if replacement is true
  • drop observations without any match

One way to see the matching effect is to show the mean difference of the features from control vs. treatment groups before and after the match - we hope to decrease the differences via matching, i.e., the characteristics of the deals are more similar in both control and treatment groups after matching.

The following table and figure show the before/after p-values of the mean difference and the effect size for different features used for calculating the propensity score:

quasi experimental methods propensity score matching

We can see except that discount_pct and featured are not significant, all other features are more similar/balanced after matching.

We can then conduct t-tests (see code notebook for details) for the dependent variables in the dataset before and after the match:

  • for revenue ( revenue ), p-value changed from 0.04 to 0.051. 0.051 is not significant meaning that having minimal requirements does not change revenue.
  • for Facebook likes received ( fb_likes ), p-value changed from 0.004 to 0.002. Both are significant meaning that having the minimal requirement increases the facebook likes received.

In the Propensity Score Matching (Python) notebook, I also include the code to use psmpy package, which generates similar results. Given that the package is not open-sourced yet - I cannot check the details.

Back to Top

Difference-in-Differences

The difference-in-differences method is a quasi-experimental approach that compares the changes in outcomes over time between a population enrolled in a program (the treatment group) and a population that is not (the comparison group). It is a useful tool for data analysis. - World Bank

The full Python code can be accessed at Kaggle: Difference-in-Differences in Python

The dataset is adapted from the dataset in Card and Krueger (1994) , which estimates the causal effect of an increase in the state minimum wage on the employment.

  • On April 1, 1992, New Jersey raised the state minimum wage from \$4.25 to \$5.05 while the minimum wage in Pennsylvania stays the same at \$4.25.
  • data about employment in fast-food restaurants in NJ (0) and PA (1) were collected in February 1992 and in November 1992.
  • 384 restaurants in total after removing null values

The calculation of DID is simple:

  • mean PA (control group) employee per restaurant before/after the treatment is 23.38/21.1, so the after/before difference for the control group is -2.28 (21.1 - 23.38)
  • mean NJ (treatment group) employee per restaurant before/after the treatment is 20.43/20.90, so the after/before difference for the treatment group is 0.47 (20.9 - 20.43)
  • the difference-in-differences (DID) is 2.75 (0.47 + 2.28), which is (the after/before difference of the treatment group) - (the after/before difference of the control group)

The same DID result can be obtained via regression, which allows adding control variables if needed:

$y = \beta_0 + \beta_1 * g + \beta_2 * t + \beta_3 * (t * g) + \varepsilon$

  • g is 0 for the control group and 1 for the treatment group
  • t is 0 for before and 1 for after

we can insert the values of g and t using the table below and see that the coefficient ($\beta_3$) of the interaction of g and t is the value for DID:

The following shows the result of the regression for the employment dataset:

quasi experimental methods propensity score matching

The p-value for $\beta_3$ in this example is not significant, which means that the average total employees per restaurant increased by 2.75 FTE (full-time equivalent) after the minimal salary raise but the result may be just due to random factors.

  • A hands-on introduction to Propensity Score use for beginners

psmpy: Propensity Score Matching in Python

Analyze Causal Effect Using Diff-in-Diff Model

  • Diff in Diff Testing (Python)

PS. The image for this post is generated via Midjourney using the prompt “Quasi-Experimental Methods”.

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

BMC Health Services Research logo

A comparison of four quasi-experimental methods: an analysis of the introduction of activity-based funding in Ireland

Gintare valentelyte, conor keegan, jan sorensen.

  • Author information
  • Article notes
  • Copyright and License information

Corresponding author.

Received 2021 Dec 13; Accepted 2022 Sep 16; Collection date 2022.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Health services research often relies on quasi-experimental study designs in the estimation of treatment effects of a policy change or an intervention. The aim of this study is to compare some of the commonly used non-experimental methods in estimating intervention effects, and to highlight their relative strengths and weaknesses. We estimate the effects of Activity-Based Funding, a hospital financing reform of Irish public hospitals, introduced in 2016.

We estimate and compare four analytical methods: Interrupted time series analysis, Difference-in-Differences, Propensity Score Matching Difference-in-Differences and the Synthetic Control method. Specifically, we focus on the comparison between the control-treatment methods and the non-control-treatment approach, interrupted time series analysis. Our empirical example evaluated the length of stay impact post hip replacement surgery, following the introduction of Activity-Based Funding in Ireland. We also contribute to the very limited research reporting the impacts of Activity-Based-Funding within the Irish context.

Interrupted time-series analysis produced statistically significant results different in interpretation, while the Difference-in-Differences, Propensity Score Matching Difference-in-Differences and Synthetic Control methods incorporating control groups, suggested no statistically significant intervention effect, on patient length of stay.

Our analysis confirms that different analytical methods for estimating intervention effects provide different assessments of the intervention effects. It is crucial that researchers employ appropriate designs which incorporate a counterfactual framework. Such methods tend to be more robust and provide a stronger basis for evidence-based policy-making.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12913-022-08657-0.

Keywords: Interrupted time-series, Difference-in-differences, Propensity score matching, Synthetic control, Activity-based funding, Policy evaluation

Introduction

In health services research, quasi-experimental methods continue to be the main approaches used in the identification of impacts of policy interventions. These methods provide alternatives to randomised experiments e.g. Randomised Controlled Trials (RCTs), which are less prevalent in health policy research, particularly for larger scale interventions. Examples of previously conducted experiments include the RAND Health Insurance Experiment [ 1 ] and the Oregon Health Insurance Experiment [ 2 ] which have since led to the restructuring of health insurance plan policies across the United States. Although such large-scale experiments can generate robust evidence for informing health policy decisions, they are often too complex, expensive, unethical or infeasible to implement for larger scale policies and interventions [ 3 , 4 ]. Quasi-experimental methods provide an alternative means to policy evaluation, using non-experimental data sources, where randomisation is infeasible or unethical when the intervention already occurred and its evaluation occurred later [ 3 ].

The evaluation of policy impacts, regardless of analytical approach, is aimed at identifying causal effects of a policy change. A concise guide highlights the approaches which are appropriate for evaluating the impact of health policies [ 3 ]. A recent review identified a number of methods appropriate for estimating intervention effects [ 5 ]. Additionally, several control-treatment approaches have recently been compared in terms of their relative performance [ 6 , 7 ].

However, there is limited empirical evidence in the health services research field comparing control-treatment analytical approaches to non-control-treatment approaches, used for estimating health intervention or policy effects. We use an empirical example of Activity-Based Funding (ABF), a hospital financing intervention, to estimate the policy impact using four non-experimental methods: Interrupted Time-Series (ITS), Difference-in-Differences (DiD), Propensity Score Matching Difference-in-Differences (PSM DiD), and Synthetic Control (SC). A review of the application of these methods in the literature examining ABF impacts has recently been undertaken [ 5 ]. Out of 19 identified studies, six studies employed ITS, seven employed DiD and one study employed the SC approach [ 5 ]. The identified effects, as assessed by reporting on a set of hospital outcomes, varied based on the analytical method that was used. The studies which employed ITS all reported statistically significant effects post-ABF which have led to increased levels of hospital activity [ 8 , 9 ], and reductions in patient length of stay (LOS) [ 10 – 13 ]. In contrast, the evidence is more mixed, among the remaining studies which employed control-treatment methods. For example, significant increases in hospital activity were reported in three studies which used the DiD approach [ 14 – 16 ], while another study found no significant impacts in terms of activity [ 17 ]. Similarly, contrasting evidence in terms of changes in LOS [ 16 , 18 , 19 ] and mortality [ 18 , 20 ] were also reported. Therefore, the overall evidence on the impacts of ABF on hospital outcomes can be considered mixed, and as highlighted by Palmer et al. (2014) [ 21 ] ‘Inferences regarding the impact of ABF are limited both by inevitable study design constraints (randomized trials of ABF are unlikely to be feasible) and by avoidable weaknesses in methodology of many studies’ [ 21 ].

The aim of this study is to compare these analytical methods in their estimation of intervention effects, using an empirical case of ABF introduction in Ireland. Specifically, we focus on the comparison of control-treatment analytical approaches (DiD, PSM DiD, SC), to ITS, a commonly used non-control-treatment approach for evaluating policies and interventions. Additionally, we contribute to the very limited research evidence assessing the impacts of ABF within the Irish context.

ABF and the Irish health system

Activity-based funding (ABF) is a financing model that incentivises hospitals to deliver care more efficiently [ 22 ]. Under ABF, hospitals receive prospectively set payments based on the number and type of patients treated [ 22 ]. Services provided to patients are reflected by an efficient price of providing those services and adjustments incorporated for different patient populations served. Prices are determined prospectively e.g. in terms of Diagnosis Related Groups (DRGs), and reflect differences in hospital activity, based on types of diagnosis and procedures provided to patients [ 23 ]. DRGs provide transparent price differences, directly linking hospital services provision to hospital payments. In theory, this incentivises hospitals to deliver more efficient healthcare (e.g. shorten LOS) and to be more transparent in their allocation of resources and finances [ 22 , 24 ].

The Irish healthcare system is predominantly a public health system, with the majority of health expenditure raised through general taxation (72%), and remainder through out-of-pocket payments (13%) and voluntary private health insurance (15%) [ 25 ]. In Ireland, most hospital care is delivered in public hospitals and this care is mostly government-financed, with approximately one-fifth of care delivered in public hospitals privately financed [ 25 , 26 ]. Patients who receive private day or inpatient treatment in public hospitals are required to pay private accommodation and consultant charges. The majority of private patient activity in public hospitals is funded through private health insurance with the remainder through out-of-pocket payments. Public or private patient status relates to whether the hospital patient saw their consultant on a public or private basis [ 27 ]. For non-consultant hospital staff, the same publicly funded staff are employed in delivering care to both publicly and privately financed patients [ 27 ].

Traditionally, all Irish public hospitals were funded on a budgetary block grant basis based on historical performance, making it difficult to measure and monitor activity and funding of public hospital care [ 28 ]. On the 1st January 2016, a major financing reform was introduced, and funding of public patients in most public hospitals moved to ABF [ 29 ]. ABF was introduced centrally by the Health Services Executive (HSE), responsible for delivery of public health services in Ireland. All public inpatient activity is funded under ABF, while all outpatient and Emergency Department (ED) activity continues to be funded using block budgets [ 30 ]. The ABF funding model is based on prospectively set average DRG prices, and additionally financially penalises hospitals for long patient LOS [ 30 ]. Additionally, the amount of activity that a hospital can carry out as well as the maximum funding it can receive, is capped, to preserve the overall health budget provided to a particular hospital [ 30 ]. Public patient reimbursement is based on the average price of DRGs, in contrast to private patients who are reimbursed at a per-diem basis [ 30 ].

Thus, this key difference in reimbursement between public and private patients treated in the same hospitals, lends itself to a naturally occurring control group for our analysis using the control-treatment approaches.

Estimation models

Interrupted time-series analysis.

Interrupted Time Series (ITS) analysis identifies intervention effects by comparing the level and trend of outcomes pre and post intervention [ 31 ]. Often, ITS compares outcome changes for a single population and does not specify a control group against which intervention effects can be compared [ 32 ]. This can bias the estimated intervention effects, as a defined control group often eliminates any unmeasured group or time-invariant confounders from the intervention itself [ 33 ]. Therefore, ITS can overestimate the effects of an intervention producing misleading estimation results [ 4 ].

The ITS analysis model can be presented as [ 34 , 35 ],

Potential outcomes framework

Alternatively, analytical approaches such as Difference-in-Differences (DiD), Propensity Score Matching Difference-in-Differences (PSM DiD) and Synthetic Control (SC) overcome some of the shortcomings of ITS. These approaches are based on the counterfactual framework and the idea of potential outcomes which quantify the estimation of causal effects of a policy or an intervention 1 . The potential outcomes framework defines a causal effect for an individual as the difference in outcomes that would have been observed for that individual with and without being exposed to an intervention [ 36 , 37 ]. Since we can never observe both potential outcomes for any one individual (we cannot go back in time to expose them to the intervention), we cannot compute the individual treatment effect [ 36 ]. Researchers therefore focus on average causal effects across populations guided by this potential outcomes framework [ 3 , 36 , 37 ]. Therefore in practice, estimation is always related to the counterfactual outcome, which is represented by the control group [ 36 , 38 ] 2 . Consequently, it is for this reason all of these analytical approaches use a clearly defined control group in estimation, against which the outcomes for a group affected by the intervention are compared. The inclusion of a control group improves the robustness of the estimated intervention effects, by approximating experimental designs such as a RCT, the gold standard [ 38 ].

Difference-in-differences analysis

The DiD approach estimates causal effects by comparing the observed outcome changes pre intervention with the counterfactual outcomes post intervention, between a naturally occurring control group and a treatment group exposed to the intervention change [ 33 ]. The key advantage of the DiD approach is its use of the intervention itself as a naturally occurring experiment, allowing to eliminate any exogenous effects from events occurring simultaneously to the intervention [ 33 , 38 ].

The DiD approach estimates the average treatment effect on the treated (ATT) across individual units at a particular time point, represented by the general DiD model as [ 3 , 6 , 33 , 38 ],

However, DiD relies on the parallel trends assumption which states that, in the absence of treatment, the average outcomes for the treated and control groups would have followed parallel trends over time [ 33 ]. This parallel trends assumption can be represented as [ 33 , 38 ],

Propensity score matching difference-in-differences

PSM DiD is an extension to the standard DiD approach. Using this approach, outcomes between treatment and control groups are compared, after matching them with similar observable factors, followed by estimation by DiD [ 40 – 42 ]. Combining the PSM approach with DiD allows further elimination of any time-invariant differences between the treatment and control groups, and allows selection on observables and unobservables which are constant over time [ 40 , 43 ]. Additionally, matching on the propensity score accounts for imbalances in the distribution of the covariates between the treatment and control groups [ 40 ] 4 . We present this model as follows [ 40 ],

In our final PSM DiD estimation model we estimate the average treatment effect on the treated (ATT) using nearest neighbour matching propensity scores, by selecting the one comparison unit i.e. patient episode whose propensity score is nearest to the treated unit in question. We present our estimation model as follows:

Additionally, PSM makes the parallel trends assumption more plausible as the control groups are based on similar propensity scores in the PSM DiD approach. PSM forms statistical twin pairs before conducting DiD estimation, thus increasing the credibility of the identification of the treatment effect [ 40 ]. Instead, PSM relies on the conditional independence assumption (CIA). This assumption states that, in absence of the intervention, the expected outcomes for the treated and control groups would have been the same, conditional on their past outcomes and observed characteristics pre-intervention [ 40 , 44 ]. However, it is also important to note, that even if covariate balance is achieved in PSM DiD, this does not necessarily mean that there will be balance across variables that were not used to build the propensity score [ 40 , 44 ]. It is for this reason that the CIA assumption is still required.

Furthermore, recent developments of the DiD approach have highlighted that additional assumptions are necessary to ensure the estimated treatment effects are unbiased [ 45 ]. It is proposed that estimates will remain consistent after conditioning on a vector of pre-treatment covariates [ 45 ]. This was our motivation for employing the PSM DiD approach, as it accounts for pre-intervention characteristics, which allow to further minimise estimation bias. PSM DiD achieves this by properly applied propensity scores, based on matched pre-intervention characteristics, thus eliminating observations that are not similar between treatment and control groups [ 41 ]. Further developments have been made to account for multiple treatment groups, which receive treatment at various time periods i.e. differential timing DiD [ 46 ]. However, this does not affect our analysis, as the introduction of ABF in our empirical example took place at one time.

Synthetic control

The Synthetic Control (SC) method estimates the ATT by constructing a counterfactual treatment-free outcome for the treated unit using the weighted average of available control units pre-intervention [ 44 , 47 , 48 ]. The weights are chosen so that the outcomes and covariates for the treated unit and the synthetic control are similar in the pre-treatment period [ 44 , 48 ]. This assumption may not hold in reality, particularly when estimating policy impacts, thus alternative analytical approaches which avoid the parallel trends assumption have been considered.

The SC approach becomes particularly useful in cases when a naturally occurring control group cannot be established, or in cases where the parallel trends assumption does not hold, and can often complement other analytical approaches [ 48 ]. Similarly to PSM, the SC method also relies on the CIA, and controls for pre-treatment outcomes and covariates by re-weighting treated observations, using a semiparametric approach [ 44 ]. For a single treated unit the synthetic control is formed by finding the vector of weights W that minimises [ 44 ]:

The choice of V is important as W* depends on the choice of V . The synthetic control W*(V) is meant to reproduce the behaviour of the outcome variable for the treated unit in the absence of the treatment. Often a V that minimises the mean squared prediction error is chosen [ 44 , 48 ]:

Similarly, we limit biases in our estimated treatment effects [ 45 ] using the SC approach, which restricts the synthetic control weights to be positive and sum to one and such that the chosen weights minimise the mean squared prediction error with respect to the outcome [ 49 ].

Data and methods

In our empirical example analysis, we used national Hospital In-Patient Enquiry (HIPE) administrative activity data from 2013 to 2019 for 19 public acute hospitals providing orthopaedic services in Ireland. HIPE data used in our analysis record and classify all activity (public and private) in Irish public hospitals [ 27 ]. We divided our data into quarterly time periods (n = 27) based on admission date. Data were available for 12 quarters pre-ABF introduction, and 15 quarters post-ABF introduction. We assessed the impact of ABF on patient average LOS, following elective hip replacement surgery, for a total of 19,565 hospital patient episodes.

For each analysis, we included hospital fixed effects and controlled for the same covariates: Age categories (reference category 60–69 years), average number of diagnoses, average number of additional procedures (additional to hip replacement), Diagnosis-Related Group (DRG) complexity (split by minor and major complexity) and interaction variables: Age categories by average number of diagnoses, age categories by average number of additional procedures, age categories by DRG complexity.

We estimated the ITS model using ordinary least squares and included public patient episodes only. Following guidance from previous studies [ 32 , 50 ], we accounted for seasonality by including indicator variables for elapsed time since ABF introduction. Additionally, we checked for presence of autocorrelation by plotting the residuals and the partial autocorrelation function [ 32 , 50 ].

For the remaining models, we used treatment and control groups consisting of public and private patient episodes, respectively, and estimated the average treatment effects on the treated (ATT). We used the key differences in reimbursement between public (DRG payments) and private (per-diem payments) patient episodes, to differentiate our treatment group from the control group. The identification strategy exploits the fact that per-diem funding of private patient care remained unchanged over the study period. Any change in outcome between public and private patients before and after the introduction of ABF should be due to the policy introduction.

In our DiD analysis, we controlled for common aggregate shock changes by including dummy variables for each time period (time fixed effects). We additionally examined the parallel trends assumption by interacting the time and treatment indicators in the pre-ABF period (see Supplementary Tables  4 , Additional File 6 ).

We estimated PSM DiD in a number of steps 6 : First we estimated propensity scores to treatment based on our list of covariates, using a probit regression. Second, we matched the observations in the treatment group (public patient episodes) with observations in the control group (private patient episodes) as per estimated propensity scores with the common support condition imposed. Finally, we compared the changes in the average LOS of the treated and matched controls by DiD estimation.

The SC estimation 7 was conducted at the hospital level. It has been reported that the SC approach used in our analysis works best with aggregate-level data [ 44 , 48 , 52 ]. We incorporated the nested option in our estimation, a fully nested optimization procedure that searches among all (diagonal) positive semidefinite matrices and sets of weights for the best fitting convex combination of the control units [ 44 , 52 ]. The synthetic control group composition consisted of private patient episodes based on characteristics from 9 different public hospitals from the sample of 19 hospitals used in our analysis [see Supplementary Tables  1 , Additional File 2 ].

To examine whether the estimated effects from all analyses still hold, we conducted sensitivity analysis and re-estimated each analytical model using trimmed LOS at 7 days (at the 90th percentile of the LOS distribution). As illustrated by the distribution of LOS in Supplementary Fig.  1 , Additional File 1 , this allowed for the exclusion of outlier LOS values. Additionally, to test the robustness of the estimated treatment effects, we tested the empirical strength of each model by inclusion and exclusion of certain covariates. We also examined the trends in the pre-ABF period across all DiD models, to check whether the trends were similar across the treatment and control groups.

Table  1 summarises the key descriptive statistics of the data analysed. Over the study period, the overall average LOS for this sample of patient episodes was 5.2 days (5.3 and 5.0 days for public and private patients, respectively). The majority (31.7%) of patients were aged 60–69 years (30.9% of public and 33.8% private patients, respectively). The average number of additional diagnoses was 2.5 for public and 2.1 for private patients (overall average of 2.4), and average additional procedures were 3.3 for public and 2.8 for private patients. The DRG complexity indicates that most patients (95.7%) had undergone minor complexity hip replacement surgery.

Descriptive Statistics of key covariates used in all models by treatment and control group

We illustrate the estimated intervention effects for each of the models in Fig.  1 . We observe a clear reduction in the average LOS from the ITS estimates (Fig.  1 a). However, the DiD and PSM DiD estimates are very similar, and we do not observe a clear effect on the average LOS, with most coefficients distributed closely around zero (Fig.  1 b and c). Similarly, the SC approach could not identify a clear effect (Fig.  1 d). Additionally, both the SC (Fig.  1 d & Supplementary Tables  1 , Additional File 2 ) and PSM DiD (Supplementary Fig.  2 , Additional File 3 ) approaches achieved good balance between the treated (public patient episodes) and control (private patient episodes) groups. Our examination of the pre-ABF trends did not identify any significant differences between treatment and control groups (see Supplementary Tables  4 , Additional File 6 ).

Fig. 1

Model estimates

Table  2 summarises the estimated treatment effects for each estimation model 8 . The ITS analysis suggested ABF had the largest and statistically significant impact on the average LOS for public patients, a reduction of 0.7 days (p < 0.01). However, this effect could not be observed with the control-treatment approaches, although we also see a negative but smaller effect on the average LOS from the DiD, PSM DiD and SC estimates. The effect is not statistically significant for any of these models. As illustrated in Fig.  2 below, we observe a generally declining trend in the average LOS for both the public and private patients in our data. This explains the statistically significant effects of ITS, relative to the control-treatment methods, which differentiate out the average LOS effects between both public and private patient episodes.

Estimated Treatment Effects by estimation model

Note: The treatment effect parameters for ITS, DiD and PSM DiD all indicate a marginal change (reduction) in the average LOS. The DiD and PSM DiD estimates are almost identical, indicating that the matched propensity scores did not have a substantial impact on the overall DiD estimates. In contrast, the difference between ITS and DiD estimates is substantial. All models control for hospital Fixed Effects (FE) except for SC estimation at the hospital level; a 16 observations in the treatment group were not matched; b The SC method relies on minimising the RMSPE; c due to aggregated data at hospital level; Significance level: *** ; p < 0.01; robust standard errors in parenthesis

Fig. 2

Average LOS by quarter 2013–2019 for treatment and control groups

The results from our sensitivity analysis (Supplementary Tables  2 , Additional File 4 ) revealed no material change for the ITS estimates, which remained statistically significant (p < 0.001). The estimated treatment effects from the control-treatment approaches remained small, and not statistically significant. Similarly, additional robustness testing of the estimated treatment effects by each model (and pre-ABF trend examination) remained consistent with the main results (Supplementary Tables  3 , Additional File 5 ).

In this study we compared the key analytical methods that have been used in the evaluation of policy interventions and used the introduction of Activity-Based Funding (ABF) in Irish public hospitals as an illustrative policy case. Specifically, we compared several control-treatment methods (DiD, PSM DiD, SC), to a non-control-treatment approach, ITS. We contribute to the limited empirical evidence in the health services research field comparing control-treatment analytical approaches to non-control-treatment approaches, based on recent evidence highlighting the common use of these methods in estimation of health intervention or policy effects [ 5 ]. Additionally, we contribute to the very limited research evidence on the evaluation of the ABF policy within the Irish context. We were able to utilise an important dimension of the funding changes, by exploiting the fact that both publicly and privately financed patients are treated in public hospitals in Ireland and over the period of analysis, private patients were not subject to a change in their funding.

From our comparative methods analysis, ITS produced statistically significant estimates, indicating a reduction in LOS post ABF introduction, relative to control-treatment approaches, which did not indicate any significant effects. This is in line with the results from other studies, which have estimated ABF effects using ITS, and have reported significant reductions in LOS [ 10 – 13 ]. Caution should be taken when considering ITS, as the estimates may not truly capture the effects of the intervention of interest. This could lead to incorrect inferences, and potentially to misguided assessment of impacts from policy changes across the hospital sector. For instance, the estimated reduction in LOS for Irish public patients, may incorrectly indicate that the ABF reform has been successful. From a policy perspective, the importance of the resulting ABF effects, would be informed by the size of ITS estimates, providing potentially misleading evidence on the funding reform.

Further, caution should be taken, as ITS analysis does not include a control group, relative to the other methods we considered which incorporated a control and treatment groups. Therefore the conclusions drawn from the ITS analysis will differ to those drawn from the control-treatment approaches. Additionally, our findings from ITS analysis align with a recent study which tested the empirical strength of the ITS approach, by comparing the estimated ITS results to the results from a RCT [ 4 ]. Relative to a RCT, ITS produced misleading results, primarily driven by the lack of control group, and ITS model assumptions [ 4 ]. This would suggest, a comparison of the slope of outcomes before and after an intervention may lead to biased estimates when evaluating causal effects on outcomes affected over time, due to influences by simultaneous and other unobservable factors at the time of the intervention.

However, over the study period, the average LOS for both public (treatment) and private (control) patient cases shows a reducing trend over time (Fig.  2 ). By limiting the analysis to the public patients only, the ITS approach ignores the system level effect for all patients treated (public and private), across public hospitals, and picks up a statistically significant and negative effect. In contrast, the control-treatment approaches account for the simultaneous downward trend in private (control) patient activity, thus approximating a natural experiment (e.g. a RCT) more closely, and producing more robust estimates, relative to ITS.

It is important to note that often no comparison group may be available, limiting the analysis to the ITS approach. This may be driven by various data limitations. For example, the data available over a period may only partially be available for a specific intervention. Therefore, conventional regression modelling may be the only feasible approach to account for pre-intervention differences, even though there is evidence that these methods may provide biased results, most notably in the presence of time-dependent confounders [ 4 ]. Additionally, certain intervention and policy evaluations may not be feasible under a control-treatment design, and for which the ITS approach is more suitable. This applies to studies which focus on a specific patient [ 53 ] or hospital group [ 10 ], or policies at a more aggregate or population level [ 54 ], for which it is difficult to identify a naturally occurring control group. Therefore, the inclusion of a control group in these instances would not be suitable, suggesting a before-after comparison in the level and trend of outcomes using ITS analysis as a more suitable approach. Additionally, ITS models may be more effective in the evaluation of policy and intervention effects when the control-treatment specific assumptions of parallel trends and the common independence assumptions do not hold [ 55 ].

Additionally, ITS has been highlighted as an effective approach to study short-term policy and intervention effects, as estimation of long-term effects can be biased due to the presence of simultaneous shocks to the outcome of interest [ 56 ]. In contrast, control-treatment approaches such as DiD and SC have been recognised as more appropriate and robust for estimation of long-term intervention effects [ 57 ], as these allow intervention effects to change over time [ 38 , 49 ]. Despite recent improvements and developments of the ITS approach [ 34 , 35 ], the benefits of adopting control-treatment approaches for health intervention and policy evaluation, have been previously highlighted [ 33 ].

It should be noted that all of the methods applied in this study are limited to the evaluation of a single policy. Therefore, any other smaller scale simultaneous policies that are implemented during the period of analysis are difficult to differentiate out in many instances. However, the control-treatment methods account for any unmeasured group or time-invariant confounders from the main intervention itself by incorporating a control group [ 33 ]. For example, the introduction of ABF in our empirical example may have been accompanied by a hospital-wide discharge policy aimed at reducing LOS. In this instance, ITS may attribute the reduction in LOS as the impact of ABF entirely, although this is a hospital policy effect. Alternatively, the inclusion of a control group (e.g. patients targeted in the LOS policy, but not to ABF) would difference out the ABF effect from the LOS policy, and would capture effects specific to ABF introduction. In this case, ITS may overestimate the impacts of ABF relative to the other approaches and may further contribute to different evidence base for policy decisions.

This study has several limitations. First, we limited our ITS analysis to a single group (public patient episodes) despite recent developments to ITS for multiple group comparisons [ 34 ]. However, this was informed by a recent review, which identified that ITS was employed to estimate intervention effects for a single group [ 5 ]. Second, for each of the control-treatment methods, we assumed that any individual shocks following ABF introduction had the same expected effect on the average LOS for the treatment and control groups. Third, we assumed that all of the models were correctly specified in terms of their respective identification and functional form assumptions. However, if either the identification or the functional assumptions are violated, the estimates can be biased, particularly as highlighted in recent literature on DiD approaches [ 45 ]. Fourth, we limited our focus on two key assumptions applicable to the quasi-experimental approaches i.e. parallel trends and conditional independence, and did not focus on other assumptions e.g. common shock assumption. Fifth, recent research evidence has addressed the issues related to intervention ‘spillover effects’ i.e. the unintended consequences of health-related interventions beyond those initially intended [ 58 ]. It is possible that the differing estimated effects, based on the analytical method used, may have, or could lead to spillover effects as a result. However, given the nature of the data used in our analysis, and our focus on a single procedure in our empirical analysis, it is difficult to identify any potential spillover effects, which may have been linked to ABF. More exploration of such effects may be necessary in future research. Finally, caution should be taken in generalising the reported ABF effects in this study given that our empirical example focused on one procedural group in one particular country.

In health services research it is not always feasible to conduct experimental analysis and we therefore often rely on observational analysis to identify the impact of policy interventions. We demonstrated that ITS analysis produces results different in interpretation relative to control-treatment approaches such as DiD, PSM DiD and SC. Our comparative method analysis therefore suggests that choice of analytical method should be carefully considered and researchers should strive to employ more appropriate designs incorporating control and treatment groups. These methods are more robust and provide a stronger basis for evidence-based policy-making and evidence for informing future financing reform and policy.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Acknowledgements

The authors wish to thank the Data Analytics team at the Healthcare Pricing Office (HPO) for granting access to the data used in this study. This study was conducted as part of the Health Research Board (HRB) SPHeRE Programme (Grant No. SPHeRE-2018-1). The Health Research Board (HRB) supports excellent research that improves people’s health, patient care and health service delivery. An earlier version of this work has been previously presented at the virtual International Health Economics Association (IHEA) Congress 2021.

List of abbreviations

Randomised Controlled Trial

Interrupted Time Series

Difference-in-Differences

Propensity Score Matching

Propensity Score Matching Difference-in-Differences

Synthetic Control

Conditional Independence Assumption

Activity-Based Funding

Average Treatment effect on the Treated

Health Service Executive

Hospital In-Patient Enquiry

Length of Stay

Diagnosis-Related Group

Author contribution

JS and GV conceived the study. GV drafted and edited the manuscript and performed statistical analysis. CK and JS critically revised the manuscript. All authors approved the final draft.

This research was funded by the Health Research Board SPHeRE-2018-1.

Data Availability

The data that support the findings of this study were made available under a strict user agreement with the Healthcare Pricing Office. Access to the data may only be sought directly from the Healthcare Pricing Office.

Declarations

Ethics approval and consent to participate.

Ethical approval for this study was granted by the Research Ethics Committee of the Royal College of Surgeons of Ireland (REC201910019). We confirm that all methods in this study were carried out in accordance with their specifications and other relevant guidelines and regulations. The ethics committee recognized that explicit consent to participate in the study was not required, as the data used in this study were retrospective, routinely collected, and anonymised. The data controller, the Healthcare Pricing Office, responsible for holding and managing the national Hospital In-Patient Enquiry (HIPE) database, granted access and permission to use the data in this study. The Healthcare Pricing Office ensured strict data user agreements were followed, and the data were anonymized by limiting certain combinations of data that could lead to patient identification. This was in line with the Healthcare Pricing Office adherence to The Data Protection Acts 1998 to 2018 and Regulation (EU) 2016/679 of the European Parliament and the council of 27 of 27 April 2016 also known as the General Data Protection Regulation or GDPR (HPO Data Protection Statement Version 1.2, May 2020, Healthcare Pricing Office. [available at: https://hpo.ie/data_protection/HPO_Data_Protection_Statement_Version_1.2_May2020_Covid_1.pdf ]

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

The treatment effect in terms of potential outcomes: where Y 0 ( i,t ) is the outcome that individual i would attain at time t in absence of treatment and Y 1 ( i,t ) is the outcome that individual i would attain at time t if exposed to treatment. The treatment effect on the outcome for individual i at time t is: Y 1 ( i,t ) – Y 0 ( i,t ). The fundamental identification problem is that for any individual i and time t , both potential outcomes Y 0 ( i, t ) and Y 1 ( i, t ) are not observed and we cannot compute the individual treatment effect. We only observe the outcome Y( i, t ) expressed as: Y( i, t ) = Y 0 ( i, t )(1 − D( i, t )) + Y 1 ( i, t )D( i, t ), [D( i,t ) = 0 control and D( i,t ) = 1 treatment]. Since treatment occurs after period t = 0, we can denote D( i ) = D( i, 1 ), then we have Y( i, 0 ) = Y 0 ( i, 0 ) and Y( i, 1 ) = Y 0 ( i, 1 )(1 − D( i )) + Y 1 ( i, 1 )D(i) (Rubin (1974)).

The change in outcomes from pre to post-intervention in the control group is a proxy for the counterfactual change in untreated potential outcomes in the treatment group.

The unit used is at discharge level but we only have one observation per discharge by definition therefore we cannot apply discharge fixed effects and instead have to include hospital fixed effects.

Matching on the propensity score works because it imposes the same distribution of the covariates for both the control and treatment groups (Rosenbaum and Rubin (1983)).

The common support condition guarantees that only units with suitable control cases are considered by dropping treatment observations whose propensity score is higher than the maximum or less than the minimum propensity score of the controls.

Using the psmatch2 Stata command using nearest neighbour matching which showed the best balancing properties after comparing several algorithms [ 51 ].

Using the synth Stata command [ 44 , 52 ].

Reported p-values for ITS and DiD are for the hypothesis that ATT = 0. For DiD PSM, reported p-values are conditional on the matched data. For SC, reported p-values were calculated using placebo-tests in a procedure akin to permutation tests (Abadie et al. 2010). This involved iteratively resampling from the control pool, and in each iteration re-assigning each control unit as a ‘placebo treated unit’, with a probability according to the proportion of treated units in the original sample. The synthetic control method was then applied on these ‘placebo data’ and ATT calculated for the placebo treated versus control units. The p-value for the ATT was calculated according to the proportion of the replicates in which the absolute value of the placebo-ATT exceeded the estimated ATT. It should be noted that the p-value based on placebo tests relate to falsification tests, while the p-values reported for the other methods relate to sampling uncertainty. Hence the p-values between each estimated model are not directly comparable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Gintare Valentelyte, Email: [email protected].

Conor Keegan, Email: [email protected].

Jan Sorensen, Email: [email protected].

  • 1. Brook RH, Keeler EB, Lohr KN, Newhouse JP, Ware JE, Rogers WH, et al. The Health Insurance Experiment: A Classic RAND Study Speaks to the Current Health Care Reform Debate. Santa Monica: RAND Corporation; 2006. [ Google Scholar ]
  • 2. Finkelstein A, Taubman S, Wright B, Bernstein M, Gruber J, Newhouse JP, et al. The Oregon Health Insurance Experiment: Evidence from the first year. Q J Econ. 2012;127(3):1057–106. doi: 10.1093/qje/qjs020. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 3. Jones AM, Rice N. Econometric evaluation of health policies. Oxford: Oxford University Press; 2011. [ Google Scholar ]
  • 4. Baicker KS, T.,. Testing the Validity of the Single Interrupted Time Series Design. CID Working Papers 364, Center for International Development at Harvard University. 2019.
  • 5. Valentelyte G, Keegan C, Sorensen J. Analytical methods to assess the impacts of activity-based funding (ABF): a scoping review. Health Econ Rev. 2021;11(1):17. doi: 10.1186/s13561-021-00315-1. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 6. O’Neill S, Kreif N, Grieve R, Sutton M, Sekhon JS. Estimating causal effects: considering three alternatives to difference-in-differences estimation. Health Serv Outcomes Res Methodol. 2016;16:1–21. doi: 10.1007/s10742-016-0146-8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 7. O’Neill S, Kreif N, Sutton M, Grieve R. A comparison of methods for health policy evaluation with controlled pre-post designs. Health Serv Res. 2020;55(2):328–38. doi: 10.1111/1475-6773.13274. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 8. Sutherland JM, Liu G, Crump RT, Law M. Paying for volume: British Columbia’s experiment with funding hospitals based on activity. Health Policy. 2016;120(11):1322–8. doi: 10.1016/j.healthpol.2016.09.010. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 9. Januleviciute J, Askildsen JE, Kaarboe O, Siciliani L, Sutton M. How do Hospitals Respond to Price Changes? Evidence from Norway. Health Econ (United Kingdom) 2016;25(5):620–36. doi: 10.1002/hec.3179. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 10. Shmueli A, Intrator O, Israeli A. The effects of introducing prospective payments to general hospitals on length of stay, quality of care, and hospitals’ income: the early experience of Israel. Soc Sci Med. 2002;55(6):981–9. doi: 10.1016/s0277-9536(01)00233-7. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 11. Perelman J, Closon MC. Hospital response to prospective financing of in-patient days: The Belgian case. Health Policy. 2007;84(2–3):200–9. doi: 10.1016/j.healthpol.2007.05.010. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 12. Martinussen PE, Hagen TP. Reimbursement systems, organisational forms and patient selection: Evidence from day surgery in Norway. Health Econ Policy Law. 2009;4(2):139–58. doi: 10.1017/S1744133109004812. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 13. Theurl E, Winner H. The impact of hospital financing on the length of stay: Evidence from Austria. Health Policy. 2007;82(3):375–89. doi: 10.1016/j.healthpol.2006.11.001. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 14. Gaughan J, Gutacker N, Grašič K, Kreif N, Siciliani L, Street A. Paying for efficiency: Incentivising same-day discharges in the English NHS. J Health Econ. 2019;68:102226. doi: 10.1016/j.jhealeco.2019.102226. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 15. Allen T, Fichera E, Sutton M. Can Payers Use Prices to Improve Quality? Evidence from English Hospitals. Health Econ. 2016;25(1):56–70. doi: 10.1002/hec.3121. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 16. Verzulli R, Fiorentini G, Lippi Bruni M, Ugolini C. Price Changes in Regulated Healthcare Markets: Do Public Hospitals Respond and How? Health Econ. 2017;26(11):1429–46. doi: 10.1002/hec.3435. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 17. Krabbe-Alkemade YJFM, Groot TLCM, Lindeboom M. Competition in the Dutch hospital sector: an analysis of health care volume and cost. Eur J Health Econ. 2017;18(2):139–53. doi: 10.1007/s10198-016-0762-9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 18. Hamada H, Sekimoto M, Imanaka Y. Effects of the per diem prospective payment system with DRG-like grouping system (DPC/PDPS) on resource usage and healthcare quality in Japan. Health Policy. 2012;107(2):194–201. doi: 10.1016/j.healthpol.2012.01.002. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 19. Farrar S, Yi D, Sutton M, Chalkley M, Sussex J, Scott A. Has payment by results affected the way that English hospitals provide care? Difference-in-differences analysis. BMJ (Online) 2009;339(7720):554–6. doi: 10.1136/bmj.b3047. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 20. Cooper Z, Gibbons S, Jones S, McGuire A. Does Hospital Competition Save Lives? Evidence From The English NHS Patient Choice Reforms*. Econ J. 2011;121(554):F228-F60. doi: 10.1111/j.1468-0297.2011.02449.x. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 21. Palmer KS, Agoritsas T, Martin D, Scott T, Mulla SM, Miller AP, et al. Activity-based funding of hospitals and its impact on mortality, readmission, discharge destination, severity of illness, and volume of care: a systematic review and meta-analysis. PLoS ONE. 2014;9(10):e109975. doi: 10.1371/journal.pone.0109975. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 22. Street A, Vitikainen K, Bjorvatn A, Hvenegaard A. Introducing activity-based financing: a review of experience in Australia, Denmark, Norway and Sweden. Working Papers 030cherp, Centre for Health Economics, University of York. 2007.
  • 23. Street A, Maynard A. Activity based financing in England: the need for continual refinement of payment by results. Health Econ Policy Law. 2007;2(4):419–27. doi: 10.1017/S174413310700429X. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 24. Shleifer A. A Theory of Yardstick Competition. RAND J Econ. 1985;16(3):319–27. [ Google Scholar ]
  • 25. Brick A, Nolan A, O’Reilly J, Smith S. Resource Allocation, Financing and Sustainability in Health Care. Evidence for the Expert Group on Resource Allocation and Financing in the Health Sector. Dublin: The Economic and Social Research Institute (ESRI); 2010. [ Google Scholar ]
  • 26. Keegan C, Connolly S, Wren MA. Measuring healthcare expenditure: different methods, different results. Ir J Med Sci (1971 -) 2018;187(1):13–23. doi: 10.1007/s11845-017-1623-y. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 27. Healthcare Pricing Office. Activity in Acute Public Hospitals in Ireland. 2021.
  • 28. Department of Health. Future Health. A Strategic Framework for Reform of the Health Service 2012–2015. Dublin; 2012.
  • 29. Health Service Executive (HSE). Activity-Based Funding Programme Implementation Plan 2015–2017. Dublin; 2015.
  • 30. Healthcare Pricing Office. Introduction to the Price Setting Process for Admitted Patients V1.0 26May2015. 2015.
  • 31. Kontopantelis E, Doran T, Springate DA, Buchan I, Reeves D. Regression based quasi-experimental approach when randomisation is not an option: interrupted time series analysis. BMJ (Clinical research ed) 2015;350:h2750. doi: 10.1136/bmj.h2750. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 32. Bernal JL, Cummins S, Gasparrini A. Interrupted time series regression for the evaluation of public health interventions: a tutorial. Int J Epidemiol. 2017;46(1):348–55. doi: 10.1093/ije/dyw098. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 33. Blundell R, Costa Dias M. Evaluation Methods for Non-Experimental Data. Fisc Stud. 2000;21(4):427–68. [ Google Scholar ]
  • 34. Linden A. Conducting Interrupted Time-series Analysis for Single- and Multiple-group Comparisons. Stata J. 2015;15(2):480–500. [ Google Scholar ]
  • 35. Linden A, Adams JL. Applying a propensity score-based weighting model to interrupted time series data: improving causal inference in programme evaluation. J Eval Clin Pract. 2011;17(6):1231–8. doi: 10.1111/j.1365-2753.2010.01504.x. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 36. Rubin DB. Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. J Am Stat Assoc. 2005;100(469):322–31. [ Google Scholar ]
  • 37. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Eductational Psychol. 1974;66(5):688–701. [ Google Scholar ]
  • 38. Angrist JDP, Jorn-Steffen . Parallel Worlds: Fixed Effects, Differences-in-differences, and Panel Data. Mostly Harmless Econometrics: Princeton University Press; 2009. [ Google Scholar ]
  • 39. Basu S, Meghani A, Siddiqi A. Evaluating the Health Impact of Large-Scale Public Policy Changes: Classical and Novel Approaches. Annu Rev Public Health. 2017;38:351–70. doi: 10.1146/annurev-publhealth-031816-044208. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 40. Heckman JJ, Ichimura H, Todd PE. Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme. Rev Econ Stud. 1997;64(4):605–54. [ Google Scholar ]
  • 41. Heckman J, Ichimura H, Smith J, Todd PE. Characterizing Selection Bias Using Experimental Data. Econometrica. 1998;66(5):1017–98. [ Google Scholar ]
  • 42. Song Y, Sun W. Health Consequences of Rural-to-Urban Migration: Evidence from Panel Data in China. Health Econ. 2016;25(10):1252–67. doi: 10.1002/hec.3212. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 43. Glazerman S, Levy DM, Myers D. Nonexperimental Replications of Social Experiments: A Systematic Review2003.
  • 44. Abadie A, Diamond A, Hainmueller J. Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. J Am Stat Assoc. 2010;105(490):493–505. [ Google Scholar ]
  • 45. Sant’Anna PHC, Zhao J. Doubly robust difference-in-differences estimators. J Econ. 2020;219(1):101–22. [ Google Scholar ]
  • 46. Callaway B, Sant’Anna PHC. Difference-in-Differences with multiple time periods. Journal of Econometrics. 2020.
  • 47. Kreif N, Grieve R, Hangartner D, Turner AJ, Nikolova S, Sutton M. Examination of the Synthetic Control Method for Evaluating Health Policies with Multiple Treated Units. Health Econ. 2016;25(12):1514–28. doi: 10.1002/hec.3258. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 48. Bouttell J, Craig P, Lewsey J, Robinson M, Popham F. Synthetic control methodology as a tool for evaluating population-level health interventions. J Epidemiol Commun Health. 2018;72(8):673. doi: 10.1136/jech-2017-210106. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 49. Abadie A. Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects. J Econ Lit. 2021;59(2):391–425. [ Google Scholar ]
  • 50. Cruz M, Bender M, Ombao H. A robust interrupted time series model for analyzing complex health care intervention data. Stat Med. 2017;36(29):4660–76. doi: 10.1002/sim.7443. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 51. Leuven E, Sianesi B. PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing. Boston College Department of Economics; 2003.
  • 52. Abadie A, Diamond AJ, Hainmueller J. Comparative Politics and the Synthetic Control Method. American Journal of Political Science 2014, Forthcoming, Formerly MIT Political Science Department Research Paper No 2011-25. 2014.
  • 53. Epstein RA, Feix J, Arbogast PG, Beckjord SH, Bobo WV. Changes to the financial responsibility for juvenile court ordered psychiatric evaluations reduce inpatient services utilization: an interrupted time series study. BMC Health Serv Res. 2012;12(1):136. doi: 10.1186/1472-6963-12-136. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 54. Pincus D, Widdifield J, Palmer KS, Paterson JM, Li A, Huang A, et al. Effects of hospital funding reform on wait times for hip fracture surgery: a population-based interrupted time-series analysis. BMC Health Serv Res. 2021;21(1):576. doi: 10.1186/s12913-021-06601-2. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 55. Hudson J, Fielding S, Ramsay CR. Methodology and reporting characteristics of studies using interrupted time series design in healthcare. BMC Med Res Methodol. 2019;19(1):137. doi: 10.1186/s12874-019-0777-x. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 56. Ewusie JE, Soobiah C, Blondal E, Beyene J, Thabane L, Hamid JS. Methods, Applications and Challenges in the Analysis of Interrupted Time Series Data: A Scoping Review. J Multidiscip Healthc. 2020;13:411–23. doi: 10.2147/JMDH.S241085. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 57. Aragón MJ, Chalkley M, Kreif N. The long-run effects of diagnosis related group payment on hospital lengths of stay in a publicly funded health care system: Evidence from 15 years of micro data. Health Economics. 2022;n/a(n/a). [ DOI ] [ PMC free article ] [ PubMed ]
  • 58. Francetic I, Meacock R, Elliott J, Kristensen SR, Britteon P, Lugo-Palacios DG, et al. Framework for identification and measurement of spillover effects in policy implementation: intended non-intended targeted non-targeted spillovers (INTENTS) Implement Sci Commun. 2022;3(1):30. doi: 10.1186/s43058-022-00280-8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data availability statement.

  • View on publisher site
  • PDF (2.1 MB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

This website may not work correctly because your browser is out of date. Please update your browser .

Propensity scores

Propensity score matching (PSM) is a quasi-experimental method used to estimate the difference in outcomes between beneficiaries and non-beneficiaries that is attributable to a particular program.

PSM reduces the selection bias that may be present in non-experimental data. Selection bias exists when units (e.g. individuals, villages, schools) cannot or have not been randomly assigned to a particular program, and those units which choose or are eligible to participate are systematically different from those that are not.

A propensity score is an estimated probability that a unit might be exposed to the program; it is constructed using the unit’s observed characteristics. The propensity scores of all units in the sample, both beneficiaries and non-beneficiaries, are used to create a comparison group with which the program’s impact can be measured. By comparing units that do not participate in a program, but otherwise share the same characteristics as those units which have participated, PSM reduces or eliminates biases in observational studies and estimates the causal effect of a program on an outcome or outcomes. 

PSM consists of four phases: estimating the probability of participation, i.e. the propensity score, for each unit in the sample; selecting a matching algorithm that is used to match beneficiaries with non-beneficiaries in order to construct a comparison group; checking for balance in the characteristics of the treatment and comparison groups; and estimating the program effect and interpreting the results.

  • Estimating the propensity score: The propensity scores are constructed using a logit or probit regression to estimate the probability of a unit’s exposure to the program, conditional on a set of observable characteristics that may affect participation in the program. In order for the propensity scores to correctly estimate the probability of participation, the characteristics included in the propensity score estimation should be well-considered and as exhaustive as possible. However, it is very important that characteristics that may have been affected by the treatment are not included. For this reason, it is best to use baseline data to estimate the propensity scores, if available. Once all relevant covariates are selected for inclusion, a logit or a probit regression is performed and the predicted probabilities are obtained.
  • Nearest-neighbour matching: Each program beneficiary is matched to the non-beneficiary unit with the closest propensity score. Non-beneficiaries for which there are no beneficiaries with a sufficiently similar score are discarded from the sample; the same is true for beneficiaries for which there is no similar non-beneficiary. A variation of nearest-neighbour matching matches multiple (for example, the or five) non-beneficiaries to one single beneficiary.
  • Radius matching : (i.e. ‘caliper’ matching): A maximum propensity score radius – a ‘caliper’ – is established, and all non-beneficiaries within the given radius of a beneficiary are matched to that beneficiary.
  • Kernel matching: For each treated subject, a weighted average of the outcome of all non-beneficiaries is derived. The weights are based on the distance of the non-beneficiaries propensity score to that of the treated subject’s, with the highest weight given to those with scores closest to the treated unit.
  • Check for balance : Once units are matched, the characteristics of the constructed treatment and comparison groups should not be significantly different; i.e., the matched units in the treatment and comparison groups should be statistically comparable. Balance is generally tested using a t-test to compare the means of all covariates included in the propensity score in order to determine if the means are statistically similar in the treatment and comparison groups.  If balance is not achieved; i.e., the means of the covariates are statistically different, a different matching method or specification should be used until the sample is sufficiently balanced.
  • Estimating the program effect and interpreting results : Following the estimation of propensity scores, the implementation of a matching algorithm, and the achievement of balance, the intervention’s impact may be estimated by averaging the differences in outcome between each treated unit and its neighbour or neighbours from the constructed comparison group. The difference in averages of the subjects who participated in the intervention and those who did not can then be interpreted as the impact of the program.

Jalan and Ravallion (2003) conducted an impact evaluation that measured the effect of access to piped water on the incidence and duration of diarrhea among children less than 5 years of age in 16 states in India. The study utilized a household survey conducted in 1993-94 and used PSM to create comparable treatment and comparison households from within the larger sample. The authors argued that impact estimates based on the full sample are subject to selection bias because not all characteristics which influence both child health and water source selection are observable or included in the survey. They also claimed that the inclusion of variables that do not necessarily predict outcomes reduce bias in estimates of causal effects. For the purposes of the study, pre-exposure variables (e.g. state of residence, composition of household, assets, religion, access to public goods and village characteristics) were incorporated into a propensity score through a logit model. The Five-Nearest-Neighbor matching method was used to create the sample for analysis amounting to 33,000 observations. Approximately 650 households were excluded after the matching process due to the unavailability of sufficiently similar households. The authors concluded that access to piped water reduces disease prevalence by 21% and illness duration by 29%.      

Source: Jalan J, and Ravallion M (2003).

Advice for choosing this method

  • Propensity score matching requires statistical computations and is best conducted using statistical programs such as Stata or SPSS. It may be useful to involve an experienced statistician, depending on levels of staff knowledge. 
  • PSM demands a deep understanding of the observable covariates that drive participation in an intervention and requires that there is substantial overlap between the propensity scores of those subjects or units which have benefited from the program and those that have not; this is called the ‘common support’. If either of these two factors is lacking, PSM is not a suitable methodology for estimating causal effects of an intervention.
  • PSM also requires a large sample size in order to gain statistically reliable results. This is true for many causal inference methodologies but is particularly true for PSM due to the tendency to discard many observations which do fall under the common support.
  • PSM is not a panacea – because it matches only on observed information, it may not eliminate bias from unobserved differences between treatment and comparison groups.
  • It is important to understand the trade-offs between reducing bias and reducing standard errors that arise when choosing the specifications of the matching algorithms.  For example, when choosing the caliper size for the radius matching, if the caliper size is too large there is a risk that very dissimilar individuals will be matched, while if the caliper size is too small the sample size may become too small to obtain statistically convincing results. Similarly, for neighbour matching, choosing multiple neighbours decreases bias, relative to single neighbour matching, but increases standard errors due to the smaller sample size caused by a more stringent specification. Such trade-offs exist in each matching algorithm.

Advice for using this method

  • Deep understanding of the context is required by the evaluator in selecting observable covariates to include in the propensity score – rarely will a comprehensive list exist. Useful criteria to consider include the explicit criteria used in determining participation in the intervention (i.e. project or program eligibility), as well as factors associated with self-selection (i.e. the subject’s distance from a project location by foot).  Covariates that are affected by the intervention should not be included in the propensity score.
  • No matching algorithm that is superior in every context - each involves a trade-off between efficiency and bias. Balance statistics resulting from multiple matching algorithms should be examined to determine which method achieves the best balance.
  • Increase the reliability of the evaluation by using multiple matching algorithms and choose the matching algorithm which produces the best balance statistics.

This book provides "a comprehensive overview of steps in designing and evaluating programs amid uncertain and potentially confounding conditions.

This book from the World Bank provides a detailed introduction to impact evaluations in the development field. It also provides several tools and approaches for conducting impact evaluations.

Propensity Scores are discussed in detail from Page 348 in the Understanding Propensity Scores unit.  

In this paper,  Canning, Mahal, Odumosu and Okonkwo (2006) use propensity score matching to assess the affect of HIV/AIDS on an individual's ability to access health care in Nigeria.

In this study, propensity score matching is used to assess the impact of micro-finance loans on reducing poverty when they are used for productive means.

This study uses propensity score matching to estimate the distribution of net income gains from an Argentinean workfare program.

Dehejia R & Wahba S. (2002) “Propensity Score Matching Options for Non-experimental Causal Studies,” Review of Economics and Statistics. Volume 84, Number 1, pp. 151-161.

Heckman J, H. Ichimura, and P. (1997) Todd. “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program”, The Review of Economic Studies, Volume 64, Number 4 (1997), pp. 605-654.

Heckman J, H. Ichimura, and P. Todd, (1998), “Matching as an Econometric Evaluation Estimator”. Retrieved from https://www.researchgate.net/publication/4783246_Matching_As_An_Econometric...

Heckman J, H. Ichimura, J. Smith, and P. Todd, (1998). “Characterizing Selection Bias Using Experimental Data”.Vol. 66, No. 5 (Sep., 1998), pp. 1017-1098. Retrieved from https://www.jstor.org/stable/2999630

Jalan J, and Ravallion M (2003). “Estimating the Benefit Incidence of an Antipoverty Program by Propensity Score Matching”. Journal of Business & Economic Statistics , American Statistical Association, vol. 21(1), pages 19-30, January. Retrieved from http://fmwww.bc.edu/RePEc/es2000/0873.pdf

Rosenbaum P and Rubin D. (1983) “ The Central Role of the Propensity Score in Observational Studies for Causal Effects ”, Biometrika , Vol. 70, No. 1, pp. 41-55

Rosenbaum P, D. Rubin (1985). “Constructing a Control Group Using Multivariate Matched Sampling Options that Incorporate the Propensity Score”, The American Statistician, Volume 39, Number 1, pp. 33-38.

Expand to view all resources related to 'Propensity scores'

  • Quasi-experimental methods for impact evaluations

'Propensity scores' is referenced in:

Framework/guide.

  • Rainbow Framework :  Compare results to the counterfactual

Back to top

© 2022 BetterEvaluation. All right reserved.

IMAGES

  1. Propensity Score Matching 論文

    quasi experimental methods propensity score matching

  2. causality

    quasi experimental methods propensity score matching

  3. Propensity Score Matching

    quasi experimental methods propensity score matching

  4. Types of propensity score matching

    quasi experimental methods propensity score matching

  5. Matching and Propensity Score Methods in Quasi-Experiments

    quasi experimental methods propensity score matching

  6. Quasi‐experimental approaches: (a) propensity score analysis, (b)...

    quasi experimental methods propensity score matching

VIDEO

  1. Experimental and quasi-experimental research designs

  2. EXPERIMENTAL DESIGNS: TRUE AND QUASI DESIGNS

  3. Methods 36

  4. Outcome research: Propensity score III 傾向分數

  5. Outcome research: Propensity score I 傾向分數

  6. Propensity Score Matching

COMMENTS

  1. Quasi-Experimental Designs for Causal Inference - PMC

    This article discusses four of the strongest quasi-experimental designs for identifying causal effects: regression discontinuity design, instrumental variable design, matching and propensity score designs, and the comparative interrupted time series design.

  2. A Practical Guide to Quasi-Experimental Methods (PSM and DID)

    In this tutorial, we use simple datasets to illustrate two quasi-experimental methods: Propensity Score Matching (PSM) and Difference-in-differences (DID). We focus on the practical side of applying the methods and provide code in both Python and R via Kaggle (see links below).

  3. Evaluating Propensity Score Methods in a Quasi-Experimental ...

    We compared the balance between treatment groups obtained by four propensity score methods: 1) 1:1 nearest neighbor matching (NN), 2) augmented 1:1 NN (using caliper of 0.2 and an exact match on an imbalanced covariate), 3) full matching, and 4) inverse probability weighting (IPW).

  4. A comparison of four quasi-experimental methods: an analysis ...

    We use an empirical example of Activity-Based Funding (ABF), a hospital financing intervention, to estimate the policy impact using four non-experimental methods: Interrupted Time-Series (ITS), Difference-in-Differences (DiD), Propensity Score Matching Difference-in-Differences (PSM DiD), and Synthetic Control (SC).

  5. Propensity scores - Better Evaluation

    Propensity score matching (PSM) is a quasi-experimental method used to estimate the difference in outcomes between beneficiaries and non-beneficiaries that is attributable to a particular program. PSM reduces the selection bias that may be present in non-experimental data.

  6. 4.3 Propensity score matching (PSM) - Fiveable

    Propensity score matching (PSM) is a powerful tool for reducing selection bias in observational studies. It creates comparable groups by estimating the probability of treatment based on observed characteristics, allowing researchers to mimic randomized experiments and make stronger causal inferences.

  7. A two-dimensional propensity score matching method for ...

    We propose a two-dimensional propensity score matching (2DPSM) method as a tool to pair individuals between the treated and control groups on both the longitudinal and cross-sectional dimensions based on their characteristics.

  8. Using Propensity Scores in Quasi-Experimental Designs to ... - ed

    score matching is a recommended method by the U.S. Department of Education to improve the strength of quasi-experimental research. However, amidst the calls for more scientifically based methodology within education, propensity score matching remains greatly underutilized in the

  9. A Two-Dimensional Propensity Score Matching Method for ...

    We develop and evaluate 2DPSM based on one-to-one nearest neighbor matching. NNM is based on distance metrics, mainly propensity score matching (PSM) and Mahalanobis distance matching (MDM). PSM reduces the multi-dimensional vectors to a scalar propensity score.

  10. Propensity Score Matching and Analysis - University Blog Service

    A propensity score is the conditional probability of a unit being assigned to a particular study condition (treatment or comparison) given a set of observed covariates. pr(z= 1 | x) is the probability of being in the treatment condition. In a randomized experiment pr(z= 1 | x) is known.