5. Lessons Learned
5.3 Lessons Learned about Collecting Data to Support Therapy Payment and
A key feature of the DOTPA project was the collection of data on therapy patients either directly or indirectly via a proxy respondent. Therapy payment reforms and case-mix and
outcomes measurement will likely require collection of similar data. Here, we offer observations from our experience and analysis on the usefulness of patient self-report data in addition to clinician assessment data, procedures for collecting the data, and issues with missing data among
29 This item is “Does the patient have any difficulty with memory, attention, problem solving, planning, organizing,
the collected data. We also discuss what we learned about sample sizes necessary for developing therapy case-mix adjustment models.
5.3.1 Collecting Patient Self-Report Data in Addition to Clinician-Observed Data The DOTPA project collected patient self-report data and clinician –observed data from therapy sites of care. Collecting the patient self-report data incurs costs and imposes burdens on patients. An important question for future data collection is whether the usefulness of the patient self-report data justifies the cost and burden of collecting the information.
Twelve of the 14 TEP members responded that they prefer to include both
clinician-observed and self-reported items in a case-mix model to predict therapy expenditures. Overall, the TEP thought that both clinician-observed and self-reported items were
complementary and should be combined to obtain a more complete picture of the patient’s functional ability. Several members pointed out that the self-reported items would not be appropriate for patients with cognitive impairments.
Analyses from the DOTPA Measurement Report (Kline et al., 2014) found some item similarity between the patient self-report and clinician-observed items, but low correlations and low intra-class correlations led to the recommendation of including both scales to capture potentially different aspects of function. TEP comments and the recommendations from the DOTPA Measurement Report were also validated by empirical analysis in Section 5.4.1 of the DOTPA Payment Alternatives Report (Amico et al., 2014b). The explanatory power of the model including both the clinician-observed and self-reported measures was greater than either the clinician-observed mobility scale or the self-report measures alone. This supports the
assertion that different information is being collected in both types of measure. Although there is value to collecting both self-report and clinician-observed measures, future work should consider the added cost of collecting these measures in relation to the benefit of added explanatory power.
5.3.2 Data Collection Procedures
The DOTPA data collection began with recruitment of participating Medicare outpatient therapy providers. This effort was originally scheduled for several months before the collection of any assessments but proved more difficult than was anticipated. Many providers who
declined to participate cited the length of the assessment instrument and a lack of sufficient time and resources as their reasons for doing so.30 Difficulties in recruitment led the DOTPA team to extend the enrollment period into the data collection phase and to revise enrollment targets downward. Future data collection efforts should consider a shorter survey instrument to
minimize respondent burden or at least an electronic data collection instrument with appropriate skip patterns to lower burden.
An individual provider’s participation in DOTPA began with a Web-based training session instructing them in both the data collection protocol and the clinical content of the assessment instrument. Although Webinars proved much less costly than in-person training
30 Some participating provider sites later indicated that skip patterns on the instrument made its completion easier
sessions, they may have been less effective. As with most Webinars, it can be difficult to monitor the engagement of the participants. Often, completed assessments were returned to the DOTPA team with errors that were discussed at length during training. Future efforts should consider more interactive approaches to training from a distance that maximize participant engagement, as well as whether in-person training is worth the additional cost.
Assessments were collected on paper forms and returned by mail every 2 weeks. As such, an assessment was typically 2–3 weeks old before the DOTPA team could review the completeness of the responses. Furthermore, because some assessment data were captured with a paper form, initial review was limited to the structure of the responses (e.g., date of birth being recorded in the form mm/dd/yyyy), and immediate validation of the content with other data sources such as claims or Medicare enrollment files was not feasible. Many of the errors discovered during this review were mistakes in protocol (e.g., incorrectly following a skip pattern or leaving a required question blank). Any mistakes were immediately communicated to the provider site coordinator. Often, corrections were obtained from these providers, although some could not locate the missing or correct information (usually citing a lack of time).
The DOTPA team then transcribed completed assessments into electronic databases. This proved a much more burdensome and costly task than was originally anticipated. Transcription was partially automated by optical character recognition software as originally planned, but output was at times unreliable and required manual validation. Furthermore, transcription of alphanumeric and free-text responses (e.g., Medicare Health Insurance Claim Number, date of birth, National Provider Identifier) could not be automated and required manual data entry. This time-consuming effort delayed completion of the analytic data file and detailed validation of the data contained within the assessments.
Future efforts should use electronic methods of data collection that could both
automatically generate analytic databases and validate responses in real time. These methods would effectively eliminate burdensome transcription and review efforts and prevent providers from submitting protocol mistakes entirely. The system could also automate skip patterns in the assessment instrument. Skip patterns alleviated some of the anticipated participant burden. Although automation would not further reduce this burden, it could present a less challenging first impression during recruitment and boost the effectiveness of that effort. Such a system was tested with a small subset of DOTPA participants during data collection, and the effort was largely successful.
5.3.3 Missing Item Responses
In addition to the overall participant sample sizes, many of the individual items on the CARE assessment were not answered, thus making it impossible to construct the patient function scales. In these cases, we dropped many cases from the analysis to run the regression models. Some of the nonreporting may have been because responding therapists did not see certain CARE questions as relevant to their treatment of the patient or did not feel qualified to provide the requested information. For example, these reasons may have been behind the low response rate to the CARE self-care items by physical therapy patients. As discussed in Section 5.2.2, there was also confusion about the Not Assessed items that had only one response option but could have represented three different reasons for clinicians’ inability to assess. Restructuring
the CARE questionnaire so that it asks for reporting of only data that are feasible and relevant for the respondent to provide, as well as including multiple response options that specify the reason data were not provided, could lessen the degree of missing or unusable data in future data collection efforts. As implemented for DOTPA, the revised instrument would need to be tested on a small set of providers and beneficiaries to ensure the changes were as intended.
5.3.4 Sample Sizes
Because of the issues noted previously, the project did not have the intended sample size available for analysis. OT had only about 500 CARE-C episodes or beneficiary/years for analysis, and SLP had fewer than 200. Sample sizes for the combined-disciplines nursing facility analysis were 500–600 episodes or beneficiary/years. These very small sample sizes meant that many of our OT and SLP community case-mix regression models and our nursing facility models overfit the data and did not achieve valid and reproducible results. We had far too many explanatory variables in the OT, SLP, and nursing facility models relative to sample size, so we could not draw many solid conclusions about the independent effect of those variables on therapy expenditures. The classification and regression tree (CART) analysis was even more limited by small sample sizes. Mutually exclusive case-mix group sizes, and therefore, statistical precision in measuring expenditure differences, rapidly diminish in the CART analysis because case-mix groups are iteratively split by additional variables.
The PT sample size of more than 4,000 CARE-C episodes or beneficiary/years was much better. Nevertheless, the most expensive cases—the ones most likely to exceed the therapy caps—tend to be rare and were poorly represented in the PT sample. The small sample sizes of expensive cases limit the ability to identify and develop case-mix categories to adjust for them. Future work should consider oversampling the rare, expensive cases, such as paralysis,
amputation, burns, severe head injuries, and severe neurological disorders (e.g., quadriplegic cerebral palsy, amyotrophic lateral sclerosis) to identify and develop case-mix categories to adjust for these costly groups. Future CART work to create mutually exclusive case-mix groups will need to have much larger sample sizes to achieve validity and reproducibility.
Without attempting a formal statistical power analysis, we opine that a minimum of 5,000 episodes or beneficiary/years is necessary to develop a valid basic case-mix model. To develop a refined model, random samples of at least tens of thousands of cases, or smaller random samples with an oversample of high-need conditions or beneficiaries, are required.