User login
Accuracy of GoogleTranslate™
The population of patients in the US with limited English proficiency (LEP)those who speak English less than very well1is substantial and continues to grow.1, 2 Patients with LEP are at risk for lower quality health care overall than their English‐speaking counterparts.38 Professional in‐person interpreters greatly improve spoken communication and quality of care for these patients,4, 9 but their assistance is typically based on the clinical encounter. Particularly if interpreting by phone, interpreters are unlikely to be able to help with materials such as discharge instructions or information sheets meant for family members. Professional written translations of patient educational material help to bridge this gap, allowing clinicians to convey detailed written instructions to patients. However, professional translations must be prepared well in advance of any encounter and can only be used for easily anticipated problems.
The need to translate less common, patient‐specific instructions arises spontaneously in clinical practice, and formally prepared written translations are not useful in these situations. Online translation tools such as GoogleTranslate (available at
We conducted a pilot evaluation of an online translation tool as it relates to detailed, complex patient educational material. Our primary goal was to compare the accuracy of a Spanish translation generated by the online tool to that done by a professional agency. Our secondary goals were: 1) to assess whether sentence word length or complexity mediated the accuracy of GT; and 2) to lay the foundation for a more comprehensive study of the accuracy of online translation tools, with respect to patient educational material.
Methods
Translation Tool and Language Choice
We selected Google Translate (GT) since it is one of the more commonly used online translation tools and because Google is the most widely used search engine in the United States.13 GT uses statistical translation methodology to convert text, documents, and websites between languages; statistical translation involves the following three steps. First, the translation program recognizes a sentence to translate. Second, it compares the words and phrases within that sentence to the billions of words in its library (drawn from bilingual professionally translated documents, such as United Nations proceedings). Third, it uses this comparison to generate a translation combining the words and phrases deemed most equivalent between the source sentence and the target language. If there are multiple sentences, the program recognizes and translates each independently. As the body of bilingual work grows, the program learns and refines its rules automatically.14 In contrast, in rule‐based translation, a program would use manually prespecified rules regarding word choice and grammar to generate a translation.15 We assessed GT's accuracy translating from English to Spanish because Spanish is the predominant non‐English language spoken in the US.1
Document Selection and Preparation
We selected the instruction manual regarding warfarin use prepared by the Agency for Healthcare Research and Quality (AHRQ) for this accuracy evaluation. We selected this manual,16 written at a 6th grade reading level, because a professional Spanish translation was available (completed by ASET International Service, LLC, before and independently of this study), and because patient educational material regarding warfarin has been associated with fewer bleeding events.17 We downloaded the English document on October 19, 2009 and used the GT website to translate it en bloc. We then copied the resulting Spanish output into a text file. The English document and the professional Spanish translation (downloaded the same day) were both converted into text files in the same manner.
Grading Methodology
We scored the translation chosen using both manual and automated evaluation techniques. These techniques are widely used in the machine translation literature and are explained below.
Manual Evaluation: Evaluators, Domains, Scoring
We recruited three nonclinician, bilingual, nativeSpanish‐speaking research assistants as evaluators. The evaluators were all college educated with a Bachelor's degree or higher and were of Mexican, Nicaraguan, and Guatemalan ancestry. Each evaluator received a brief orientation regarding the project, as well as an explanation of the scores, and then proceeded to the blinded evaluation independently.
We asked evaluators to score sentences on Likert scales along five primary domains: fluency, adequacy, meaning, severity, and preference. Fluency and adequacy are well accepted components of machine translation evaluation,18 with fluency being an assessment of grammar and readability ranging from 5 (Perfect fluency; like reading a newspaper) to 1 (No fluency; no appreciable grammar, not understandable) and adequacy being an assessment of information preservation ranging from 5 (100% of information conveyed from the original) to 1 (0% of information conveyed from the original). Given that a sentence can be highly adequate but drastically change the connotation and intent of the sentence (eg, a sentence that contains 75% of the correct words but changes a sentence from take this medication twice a day to take this medication once every two days), we asked evaluators to assess meaning, a measure of connotation and intent maintenance, with scores ranging from 5 (Same meaning as original) to 1 (Totally different meaning from the original).19 Evaluators also assessed severity, a new measure of potential harm if a given sentence was assessed as having errors of any kind, ranging from 5 (Error, no effect on patient care) to 1 (Error, dangerous to patient) with an additional option of N/A (Sentence basically accurate). Finally, evaluators rated a blinded preference (also a new measure) for either of two translated sentences, ranging from Strongly prefer translation #1 to Strongly prefer translation #2. The order of the sentences was random (eg, sometimes the professional translation was first and sometimes the GT translation was). We subsequently converted this to preference for the professional translation, ranging from 5 (Strongly prefer the professional translation) to 1 (Strongly prefer the GT translation) in order to standardize the responses (Figures 1 and 2).
The overall flow of the study is given in Figure 3. Each evaluator initially scored 20 sentences translated by GT and 10 sentences translated professionally along the first four domains. All 30 of these sentences were randomly selected from the original, 263‐sentence pamphlet. For fluency, evaluators had access only to the translated sentence to be scored; for adequacy, meaning, and severity, they had access to both the translated sentence and the original English sentence. Ten of the 30 sentences were further selected randomly for scoring on the preference domain. For these 10 sentences, evaluators compared the GT and professional translations of the same sentence (with the original English sentence available as a reference) and indicated a preference, for any reason, for one translation or the other. Evaluators were blinded to the technique of translation (GT or professional) for all scored sentences and domains. We chose twice as many sentences from the GT preparations for the first four domains to maximize measurements for the translation technology we were evaluating, with the smaller number of professional translations serving as controls.
After scoring the first 30 sentences, evaluators met with one of the authors (R.R.K.) to discuss and consolidate their approach to scoring. They then scored an additional 10 GT‐translated sentences and 5 professionally translated sentences for the first four domains, and 9 of these 15 sentences for preference, to see if the meeting changed their scoring approach. These sentences were selected randomly from the original, 263‐sentence pamphlet, excluding the 30 evaluated in the previous step.
Automated Machine Translation Evaluation
Machine translation researchers have developed automated measures allowing the rapid and inexpensive scoring and rescoring of translations. These automated measures supplement more time‐ and resource‐intensive manual evaluations. The automated measures are based upon how well the translation compares to one or, ideally, multiple professionally prepared reference translations. They correlate well with human judgments on the domains above, especially when multiple reference translations are used (increasing the number of reference translations increases the variability allowed for words and phrases in the machine translation, improving the likelihood that differences in score are related to differences in quality rather than differences in translator preference).20 For this study, we used Metric for Evaluation of Translation with Explicit Ordering (METEOR), a machine translation evaluation system that allows additional flexibility for the machine translation in terms of grading individual sentences and being sensitive to synonyms, word stemming, and word order.21 We obtained a METEOR score for each of the GT‐translated sentences using the professional translation as our reference, and assessed correlation between this automated measure and the manual evaluations for the GT sentences, with the aim of assessing the feasibility of using METEOR in future work on patient educational material translation.
Outcomes and Statistical Analysis
We compared the scores assigned to GT‐translated sentences for each of the five manually scored domains as compared to the scores of the professionally translated sentences, as well as the impact of word count and sentence complexity on the scores achieved specifically by the GT‐translated sentences, using clustered linear regression to account for the fact that each of the 45 sentences were scored by each of the three evaluators. Sentences were classified as simple if they contained one or fewer clauses and complex if they contained more than one clause.22 We also assessed interrater reliability for the manual scoring system using intraclass correlation coefficients and repeatability. Repeatability is an estimate of the maximum difference, with 95% confidence, between scores assigned to the same sentence on the same domain by two different evaluators;23 lower scores indicate greater agreement between evaluators. Since we did not have clinical data or a gold standard, we used repeatability to estimate the value above which a difference between two scores might be clinically significant and not simply due to interrater variability.24 Finally, we assessed the correlation of the manual scores with those calculated by the METEOR automated evaluation tool using Pearson correlation coefficients. All analyses were conducted using Stata 11 (College Station, TX).
Results
Sentence Description
A total of 45 sentences were evaluated by the bilingual research assistants. The initial 30 sentences and the subsequent, post‐consolidation meeting 15 sentences were scored similarly in all outcomes, after adjustment for word length and complexity, so we pooled all 45 sentences (as well as the 19 total sentence pairs scored for preference) for the final analysis. Average sentence lengths were 14.2 words, 15.5 words, and 16.6 words for the English source text, professionally translated sentences, and GT‐translated sentences, respectively. Thirty‐three percent of the English source sentences were simple and 67% were complex.
Manual Evaluation Scores
Sentences translated by GT received worse scores on fluency as compared to the professional translations (3.4 vs 4.7, P < 0.0001). Comparisons for adequacy and meaning were not statistically significantly different. GT‐translated sentences contained more errors of any severity as compared to the professional translations (39% vs 22%, P = 0.05), but a similar number of serious, clinically impactful errors (severity scores of 3, 2, or 1; 4% vs 2%, P = 0.61). However, one GT‐translated sentence was considered erroneous with a severity level of 1 (Error, dangerous to patient). This particular sentence was 25 words long and complex in structure in the original English document; all three evaluators considered the GT translation nonsensical (La hemorragia mayor, llame a su mdico, o ir a la emergencia de un hospital habitacin si usted tiene cualquiera de los siguientes: Red N, oscuro, caf o cola de orina de color.) Evaluators had no overall preference for the professional translation (3.2, 95% confidence interval = 2.7 to 3.7, with 3 indicating no preference; P = 0.36) (Table 1).
GoogleTranslate Translation | Professional Translation | P Value | |
---|---|---|---|
| |||
Fluency* | 3.4 | 4.7 | <0.0001 |
Adequacy* | 4.5 | 4.8 | 0.19 |
Meaning* | 4.2 | 4.5 | 0.29 |
Severity | |||
Any error | 39% | 22% | 0.05 |
Serious error | 4% | 2% | 0.61 |
Preference* | 3.2 | 0.36 |
Mediation of Scores by Sentence Length or Complexity
We found that sentence length was not associated with scores for fluency, adequacy, meaning, severity, or preference (P > 0.30 in each case). Complexity, however, was significantly associated with preference: evaluators' preferred the professional translation for complex English sentences while being more ambivalent about simple English sentences (3.6 vs 2.6, P = 0.03).
Interrater Reliability and Repeatability
We assessed the interrater reliability for each domain using intraclass correlation coefficients and repeatability. For fluency, the intraclass correlation was best at 0.70; for adequacy, it was 0.58; for meaning, 0.42; for severity, 0.48; and for preference, 0.37. The repeatability scores were 1.4 for fluency, 0.6 for adequacy, 2.2 for meaning, 1.2 for severity, and 3.8 for preference, indicating that two evaluators might give a sentence almost the same score (at most, 1 point apart from one another) for adequacy, but might have opposite preferences regarding which translation of a sentence was superior.
Correlation with METEOR
Correlation between the first four domains and the METEOR scores were less than in prior studies.21 Fluency correlated best with METEOR at 0.53; adequacy correlated least with METEOR at 0.29. The remaining scores were in‐between. All correlations were statistically significant at P < 0.01 (Table 2).
Correlation with METEOR | P value | |
---|---|---|
| ||
Fluency | 0.53 | <0.0001 |
Adequacy | 0.29 | 0.006 |
Meaning | 0.33 | 0.002 |
Severity | 0.39 | 0.002 |
Discussion
In this preliminary study comparing the accuracy of GT to professional translation for patient educational material, we found that GT was inferior to the professional translation in grammatical fluency but generally preserved the content and sense of the original text. Out of 30 GT sentences assessed, there was one substantially erroneous translation that was considered potentially dangerous. Evaluators preferred the professionally translated sentences for complex sentences, but when the English source sentence was simplecontaining a single clausethis preference disappeared.
Like Sharif and Tse,12 we found that for information not arranged in sentences, automated translation sometimes produced nonsensical sentences. In our study, these resulted from an English sentence fragment followed by a bulleted list; in their study, the nonsensical translations resulted from pharmacy labels. The difference in frequency of these errors between our studies may have resulted partly from the translation tool evaluated (GT vs programs used by pharmacies in the Bronx), but may have also been due to our use of machine translation for complete sentencesthe purpose for which it is optimally designed. The hypothesis that machine translations of clinical information are most understandable when used for simple, complete sentences concurs with the methodology used by these tools and requires further study.
GT has the potential to be very useful to clinicians, particularly for those instances when the communication required is both spontaneous and routine or noncritical. For example, in the inpatient setting, patients could communicate diet and other nonclinical requests, as well as ask or answer simple, short questions when the interpreter is not available. In such situations, the low cost and ease of using online translations and machine translation more generally may help to circumvent the tendency of clinicians to get by with inadequate language skills or to avoid communication altogether.25 If used wisely, GT and other online tools could supplement the use of standardized translations and professional interpreters in helping clinicians to overcome language barriers and linguistic inertia, though this will require further assessment.
Ours is a pilot study, and while it suggests a more promising way to use online translation tools, significant further evaluation is required regarding accuracy and applicability prior to widespread use of any machine translation tools for patient care. The document we utilized for evaluation was a professionally translated patient educational brochure provided to individuals starting a complex medication. As online translation tools would most likely not be used in this setting, but rather for spontaneous and less critical patient‐specific instructions, further testing of GT as applied to such scenarios should be considered. Second, we only evaluated GT for English translated into Spanish; its usefulness in other languages will need to be evaluated. It also remains to be seen how easily GT translations will be understood by patients, who may have variable medical understanding and educational attainment as compared to our evaluators. Finally, in this evaluation, we only assessed automated written translation, not automated spoken translation services such as those now available on cellular phones and other mobile devices.11 The latter are based upon translation software with an additional speech recognition interface. These applications may prove to be even more useful than online translation, but the speech recognition component will add an additional layer of potential error and these applications will need to be evaluated on their own merits.
The domains chosen for this study had only moderate interrater reliability as assessed by intraclass correlation and repeatability, with meaning and preference scoring particularly poorly. The latter domains in particular will require more thorough assessment before routine use in online translation assessment. The variability in all domains may have resulted partly from the choice of nonclinicians of different ancestral backgrounds as evaluators. However, this variability is likely better representative of the wide range of patient backgrounds. Because our evaluators were not professional translators, we asked a professional interpreter to grade all sentences to assess the quality of their evaluation. While the interpreter noted slightly fewer errors among the professionally translated sentences (13% vs 22%) and slightly more errors among the GT‐translated sentences (50% vs 39%), and preferred the professional translation slightly more (3.8 vs 3.2), his scores for all of the other measures were almost identical, increasing our confidence in our primary findings (Appendix A). Additionally, since statistical translation is conducted sentence by sentence, in our study evaluators only scored translations at the sentence level. The accuracy of GT for whole paragraphs or entire documents will need to be assessed separately. The correlation between METEOR and the manual evaluation scores was less than in prior studies; while inexpensive to assess, METEOR will have to be recalibrated in optimal circumstanceswith several reference translations available rather than just onebefore it can be used to supplement the assessment of new languages, new materials, other translation technologies, and improvements in a given technology over time for patient educational material.
In summary, GT scored worse in grammar but similarly in content and sense to the professional translation, committing one critical error in translating a complex, fragmented sentence as nonsense. We believe that, with further study and judicious use, GT has the potential to substantially improve clinicians' communication with patients with limited English proficiency in the area of brief spontaneous patient‐specific information, supplementing well the role that professional spoken interpretation and standardized written translations already play.
- Language use and English‐speaking ability: 2000. In:Census 2000 Brief.Washington, DC:US Census Bureau;2003. p. 2. http://www.census.gov/prod/2003pubs/c2kbr‐29.pdf. , .
- The need for more research on language barriers in health care: a proposed research agenda.Milbank Q.2006;84(1):111–133. , , , , .
- Language proficiency and adverse events in US hospitals: a pilot study.Int J Qual Health Care.2007;19(2):60–67. , , , .
- The impact of medical interpreter services on the quality of health care: a systematic review.Med Care Res Rev.2005;62(3):255–299. .
- Errors in medical interpretation and their potential clinical consequences in pediatric encounters.Pediatrics.2003;111(1):6–14. , , , et al.
- The effect of English language proficiency on length of stay and in‐hospital mortality.J Gen Intern Med.2004;19(3):221–228. , , , et al.
- Influence of language barriers on outcomes of hospital care for general medicine inpatients.J Hosp Med.2010;5(5):276–282. , , , .
- Hospitals, language, and culture: a snapshot of the nation. In:Los Angeles, CA:The California Endowment, the Joint Commission;2007. p.51–52. http://www.jointcommission.org/assets/1/6/hlc_paper.pdf. , .
- Do professional interpreters improve clinical care for patients with limited English proficiency? A systematic review of the literature.Health Serv Res.2007;42(2):727–754. , , , .
- Google's Computing Power Refines Translation Tool.New York Times; March 9,2010. Accessed March 24, 2010. http://www.nytimes.com/2010/03/09/technology/09translate.html?_r=1. .
- New York Times; March 20,2010. Accessed March 24, 2010. http://www.nytimes.com/2010/03/21/opinion/21bellos.html. , Translator.
- Accuracy of computer‐generated, Spanish‐language medicine labels.Pediatrics.2010;125(5):960–965. , .
- Nielsen NetRatings Search Engine Ratings.SearchEngineWatch; August 22,2006. Accessed March 24, 2010. http://searchenginewatch.com/2156451. .
- Google.Google Translate Help;2010. Accessed March 24, 2010. http://translate.google.com/support/?hl=en.
- Chapter 4: Basic strategies. In:An Introduction to Machine Translation;1992. Accessed April 22, 2010. http://www.hutchinsweb.me.uk/IntroMT‐4.pdf , .
- Your Guide to Coumadin®/Warfarin Therapy.Agency for Healthcare Research and Quality; August 21,2008. Accessed October 19, 2009. http://www.ahrq.gov/consumer/btpills.htm. .
- Patient reported receipt of medication instructions for warfarin is associated with reduced risk of serious bleeding events.J Gen Intern Med.2008;23(10):1589–1594. , , , et al.
- The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of AMTA, 1994, Columbia, MD; October1994. , , .
- Overview of the IWSLT 2005 evaluation campaign. In: Proceedings of IWSLT 2005, Pittsburgh, PA; October2005. , .
- BLEU: a method for automatic evaluation of machine translation. In: ACL‐2002: 40th Annual Meeting of the Association for Computational Linguistics.2002:311–318. , , , .
- METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation at ACL, Prague, Czech Republic; June2007. , .
- The Structure of a Sentence.Ottawa:The Writing Centre, University of Ottawa;2007. .
- Statistical methods for assessing agreement between two methods of clinical measurement.Lancet.1986;1(8476):307–310. , .
- Measurement, reproducibility, and validity. In:Epidemiologic Methods 203.San Francisco:Department of Biostatistics and Epidemiology, University of California;2009. .
- Getting by: underuse of interpreters by resident physicians.J Gen Intern Med.2009;24(2):256–262. , , , , .
The population of patients in the US with limited English proficiency (LEP)those who speak English less than very well1is substantial and continues to grow.1, 2 Patients with LEP are at risk for lower quality health care overall than their English‐speaking counterparts.38 Professional in‐person interpreters greatly improve spoken communication and quality of care for these patients,4, 9 but their assistance is typically based on the clinical encounter. Particularly if interpreting by phone, interpreters are unlikely to be able to help with materials such as discharge instructions or information sheets meant for family members. Professional written translations of patient educational material help to bridge this gap, allowing clinicians to convey detailed written instructions to patients. However, professional translations must be prepared well in advance of any encounter and can only be used for easily anticipated problems.
The need to translate less common, patient‐specific instructions arises spontaneously in clinical practice, and formally prepared written translations are not useful in these situations. Online translation tools such as GoogleTranslate (available at
We conducted a pilot evaluation of an online translation tool as it relates to detailed, complex patient educational material. Our primary goal was to compare the accuracy of a Spanish translation generated by the online tool to that done by a professional agency. Our secondary goals were: 1) to assess whether sentence word length or complexity mediated the accuracy of GT; and 2) to lay the foundation for a more comprehensive study of the accuracy of online translation tools, with respect to patient educational material.
Methods
Translation Tool and Language Choice
We selected Google Translate (GT) since it is one of the more commonly used online translation tools and because Google is the most widely used search engine in the United States.13 GT uses statistical translation methodology to convert text, documents, and websites between languages; statistical translation involves the following three steps. First, the translation program recognizes a sentence to translate. Second, it compares the words and phrases within that sentence to the billions of words in its library (drawn from bilingual professionally translated documents, such as United Nations proceedings). Third, it uses this comparison to generate a translation combining the words and phrases deemed most equivalent between the source sentence and the target language. If there are multiple sentences, the program recognizes and translates each independently. As the body of bilingual work grows, the program learns and refines its rules automatically.14 In contrast, in rule‐based translation, a program would use manually prespecified rules regarding word choice and grammar to generate a translation.15 We assessed GT's accuracy translating from English to Spanish because Spanish is the predominant non‐English language spoken in the US.1
Document Selection and Preparation
We selected the instruction manual regarding warfarin use prepared by the Agency for Healthcare Research and Quality (AHRQ) for this accuracy evaluation. We selected this manual,16 written at a 6th grade reading level, because a professional Spanish translation was available (completed by ASET International Service, LLC, before and independently of this study), and because patient educational material regarding warfarin has been associated with fewer bleeding events.17 We downloaded the English document on October 19, 2009 and used the GT website to translate it en bloc. We then copied the resulting Spanish output into a text file. The English document and the professional Spanish translation (downloaded the same day) were both converted into text files in the same manner.
Grading Methodology
We scored the translation chosen using both manual and automated evaluation techniques. These techniques are widely used in the machine translation literature and are explained below.
Manual Evaluation: Evaluators, Domains, Scoring
We recruited three nonclinician, bilingual, nativeSpanish‐speaking research assistants as evaluators. The evaluators were all college educated with a Bachelor's degree or higher and were of Mexican, Nicaraguan, and Guatemalan ancestry. Each evaluator received a brief orientation regarding the project, as well as an explanation of the scores, and then proceeded to the blinded evaluation independently.
We asked evaluators to score sentences on Likert scales along five primary domains: fluency, adequacy, meaning, severity, and preference. Fluency and adequacy are well accepted components of machine translation evaluation,18 with fluency being an assessment of grammar and readability ranging from 5 (Perfect fluency; like reading a newspaper) to 1 (No fluency; no appreciable grammar, not understandable) and adequacy being an assessment of information preservation ranging from 5 (100% of information conveyed from the original) to 1 (0% of information conveyed from the original). Given that a sentence can be highly adequate but drastically change the connotation and intent of the sentence (eg, a sentence that contains 75% of the correct words but changes a sentence from take this medication twice a day to take this medication once every two days), we asked evaluators to assess meaning, a measure of connotation and intent maintenance, with scores ranging from 5 (Same meaning as original) to 1 (Totally different meaning from the original).19 Evaluators also assessed severity, a new measure of potential harm if a given sentence was assessed as having errors of any kind, ranging from 5 (Error, no effect on patient care) to 1 (Error, dangerous to patient) with an additional option of N/A (Sentence basically accurate). Finally, evaluators rated a blinded preference (also a new measure) for either of two translated sentences, ranging from Strongly prefer translation #1 to Strongly prefer translation #2. The order of the sentences was random (eg, sometimes the professional translation was first and sometimes the GT translation was). We subsequently converted this to preference for the professional translation, ranging from 5 (Strongly prefer the professional translation) to 1 (Strongly prefer the GT translation) in order to standardize the responses (Figures 1 and 2).
The overall flow of the study is given in Figure 3. Each evaluator initially scored 20 sentences translated by GT and 10 sentences translated professionally along the first four domains. All 30 of these sentences were randomly selected from the original, 263‐sentence pamphlet. For fluency, evaluators had access only to the translated sentence to be scored; for adequacy, meaning, and severity, they had access to both the translated sentence and the original English sentence. Ten of the 30 sentences were further selected randomly for scoring on the preference domain. For these 10 sentences, evaluators compared the GT and professional translations of the same sentence (with the original English sentence available as a reference) and indicated a preference, for any reason, for one translation or the other. Evaluators were blinded to the technique of translation (GT or professional) for all scored sentences and domains. We chose twice as many sentences from the GT preparations for the first four domains to maximize measurements for the translation technology we were evaluating, with the smaller number of professional translations serving as controls.
After scoring the first 30 sentences, evaluators met with one of the authors (R.R.K.) to discuss and consolidate their approach to scoring. They then scored an additional 10 GT‐translated sentences and 5 professionally translated sentences for the first four domains, and 9 of these 15 sentences for preference, to see if the meeting changed their scoring approach. These sentences were selected randomly from the original, 263‐sentence pamphlet, excluding the 30 evaluated in the previous step.
Automated Machine Translation Evaluation
Machine translation researchers have developed automated measures allowing the rapid and inexpensive scoring and rescoring of translations. These automated measures supplement more time‐ and resource‐intensive manual evaluations. The automated measures are based upon how well the translation compares to one or, ideally, multiple professionally prepared reference translations. They correlate well with human judgments on the domains above, especially when multiple reference translations are used (increasing the number of reference translations increases the variability allowed for words and phrases in the machine translation, improving the likelihood that differences in score are related to differences in quality rather than differences in translator preference).20 For this study, we used Metric for Evaluation of Translation with Explicit Ordering (METEOR), a machine translation evaluation system that allows additional flexibility for the machine translation in terms of grading individual sentences and being sensitive to synonyms, word stemming, and word order.21 We obtained a METEOR score for each of the GT‐translated sentences using the professional translation as our reference, and assessed correlation between this automated measure and the manual evaluations for the GT sentences, with the aim of assessing the feasibility of using METEOR in future work on patient educational material translation.
Outcomes and Statistical Analysis
We compared the scores assigned to GT‐translated sentences for each of the five manually scored domains as compared to the scores of the professionally translated sentences, as well as the impact of word count and sentence complexity on the scores achieved specifically by the GT‐translated sentences, using clustered linear regression to account for the fact that each of the 45 sentences were scored by each of the three evaluators. Sentences were classified as simple if they contained one or fewer clauses and complex if they contained more than one clause.22 We also assessed interrater reliability for the manual scoring system using intraclass correlation coefficients and repeatability. Repeatability is an estimate of the maximum difference, with 95% confidence, between scores assigned to the same sentence on the same domain by two different evaluators;23 lower scores indicate greater agreement between evaluators. Since we did not have clinical data or a gold standard, we used repeatability to estimate the value above which a difference between two scores might be clinically significant and not simply due to interrater variability.24 Finally, we assessed the correlation of the manual scores with those calculated by the METEOR automated evaluation tool using Pearson correlation coefficients. All analyses were conducted using Stata 11 (College Station, TX).
Results
Sentence Description
A total of 45 sentences were evaluated by the bilingual research assistants. The initial 30 sentences and the subsequent, post‐consolidation meeting 15 sentences were scored similarly in all outcomes, after adjustment for word length and complexity, so we pooled all 45 sentences (as well as the 19 total sentence pairs scored for preference) for the final analysis. Average sentence lengths were 14.2 words, 15.5 words, and 16.6 words for the English source text, professionally translated sentences, and GT‐translated sentences, respectively. Thirty‐three percent of the English source sentences were simple and 67% were complex.
Manual Evaluation Scores
Sentences translated by GT received worse scores on fluency as compared to the professional translations (3.4 vs 4.7, P < 0.0001). Comparisons for adequacy and meaning were not statistically significantly different. GT‐translated sentences contained more errors of any severity as compared to the professional translations (39% vs 22%, P = 0.05), but a similar number of serious, clinically impactful errors (severity scores of 3, 2, or 1; 4% vs 2%, P = 0.61). However, one GT‐translated sentence was considered erroneous with a severity level of 1 (Error, dangerous to patient). This particular sentence was 25 words long and complex in structure in the original English document; all three evaluators considered the GT translation nonsensical (La hemorragia mayor, llame a su mdico, o ir a la emergencia de un hospital habitacin si usted tiene cualquiera de los siguientes: Red N, oscuro, caf o cola de orina de color.) Evaluators had no overall preference for the professional translation (3.2, 95% confidence interval = 2.7 to 3.7, with 3 indicating no preference; P = 0.36) (Table 1).
GoogleTranslate Translation | Professional Translation | P Value | |
---|---|---|---|
| |||
Fluency* | 3.4 | 4.7 | <0.0001 |
Adequacy* | 4.5 | 4.8 | 0.19 |
Meaning* | 4.2 | 4.5 | 0.29 |
Severity | |||
Any error | 39% | 22% | 0.05 |
Serious error | 4% | 2% | 0.61 |
Preference* | 3.2 | 0.36 |
Mediation of Scores by Sentence Length or Complexity
We found that sentence length was not associated with scores for fluency, adequacy, meaning, severity, or preference (P > 0.30 in each case). Complexity, however, was significantly associated with preference: evaluators' preferred the professional translation for complex English sentences while being more ambivalent about simple English sentences (3.6 vs 2.6, P = 0.03).
Interrater Reliability and Repeatability
We assessed the interrater reliability for each domain using intraclass correlation coefficients and repeatability. For fluency, the intraclass correlation was best at 0.70; for adequacy, it was 0.58; for meaning, 0.42; for severity, 0.48; and for preference, 0.37. The repeatability scores were 1.4 for fluency, 0.6 for adequacy, 2.2 for meaning, 1.2 for severity, and 3.8 for preference, indicating that two evaluators might give a sentence almost the same score (at most, 1 point apart from one another) for adequacy, but might have opposite preferences regarding which translation of a sentence was superior.
Correlation with METEOR
Correlation between the first four domains and the METEOR scores were less than in prior studies.21 Fluency correlated best with METEOR at 0.53; adequacy correlated least with METEOR at 0.29. The remaining scores were in‐between. All correlations were statistically significant at P < 0.01 (Table 2).
Correlation with METEOR | P value | |
---|---|---|
| ||
Fluency | 0.53 | <0.0001 |
Adequacy | 0.29 | 0.006 |
Meaning | 0.33 | 0.002 |
Severity | 0.39 | 0.002 |
Discussion
In this preliminary study comparing the accuracy of GT to professional translation for patient educational material, we found that GT was inferior to the professional translation in grammatical fluency but generally preserved the content and sense of the original text. Out of 30 GT sentences assessed, there was one substantially erroneous translation that was considered potentially dangerous. Evaluators preferred the professionally translated sentences for complex sentences, but when the English source sentence was simplecontaining a single clausethis preference disappeared.
Like Sharif and Tse,12 we found that for information not arranged in sentences, automated translation sometimes produced nonsensical sentences. In our study, these resulted from an English sentence fragment followed by a bulleted list; in their study, the nonsensical translations resulted from pharmacy labels. The difference in frequency of these errors between our studies may have resulted partly from the translation tool evaluated (GT vs programs used by pharmacies in the Bronx), but may have also been due to our use of machine translation for complete sentencesthe purpose for which it is optimally designed. The hypothesis that machine translations of clinical information are most understandable when used for simple, complete sentences concurs with the methodology used by these tools and requires further study.
GT has the potential to be very useful to clinicians, particularly for those instances when the communication required is both spontaneous and routine or noncritical. For example, in the inpatient setting, patients could communicate diet and other nonclinical requests, as well as ask or answer simple, short questions when the interpreter is not available. In such situations, the low cost and ease of using online translations and machine translation more generally may help to circumvent the tendency of clinicians to get by with inadequate language skills or to avoid communication altogether.25 If used wisely, GT and other online tools could supplement the use of standardized translations and professional interpreters in helping clinicians to overcome language barriers and linguistic inertia, though this will require further assessment.
Ours is a pilot study, and while it suggests a more promising way to use online translation tools, significant further evaluation is required regarding accuracy and applicability prior to widespread use of any machine translation tools for patient care. The document we utilized for evaluation was a professionally translated patient educational brochure provided to individuals starting a complex medication. As online translation tools would most likely not be used in this setting, but rather for spontaneous and less critical patient‐specific instructions, further testing of GT as applied to such scenarios should be considered. Second, we only evaluated GT for English translated into Spanish; its usefulness in other languages will need to be evaluated. It also remains to be seen how easily GT translations will be understood by patients, who may have variable medical understanding and educational attainment as compared to our evaluators. Finally, in this evaluation, we only assessed automated written translation, not automated spoken translation services such as those now available on cellular phones and other mobile devices.11 The latter are based upon translation software with an additional speech recognition interface. These applications may prove to be even more useful than online translation, but the speech recognition component will add an additional layer of potential error and these applications will need to be evaluated on their own merits.
The domains chosen for this study had only moderate interrater reliability as assessed by intraclass correlation and repeatability, with meaning and preference scoring particularly poorly. The latter domains in particular will require more thorough assessment before routine use in online translation assessment. The variability in all domains may have resulted partly from the choice of nonclinicians of different ancestral backgrounds as evaluators. However, this variability is likely better representative of the wide range of patient backgrounds. Because our evaluators were not professional translators, we asked a professional interpreter to grade all sentences to assess the quality of their evaluation. While the interpreter noted slightly fewer errors among the professionally translated sentences (13% vs 22%) and slightly more errors among the GT‐translated sentences (50% vs 39%), and preferred the professional translation slightly more (3.8 vs 3.2), his scores for all of the other measures were almost identical, increasing our confidence in our primary findings (Appendix A). Additionally, since statistical translation is conducted sentence by sentence, in our study evaluators only scored translations at the sentence level. The accuracy of GT for whole paragraphs or entire documents will need to be assessed separately. The correlation between METEOR and the manual evaluation scores was less than in prior studies; while inexpensive to assess, METEOR will have to be recalibrated in optimal circumstanceswith several reference translations available rather than just onebefore it can be used to supplement the assessment of new languages, new materials, other translation technologies, and improvements in a given technology over time for patient educational material.
In summary, GT scored worse in grammar but similarly in content and sense to the professional translation, committing one critical error in translating a complex, fragmented sentence as nonsense. We believe that, with further study and judicious use, GT has the potential to substantially improve clinicians' communication with patients with limited English proficiency in the area of brief spontaneous patient‐specific information, supplementing well the role that professional spoken interpretation and standardized written translations already play.
The population of patients in the US with limited English proficiency (LEP)those who speak English less than very well1is substantial and continues to grow.1, 2 Patients with LEP are at risk for lower quality health care overall than their English‐speaking counterparts.38 Professional in‐person interpreters greatly improve spoken communication and quality of care for these patients,4, 9 but their assistance is typically based on the clinical encounter. Particularly if interpreting by phone, interpreters are unlikely to be able to help with materials such as discharge instructions or information sheets meant for family members. Professional written translations of patient educational material help to bridge this gap, allowing clinicians to convey detailed written instructions to patients. However, professional translations must be prepared well in advance of any encounter and can only be used for easily anticipated problems.
The need to translate less common, patient‐specific instructions arises spontaneously in clinical practice, and formally prepared written translations are not useful in these situations. Online translation tools such as GoogleTranslate (available at
We conducted a pilot evaluation of an online translation tool as it relates to detailed, complex patient educational material. Our primary goal was to compare the accuracy of a Spanish translation generated by the online tool to that done by a professional agency. Our secondary goals were: 1) to assess whether sentence word length or complexity mediated the accuracy of GT; and 2) to lay the foundation for a more comprehensive study of the accuracy of online translation tools, with respect to patient educational material.
Methods
Translation Tool and Language Choice
We selected Google Translate (GT) since it is one of the more commonly used online translation tools and because Google is the most widely used search engine in the United States.13 GT uses statistical translation methodology to convert text, documents, and websites between languages; statistical translation involves the following three steps. First, the translation program recognizes a sentence to translate. Second, it compares the words and phrases within that sentence to the billions of words in its library (drawn from bilingual professionally translated documents, such as United Nations proceedings). Third, it uses this comparison to generate a translation combining the words and phrases deemed most equivalent between the source sentence and the target language. If there are multiple sentences, the program recognizes and translates each independently. As the body of bilingual work grows, the program learns and refines its rules automatically.14 In contrast, in rule‐based translation, a program would use manually prespecified rules regarding word choice and grammar to generate a translation.15 We assessed GT's accuracy translating from English to Spanish because Spanish is the predominant non‐English language spoken in the US.1
Document Selection and Preparation
We selected the instruction manual regarding warfarin use prepared by the Agency for Healthcare Research and Quality (AHRQ) for this accuracy evaluation. We selected this manual,16 written at a 6th grade reading level, because a professional Spanish translation was available (completed by ASET International Service, LLC, before and independently of this study), and because patient educational material regarding warfarin has been associated with fewer bleeding events.17 We downloaded the English document on October 19, 2009 and used the GT website to translate it en bloc. We then copied the resulting Spanish output into a text file. The English document and the professional Spanish translation (downloaded the same day) were both converted into text files in the same manner.
Grading Methodology
We scored the translation chosen using both manual and automated evaluation techniques. These techniques are widely used in the machine translation literature and are explained below.
Manual Evaluation: Evaluators, Domains, Scoring
We recruited three nonclinician, bilingual, nativeSpanish‐speaking research assistants as evaluators. The evaluators were all college educated with a Bachelor's degree or higher and were of Mexican, Nicaraguan, and Guatemalan ancestry. Each evaluator received a brief orientation regarding the project, as well as an explanation of the scores, and then proceeded to the blinded evaluation independently.
We asked evaluators to score sentences on Likert scales along five primary domains: fluency, adequacy, meaning, severity, and preference. Fluency and adequacy are well accepted components of machine translation evaluation,18 with fluency being an assessment of grammar and readability ranging from 5 (Perfect fluency; like reading a newspaper) to 1 (No fluency; no appreciable grammar, not understandable) and adequacy being an assessment of information preservation ranging from 5 (100% of information conveyed from the original) to 1 (0% of information conveyed from the original). Given that a sentence can be highly adequate but drastically change the connotation and intent of the sentence (eg, a sentence that contains 75% of the correct words but changes a sentence from take this medication twice a day to take this medication once every two days), we asked evaluators to assess meaning, a measure of connotation and intent maintenance, with scores ranging from 5 (Same meaning as original) to 1 (Totally different meaning from the original).19 Evaluators also assessed severity, a new measure of potential harm if a given sentence was assessed as having errors of any kind, ranging from 5 (Error, no effect on patient care) to 1 (Error, dangerous to patient) with an additional option of N/A (Sentence basically accurate). Finally, evaluators rated a blinded preference (also a new measure) for either of two translated sentences, ranging from Strongly prefer translation #1 to Strongly prefer translation #2. The order of the sentences was random (eg, sometimes the professional translation was first and sometimes the GT translation was). We subsequently converted this to preference for the professional translation, ranging from 5 (Strongly prefer the professional translation) to 1 (Strongly prefer the GT translation) in order to standardize the responses (Figures 1 and 2).
The overall flow of the study is given in Figure 3. Each evaluator initially scored 20 sentences translated by GT and 10 sentences translated professionally along the first four domains. All 30 of these sentences were randomly selected from the original, 263‐sentence pamphlet. For fluency, evaluators had access only to the translated sentence to be scored; for adequacy, meaning, and severity, they had access to both the translated sentence and the original English sentence. Ten of the 30 sentences were further selected randomly for scoring on the preference domain. For these 10 sentences, evaluators compared the GT and professional translations of the same sentence (with the original English sentence available as a reference) and indicated a preference, for any reason, for one translation or the other. Evaluators were blinded to the technique of translation (GT or professional) for all scored sentences and domains. We chose twice as many sentences from the GT preparations for the first four domains to maximize measurements for the translation technology we were evaluating, with the smaller number of professional translations serving as controls.
After scoring the first 30 sentences, evaluators met with one of the authors (R.R.K.) to discuss and consolidate their approach to scoring. They then scored an additional 10 GT‐translated sentences and 5 professionally translated sentences for the first four domains, and 9 of these 15 sentences for preference, to see if the meeting changed their scoring approach. These sentences were selected randomly from the original, 263‐sentence pamphlet, excluding the 30 evaluated in the previous step.
Automated Machine Translation Evaluation
Machine translation researchers have developed automated measures allowing the rapid and inexpensive scoring and rescoring of translations. These automated measures supplement more time‐ and resource‐intensive manual evaluations. The automated measures are based upon how well the translation compares to one or, ideally, multiple professionally prepared reference translations. They correlate well with human judgments on the domains above, especially when multiple reference translations are used (increasing the number of reference translations increases the variability allowed for words and phrases in the machine translation, improving the likelihood that differences in score are related to differences in quality rather than differences in translator preference).20 For this study, we used Metric for Evaluation of Translation with Explicit Ordering (METEOR), a machine translation evaluation system that allows additional flexibility for the machine translation in terms of grading individual sentences and being sensitive to synonyms, word stemming, and word order.21 We obtained a METEOR score for each of the GT‐translated sentences using the professional translation as our reference, and assessed correlation between this automated measure and the manual evaluations for the GT sentences, with the aim of assessing the feasibility of using METEOR in future work on patient educational material translation.
Outcomes and Statistical Analysis
We compared the scores assigned to GT‐translated sentences for each of the five manually scored domains as compared to the scores of the professionally translated sentences, as well as the impact of word count and sentence complexity on the scores achieved specifically by the GT‐translated sentences, using clustered linear regression to account for the fact that each of the 45 sentences were scored by each of the three evaluators. Sentences were classified as simple if they contained one or fewer clauses and complex if they contained more than one clause.22 We also assessed interrater reliability for the manual scoring system using intraclass correlation coefficients and repeatability. Repeatability is an estimate of the maximum difference, with 95% confidence, between scores assigned to the same sentence on the same domain by two different evaluators;23 lower scores indicate greater agreement between evaluators. Since we did not have clinical data or a gold standard, we used repeatability to estimate the value above which a difference between two scores might be clinically significant and not simply due to interrater variability.24 Finally, we assessed the correlation of the manual scores with those calculated by the METEOR automated evaluation tool using Pearson correlation coefficients. All analyses were conducted using Stata 11 (College Station, TX).
Results
Sentence Description
A total of 45 sentences were evaluated by the bilingual research assistants. The initial 30 sentences and the subsequent, post‐consolidation meeting 15 sentences were scored similarly in all outcomes, after adjustment for word length and complexity, so we pooled all 45 sentences (as well as the 19 total sentence pairs scored for preference) for the final analysis. Average sentence lengths were 14.2 words, 15.5 words, and 16.6 words for the English source text, professionally translated sentences, and GT‐translated sentences, respectively. Thirty‐three percent of the English source sentences were simple and 67% were complex.
Manual Evaluation Scores
Sentences translated by GT received worse scores on fluency as compared to the professional translations (3.4 vs 4.7, P < 0.0001). Comparisons for adequacy and meaning were not statistically significantly different. GT‐translated sentences contained more errors of any severity as compared to the professional translations (39% vs 22%, P = 0.05), but a similar number of serious, clinically impactful errors (severity scores of 3, 2, or 1; 4% vs 2%, P = 0.61). However, one GT‐translated sentence was considered erroneous with a severity level of 1 (Error, dangerous to patient). This particular sentence was 25 words long and complex in structure in the original English document; all three evaluators considered the GT translation nonsensical (La hemorragia mayor, llame a su mdico, o ir a la emergencia de un hospital habitacin si usted tiene cualquiera de los siguientes: Red N, oscuro, caf o cola de orina de color.) Evaluators had no overall preference for the professional translation (3.2, 95% confidence interval = 2.7 to 3.7, with 3 indicating no preference; P = 0.36) (Table 1).
GoogleTranslate Translation | Professional Translation | P Value | |
---|---|---|---|
| |||
Fluency* | 3.4 | 4.7 | <0.0001 |
Adequacy* | 4.5 | 4.8 | 0.19 |
Meaning* | 4.2 | 4.5 | 0.29 |
Severity | |||
Any error | 39% | 22% | 0.05 |
Serious error | 4% | 2% | 0.61 |
Preference* | 3.2 | 0.36 |
Mediation of Scores by Sentence Length or Complexity
We found that sentence length was not associated with scores for fluency, adequacy, meaning, severity, or preference (P > 0.30 in each case). Complexity, however, was significantly associated with preference: evaluators' preferred the professional translation for complex English sentences while being more ambivalent about simple English sentences (3.6 vs 2.6, P = 0.03).
Interrater Reliability and Repeatability
We assessed the interrater reliability for each domain using intraclass correlation coefficients and repeatability. For fluency, the intraclass correlation was best at 0.70; for adequacy, it was 0.58; for meaning, 0.42; for severity, 0.48; and for preference, 0.37. The repeatability scores were 1.4 for fluency, 0.6 for adequacy, 2.2 for meaning, 1.2 for severity, and 3.8 for preference, indicating that two evaluators might give a sentence almost the same score (at most, 1 point apart from one another) for adequacy, but might have opposite preferences regarding which translation of a sentence was superior.
Correlation with METEOR
Correlation between the first four domains and the METEOR scores were less than in prior studies.21 Fluency correlated best with METEOR at 0.53; adequacy correlated least with METEOR at 0.29. The remaining scores were in‐between. All correlations were statistically significant at P < 0.01 (Table 2).
Correlation with METEOR | P value | |
---|---|---|
| ||
Fluency | 0.53 | <0.0001 |
Adequacy | 0.29 | 0.006 |
Meaning | 0.33 | 0.002 |
Severity | 0.39 | 0.002 |
Discussion
In this preliminary study comparing the accuracy of GT to professional translation for patient educational material, we found that GT was inferior to the professional translation in grammatical fluency but generally preserved the content and sense of the original text. Out of 30 GT sentences assessed, there was one substantially erroneous translation that was considered potentially dangerous. Evaluators preferred the professionally translated sentences for complex sentences, but when the English source sentence was simplecontaining a single clausethis preference disappeared.
Like Sharif and Tse,12 we found that for information not arranged in sentences, automated translation sometimes produced nonsensical sentences. In our study, these resulted from an English sentence fragment followed by a bulleted list; in their study, the nonsensical translations resulted from pharmacy labels. The difference in frequency of these errors between our studies may have resulted partly from the translation tool evaluated (GT vs programs used by pharmacies in the Bronx), but may have also been due to our use of machine translation for complete sentencesthe purpose for which it is optimally designed. The hypothesis that machine translations of clinical information are most understandable when used for simple, complete sentences concurs with the methodology used by these tools and requires further study.
GT has the potential to be very useful to clinicians, particularly for those instances when the communication required is both spontaneous and routine or noncritical. For example, in the inpatient setting, patients could communicate diet and other nonclinical requests, as well as ask or answer simple, short questions when the interpreter is not available. In such situations, the low cost and ease of using online translations and machine translation more generally may help to circumvent the tendency of clinicians to get by with inadequate language skills or to avoid communication altogether.25 If used wisely, GT and other online tools could supplement the use of standardized translations and professional interpreters in helping clinicians to overcome language barriers and linguistic inertia, though this will require further assessment.
Ours is a pilot study, and while it suggests a more promising way to use online translation tools, significant further evaluation is required regarding accuracy and applicability prior to widespread use of any machine translation tools for patient care. The document we utilized for evaluation was a professionally translated patient educational brochure provided to individuals starting a complex medication. As online translation tools would most likely not be used in this setting, but rather for spontaneous and less critical patient‐specific instructions, further testing of GT as applied to such scenarios should be considered. Second, we only evaluated GT for English translated into Spanish; its usefulness in other languages will need to be evaluated. It also remains to be seen how easily GT translations will be understood by patients, who may have variable medical understanding and educational attainment as compared to our evaluators. Finally, in this evaluation, we only assessed automated written translation, not automated spoken translation services such as those now available on cellular phones and other mobile devices.11 The latter are based upon translation software with an additional speech recognition interface. These applications may prove to be even more useful than online translation, but the speech recognition component will add an additional layer of potential error and these applications will need to be evaluated on their own merits.
The domains chosen for this study had only moderate interrater reliability as assessed by intraclass correlation and repeatability, with meaning and preference scoring particularly poorly. The latter domains in particular will require more thorough assessment before routine use in online translation assessment. The variability in all domains may have resulted partly from the choice of nonclinicians of different ancestral backgrounds as evaluators. However, this variability is likely better representative of the wide range of patient backgrounds. Because our evaluators were not professional translators, we asked a professional interpreter to grade all sentences to assess the quality of their evaluation. While the interpreter noted slightly fewer errors among the professionally translated sentences (13% vs 22%) and slightly more errors among the GT‐translated sentences (50% vs 39%), and preferred the professional translation slightly more (3.8 vs 3.2), his scores for all of the other measures were almost identical, increasing our confidence in our primary findings (Appendix A). Additionally, since statistical translation is conducted sentence by sentence, in our study evaluators only scored translations at the sentence level. The accuracy of GT for whole paragraphs or entire documents will need to be assessed separately. The correlation between METEOR and the manual evaluation scores was less than in prior studies; while inexpensive to assess, METEOR will have to be recalibrated in optimal circumstanceswith several reference translations available rather than just onebefore it can be used to supplement the assessment of new languages, new materials, other translation technologies, and improvements in a given technology over time for patient educational material.
In summary, GT scored worse in grammar but similarly in content and sense to the professional translation, committing one critical error in translating a complex, fragmented sentence as nonsense. We believe that, with further study and judicious use, GT has the potential to substantially improve clinicians' communication with patients with limited English proficiency in the area of brief spontaneous patient‐specific information, supplementing well the role that professional spoken interpretation and standardized written translations already play.
- Language use and English‐speaking ability: 2000. In:Census 2000 Brief.Washington, DC:US Census Bureau;2003. p. 2. http://www.census.gov/prod/2003pubs/c2kbr‐29.pdf. , .
- The need for more research on language barriers in health care: a proposed research agenda.Milbank Q.2006;84(1):111–133. , , , , .
- Language proficiency and adverse events in US hospitals: a pilot study.Int J Qual Health Care.2007;19(2):60–67. , , , .
- The impact of medical interpreter services on the quality of health care: a systematic review.Med Care Res Rev.2005;62(3):255–299. .
- Errors in medical interpretation and their potential clinical consequences in pediatric encounters.Pediatrics.2003;111(1):6–14. , , , et al.
- The effect of English language proficiency on length of stay and in‐hospital mortality.J Gen Intern Med.2004;19(3):221–228. , , , et al.
- Influence of language barriers on outcomes of hospital care for general medicine inpatients.J Hosp Med.2010;5(5):276–282. , , , .
- Hospitals, language, and culture: a snapshot of the nation. In:Los Angeles, CA:The California Endowment, the Joint Commission;2007. p.51–52. http://www.jointcommission.org/assets/1/6/hlc_paper.pdf. , .
- Do professional interpreters improve clinical care for patients with limited English proficiency? A systematic review of the literature.Health Serv Res.2007;42(2):727–754. , , , .
- Google's Computing Power Refines Translation Tool.New York Times; March 9,2010. Accessed March 24, 2010. http://www.nytimes.com/2010/03/09/technology/09translate.html?_r=1. .
- New York Times; March 20,2010. Accessed March 24, 2010. http://www.nytimes.com/2010/03/21/opinion/21bellos.html. , Translator.
- Accuracy of computer‐generated, Spanish‐language medicine labels.Pediatrics.2010;125(5):960–965. , .
- Nielsen NetRatings Search Engine Ratings.SearchEngineWatch; August 22,2006. Accessed March 24, 2010. http://searchenginewatch.com/2156451. .
- Google.Google Translate Help;2010. Accessed March 24, 2010. http://translate.google.com/support/?hl=en.
- Chapter 4: Basic strategies. In:An Introduction to Machine Translation;1992. Accessed April 22, 2010. http://www.hutchinsweb.me.uk/IntroMT‐4.pdf , .
- Your Guide to Coumadin®/Warfarin Therapy.Agency for Healthcare Research and Quality; August 21,2008. Accessed October 19, 2009. http://www.ahrq.gov/consumer/btpills.htm. .
- Patient reported receipt of medication instructions for warfarin is associated with reduced risk of serious bleeding events.J Gen Intern Med.2008;23(10):1589–1594. , , , et al.
- The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of AMTA, 1994, Columbia, MD; October1994. , , .
- Overview of the IWSLT 2005 evaluation campaign. In: Proceedings of IWSLT 2005, Pittsburgh, PA; October2005. , .
- BLEU: a method for automatic evaluation of machine translation. In: ACL‐2002: 40th Annual Meeting of the Association for Computational Linguistics.2002:311–318. , , , .
- METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation at ACL, Prague, Czech Republic; June2007. , .
- The Structure of a Sentence.Ottawa:The Writing Centre, University of Ottawa;2007. .
- Statistical methods for assessing agreement between two methods of clinical measurement.Lancet.1986;1(8476):307–310. , .
- Measurement, reproducibility, and validity. In:Epidemiologic Methods 203.San Francisco:Department of Biostatistics and Epidemiology, University of California;2009. .
- Getting by: underuse of interpreters by resident physicians.J Gen Intern Med.2009;24(2):256–262. , , , , .
- Language use and English‐speaking ability: 2000. In:Census 2000 Brief.Washington, DC:US Census Bureau;2003. p. 2. http://www.census.gov/prod/2003pubs/c2kbr‐29.pdf. , .
- The need for more research on language barriers in health care: a proposed research agenda.Milbank Q.2006;84(1):111–133. , , , , .
- Language proficiency and adverse events in US hospitals: a pilot study.Int J Qual Health Care.2007;19(2):60–67. , , , .
- The impact of medical interpreter services on the quality of health care: a systematic review.Med Care Res Rev.2005;62(3):255–299. .
- Errors in medical interpretation and their potential clinical consequences in pediatric encounters.Pediatrics.2003;111(1):6–14. , , , et al.
- The effect of English language proficiency on length of stay and in‐hospital mortality.J Gen Intern Med.2004;19(3):221–228. , , , et al.
- Influence of language barriers on outcomes of hospital care for general medicine inpatients.J Hosp Med.2010;5(5):276–282. , , , .
- Hospitals, language, and culture: a snapshot of the nation. In:Los Angeles, CA:The California Endowment, the Joint Commission;2007. p.51–52. http://www.jointcommission.org/assets/1/6/hlc_paper.pdf. , .
- Do professional interpreters improve clinical care for patients with limited English proficiency? A systematic review of the literature.Health Serv Res.2007;42(2):727–754. , , , .
- Google's Computing Power Refines Translation Tool.New York Times; March 9,2010. Accessed March 24, 2010. http://www.nytimes.com/2010/03/09/technology/09translate.html?_r=1. .
- New York Times; March 20,2010. Accessed March 24, 2010. http://www.nytimes.com/2010/03/21/opinion/21bellos.html. , Translator.
- Accuracy of computer‐generated, Spanish‐language medicine labels.Pediatrics.2010;125(5):960–965. , .
- Nielsen NetRatings Search Engine Ratings.SearchEngineWatch; August 22,2006. Accessed March 24, 2010. http://searchenginewatch.com/2156451. .
- Google.Google Translate Help;2010. Accessed March 24, 2010. http://translate.google.com/support/?hl=en.
- Chapter 4: Basic strategies. In:An Introduction to Machine Translation;1992. Accessed April 22, 2010. http://www.hutchinsweb.me.uk/IntroMT‐4.pdf , .
- Your Guide to Coumadin®/Warfarin Therapy.Agency for Healthcare Research and Quality; August 21,2008. Accessed October 19, 2009. http://www.ahrq.gov/consumer/btpills.htm. .
- Patient reported receipt of medication instructions for warfarin is associated with reduced risk of serious bleeding events.J Gen Intern Med.2008;23(10):1589–1594. , , , et al.
- The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of AMTA, 1994, Columbia, MD; October1994. , , .
- Overview of the IWSLT 2005 evaluation campaign. In: Proceedings of IWSLT 2005, Pittsburgh, PA; October2005. , .
- BLEU: a method for automatic evaluation of machine translation. In: ACL‐2002: 40th Annual Meeting of the Association for Computational Linguistics.2002:311–318. , , , .
- METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation at ACL, Prague, Czech Republic; June2007. , .
- The Structure of a Sentence.Ottawa:The Writing Centre, University of Ottawa;2007. .
- Statistical methods for assessing agreement between two methods of clinical measurement.Lancet.1986;1(8476):307–310. , .
- Measurement, reproducibility, and validity. In:Epidemiologic Methods 203.San Francisco:Department of Biostatistics and Epidemiology, University of California;2009. .
- Getting by: underuse of interpreters by resident physicians.J Gen Intern Med.2009;24(2):256–262. , , , , .
Copyright © 2011 Society of Hospital Medicine
Language Barriers and Hospital Care
Forty‐five‐million Americans speak a language other than English and more than 19 million of these speak English less than very wellor are limited English proficient (LEP).1 The number of non‐English‐speaking and LEP people in the US has risen in recent decades, presenting a challenge to healthcare systems to provide high‐quality, patient‐centered care for these patients.2
For outpatients, language barriers are a fundamental contributor to gaps in health care. In the clinic setting, patients who do not speak English well have less access to a usual source of care and lower rates of physician visits and preventive services.36 Even when patients with language barriers do have access to care, they have poorer adherence, decreased comprehension of their diagnoses, decreased satisfaction with care, and increased medication complications.710
Few studies, however, have examined how language influences outcomes of hospital care. Compared to English‐speakers, patients who do not speak English well may experience longer lengths of stay,11 and have more adverse events while in the hospital.12 However, these previous studies have not investigated outcomes immediately post‐hospitalization, such as readmission rates and mortality, nor have they directly addressed the interaction between ethnicity and language.
To understand these questions, we analyzed data collected from a university‐based teaching hospital which cares for patients of diverse cultural and language backgrounds. Using these data, we examined how patients' primary language influenced hospital costs, length of stay (LOS), 30‐day readmission, and 30‐day mortality risk.
Patients and Methods
Patient Population and Setting
Our study examined patients admitted to the General Medicine Service at the University of California, San Francisco Medical Center (UCSF) between July 1, 2001 and June 30th, 2003, the time period during which UCSF participated in the Multicenter Hospitalist Trial (MHT) a prospective quasi‐randomized trial of hospitalist care for general medicine patients.13, 14
UCSF Moffitt‐Long Hospital is a 400‐bed urban academic medical center which provides services to the City and County of San Francisco, an ethnically and linguistically diverse area. UCSF employs staff language interpreters in Spanish, Chinese and Russian who travel to its many outpatient clinics, Comprehensive Cancer Center, Children's Hospital, as well as to Moffitt‐Long Hospital upon request; phone interpretation is also available when in‐person interpreters are not available, for off hours needs and for less common languages. During the period of this study there were no specific inpatient guidelines in place for use of interpretation services at UCSF, nor were there any specific interventions targeting LEP or non‐English speaking inpatients.
Patients were eligible for the MHT if they were 18 years of age or older and admitted at random to a hospitalist or non‐hospitalist physician (eg, outpatient general internist attending on average 1‐month/year); a minority of patients were cared for directly by their primary care physician while in the hospital, and were excluded. For purposes of our study, which merged MHT data with hospital administrative data on primary language, we further excluded all admissions for patients for whom primary language was missing (n = 5), whose listing was unknown or other language (n = 78), sign language (n = 3) or whose language was listed but was not one of the included languages (n = 258). Included languages were English, Chinese, Russian and Spanish. Because LOS and cost data were skewed, we excluded those admissions with the top 1% longest stays and the top 1% highest cost (n = 176); these exclusions did not alter the proportion of admissions across language and ethnicity. In addition, we excluded 102 admissions that were missing data on cost and 11 with costs <$500 and which were likely to be erroneous. Our research was approved by the UCSF Institutional Review Board.
Data Sources
We collected administrative data from Transition Systems Inc (TSI, Boston, MA) billing databases at UCSF as part of the MHT. These data include patient demographics, insurance, costs, ICD‐9CM diagnostic codes, admission and discharge dates in Uniform Bill 92 format. Patient mortality information was collected as part of the MHT using the National Death Index.14
Language data were collected from a separate patient‐registration database (STOR) at UCSF. Information on a patient's primary language is entered at the time each patient first registers at UCSF, whether for the index hospitalization or for prior clinic visits, and is based generally on patient self report. As part of our validation step, we cross‐checked 829 STOR language entries against patient reports and found 91% agreement with the majority of the errors classifying non‐English speakers as English‐speakers.
Measures
Predictor
Our primary language variable was derived using language designations collected from patient registration databases described above. Using these data we specified our key language groups as English, Chinese (Cantonese or Mandarin), Russian, or Spanish.
Outcomes
LOS and total cost of hospital stay for each hospitalization derived from administrative data sources. Readmissions were identified at the time patients were readmitted to UCSF (eg, flagged in administrative data). Mortality was determined by whether an individual patient with an admission in the database was recorded in the National Death Index as dead within 30‐days of admission.
Covariates
Additional covariates included age at admission, gender, ethnicity as recorded in registration databases (White, African American, Asian, Latino, Other), insurance, principal billing diagnosis, whether or not a patient received intensive care unit (ICU) care, type of admitting attending physician (Hospitalist/non‐Hospitalist), and an administrative Charlson comorbidity score.15 To collapse the principal diagnoses into categories, we used the Healthcare Cost and Utilization Project (HCUP)'s Clinical Classification System, which allowed us to classify each diagnosis in 1 of 14 generally accepted categories.16
Analysis
Statistical analyses were performed using STATA statistical software (STATACorp, Version 9, College Station, TX). We examined descriptive means and proportions for all variables, including sociodemographic, hospitalization, comorbidity and outcome variables. We compared English and non‐English speakers on all covariate and outcome variables using t‐tests for comparison of means and chi‐square for comparison of categorical variables.
It was not possible to fully test the language‐by‐ethnicity interactionwhether or not the impact of language varied by ethnic groupbecause many cells of the joint distribution were very sparse (eg, the sample contained very few non‐English‐speaking African Americans). Therefore, to better understand the influence of English vs. non‐English language usage across different ethnic groups, we created a combined language‐ethnicity predictor variable which categorized each subject first by language and then for the English‐speakers by ethnicity. For example, a Chinese, Spanish or Russian speaker would be categorized as such, and an English‐speaker could fall into the English‐White, English‐African American, English‐Asian or English‐Latino group. This allowed us to test whether there were any differences in language effects across the White, Asian, and Latino ethnicities, and any difference in ethnicity effects among English‐speakers.
Because cost and LOS were skewed, we used negative binomial models for LOS and log transformed costs. We performed a sensitivity analysis testing whether our results were robust to the exclusion of the admissions with the top 1% LOS and top 1% cost. We used logistic regression for the 30‐day readmission and mortality outcomes.
Our primary predictor was the language‐ethnicity variable described above. To determine the independent association between this predictor and our key outcomes, we then built models which included additional potential confounders selected either for face validity or because of observed confounding with other covariates. Our inclusion of potential confounders was limited by the variables available in the administrative database; thus, we were not able to pursue detailed analyses of communication and literacy factors and their interaction with our predictor or their independent impact on outcomes. Models also included a linear spline with a single knot at age 65 years as a further adjustment for age in Medicare recipients.1719 For the 30‐day readmission outcome model, we excluded those admissions for which the patient either died in the hospital or was discharged to hospice care. Within each model we tested the impact of a language barrier using custom contrasts. This allowed us to examine the language‐ethnicity effect aggregating all non‐English speakers compared to all English‐speakers, comparing each non‐English speaking group to all English‐speakers, comparing Chinese speakers to English‐speaking Asians and Spanish speakers to English‐speaking‐Latinos, as well as to test whether the effect of English language is the same across ethnicities.
Results
Admission Characteristics of the Sample
A total of 7023 patients were admitted to the General Medicine service, 5877 (84%) of whom were English‐speakers and 1146 (16%) non‐English‐speakers (Table 1). Overall, half of the admitted patients were women (50%), and the vast majority was insured (93%). The most common principal diagnoses were respiratory and gastrointestinal disorders. Only a small number of non‐English speakers 164 (14%) were recorded in the UCSF Interpreter Services database as having had any interaction with a professional staff interpreter during their hospitalization.
English (n = 5877) n (%) | Non‐English (n = 1146) n (%) | |
---|---|---|
| ||
Socio‐economic variables | ||
Language‐ethnicity | ||
English | ||
White | 3066 (52.2) | |
African American | 1351 (23.0) | |
Asian | 544 (9.3) | |
Latino | 298 (5.1) | |
Other | 618 (10.5) | |
Chinese speakers | 584 (51.0) | |
Spanish speakers | 272 (25.3) | |
Russian speakers | 290 (23.7) | |
Age mean (SD) (range 18‐105) | 58.8 (20.3) | 72.3 (15.5) |
Gender | ||
Male | 2967 (50.5) | 514 (44.8) |
Female | 2910 (49.5) | 632 (55.2) |
Insurance | ||
Medicare | 2878 (49.0) | 800 (69.8) |
Medicaid | 1201 (20.4) | 193 (16.8) |
Commercial | 1358 (23.1) | 106 (9.3) |
Charity/other | 440 (7.5) | 47 (4.1) |
Hospitalization variables | ||
Admitted to ICU | ||
Yes | 721 (12.3) | 149 (13.0) |
Attending physician | ||
Hospitalist | 3950 (67.2) | 781 (68.2) |
Comorbidity variables | ||
Principal Diagnosis | ||
Respiratory disorder | 1061 (18.1) | 225 (19.6) |
Gastrointestinal disorder | 963 (16.4) | 205 (17.9) |
Circulatory disorder | 613 (10.4) | 140 (12.2) |
Endocrine/metabolism | 671 (11.4) | 80 (7.0) |
Injury/poisoning | 475 (8.1) | 64 (5.6) |
Malignancy | 395 (6.7) | 107 (9.3) |
Renal/urinary disorder | 383 (6.5) | 108 (9.4) |
Skin disorder | 278 (4.7) | 28 (2.9) |
Infection/fatigue NOS | 206 (3.5) | 45 (3.4) |
Blood disorder (non‐malignant) | 189 (3.2) | 38 (3.3) |
Musculoskeletal/connective tissue disorder | 164 (2.8) | 33 (2.9) |
Mental disorder/substance abuse | 171 (2.9) | 7 (0.6) |
Nervous system/brain infection | 137 (2.3) | 26 (2.3) |
Unclassified | 171 (2.9) | 40 (3.5) |
Charlson Index score mean (SD) | 0.97 1.33 | 1.10 1.42 |
Among English speakers, Whites and African Americans were the most common ethnicities; however, more than 500 admissions were categorized as Asian ethnicity, and more than 600 as patients of other ethnicity. Close to 300 admissions were for Latinos. Among non‐English speakers, Chinese speakers had the largest number of admissions (n = 584), while Spanish and Russian speakers had similar numbers (n = 272 and 290 respectively).
Non‐English speakers were older, more likely to be female, more likely to be insured by Medicare, and more likely to have a higher comorbidity index score. While comorbidity scores were similar among non‐English speakers (Chinese 1.13 1.50; Russian 1.09 1.37; Spanish 1.06 1.30), they differed considerably among English speakers (White 0.94 1.29; African American 1.05 1.40; Asian 1.04 1.45; Latino 0.89 1.23; Other 0.91 1.29).
Hospital Outcome by Language‐Ethnicity Group (Table 2)
When aggregated together, non‐English speakers were somewhat more likely to be dead at 30‐days and have lower cost admissions; however, they did not differ from English speakers on LOS or readmission rates. While differences among disaggregated language‐ethnicity groups were not all statistically significant, English‐speaking Whites had the longest LOS (mean = 4.9 days) and highest costs (mean = $10,530). English‐speaking African Americans, Chinese and Spanish speakers had the highest 30‐day readmission rates; whereas, English‐speaking Latinos and Russian speakers had markedly lower 30‐day readmission rates (2.5% and 6.4%, respectively). Chinese speakers had the highest 30‐day mortality, followed by English speaking Whites and Asians.
Language‐Ethnicity Groups | LOS* Mean #Days (SD) | Cost Mean Cost $ (SD) | 30‐Day Readmission, n (%) | 30‐Day Mortality, n (%) |
---|---|---|---|---|
| ||||
English speakers (all) | 4.7 (4.5) | 10,035 (15,041) | 648 (11.9) | 613 (10.4) |
White | 4.9 (5.1) | 10,530 (15,894) | 322 (11.4) | 377 (12.3) |
African American | 4.5 (4.8) | 9107 (13,314) | 227 (17.5) | 91 (6.7) |
Asian | 4.3 (4.5) | 9933 (15,607) | 43 (8.8) | 67 (12.3) |
Latino | 4.6 (4.8) | 9823 (14,113) | 7 (2.5) | 18 (6.0) |
Other | 4.5 (4.8) | 9662 (14,016) | 49 (8.5) | 60 (9.7) |
Non‐English speakers (all) | 4.5 (4.5) | 9515 (13,213) | 117 (11.0) | 147 (12.8) |
Chinese speakers | 4.5 (4.6) | 9505 (12,841) | 69 (12.8) | 85 (14.6) |
Spanish speakers | 4.5 (4.5) | 9115 (13,846) | 31 (12.0) | 28 (10.3) |
Russian speakers | 4.7 (4.2) | 9846 (13,360) | 17 (6.4) | 34 (11.7) |
We further investigated differences among English speakers to better understand the very high rate of readmission for African Americans and the very low rate for English‐speaking Latinos. African Americans were on average younger than other English speakers (55 19 years vs. 60 21 years; P < 0.001); but, they had higher comorbidity scores than other English speakers (1.05 1.40 vs. 0.94 1.31; P = 0.008), and were more likely to be admitted for non‐malignant blood disorders (eg, sickle cell disease), endocrine disorders (eg, diabetes mellitus), and circulatory disorders (eg, stroke). In contrast, English‐speaking Latinos were also younger than other English speakers (53 21 years vs. 59 20 years; P < 0.001), but they trended toward lower comorbidity scores (0.87 1.23 vs. 0.97 1.33; P = 0.2), and were more likely to be admitted for gastrointestinal and musculoskeletal disorders, and less likely to be admitted for malignancy and endocrine disorders.
Multivariate Analyses: Association of Aggregated and Disaggregated Language‐Ethnicity Groups With Hospital Outcomes (Table 3)
In multivariate models examining aggregated language‐ethnicity groups, non‐English speakers had a trend toward higher odds of readmission at 30‐days post‐discharge than the English‐speaking group (odds ratio [OR], 1.3; 95% confidence interval [CI], 1.0‐1.7). There were no significant differences for LOS, cost, or 30‐day mortality. Compared to English speakers, Chinese and Spanish speakers had 70% and 50% higher adjusted odds of readmission at 30‐days post‐discharge respectively, while Russian speakers' odds of readmission was not increased. Additionally, Chinese speakers had 7% shorter LOS than English‐speakers. There were no significant differences among any of the language‐ethnicity groups for 30‐day mortality. The increased odds of readmission for Chinese and Spanish speakers compared to English speakers was robust to reinclusion of the admissions with the top 1% LOS and top 1% cost.
Language Categorization | LOS, % Difference (95% CI) | Total Cost, % Difference (95% CI) | 30‐Day Readmission,* OR (95% CI) | Mortality, OR (95% CI) |
---|---|---|---|---|
| ||||
All English speakers | Reference | Reference | Reference | Reference |
Non‐English speakers | 3.1 (8.7 to 3.1) | 2.5 (8.3 to 2.1) | 1.3 (1.0 to 1.7) | 0.9 (0.7 to 1.2) |
All English speakers | Reference | Reference | Reference | Reference |
Chinese speakers | 7.2 (13.9 to 0) | 5.3 (12.2 to 2.1) | 1.7 (1.2 to 2.3) | 1.0 (0.8 to 1.4) |
Spanish speakers | 3.0 (12.6 to 7.6) | 3.0 (12.7 to 7.7) | 1.5 (1.0 to 2.3) | 0.9 (0.6 to 1.5) |
Russian speakers | 1.5 (8.3 to 12.2) | 0.9 (8.9 to 11.8) | 0.8 (0.5 to 1.4) | 0.8 (0.5 to 1.2) |
Multivariate Analyses: Association of Language for Asians and Latinos, and of Ethnicity for English speakers, With Hospital Outcomes (Table 4)
Both Chinese and Spanish speakers had significantly higher odds of 30‐day readmission than their English speaking Asian and Latino counterparts. There were no significant differences in LOS, cost, or 30‐day mortality in this within‐ethnicity analysis. Among English speakers, admissions for patients with Asian ethnicity were 15% shorter and resulted in 9% lower costs than for Whites. While LOS and cost were similar for English‐speaking Latino and White admissions, English‐speaking Latinos had markedly lower odds of 30‐day readmission than their White counterparts. Whereas African‐Americans had 6% shorter LOS, 40% higher odds of readmission and 30% lower odds of mortality at 30‐days than English speaking Whites.
Language‐Ethnicity Comparisons | LOS, % Difference (95% CI) | Total Cost, % Difference (95% CI) | 30‐Day Readmission,* OR (95% CI) | Mortality, OR (95% CI) |
---|---|---|---|---|
| ||||
English speaking Asians | Reference | Reference | Reference | Reference |
Chinese speakers | 2.2 (7.4 to 12.7) | 0.3 (9.2 to 10.7) | 1.5 (1.0 to 2.3) | 0.8 (0.6 to 1.2) |
English speaking Latinos | Reference | Reference | Reference | Reference |
Spanish speakers | 4.5 (16.8 to 9.5) | 1.2 (14.0 to 13.5) | 5.7 (2.4 to 13.2) | 1.2 (0.6 to 2.4) |
English‐White | Reference | Reference | Reference | Reference |
English‐African American | 6.2 (11.3 to 0.9) | 4.4 (9.6 to 1.1) | 1.4 (1.1 to 1.7) | 0.6 (0.5 to 0.8) |
English‐Asian | 14.6 (20.9 to 7.9) | 8.6 (15.4 to 1.4) | 0.8 (0.5 to 1.0) | 1.0 (0.7 to 1.4) |
English‐Latino | 4.5 (13.5 to 5.4) | 5.0 (14.0 to 5.0) | 0.2 (0.1 to 0.4) | 0.6 (0.4 to 1.0) |
Conclusion/Discussion
Our results indicate that language barriers may contribute to higher readmission rates for non‐English speakers, but that they have less impact on care efficiency or mortality. This finding of an association between language and readmission, without a similar association with efficiency, suggests a potentially communication‐critical step in care.20, 21 Patients with language barriers are more likely to experience adverse events, and those events are often caused by errors in communication.12 It is conceivable that higher readmission risk for Chinese and Spanish speakers in our study was, at least in part, due to gaps in communication that are present in all patient groups, and exacerbated by the presence of a language barrier. This barrier is likely present during hospitalization but magnified at discharge, limiting caregivers' ability to understand patients' needs for home care, while simultaneously limiting patients' understanding of the discharge plan. After discharge, it is also possible that non‐English speakers are less able to communicate their needs as they arise, ormore subtlyfeel less supported by a primarily English speaking healthcare system. As in other clinical arenas,22 it is quite possible that increased access to professional interpreters in the hospital setting, and particularly at the time of discharge, would enhance communication and outcomes for LEP patients. Our interpreter services data showed that the patients in our study had quite limited access to staff professional interpreters.
Our findings differ somewhat from those of John‐Baptiste et al.,11 who found that language barriers contributed to increased LOS for patients with cardiac and major surgical diagnoses. Our study's findings are akin to recent research suggesting that being a monolingual Spanish speaker or receiving interpreter services may not significantly impact LOS or cost of hospitalization,23 and that LOS and in‐hospital mortality do not differ for non‐English speakers and English speakers after acute myocardial infarction.24 These studies, along with our results, suggest that that care efficiency in the hospital may be driven much more by clinical acuity (eg, the need to respond rapidly to urgent clinical signs such as hypotension, fever and respiratory distress) than by adequacy of communication. For example, elderly LEP patients may be even more likely than English speakers to have vigilant family members at the bedside throughout their hospitalization due to their need for communication assistance; these family members can quickly alert hospital staff to concerning changes in the patient's condition.
Our results also suggest the possibility that language and ethnicity are not monolithic concepts, and that even within language and ethnic groups there are potential differences in care pattern. For example, not speaking English may be a surrogate marker for unmeasured factors such as social supports and access to care. Language is intimately associated with culture; it remains plausible that cultural differences between highly acculturated and less acculturated members of a given ethno‐cultural group may have contributed to our observed differences in readmission rates. Differences in culture and associated factors, such as social support or use of multiple hospital systems, may account for lack of higher readmission risk in Russian speakers, while Chinese and Spanish speakers had higher readmission risk.
In addition, our finding that English‐speaking Latinos had lower readmission risk than any other group may be more consistent with their clinical characteristicseg, younger age, fewer comorbiditiesthan with cultural factors. Our finding that African American patients had the highest readmission risk in our hospital was both surprising and concerning. Some of this increased risk may be explained by clinical characteristics, such as higher comorbidities and higher rates of diagnoses leading to frequent admissions (eg, sickle cell disease); however, the reasons for this disparity deserves further investigation.
Our study has limitations. First, our data are administrative, and lack information about patients' educational attainment, social support, acculturation, utilization of other hospital systems, and usual source of care. Despite this, we were able to account for many significant covariates that might contribute to readmission rates, including age, insurance status, gender, comorbidities, and admission to the intensive care unit.2528 Second, our information about patients' English language proficiency is limited. While direct assessments of English proficiency are more accurate ways to determine a patient's ability to communicate with health care providers in English,29 our language validation work conducted in preparation for this study suggests that most of our patients recorded as having a non‐English primary language (87%) also have a low score on a language acculturation scale.
Third, only 14% of our non‐English speaking subjects utilized professional staff interpreters, and we had no information on the use of professional telephonic interpreters, or ad hoc interpretersfamily members, non‐interpreter staff membersand their impact on our results. It is well‐documented that ad hoc interpreters are used frequently in healthcare, particularly in the hospital setting, and thus we can assume this to be true in our study.30, 31 As noted above, it is likely that the advocacy of family members and friends at the bedside helped to minimize potential differences in care efficiency for patients with language barriers. Finally, our study was performed at a single university based hospital and may not produce results which are applicable to other care settings.
Our findings point to several avenues for future research on language barriers and hospitalized patients. First, the field would benefit from an examination of the impact of easy access to professional interpreters during hospitalization on outcomes of hospital care, in particular on readmission rates. Second, there is need for development and assessment of best practices for creating a culture of professional interpreter utilization in the hospital among physicians and nursing staff. Third, investigation of the role of caregiver presence in the hospital room and how this might differ by patient culture, age and language ability may further elucidate some of the differences across language groups observed in our study. Lastly, a more granular investigation of clinician‐patient communication and the importance of interpersonal processes of care on both patient satisfaction and understanding of and adherence to discharge instructions could lead to the development of detailed interventions to enhance this communication and these outcomes as it has done for communication‐sensitive outcomes in the outpatient arena.3234
In summary, our study suggests that higher risk for readmission can be added to the unfortunate list of outcomes which are worsened due to language barriers, pointing to transition from the hospital as a potentially communication‐critical step in care which may be amenable to intervention. Our findings also suggest that this risk can vary even between groups of patients who do not speak English primarily. Whether and to what degree language and communication barriers aloneincluding access to professional interpreters and patient‐centered communicationduring hospitalization, or differences in caregiver social support both during and after hospitalization as well as access to care post‐hospitalization contribute to these findings is a worthy subject of future research.
Acknowledgements
The authors acknowledge Dr. Eliseo J. Prez‐Stable for his mentorship on this project.
- 2000. Available at: http://www.census.gov/prod/2003pubs/c2kbr‐29.pdf. Accessed January 2010. , . Language Use and English‐Speaking Ability:
- U.S. Department of Health and Human Services.2006National Healthcare Disparities Report. AHRQ Publication No. 070012; 2006.
- Disparities in health care by race, ethnicity, and language among the insured: findings from a national sample.Med Care.2002;40(1):52–59. , , , .
- The effect of physician‐patient communication on mammography utilization by different ethnic groups.Med Care.1991;29(11):1065–1082. , .
- Language of interview:relevance for research of Southwest Hispanics.Am J Pub Health.1991;81(11):1399–1404. , .
- Is language a barrier to the use of preventive services?J Gen Intern Med.1997;12(8):472–477. , , , .
- Impact of language barriers on patient satisfaction in an emergency department.JGIM.1999;14:82–87. , , , .
- Patient comprehension of doctor‐patient communication on discharge from the emergency department.J Emerg Med.1997;15(1):1–7. .
- Drug complications in outpatients.J Gen Intern Med.2000;15:149–154. , , , et al.
- Language concordance as a determinant of patient compliance and emergency room use in patients with asthma.Med Care.1988;26(12):1119–1128. .
- The effect of English language proficiency on length of stay and in‐hospital mortality.J Gen Intern Med.2004;19:221–228. , , ,et al.
- Language proficiency and adverse events in US hospitals: a pilot study.Int J Qual Health Care.2007;19(2):60–67. , , , .
- Factors associated with discussion of care plans and code status at the time of hospital admission: results from the Multicenter Hospitalist Study.J Hosp Med.2008;3(6):437–445. , , , et al.
- Quality of care for decompensated heart failure: comparable performance between academic hospitalists and non‐hospitalists.J Gen Intern Med.2008;23(9):1399–1406. , , , et al.
- A new method of classifying prognostic comorbidity in longitudinal studies: development and validation.J Chronic Dis.1987;40(5):373–383. , , , .
- AHRQ. Healthcare Cost and Utilizlation Project: Tools 132(3):191–200.
- Free knot splines for logistic models and threshold selection.Comput Methods Programs Biomed.2005;77(1):1–9. , , .
- Statistical methods in epidemiology: a comparison of statistical methods to analyze dose‐response and trend analysis in epidemiologic studies.J Clin Epidemiol.1998;51(12):1223–1233. , , , , .
- The Care Transitions Project. Health Care Policy and Research, Practitioner Tools. Available at: http://www.caretransitions.org/practitioner_tools.asp.
- AHRQ. Improving safety at the point of care. Available at: http://www.ahrq.gov/qual/pips. Accessed January2010.
- Do professional interpreters improve clinical care for patients with limited english proficiency? A systematic review of the literature.Health Serv Res.2007;42(2):727–754. , , , .
- The impact of an enhanced interpreter service intervention on hospital costs and patient satisfaction.J Gen Intern Med.2007;22 Suppl 2:306–311. , , .
- Acute myocardial infarction length of stay and hospital mortality are not associated with language preference.J Gen Intern Med.2008;23(2):190–194. , , , , .
- A systematic literature review of factors affecting outcome in older medical patients admitted to hospital.Age Ageing.2004;33(2):110–115. , , .
- Risk factors and prognostic predictors of unexpected intensive care unit admission within 3 days after ED discharge.Am J Emerg Med.2007;25(9):1009–1014. , , , , , .
- Effect of gender, ethnicity, pulmonary disease, and symptom stability on rehospitalization in patients with heart failure.Am J Cardiol.2007;100(7):1139–1144. , .
- Bouncing back: patterns and predictors of complicated transitions 30 days after hospitalization for acute ischemic stroke.J Am Geriatr Soc.2007;55(3):365–373. , , , .
- Identification of limited English proficient patients in clinical care.J Gen Intern Med.2008;23(10):1555–1560. , , , , .
- Hospital langague services for patients with limited English proficiency: results from a national survey.Health Research October2006. , , , , .
- Hospitals, Language, and Culture: a Snapshot of the Nation. The Joint Commission and The California Endowment;2007. , .
- A randomized controlled trial of interventions to enhance patient‐physician partnership, patient adherence and high blood pressure control among ethnic minorities and poor persons: study protocol NCT00123045.Implement Sci.2009;4:7. , , , et al.
- Interpersonal processes of care and patient satisfaction: do associations differ by race, ethnicity, and language?Health Serv Res.2009;44(4):1326–1344. , , , , .
- Understanding concordance in patient‐physician relationships: personal and ethnic dimensions of shared identity.Ann Fam Med.2008;6(3):198–205. , , , .
Forty‐five‐million Americans speak a language other than English and more than 19 million of these speak English less than very wellor are limited English proficient (LEP).1 The number of non‐English‐speaking and LEP people in the US has risen in recent decades, presenting a challenge to healthcare systems to provide high‐quality, patient‐centered care for these patients.2
For outpatients, language barriers are a fundamental contributor to gaps in health care. In the clinic setting, patients who do not speak English well have less access to a usual source of care and lower rates of physician visits and preventive services.36 Even when patients with language barriers do have access to care, they have poorer adherence, decreased comprehension of their diagnoses, decreased satisfaction with care, and increased medication complications.710
Few studies, however, have examined how language influences outcomes of hospital care. Compared to English‐speakers, patients who do not speak English well may experience longer lengths of stay,11 and have more adverse events while in the hospital.12 However, these previous studies have not investigated outcomes immediately post‐hospitalization, such as readmission rates and mortality, nor have they directly addressed the interaction between ethnicity and language.
To understand these questions, we analyzed data collected from a university‐based teaching hospital which cares for patients of diverse cultural and language backgrounds. Using these data, we examined how patients' primary language influenced hospital costs, length of stay (LOS), 30‐day readmission, and 30‐day mortality risk.
Patients and Methods
Patient Population and Setting
Our study examined patients admitted to the General Medicine Service at the University of California, San Francisco Medical Center (UCSF) between July 1, 2001 and June 30th, 2003, the time period during which UCSF participated in the Multicenter Hospitalist Trial (MHT) a prospective quasi‐randomized trial of hospitalist care for general medicine patients.13, 14
UCSF Moffitt‐Long Hospital is a 400‐bed urban academic medical center which provides services to the City and County of San Francisco, an ethnically and linguistically diverse area. UCSF employs staff language interpreters in Spanish, Chinese and Russian who travel to its many outpatient clinics, Comprehensive Cancer Center, Children's Hospital, as well as to Moffitt‐Long Hospital upon request; phone interpretation is also available when in‐person interpreters are not available, for off hours needs and for less common languages. During the period of this study there were no specific inpatient guidelines in place for use of interpretation services at UCSF, nor were there any specific interventions targeting LEP or non‐English speaking inpatients.
Patients were eligible for the MHT if they were 18 years of age or older and admitted at random to a hospitalist or non‐hospitalist physician (eg, outpatient general internist attending on average 1‐month/year); a minority of patients were cared for directly by their primary care physician while in the hospital, and were excluded. For purposes of our study, which merged MHT data with hospital administrative data on primary language, we further excluded all admissions for patients for whom primary language was missing (n = 5), whose listing was unknown or other language (n = 78), sign language (n = 3) or whose language was listed but was not one of the included languages (n = 258). Included languages were English, Chinese, Russian and Spanish. Because LOS and cost data were skewed, we excluded those admissions with the top 1% longest stays and the top 1% highest cost (n = 176); these exclusions did not alter the proportion of admissions across language and ethnicity. In addition, we excluded 102 admissions that were missing data on cost and 11 with costs <$500 and which were likely to be erroneous. Our research was approved by the UCSF Institutional Review Board.
Data Sources
We collected administrative data from Transition Systems Inc (TSI, Boston, MA) billing databases at UCSF as part of the MHT. These data include patient demographics, insurance, costs, ICD‐9CM diagnostic codes, admission and discharge dates in Uniform Bill 92 format. Patient mortality information was collected as part of the MHT using the National Death Index.14
Language data were collected from a separate patient‐registration database (STOR) at UCSF. Information on a patient's primary language is entered at the time each patient first registers at UCSF, whether for the index hospitalization or for prior clinic visits, and is based generally on patient self report. As part of our validation step, we cross‐checked 829 STOR language entries against patient reports and found 91% agreement with the majority of the errors classifying non‐English speakers as English‐speakers.
Measures
Predictor
Our primary language variable was derived using language designations collected from patient registration databases described above. Using these data we specified our key language groups as English, Chinese (Cantonese or Mandarin), Russian, or Spanish.
Outcomes
LOS and total cost of hospital stay for each hospitalization derived from administrative data sources. Readmissions were identified at the time patients were readmitted to UCSF (eg, flagged in administrative data). Mortality was determined by whether an individual patient with an admission in the database was recorded in the National Death Index as dead within 30‐days of admission.
Covariates
Additional covariates included age at admission, gender, ethnicity as recorded in registration databases (White, African American, Asian, Latino, Other), insurance, principal billing diagnosis, whether or not a patient received intensive care unit (ICU) care, type of admitting attending physician (Hospitalist/non‐Hospitalist), and an administrative Charlson comorbidity score.15 To collapse the principal diagnoses into categories, we used the Healthcare Cost and Utilization Project (HCUP)'s Clinical Classification System, which allowed us to classify each diagnosis in 1 of 14 generally accepted categories.16
Analysis
Statistical analyses were performed using STATA statistical software (STATACorp, Version 9, College Station, TX). We examined descriptive means and proportions for all variables, including sociodemographic, hospitalization, comorbidity and outcome variables. We compared English and non‐English speakers on all covariate and outcome variables using t‐tests for comparison of means and chi‐square for comparison of categorical variables.
It was not possible to fully test the language‐by‐ethnicity interactionwhether or not the impact of language varied by ethnic groupbecause many cells of the joint distribution were very sparse (eg, the sample contained very few non‐English‐speaking African Americans). Therefore, to better understand the influence of English vs. non‐English language usage across different ethnic groups, we created a combined language‐ethnicity predictor variable which categorized each subject first by language and then for the English‐speakers by ethnicity. For example, a Chinese, Spanish or Russian speaker would be categorized as such, and an English‐speaker could fall into the English‐White, English‐African American, English‐Asian or English‐Latino group. This allowed us to test whether there were any differences in language effects across the White, Asian, and Latino ethnicities, and any difference in ethnicity effects among English‐speakers.
Because cost and LOS were skewed, we used negative binomial models for LOS and log transformed costs. We performed a sensitivity analysis testing whether our results were robust to the exclusion of the admissions with the top 1% LOS and top 1% cost. We used logistic regression for the 30‐day readmission and mortality outcomes.
Our primary predictor was the language‐ethnicity variable described above. To determine the independent association between this predictor and our key outcomes, we then built models which included additional potential confounders selected either for face validity or because of observed confounding with other covariates. Our inclusion of potential confounders was limited by the variables available in the administrative database; thus, we were not able to pursue detailed analyses of communication and literacy factors and their interaction with our predictor or their independent impact on outcomes. Models also included a linear spline with a single knot at age 65 years as a further adjustment for age in Medicare recipients.1719 For the 30‐day readmission outcome model, we excluded those admissions for which the patient either died in the hospital or was discharged to hospice care. Within each model we tested the impact of a language barrier using custom contrasts. This allowed us to examine the language‐ethnicity effect aggregating all non‐English speakers compared to all English‐speakers, comparing each non‐English speaking group to all English‐speakers, comparing Chinese speakers to English‐speaking Asians and Spanish speakers to English‐speaking‐Latinos, as well as to test whether the effect of English language is the same across ethnicities.
Results
Admission Characteristics of the Sample
A total of 7023 patients were admitted to the General Medicine service, 5877 (84%) of whom were English‐speakers and 1146 (16%) non‐English‐speakers (Table 1). Overall, half of the admitted patients were women (50%), and the vast majority was insured (93%). The most common principal diagnoses were respiratory and gastrointestinal disorders. Only a small number of non‐English speakers 164 (14%) were recorded in the UCSF Interpreter Services database as having had any interaction with a professional staff interpreter during their hospitalization.
English (n = 5877) n (%) | Non‐English (n = 1146) n (%) | |
---|---|---|
| ||
Socio‐economic variables | ||
Language‐ethnicity | ||
English | ||
White | 3066 (52.2) | |
African American | 1351 (23.0) | |
Asian | 544 (9.3) | |
Latino | 298 (5.1) | |
Other | 618 (10.5) | |
Chinese speakers | 584 (51.0) | |
Spanish speakers | 272 (25.3) | |
Russian speakers | 290 (23.7) | |
Age mean (SD) (range 18‐105) | 58.8 (20.3) | 72.3 (15.5) |
Gender | ||
Male | 2967 (50.5) | 514 (44.8) |
Female | 2910 (49.5) | 632 (55.2) |
Insurance | ||
Medicare | 2878 (49.0) | 800 (69.8) |
Medicaid | 1201 (20.4) | 193 (16.8) |
Commercial | 1358 (23.1) | 106 (9.3) |
Charity/other | 440 (7.5) | 47 (4.1) |
Hospitalization variables | ||
Admitted to ICU | ||
Yes | 721 (12.3) | 149 (13.0) |
Attending physician | ||
Hospitalist | 3950 (67.2) | 781 (68.2) |
Comorbidity variables | ||
Principal Diagnosis | ||
Respiratory disorder | 1061 (18.1) | 225 (19.6) |
Gastrointestinal disorder | 963 (16.4) | 205 (17.9) |
Circulatory disorder | 613 (10.4) | 140 (12.2) |
Endocrine/metabolism | 671 (11.4) | 80 (7.0) |
Injury/poisoning | 475 (8.1) | 64 (5.6) |
Malignancy | 395 (6.7) | 107 (9.3) |
Renal/urinary disorder | 383 (6.5) | 108 (9.4) |
Skin disorder | 278 (4.7) | 28 (2.9) |
Infection/fatigue NOS | 206 (3.5) | 45 (3.4) |
Blood disorder (non‐malignant) | 189 (3.2) | 38 (3.3) |
Musculoskeletal/connective tissue disorder | 164 (2.8) | 33 (2.9) |
Mental disorder/substance abuse | 171 (2.9) | 7 (0.6) |
Nervous system/brain infection | 137 (2.3) | 26 (2.3) |
Unclassified | 171 (2.9) | 40 (3.5) |
Charlson Index score mean (SD) | 0.97 1.33 | 1.10 1.42 |
Among English speakers, Whites and African Americans were the most common ethnicities; however, more than 500 admissions were categorized as Asian ethnicity, and more than 600 as patients of other ethnicity. Close to 300 admissions were for Latinos. Among non‐English speakers, Chinese speakers had the largest number of admissions (n = 584), while Spanish and Russian speakers had similar numbers (n = 272 and 290 respectively).
Non‐English speakers were older, more likely to be female, more likely to be insured by Medicare, and more likely to have a higher comorbidity index score. While comorbidity scores were similar among non‐English speakers (Chinese 1.13 1.50; Russian 1.09 1.37; Spanish 1.06 1.30), they differed considerably among English speakers (White 0.94 1.29; African American 1.05 1.40; Asian 1.04 1.45; Latino 0.89 1.23; Other 0.91 1.29).
Hospital Outcome by Language‐Ethnicity Group (Table 2)
When aggregated together, non‐English speakers were somewhat more likely to be dead at 30‐days and have lower cost admissions; however, they did not differ from English speakers on LOS or readmission rates. While differences among disaggregated language‐ethnicity groups were not all statistically significant, English‐speaking Whites had the longest LOS (mean = 4.9 days) and highest costs (mean = $10,530). English‐speaking African Americans, Chinese and Spanish speakers had the highest 30‐day readmission rates; whereas, English‐speaking Latinos and Russian speakers had markedly lower 30‐day readmission rates (2.5% and 6.4%, respectively). Chinese speakers had the highest 30‐day mortality, followed by English speaking Whites and Asians.
Language‐Ethnicity Groups | LOS* Mean #Days (SD) | Cost Mean Cost $ (SD) | 30‐Day Readmission, n (%) | 30‐Day Mortality, n (%) |
---|---|---|---|---|
| ||||
English speakers (all) | 4.7 (4.5) | 10,035 (15,041) | 648 (11.9) | 613 (10.4) |
White | 4.9 (5.1) | 10,530 (15,894) | 322 (11.4) | 377 (12.3) |
African American | 4.5 (4.8) | 9107 (13,314) | 227 (17.5) | 91 (6.7) |
Asian | 4.3 (4.5) | 9933 (15,607) | 43 (8.8) | 67 (12.3) |
Latino | 4.6 (4.8) | 9823 (14,113) | 7 (2.5) | 18 (6.0) |
Other | 4.5 (4.8) | 9662 (14,016) | 49 (8.5) | 60 (9.7) |
Non‐English speakers (all) | 4.5 (4.5) | 9515 (13,213) | 117 (11.0) | 147 (12.8) |
Chinese speakers | 4.5 (4.6) | 9505 (12,841) | 69 (12.8) | 85 (14.6) |
Spanish speakers | 4.5 (4.5) | 9115 (13,846) | 31 (12.0) | 28 (10.3) |
Russian speakers | 4.7 (4.2) | 9846 (13,360) | 17 (6.4) | 34 (11.7) |
We further investigated differences among English speakers to better understand the very high rate of readmission for African Americans and the very low rate for English‐speaking Latinos. African Americans were on average younger than other English speakers (55 19 years vs. 60 21 years; P < 0.001); but, they had higher comorbidity scores than other English speakers (1.05 1.40 vs. 0.94 1.31; P = 0.008), and were more likely to be admitted for non‐malignant blood disorders (eg, sickle cell disease), endocrine disorders (eg, diabetes mellitus), and circulatory disorders (eg, stroke). In contrast, English‐speaking Latinos were also younger than other English speakers (53 21 years vs. 59 20 years; P < 0.001), but they trended toward lower comorbidity scores (0.87 1.23 vs. 0.97 1.33; P = 0.2), and were more likely to be admitted for gastrointestinal and musculoskeletal disorders, and less likely to be admitted for malignancy and endocrine disorders.
Multivariate Analyses: Association of Aggregated and Disaggregated Language‐Ethnicity Groups With Hospital Outcomes (Table 3)
In multivariate models examining aggregated language‐ethnicity groups, non‐English speakers had a trend toward higher odds of readmission at 30‐days post‐discharge than the English‐speaking group (odds ratio [OR], 1.3; 95% confidence interval [CI], 1.0‐1.7). There were no significant differences for LOS, cost, or 30‐day mortality. Compared to English speakers, Chinese and Spanish speakers had 70% and 50% higher adjusted odds of readmission at 30‐days post‐discharge respectively, while Russian speakers' odds of readmission was not increased. Additionally, Chinese speakers had 7% shorter LOS than English‐speakers. There were no significant differences among any of the language‐ethnicity groups for 30‐day mortality. The increased odds of readmission for Chinese and Spanish speakers compared to English speakers was robust to reinclusion of the admissions with the top 1% LOS and top 1% cost.
Language Categorization | LOS, % Difference (95% CI) | Total Cost, % Difference (95% CI) | 30‐Day Readmission,* OR (95% CI) | Mortality, OR (95% CI) |
---|---|---|---|---|
| ||||
All English speakers | Reference | Reference | Reference | Reference |
Non‐English speakers | 3.1 (8.7 to 3.1) | 2.5 (8.3 to 2.1) | 1.3 (1.0 to 1.7) | 0.9 (0.7 to 1.2) |
All English speakers | Reference | Reference | Reference | Reference |
Chinese speakers | 7.2 (13.9 to 0) | 5.3 (12.2 to 2.1) | 1.7 (1.2 to 2.3) | 1.0 (0.8 to 1.4) |
Spanish speakers | 3.0 (12.6 to 7.6) | 3.0 (12.7 to 7.7) | 1.5 (1.0 to 2.3) | 0.9 (0.6 to 1.5) |
Russian speakers | 1.5 (8.3 to 12.2) | 0.9 (8.9 to 11.8) | 0.8 (0.5 to 1.4) | 0.8 (0.5 to 1.2) |
Multivariate Analyses: Association of Language for Asians and Latinos, and of Ethnicity for English speakers, With Hospital Outcomes (Table 4)
Both Chinese and Spanish speakers had significantly higher odds of 30‐day readmission than their English speaking Asian and Latino counterparts. There were no significant differences in LOS, cost, or 30‐day mortality in this within‐ethnicity analysis. Among English speakers, admissions for patients with Asian ethnicity were 15% shorter and resulted in 9% lower costs than for Whites. While LOS and cost were similar for English‐speaking Latino and White admissions, English‐speaking Latinos had markedly lower odds of 30‐day readmission than their White counterparts. Whereas African‐Americans had 6% shorter LOS, 40% higher odds of readmission and 30% lower odds of mortality at 30‐days than English speaking Whites.
Language‐Ethnicity Comparisons | LOS, % Difference (95% CI) | Total Cost, % Difference (95% CI) | 30‐Day Readmission,* OR (95% CI) | Mortality, OR (95% CI) |
---|---|---|---|---|
| ||||
English speaking Asians | Reference | Reference | Reference | Reference |
Chinese speakers | 2.2 (7.4 to 12.7) | 0.3 (9.2 to 10.7) | 1.5 (1.0 to 2.3) | 0.8 (0.6 to 1.2) |
English speaking Latinos | Reference | Reference | Reference | Reference |
Spanish speakers | 4.5 (16.8 to 9.5) | 1.2 (14.0 to 13.5) | 5.7 (2.4 to 13.2) | 1.2 (0.6 to 2.4) |
English‐White | Reference | Reference | Reference | Reference |
English‐African American | 6.2 (11.3 to 0.9) | 4.4 (9.6 to 1.1) | 1.4 (1.1 to 1.7) | 0.6 (0.5 to 0.8) |
English‐Asian | 14.6 (20.9 to 7.9) | 8.6 (15.4 to 1.4) | 0.8 (0.5 to 1.0) | 1.0 (0.7 to 1.4) |
English‐Latino | 4.5 (13.5 to 5.4) | 5.0 (14.0 to 5.0) | 0.2 (0.1 to 0.4) | 0.6 (0.4 to 1.0) |
Conclusion/Discussion
Our results indicate that language barriers may contribute to higher readmission rates for non‐English speakers, but that they have less impact on care efficiency or mortality. This finding of an association between language and readmission, without a similar association with efficiency, suggests a potentially communication‐critical step in care.20, 21 Patients with language barriers are more likely to experience adverse events, and those events are often caused by errors in communication.12 It is conceivable that higher readmission risk for Chinese and Spanish speakers in our study was, at least in part, due to gaps in communication that are present in all patient groups, and exacerbated by the presence of a language barrier. This barrier is likely present during hospitalization but magnified at discharge, limiting caregivers' ability to understand patients' needs for home care, while simultaneously limiting patients' understanding of the discharge plan. After discharge, it is also possible that non‐English speakers are less able to communicate their needs as they arise, ormore subtlyfeel less supported by a primarily English speaking healthcare system. As in other clinical arenas,22 it is quite possible that increased access to professional interpreters in the hospital setting, and particularly at the time of discharge, would enhance communication and outcomes for LEP patients. Our interpreter services data showed that the patients in our study had quite limited access to staff professional interpreters.
Our findings differ somewhat from those of John‐Baptiste et al.,11 who found that language barriers contributed to increased LOS for patients with cardiac and major surgical diagnoses. Our study's findings are akin to recent research suggesting that being a monolingual Spanish speaker or receiving interpreter services may not significantly impact LOS or cost of hospitalization,23 and that LOS and in‐hospital mortality do not differ for non‐English speakers and English speakers after acute myocardial infarction.24 These studies, along with our results, suggest that that care efficiency in the hospital may be driven much more by clinical acuity (eg, the need to respond rapidly to urgent clinical signs such as hypotension, fever and respiratory distress) than by adequacy of communication. For example, elderly LEP patients may be even more likely than English speakers to have vigilant family members at the bedside throughout their hospitalization due to their need for communication assistance; these family members can quickly alert hospital staff to concerning changes in the patient's condition.
Our results also suggest the possibility that language and ethnicity are not monolithic concepts, and that even within language and ethnic groups there are potential differences in care pattern. For example, not speaking English may be a surrogate marker for unmeasured factors such as social supports and access to care. Language is intimately associated with culture; it remains plausible that cultural differences between highly acculturated and less acculturated members of a given ethno‐cultural group may have contributed to our observed differences in readmission rates. Differences in culture and associated factors, such as social support or use of multiple hospital systems, may account for lack of higher readmission risk in Russian speakers, while Chinese and Spanish speakers had higher readmission risk.
In addition, our finding that English‐speaking Latinos had lower readmission risk than any other group may be more consistent with their clinical characteristicseg, younger age, fewer comorbiditiesthan with cultural factors. Our finding that African American patients had the highest readmission risk in our hospital was both surprising and concerning. Some of this increased risk may be explained by clinical characteristics, such as higher comorbidities and higher rates of diagnoses leading to frequent admissions (eg, sickle cell disease); however, the reasons for this disparity deserves further investigation.
Our study has limitations. First, our data are administrative, and lack information about patients' educational attainment, social support, acculturation, utilization of other hospital systems, and usual source of care. Despite this, we were able to account for many significant covariates that might contribute to readmission rates, including age, insurance status, gender, comorbidities, and admission to the intensive care unit.2528 Second, our information about patients' English language proficiency is limited. While direct assessments of English proficiency are more accurate ways to determine a patient's ability to communicate with health care providers in English,29 our language validation work conducted in preparation for this study suggests that most of our patients recorded as having a non‐English primary language (87%) also have a low score on a language acculturation scale.
Third, only 14% of our non‐English speaking subjects utilized professional staff interpreters, and we had no information on the use of professional telephonic interpreters, or ad hoc interpretersfamily members, non‐interpreter staff membersand their impact on our results. It is well‐documented that ad hoc interpreters are used frequently in healthcare, particularly in the hospital setting, and thus we can assume this to be true in our study.30, 31 As noted above, it is likely that the advocacy of family members and friends at the bedside helped to minimize potential differences in care efficiency for patients with language barriers. Finally, our study was performed at a single university based hospital and may not produce results which are applicable to other care settings.
Our findings point to several avenues for future research on language barriers and hospitalized patients. First, the field would benefit from an examination of the impact of easy access to professional interpreters during hospitalization on outcomes of hospital care, in particular on readmission rates. Second, there is need for development and assessment of best practices for creating a culture of professional interpreter utilization in the hospital among physicians and nursing staff. Third, investigation of the role of caregiver presence in the hospital room and how this might differ by patient culture, age and language ability may further elucidate some of the differences across language groups observed in our study. Lastly, a more granular investigation of clinician‐patient communication and the importance of interpersonal processes of care on both patient satisfaction and understanding of and adherence to discharge instructions could lead to the development of detailed interventions to enhance this communication and these outcomes as it has done for communication‐sensitive outcomes in the outpatient arena.3234
In summary, our study suggests that higher risk for readmission can be added to the unfortunate list of outcomes which are worsened due to language barriers, pointing to transition from the hospital as a potentially communication‐critical step in care which may be amenable to intervention. Our findings also suggest that this risk can vary even between groups of patients who do not speak English primarily. Whether and to what degree language and communication barriers aloneincluding access to professional interpreters and patient‐centered communicationduring hospitalization, or differences in caregiver social support both during and after hospitalization as well as access to care post‐hospitalization contribute to these findings is a worthy subject of future research.
Acknowledgements
The authors acknowledge Dr. Eliseo J. Prez‐Stable for his mentorship on this project.
Forty‐five‐million Americans speak a language other than English and more than 19 million of these speak English less than very wellor are limited English proficient (LEP).1 The number of non‐English‐speaking and LEP people in the US has risen in recent decades, presenting a challenge to healthcare systems to provide high‐quality, patient‐centered care for these patients.2
For outpatients, language barriers are a fundamental contributor to gaps in health care. In the clinic setting, patients who do not speak English well have less access to a usual source of care and lower rates of physician visits and preventive services.36 Even when patients with language barriers do have access to care, they have poorer adherence, decreased comprehension of their diagnoses, decreased satisfaction with care, and increased medication complications.710
Few studies, however, have examined how language influences outcomes of hospital care. Compared to English‐speakers, patients who do not speak English well may experience longer lengths of stay,11 and have more adverse events while in the hospital.12 However, these previous studies have not investigated outcomes immediately post‐hospitalization, such as readmission rates and mortality, nor have they directly addressed the interaction between ethnicity and language.
To understand these questions, we analyzed data collected from a university‐based teaching hospital which cares for patients of diverse cultural and language backgrounds. Using these data, we examined how patients' primary language influenced hospital costs, length of stay (LOS), 30‐day readmission, and 30‐day mortality risk.
Patients and Methods
Patient Population and Setting
Our study examined patients admitted to the General Medicine Service at the University of California, San Francisco Medical Center (UCSF) between July 1, 2001 and June 30th, 2003, the time period during which UCSF participated in the Multicenter Hospitalist Trial (MHT) a prospective quasi‐randomized trial of hospitalist care for general medicine patients.13, 14
UCSF Moffitt‐Long Hospital is a 400‐bed urban academic medical center which provides services to the City and County of San Francisco, an ethnically and linguistically diverse area. UCSF employs staff language interpreters in Spanish, Chinese and Russian who travel to its many outpatient clinics, Comprehensive Cancer Center, Children's Hospital, as well as to Moffitt‐Long Hospital upon request; phone interpretation is also available when in‐person interpreters are not available, for off hours needs and for less common languages. During the period of this study there were no specific inpatient guidelines in place for use of interpretation services at UCSF, nor were there any specific interventions targeting LEP or non‐English speaking inpatients.
Patients were eligible for the MHT if they were 18 years of age or older and admitted at random to a hospitalist or non‐hospitalist physician (eg, outpatient general internist attending on average 1‐month/year); a minority of patients were cared for directly by their primary care physician while in the hospital, and were excluded. For purposes of our study, which merged MHT data with hospital administrative data on primary language, we further excluded all admissions for patients for whom primary language was missing (n = 5), whose listing was unknown or other language (n = 78), sign language (n = 3) or whose language was listed but was not one of the included languages (n = 258). Included languages were English, Chinese, Russian and Spanish. Because LOS and cost data were skewed, we excluded those admissions with the top 1% longest stays and the top 1% highest cost (n = 176); these exclusions did not alter the proportion of admissions across language and ethnicity. In addition, we excluded 102 admissions that were missing data on cost and 11 with costs <$500 and which were likely to be erroneous. Our research was approved by the UCSF Institutional Review Board.
Data Sources
We collected administrative data from Transition Systems Inc (TSI, Boston, MA) billing databases at UCSF as part of the MHT. These data include patient demographics, insurance, costs, ICD‐9CM diagnostic codes, admission and discharge dates in Uniform Bill 92 format. Patient mortality information was collected as part of the MHT using the National Death Index.14
Language data were collected from a separate patient‐registration database (STOR) at UCSF. Information on a patient's primary language is entered at the time each patient first registers at UCSF, whether for the index hospitalization or for prior clinic visits, and is based generally on patient self report. As part of our validation step, we cross‐checked 829 STOR language entries against patient reports and found 91% agreement with the majority of the errors classifying non‐English speakers as English‐speakers.
Measures
Predictor
Our primary language variable was derived using language designations collected from patient registration databases described above. Using these data we specified our key language groups as English, Chinese (Cantonese or Mandarin), Russian, or Spanish.
Outcomes
LOS and total cost of hospital stay for each hospitalization derived from administrative data sources. Readmissions were identified at the time patients were readmitted to UCSF (eg, flagged in administrative data). Mortality was determined by whether an individual patient with an admission in the database was recorded in the National Death Index as dead within 30‐days of admission.
Covariates
Additional covariates included age at admission, gender, ethnicity as recorded in registration databases (White, African American, Asian, Latino, Other), insurance, principal billing diagnosis, whether or not a patient received intensive care unit (ICU) care, type of admitting attending physician (Hospitalist/non‐Hospitalist), and an administrative Charlson comorbidity score.15 To collapse the principal diagnoses into categories, we used the Healthcare Cost and Utilization Project (HCUP)'s Clinical Classification System, which allowed us to classify each diagnosis in 1 of 14 generally accepted categories.16
Analysis
Statistical analyses were performed using STATA statistical software (STATACorp, Version 9, College Station, TX). We examined descriptive means and proportions for all variables, including sociodemographic, hospitalization, comorbidity and outcome variables. We compared English and non‐English speakers on all covariate and outcome variables using t‐tests for comparison of means and chi‐square for comparison of categorical variables.
It was not possible to fully test the language‐by‐ethnicity interactionwhether or not the impact of language varied by ethnic groupbecause many cells of the joint distribution were very sparse (eg, the sample contained very few non‐English‐speaking African Americans). Therefore, to better understand the influence of English vs. non‐English language usage across different ethnic groups, we created a combined language‐ethnicity predictor variable which categorized each subject first by language and then for the English‐speakers by ethnicity. For example, a Chinese, Spanish or Russian speaker would be categorized as such, and an English‐speaker could fall into the English‐White, English‐African American, English‐Asian or English‐Latino group. This allowed us to test whether there were any differences in language effects across the White, Asian, and Latino ethnicities, and any difference in ethnicity effects among English‐speakers.
Because cost and LOS were skewed, we used negative binomial models for LOS and log transformed costs. We performed a sensitivity analysis testing whether our results were robust to the exclusion of the admissions with the top 1% LOS and top 1% cost. We used logistic regression for the 30‐day readmission and mortality outcomes.
Our primary predictor was the language‐ethnicity variable described above. To determine the independent association between this predictor and our key outcomes, we then built models which included additional potential confounders selected either for face validity or because of observed confounding with other covariates. Our inclusion of potential confounders was limited by the variables available in the administrative database; thus, we were not able to pursue detailed analyses of communication and literacy factors and their interaction with our predictor or their independent impact on outcomes. Models also included a linear spline with a single knot at age 65 years as a further adjustment for age in Medicare recipients.1719 For the 30‐day readmission outcome model, we excluded those admissions for which the patient either died in the hospital or was discharged to hospice care. Within each model we tested the impact of a language barrier using custom contrasts. This allowed us to examine the language‐ethnicity effect aggregating all non‐English speakers compared to all English‐speakers, comparing each non‐English speaking group to all English‐speakers, comparing Chinese speakers to English‐speaking Asians and Spanish speakers to English‐speaking‐Latinos, as well as to test whether the effect of English language is the same across ethnicities.
Results
Admission Characteristics of the Sample
A total of 7023 patients were admitted to the General Medicine service, 5877 (84%) of whom were English‐speakers and 1146 (16%) non‐English‐speakers (Table 1). Overall, half of the admitted patients were women (50%), and the vast majority was insured (93%). The most common principal diagnoses were respiratory and gastrointestinal disorders. Only a small number of non‐English speakers 164 (14%) were recorded in the UCSF Interpreter Services database as having had any interaction with a professional staff interpreter during their hospitalization.
English (n = 5877) n (%) | Non‐English (n = 1146) n (%) | |
---|---|---|
| ||
Socio‐economic variables | ||
Language‐ethnicity | ||
English | ||
White | 3066 (52.2) | |
African American | 1351 (23.0) | |
Asian | 544 (9.3) | |
Latino | 298 (5.1) | |
Other | 618 (10.5) | |
Chinese speakers | 584 (51.0) | |
Spanish speakers | 272 (25.3) | |
Russian speakers | 290 (23.7) | |
Age mean (SD) (range 18‐105) | 58.8 (20.3) | 72.3 (15.5) |
Gender | ||
Male | 2967 (50.5) | 514 (44.8) |
Female | 2910 (49.5) | 632 (55.2) |
Insurance | ||
Medicare | 2878 (49.0) | 800 (69.8) |
Medicaid | 1201 (20.4) | 193 (16.8) |
Commercial | 1358 (23.1) | 106 (9.3) |
Charity/other | 440 (7.5) | 47 (4.1) |
Hospitalization variables | ||
Admitted to ICU | ||
Yes | 721 (12.3) | 149 (13.0) |
Attending physician | ||
Hospitalist | 3950 (67.2) | 781 (68.2) |
Comorbidity variables | ||
Principal Diagnosis | ||
Respiratory disorder | 1061 (18.1) | 225 (19.6) |
Gastrointestinal disorder | 963 (16.4) | 205 (17.9) |
Circulatory disorder | 613 (10.4) | 140 (12.2) |
Endocrine/metabolism | 671 (11.4) | 80 (7.0) |
Injury/poisoning | 475 (8.1) | 64 (5.6) |
Malignancy | 395 (6.7) | 107 (9.3) |
Renal/urinary disorder | 383 (6.5) | 108 (9.4) |
Skin disorder | 278 (4.7) | 28 (2.9) |
Infection/fatigue NOS | 206 (3.5) | 45 (3.4) |
Blood disorder (non‐malignant) | 189 (3.2) | 38 (3.3) |
Musculoskeletal/connective tissue disorder | 164 (2.8) | 33 (2.9) |
Mental disorder/substance abuse | 171 (2.9) | 7 (0.6) |
Nervous system/brain infection | 137 (2.3) | 26 (2.3) |
Unclassified | 171 (2.9) | 40 (3.5) |
Charlson Index score mean (SD) | 0.97 1.33 | 1.10 1.42 |
Among English speakers, Whites and African Americans were the most common ethnicities; however, more than 500 admissions were categorized as Asian ethnicity, and more than 600 as patients of other ethnicity. Close to 300 admissions were for Latinos. Among non‐English speakers, Chinese speakers had the largest number of admissions (n = 584), while Spanish and Russian speakers had similar numbers (n = 272 and 290 respectively).
Non‐English speakers were older, more likely to be female, more likely to be insured by Medicare, and more likely to have a higher comorbidity index score. While comorbidity scores were similar among non‐English speakers (Chinese 1.13 1.50; Russian 1.09 1.37; Spanish 1.06 1.30), they differed considerably among English speakers (White 0.94 1.29; African American 1.05 1.40; Asian 1.04 1.45; Latino 0.89 1.23; Other 0.91 1.29).
Hospital Outcome by Language‐Ethnicity Group (Table 2)
When aggregated together, non‐English speakers were somewhat more likely to be dead at 30‐days and have lower cost admissions; however, they did not differ from English speakers on LOS or readmission rates. While differences among disaggregated language‐ethnicity groups were not all statistically significant, English‐speaking Whites had the longest LOS (mean = 4.9 days) and highest costs (mean = $10,530). English‐speaking African Americans, Chinese and Spanish speakers had the highest 30‐day readmission rates; whereas, English‐speaking Latinos and Russian speakers had markedly lower 30‐day readmission rates (2.5% and 6.4%, respectively). Chinese speakers had the highest 30‐day mortality, followed by English speaking Whites and Asians.
Language‐Ethnicity Groups | LOS* Mean #Days (SD) | Cost Mean Cost $ (SD) | 30‐Day Readmission, n (%) | 30‐Day Mortality, n (%) |
---|---|---|---|---|
| ||||
English speakers (all) | 4.7 (4.5) | 10,035 (15,041) | 648 (11.9) | 613 (10.4) |
White | 4.9 (5.1) | 10,530 (15,894) | 322 (11.4) | 377 (12.3) |
African American | 4.5 (4.8) | 9107 (13,314) | 227 (17.5) | 91 (6.7) |
Asian | 4.3 (4.5) | 9933 (15,607) | 43 (8.8) | 67 (12.3) |
Latino | 4.6 (4.8) | 9823 (14,113) | 7 (2.5) | 18 (6.0) |
Other | 4.5 (4.8) | 9662 (14,016) | 49 (8.5) | 60 (9.7) |
Non‐English speakers (all) | 4.5 (4.5) | 9515 (13,213) | 117 (11.0) | 147 (12.8) |
Chinese speakers | 4.5 (4.6) | 9505 (12,841) | 69 (12.8) | 85 (14.6) |
Spanish speakers | 4.5 (4.5) | 9115 (13,846) | 31 (12.0) | 28 (10.3) |
Russian speakers | 4.7 (4.2) | 9846 (13,360) | 17 (6.4) | 34 (11.7) |
We further investigated differences among English speakers to better understand the very high rate of readmission for African Americans and the very low rate for English‐speaking Latinos. African Americans were on average younger than other English speakers (55 19 years vs. 60 21 years; P < 0.001); but, they had higher comorbidity scores than other English speakers (1.05 1.40 vs. 0.94 1.31; P = 0.008), and were more likely to be admitted for non‐malignant blood disorders (eg, sickle cell disease), endocrine disorders (eg, diabetes mellitus), and circulatory disorders (eg, stroke). In contrast, English‐speaking Latinos were also younger than other English speakers (53 21 years vs. 59 20 years; P < 0.001), but they trended toward lower comorbidity scores (0.87 1.23 vs. 0.97 1.33; P = 0.2), and were more likely to be admitted for gastrointestinal and musculoskeletal disorders, and less likely to be admitted for malignancy and endocrine disorders.
Multivariate Analyses: Association of Aggregated and Disaggregated Language‐Ethnicity Groups With Hospital Outcomes (Table 3)
In multivariate models examining aggregated language‐ethnicity groups, non‐English speakers had a trend toward higher odds of readmission at 30‐days post‐discharge than the English‐speaking group (odds ratio [OR], 1.3; 95% confidence interval [CI], 1.0‐1.7). There were no significant differences for LOS, cost, or 30‐day mortality. Compared to English speakers, Chinese and Spanish speakers had 70% and 50% higher adjusted odds of readmission at 30‐days post‐discharge respectively, while Russian speakers' odds of readmission was not increased. Additionally, Chinese speakers had 7% shorter LOS than English‐speakers. There were no significant differences among any of the language‐ethnicity groups for 30‐day mortality. The increased odds of readmission for Chinese and Spanish speakers compared to English speakers was robust to reinclusion of the admissions with the top 1% LOS and top 1% cost.
Language Categorization | LOS, % Difference (95% CI) | Total Cost, % Difference (95% CI) | 30‐Day Readmission,* OR (95% CI) | Mortality, OR (95% CI) |
---|---|---|---|---|
| ||||
All English speakers | Reference | Reference | Reference | Reference |
Non‐English speakers | 3.1 (8.7 to 3.1) | 2.5 (8.3 to 2.1) | 1.3 (1.0 to 1.7) | 0.9 (0.7 to 1.2) |
All English speakers | Reference | Reference | Reference | Reference |
Chinese speakers | 7.2 (13.9 to 0) | 5.3 (12.2 to 2.1) | 1.7 (1.2 to 2.3) | 1.0 (0.8 to 1.4) |
Spanish speakers | 3.0 (12.6 to 7.6) | 3.0 (12.7 to 7.7) | 1.5 (1.0 to 2.3) | 0.9 (0.6 to 1.5) |
Russian speakers | 1.5 (8.3 to 12.2) | 0.9 (8.9 to 11.8) | 0.8 (0.5 to 1.4) | 0.8 (0.5 to 1.2) |
Multivariate Analyses: Association of Language for Asians and Latinos, and of Ethnicity for English speakers, With Hospital Outcomes (Table 4)
Both Chinese and Spanish speakers had significantly higher odds of 30‐day readmission than their English speaking Asian and Latino counterparts. There were no significant differences in LOS, cost, or 30‐day mortality in this within‐ethnicity analysis. Among English speakers, admissions for patients with Asian ethnicity were 15% shorter and resulted in 9% lower costs than for Whites. While LOS and cost were similar for English‐speaking Latino and White admissions, English‐speaking Latinos had markedly lower odds of 30‐day readmission than their White counterparts. Whereas African‐Americans had 6% shorter LOS, 40% higher odds of readmission and 30% lower odds of mortality at 30‐days than English speaking Whites.
Language‐Ethnicity Comparisons | LOS, % Difference (95% CI) | Total Cost, % Difference (95% CI) | 30‐Day Readmission,* OR (95% CI) | Mortality, OR (95% CI) |
---|---|---|---|---|
| ||||
English speaking Asians | Reference | Reference | Reference | Reference |
Chinese speakers | 2.2 (7.4 to 12.7) | 0.3 (9.2 to 10.7) | 1.5 (1.0 to 2.3) | 0.8 (0.6 to 1.2) |
English speaking Latinos | Reference | Reference | Reference | Reference |
Spanish speakers | 4.5 (16.8 to 9.5) | 1.2 (14.0 to 13.5) | 5.7 (2.4 to 13.2) | 1.2 (0.6 to 2.4) |
English‐White | Reference | Reference | Reference | Reference |
English‐African American | 6.2 (11.3 to 0.9) | 4.4 (9.6 to 1.1) | 1.4 (1.1 to 1.7) | 0.6 (0.5 to 0.8) |
English‐Asian | 14.6 (20.9 to 7.9) | 8.6 (15.4 to 1.4) | 0.8 (0.5 to 1.0) | 1.0 (0.7 to 1.4) |
English‐Latino | 4.5 (13.5 to 5.4) | 5.0 (14.0 to 5.0) | 0.2 (0.1 to 0.4) | 0.6 (0.4 to 1.0) |
Conclusion/Discussion
Our results indicate that language barriers may contribute to higher readmission rates for non‐English speakers, but that they have less impact on care efficiency or mortality. This finding of an association between language and readmission, without a similar association with efficiency, suggests a potentially communication‐critical step in care.20, 21 Patients with language barriers are more likely to experience adverse events, and those events are often caused by errors in communication.12 It is conceivable that higher readmission risk for Chinese and Spanish speakers in our study was, at least in part, due to gaps in communication that are present in all patient groups, and exacerbated by the presence of a language barrier. This barrier is likely present during hospitalization but magnified at discharge, limiting caregivers' ability to understand patients' needs for home care, while simultaneously limiting patients' understanding of the discharge plan. After discharge, it is also possible that non‐English speakers are less able to communicate their needs as they arise, ormore subtlyfeel less supported by a primarily English speaking healthcare system. As in other clinical arenas,22 it is quite possible that increased access to professional interpreters in the hospital setting, and particularly at the time of discharge, would enhance communication and outcomes for LEP patients. Our interpreter services data showed that the patients in our study had quite limited access to staff professional interpreters.
Our findings differ somewhat from those of John‐Baptiste et al.,11 who found that language barriers contributed to increased LOS for patients with cardiac and major surgical diagnoses. Our study's findings are akin to recent research suggesting that being a monolingual Spanish speaker or receiving interpreter services may not significantly impact LOS or cost of hospitalization,23 and that LOS and in‐hospital mortality do not differ for non‐English speakers and English speakers after acute myocardial infarction.24 These studies, along with our results, suggest that that care efficiency in the hospital may be driven much more by clinical acuity (eg, the need to respond rapidly to urgent clinical signs such as hypotension, fever and respiratory distress) than by adequacy of communication. For example, elderly LEP patients may be even more likely than English speakers to have vigilant family members at the bedside throughout their hospitalization due to their need for communication assistance; these family members can quickly alert hospital staff to concerning changes in the patient's condition.
Our results also suggest the possibility that language and ethnicity are not monolithic concepts, and that even within language and ethnic groups there are potential differences in care pattern. For example, not speaking English may be a surrogate marker for unmeasured factors such as social supports and access to care. Language is intimately associated with culture; it remains plausible that cultural differences between highly acculturated and less acculturated members of a given ethno‐cultural group may have contributed to our observed differences in readmission rates. Differences in culture and associated factors, such as social support or use of multiple hospital systems, may account for lack of higher readmission risk in Russian speakers, while Chinese and Spanish speakers had higher readmission risk.
In addition, our finding that English‐speaking Latinos had lower readmission risk than any other group may be more consistent with their clinical characteristicseg, younger age, fewer comorbiditiesthan with cultural factors. Our finding that African American patients had the highest readmission risk in our hospital was both surprising and concerning. Some of this increased risk may be explained by clinical characteristics, such as higher comorbidities and higher rates of diagnoses leading to frequent admissions (eg, sickle cell disease); however, the reasons for this disparity deserves further investigation.
Our study has limitations. First, our data are administrative, and lack information about patients' educational attainment, social support, acculturation, utilization of other hospital systems, and usual source of care. Despite this, we were able to account for many significant covariates that might contribute to readmission rates, including age, insurance status, gender, comorbidities, and admission to the intensive care unit.2528 Second, our information about patients' English language proficiency is limited. While direct assessments of English proficiency are more accurate ways to determine a patient's ability to communicate with health care providers in English,29 our language validation work conducted in preparation for this study suggests that most of our patients recorded as having a non‐English primary language (87%) also have a low score on a language acculturation scale.
Third, only 14% of our non‐English speaking subjects utilized professional staff interpreters, and we had no information on the use of professional telephonic interpreters, or ad hoc interpretersfamily members, non‐interpreter staff membersand their impact on our results. It is well‐documented that ad hoc interpreters are used frequently in healthcare, particularly in the hospital setting, and thus we can assume this to be true in our study.30, 31 As noted above, it is likely that the advocacy of family members and friends at the bedside helped to minimize potential differences in care efficiency for patients with language barriers. Finally, our study was performed at a single university based hospital and may not produce results which are applicable to other care settings.
Our findings point to several avenues for future research on language barriers and hospitalized patients. First, the field would benefit from an examination of the impact of easy access to professional interpreters during hospitalization on outcomes of hospital care, in particular on readmission rates. Second, there is need for development and assessment of best practices for creating a culture of professional interpreter utilization in the hospital among physicians and nursing staff. Third, investigation of the role of caregiver presence in the hospital room and how this might differ by patient culture, age and language ability may further elucidate some of the differences across language groups observed in our study. Lastly, a more granular investigation of clinician‐patient communication and the importance of interpersonal processes of care on both patient satisfaction and understanding of and adherence to discharge instructions could lead to the development of detailed interventions to enhance this communication and these outcomes as it has done for communication‐sensitive outcomes in the outpatient arena.3234
In summary, our study suggests that higher risk for readmission can be added to the unfortunate list of outcomes which are worsened due to language barriers, pointing to transition from the hospital as a potentially communication‐critical step in care which may be amenable to intervention. Our findings also suggest that this risk can vary even between groups of patients who do not speak English primarily. Whether and to what degree language and communication barriers aloneincluding access to professional interpreters and patient‐centered communicationduring hospitalization, or differences in caregiver social support both during and after hospitalization as well as access to care post‐hospitalization contribute to these findings is a worthy subject of future research.
Acknowledgements
The authors acknowledge Dr. Eliseo J. Prez‐Stable for his mentorship on this project.
- 2000. Available at: http://www.census.gov/prod/2003pubs/c2kbr‐29.pdf. Accessed January 2010. , . Language Use and English‐Speaking Ability:
- U.S. Department of Health and Human Services.2006National Healthcare Disparities Report. AHRQ Publication No. 070012; 2006.
- Disparities in health care by race, ethnicity, and language among the insured: findings from a national sample.Med Care.2002;40(1):52–59. , , , .
- The effect of physician‐patient communication on mammography utilization by different ethnic groups.Med Care.1991;29(11):1065–1082. , .
- Language of interview:relevance for research of Southwest Hispanics.Am J Pub Health.1991;81(11):1399–1404. , .
- Is language a barrier to the use of preventive services?J Gen Intern Med.1997;12(8):472–477. , , , .
- Impact of language barriers on patient satisfaction in an emergency department.JGIM.1999;14:82–87. , , , .
- Patient comprehension of doctor‐patient communication on discharge from the emergency department.J Emerg Med.1997;15(1):1–7. .
- Drug complications in outpatients.J Gen Intern Med.2000;15:149–154. , , , et al.
- Language concordance as a determinant of patient compliance and emergency room use in patients with asthma.Med Care.1988;26(12):1119–1128. .
- The effect of English language proficiency on length of stay and in‐hospital mortality.J Gen Intern Med.2004;19:221–228. , , ,et al.
- Language proficiency and adverse events in US hospitals: a pilot study.Int J Qual Health Care.2007;19(2):60–67. , , , .
- Factors associated with discussion of care plans and code status at the time of hospital admission: results from the Multicenter Hospitalist Study.J Hosp Med.2008;3(6):437–445. , , , et al.
- Quality of care for decompensated heart failure: comparable performance between academic hospitalists and non‐hospitalists.J Gen Intern Med.2008;23(9):1399–1406. , , , et al.
- A new method of classifying prognostic comorbidity in longitudinal studies: development and validation.J Chronic Dis.1987;40(5):373–383. , , , .
- AHRQ. Healthcare Cost and Utilizlation Project: Tools 132(3):191–200.
- Free knot splines for logistic models and threshold selection.Comput Methods Programs Biomed.2005;77(1):1–9. , , .
- Statistical methods in epidemiology: a comparison of statistical methods to analyze dose‐response and trend analysis in epidemiologic studies.J Clin Epidemiol.1998;51(12):1223–1233. , , , , .
- The Care Transitions Project. Health Care Policy and Research, Practitioner Tools. Available at: http://www.caretransitions.org/practitioner_tools.asp.
- AHRQ. Improving safety at the point of care. Available at: http://www.ahrq.gov/qual/pips. Accessed January2010.
- Do professional interpreters improve clinical care for patients with limited english proficiency? A systematic review of the literature.Health Serv Res.2007;42(2):727–754. , , , .
- The impact of an enhanced interpreter service intervention on hospital costs and patient satisfaction.J Gen Intern Med.2007;22 Suppl 2:306–311. , , .
- Acute myocardial infarction length of stay and hospital mortality are not associated with language preference.J Gen Intern Med.2008;23(2):190–194. , , , , .
- A systematic literature review of factors affecting outcome in older medical patients admitted to hospital.Age Ageing.2004;33(2):110–115. , , .
- Risk factors and prognostic predictors of unexpected intensive care unit admission within 3 days after ED discharge.Am J Emerg Med.2007;25(9):1009–1014. , , , , , .
- Effect of gender, ethnicity, pulmonary disease, and symptom stability on rehospitalization in patients with heart failure.Am J Cardiol.2007;100(7):1139–1144. , .
- Bouncing back: patterns and predictors of complicated transitions 30 days after hospitalization for acute ischemic stroke.J Am Geriatr Soc.2007;55(3):365–373. , , , .
- Identification of limited English proficient patients in clinical care.J Gen Intern Med.2008;23(10):1555–1560. , , , , .
- Hospital langague services for patients with limited English proficiency: results from a national survey.Health Research October2006. , , , , .
- Hospitals, Language, and Culture: a Snapshot of the Nation. The Joint Commission and The California Endowment;2007. , .
- A randomized controlled trial of interventions to enhance patient‐physician partnership, patient adherence and high blood pressure control among ethnic minorities and poor persons: study protocol NCT00123045.Implement Sci.2009;4:7. , , , et al.
- Interpersonal processes of care and patient satisfaction: do associations differ by race, ethnicity, and language?Health Serv Res.2009;44(4):1326–1344. , , , , .
- Understanding concordance in patient‐physician relationships: personal and ethnic dimensions of shared identity.Ann Fam Med.2008;6(3):198–205. , , , .
- 2000. Available at: http://www.census.gov/prod/2003pubs/c2kbr‐29.pdf. Accessed January 2010. , . Language Use and English‐Speaking Ability:
- U.S. Department of Health and Human Services.2006National Healthcare Disparities Report. AHRQ Publication No. 070012; 2006.
- Disparities in health care by race, ethnicity, and language among the insured: findings from a national sample.Med Care.2002;40(1):52–59. , , , .
- The effect of physician‐patient communication on mammography utilization by different ethnic groups.Med Care.1991;29(11):1065–1082. , .
- Language of interview:relevance for research of Southwest Hispanics.Am J Pub Health.1991;81(11):1399–1404. , .
- Is language a barrier to the use of preventive services?J Gen Intern Med.1997;12(8):472–477. , , , .
- Impact of language barriers on patient satisfaction in an emergency department.JGIM.1999;14:82–87. , , , .
- Patient comprehension of doctor‐patient communication on discharge from the emergency department.J Emerg Med.1997;15(1):1–7. .
- Drug complications in outpatients.J Gen Intern Med.2000;15:149–154. , , , et al.
- Language concordance as a determinant of patient compliance and emergency room use in patients with asthma.Med Care.1988;26(12):1119–1128. .
- The effect of English language proficiency on length of stay and in‐hospital mortality.J Gen Intern Med.2004;19:221–228. , , ,et al.
- Language proficiency and adverse events in US hospitals: a pilot study.Int J Qual Health Care.2007;19(2):60–67. , , , .
- Factors associated with discussion of care plans and code status at the time of hospital admission: results from the Multicenter Hospitalist Study.J Hosp Med.2008;3(6):437–445. , , , et al.
- Quality of care for decompensated heart failure: comparable performance between academic hospitalists and non‐hospitalists.J Gen Intern Med.2008;23(9):1399–1406. , , , et al.
- A new method of classifying prognostic comorbidity in longitudinal studies: development and validation.J Chronic Dis.1987;40(5):373–383. , , , .
- AHRQ. Healthcare Cost and Utilizlation Project: Tools 132(3):191–200.
- Free knot splines for logistic models and threshold selection.Comput Methods Programs Biomed.2005;77(1):1–9. , , .
- Statistical methods in epidemiology: a comparison of statistical methods to analyze dose‐response and trend analysis in epidemiologic studies.J Clin Epidemiol.1998;51(12):1223–1233. , , , , .
- The Care Transitions Project. Health Care Policy and Research, Practitioner Tools. Available at: http://www.caretransitions.org/practitioner_tools.asp.
- AHRQ. Improving safety at the point of care. Available at: http://www.ahrq.gov/qual/pips. Accessed January2010.
- Do professional interpreters improve clinical care for patients with limited english proficiency? A systematic review of the literature.Health Serv Res.2007;42(2):727–754. , , , .
- The impact of an enhanced interpreter service intervention on hospital costs and patient satisfaction.J Gen Intern Med.2007;22 Suppl 2:306–311. , , .
- Acute myocardial infarction length of stay and hospital mortality are not associated with language preference.J Gen Intern Med.2008;23(2):190–194. , , , , .
- A systematic literature review of factors affecting outcome in older medical patients admitted to hospital.Age Ageing.2004;33(2):110–115. , , .
- Risk factors and prognostic predictors of unexpected intensive care unit admission within 3 days after ED discharge.Am J Emerg Med.2007;25(9):1009–1014. , , , , , .
- Effect of gender, ethnicity, pulmonary disease, and symptom stability on rehospitalization in patients with heart failure.Am J Cardiol.2007;100(7):1139–1144. , .
- Bouncing back: patterns and predictors of complicated transitions 30 days after hospitalization for acute ischemic stroke.J Am Geriatr Soc.2007;55(3):365–373. , , , .
- Identification of limited English proficient patients in clinical care.J Gen Intern Med.2008;23(10):1555–1560. , , , , .
- Hospital langague services for patients with limited English proficiency: results from a national survey.Health Research October2006. , , , , .
- Hospitals, Language, and Culture: a Snapshot of the Nation. The Joint Commission and The California Endowment;2007. , .
- A randomized controlled trial of interventions to enhance patient‐physician partnership, patient adherence and high blood pressure control among ethnic minorities and poor persons: study protocol NCT00123045.Implement Sci.2009;4:7. , , , et al.
- Interpersonal processes of care and patient satisfaction: do associations differ by race, ethnicity, and language?Health Serv Res.2009;44(4):1326–1344. , , , , .
- Understanding concordance in patient‐physician relationships: personal and ethnic dimensions of shared identity.Ann Fam Med.2008;6(3):198–205. , , , .
Copyright © 2010 Society of Hospital Medicine