**ORIGINAL RESEARCH **INVESTIGACIÓN ORIGINAL

**A common error in the ecological regression of cancer incidence on the deprivation index**

**Un error frecuente en los modelos de regresión ecológica de la incidencia de cáncer con respecto al índice de carencia **

**Gemma Renart ^{I}; Marc Saez^{I}; Carme Saurina^{I}; Rafael Marcos-Gragera^{I}; Ricardo Ocaña-Riola^{II}; Carmen Martos^{III}; Maria A. Barceló^{I}; Federico Arribas^{IV}; Tomás Alcalá^{IV}**

^{I}Research Group on Statistics, Econometrics and Health (GRECS), Universitat de Girona, Girona, Catalonia, Spain. Send correspondence to Gemma Renart, gemma.renart@udg.edu ^{II}Escuela Andaluza de Salud Pública (EASP), Granada, Spain. ^{III}Centro Superior de Investigación en Salud Pública (CSISP), Valencia, Spain ^{IV}Instituto Aragonés de Ciencias de la Salud (IACS), Zaragoza, Spain

**ABSTRACT**

**OBJECTIVE:** To determine if introducing age as another explanatory variable in an ecological regression model relating crude rates of cancer incidence and a deprivation index provides better results than the usual practice of using the standard incidence ratio (SIR) as the response variable, introducing the non-standardized index, and not including age in the model. **METHODS:** Relative risks associated with the deprivation index for some locations of cancer in Spain's Girona Health Region were estimated using two different models. Model 1 estimated relative risks with the indirect method, using the SIR as the response variable. Model 2 estimated relative risks using age as an explanatory variable and crude cancer rates as the response variable. Two scenarios and two sub-scenarios were simulated to test the properties of the estimators and the goodness of fit of the two models. **RESULTS:** The results obtained from Model 2's estimates were slightly better (less biased) than those from Model 1. The results of the simulation showed that in all cases (two scenarios and two sub-scenarios) Model 2 had a better fit than Model 1. The probability density for the parameter of interest provided evidence that Model 1 leads to biased estimates. **CONCLUSIONS:** When attempting to explain the relative risk of incidence of cancer using ecological models that control geographic variability, introducing age as another explanatory variable and crude rates as a response variable provides less biased results.

**Key words:** Incidence; spatial analysis; risk; Spain.

**RESUMEN*** *

**OBJETIVO:** Determinar si la introducción de la edad como otra variable independiente en un modelo de regresión ecológica que relaciona las tasas brutas de incidencia de cáncer con un índice de carencia, ofrece mejores resultados que la práctica corriente del uso de la razón de incidencia normalizada como criterio de valoración, con introducción del índice sin normalización y sin incluir la edad en el modelo. **MÉTODOS:**Se calcularon los riesgos relativos asociados con el índice de carencia de algunos tipos de cáncer en la Región Sanitaria de Girona en España, mediante dos modelos diferentes. En el modelo 1 se calcularon los riesgos relativos con el método indirecto, usando la razón de incidencia normalizada como criterio de valoración. En el modelo 2 se calcularon los riesgos relativos introduciendo la edad como una variable independiente y las tasas brutas de cáncer como criterio de valoración. Se simularon dos hipótesis y dos subhipótesis con el fin de verificar las propiedades de los estimadores y la bondad del ajuste de ambos modelos. **RESULTADOS:** Los resultados obtenidos a partir de las estimaciones con el modelo 2 fueron un poco mejores (menos sesgados) que los resultados obtenidos con el modelo 1. Los resultados de la simulación indicaron que en todos los casos (las dos hipótesis y las dos subhipótesis) el modelo 2 exhibió un mejor ajuste que el modelo 1. La función de densidad del parámetro de interés puso en evidencia que el modelo 1 da lugar a estimaciones sesgadas. **CONCLUSIONES: **Cuando se intenta explicar el riesgo relativo de incidencia de cáncer mediante modelos de regresión ecológica que tienen en cuenta la variabilidad geográfica, se obtienen resultados menos sesgados cuando se introduce la edad como una de las variables independientes y se utilizan las tasas brutas de incidencia como criterio de valoración.

**Palabras clave:** Incidencia; análisis espacial; riesgo; España.

In epidemiology, death or incidence ratios, standardized by factors known to confound the relationships of interest, are used to compare incidence and mortality in different geographic areas. The indirect method of standardization compares cases observed in a particular area with those one would expect to find within a certain reference population if the risks were the same for each age group. The standardizing factor is usually age distribution. The ratio of observed cases to expected cases, known as the standardized mortality ratio (SMR) or standard incidence ratio (SIR), is essentially a relative risk estimator for the area (i.e., an estimator of the risk of illness in an area in relation to the reference population). However, it has been demonstrated that problems arise with the use of age-adjusted rates in ecological regression models (1). Rosenbaum and Rubin (2) confirm that using standardized rates as a response variable in regression models leads to biased results because only the response values and not the predictor values are adjusted by the same confounding factor, usually distribution by age, resulting in what is known as the "mutual standardization problem." Anselin (3, 4) confirms that rates derived from both direct and indirect standardization are calculated assuming a homogeneous relationship between risk and distribution by age in space (and time). The use of a non-standardized predictor variable implicitly assumes that the effect is constant across all strata of the confounding variable. Grisotto et al. (5), in line with Rosenbaum and Rubin (2), show that less biased estimators would be obtained by standardizing both the response and the predictor using the same variable, or by using the crude rates as response and including age in the regression models as one more explanatory variable.

The objective of this research was to show that introducing age as another explanatory variable in an ecological regression model relating crude rates of cancer incidence and a deprivation index provides better results than the usual practice of using the SIR as a response, introducing the non-standardized index, and not including age in the model.

**MATERIALS AND METHODS**

The current study was undertaken within the framework of the MEDEA project.^{5} One of the project objectives was to estimate the relative risks associated with a deprivation index for some cancer locations in the Girona Health Region (GHR) in the province of Girona in northern Catalonia, an autonomous region in Spain, and to ascertain if the index could explain part of the spatial variability found in some of these locations (6, 7).

The analysis was performed on data provided by the Girona Cancer Registry (8, 9) for 1) incident cases of lung, tracheal, and bronchial cancer (codes C33–C34 in the International Classification of Diseases, 10th revision [ICD-10]); melanoma skin cancer (ICD-10 code C43); and non-Hodgkin's lymphoma (ICD-10 codes C82–C85 and C96) for men and women; 2) incident cases of larynx cancer (ICD-10 code C32) in men; and 3) breast cancer (ICD-10 code C50) in women.

All residents of the GHR (670 096 people, including 339 839 men and 330 257 women, according to the 2006 municipal population register) were included in the study population. The study took place from the years 1993 to 2006 (both inclusive), and the geographic area of analysis was the census tract.

The SIRs were calculated using the number of observed cases of the neoplasia of interest in the census tract *i* (with *i* = 1–500) during the period 1993–2006, and the number of expected cases of those diseases for the same tract. The reference population for the SIR was assumed to be the estimated population for each census tract in the GHR.

Although widely used, SIRs do have some limitations (7, 10, 11). These problems can be solved by smoothing. In this study, the Besag, York, and Mollié (BYM) model (12, 13) was used for Model 1, within a full Bayesian perspective. Two random effects were introduced into the model (spatial dependency and [nonspatial] unstructured variability) to gather all unexplained variability, and the parameters were assigned a probability distribution (prior distribution).

The method used to estimate the model parameters is explained in Annex 1.

To test the properties of the estimators and the goodness of fit of the two models, two scenarios and two sub-scenarios were simulated (Annex 2).

**Model 1**

Stratifying by sex, the BYM model (Model 1) was specified as a generalized linear mixed model (GLMM) with a Poisson response variable

where *O _{i}* is the number of cases observed in each census tract

*i*; m

_{i}*is the mean for the Poisson distribution (*

*E*(

*O*));

_{i}*E*is the number of expected cases in each tract

_{i}*i*;

*S*is the random effect that captures the spatial variability; and

_{i}*u*

*is the random effect that captured the unstructured variability.*

_{i}A deprivation index was introduced into the model as an explanatory variable to capture the specific socioeconomic contextual effects of geographic location on health. The index was constructed in accordance with the protocol established for the MEDEA project (14). *Index _{ji}* in Model 1 denotes dummy variables relative to the quintile

*j*for each census tract

*i*of the deprivation index, and

*b*

*is the associated parameter. The first quintile was used as the reference.*

_{j}The random effects *u** _{i}* in Model 1 were independent and normally distributed, with zero mean and constant variance. For the random effect that captures the spatial variability, a conditional autoregressive model (CAR)

*S*was used (15–17). The CAR model assumed dependency existed between neighboring areas, with "neighboring" defined as "adjacent" (18).

_{i}**Model 2**

For Model 2, crude rates were used as the response variable, incorporating the age structure of the population as an explanatory variable. The population was divided into five age groups (< 1 year, 1–14 years, 15–44 years, 45–64 years, and 65 years or more) corresponding to the five stages of health care (neonatal, child, young adult, adult, and old age). As cancer incidence is low for those under 14 years old, the first three age groups were combined into one cohort, for a total of three age groups (£ 44 years, 45–64 years, and 65 years or older). This resulted in alternative specifications for Model 2

where *Pob _{i}* denoted the population of census tract

*i*,

*P*4564

*denotes the percentage of the population between ages 45 and 64 years (both inclusive), and*

_{i }*P*65

*M*denotes the population 65 years or older. The first age group (£ 44 years) was not included in the model to avoid problems of colinearity with the two other age groups.

_{i}

**RESULTS**

The results obtained from the model estimators are shown in Table 1. Model 1 used the SIR as the response variable without standardizing the deprivation index. Model 2 used crude cancer rates as the response variable, and a non-standardized deprivation index, but included the age structure of the population.

Model goodness of fit, measured using the deviance information criterion (DIC), was very similar or slightly better in Model 2. However, the standard errors of the random effect that captures the spatial variability and the standard deviations of the random effect that captures the nonstructured variability were lower when using Model 2 for all types of cancers except melanoma skin cancer in men and women, and tracheal, bronchial, and lung cancer in women (where they were higher), and non-Hodgkin's lymphoma (where they were almost equal).

The most significant differences were found in the statistical significance of the relative risk associated with the quintiles of the deprivation index. Thus, while the results of Model 1 indicate an association between the quintiles of the deprivation index and male cases of tracheal, bronchial, and lung cancer; larynx cancer; and non-Hodgkin's lymphoma, the statistical significances disappeared completely when using Model 2. Moreover, the relative risks of Model 2 were much lower (albeit without statistical significance) than those obtained in Model 1.

The only case where both methods of estimation provided a statistically significant association with the deprivation index (albeit only in the fifth quintile) was for incidence of breast cancer. However, the relative risk obtained in Model 2 was lower than that obtained in Model 1.

In the above-mentioned three cases where Model 1 revealed a significant association between the deprivation index and incidence of cancer (tracheal, bronchial, and lung; larynx; and non-Hodgkin's lymphoma—all in men), being in either of the two age groups included in Model 2 (45–64 years, and 65 years or older) appeared to be a significant predictor. Statistical significance was the same in all three cases: negative for the group aged 45–64 years, and positive for the group aged 65 years or older.

**Simulation**

Even in the most favorable scenario (Scenario 1), the results for coverage (Table 2) and probability density for the parameter of interest (Figure 1) provide evidence that Model 1, which used the standard practice in ecological spatial regression (considering SIR as the response variable and a non-standardized deprivation index as an explanatory variable), provided biased estimates.

In all scenarios, Model 2 had a better fit than Model 1 in terms of lower DIC, effective number of parameters, and standard deviations of the random effects.

**DISCUSSION**

Technological progress and the availability of geographic information have allowed for the application of spatial epidemiology in the area of public health to identify areas with a higher risk of given health problems, using the SMR or the SIR, where incidence serves as an indicator for comparing risks in the different geographic areas studied. Various indices have been introduced in the models to explain geographic variability (12, 13, 19).

The method of indirect standardization of rates is frequently used to study the space–time distribution of incidence and mortality in small areas. However, various limitations exist that suggest this method should not be used in studies with geographic correlation, time series analysis of morbidity, or risk comparison between areas (20).

The SIR, like the SMR, is essentially a quotient of rates, adjusted using the direct method, where the numerator is the weighted mean for the specific study areas, or target population, and the denominator is the weighted mean for the rates specific to the external region, or non-target population. In this calculation, the standard or reference population corresponds to the area of study itself (the target population). The weighted values used to discern the weighted mean of specific rates derive from said population, so the reference population is never the external area or non-target population, as is often asserted in scientific publications. For this reason, SIRs for different geographic regions always have different reference populations, and the confusion bias resulting from the different population pyramids is always present when comparisons are attempted. Therefore, it is incorrect to claim that geographic areas with elevated SIRs show a higher incidence than areas with low SIRs. If the areas are not comparable, it is not possible to rank their values as synonyms of incidence frequency adjusted according to age and sex groupings.

In addition, calculating the SIR would only make sense when the specific rates of the target and non-target areas are proportional. This condition is especially difficult to verify when the indirect method is applied, because the specific target area (20) rates are not used. For this reason, studying the geographic distribution of incidence using the SIR or SMR may dilute important aspects of each stratum of the population and result in biased results (21).

These questions about the use of SIRs also affect the analysis of ecological associations. The majority of epidemiological studies model the logarithm of the mean number of cases observed according to the Napierian logarithm of the number of cases expected, which acts as an offset, plus a linear combination of explanatory variables. The parameters of the model are estimated through frequentist or Bayesian methods, and the exponential value of the linear combination of explanatory variables is risk, or the adjusted SIR. However, because the order of these values lacks epidemiological sense, it is inappropriate to equate a percentage increase or decrease in the SIR between two consecutive values of an explanatory variable to an increase or decrease in incidence adjusted by age or sex, because the compared areas have a different population pyramid (20).

Two scenarios with one sub-scenario each were simulated using two different models. Model 1 reproduced the standard practice, with the SIR as a response variable and the non-standardized deprivation index as an explanatory variable. In Model 2, crude rates were the response variable, and the non-standardized deprivation index (the explanatory variable) was adjusted by age.

The results of the simulation provide evidence that, even supposing that the effect of the explanatory variable (the deprivation index, in this case) is constant between strata of the confounding variable (age, in this study), which would theoretically be the most favorable scenario for standard approximation, the estimators used in Model 1 prove to be biased. In addition, goodness of fit, measured as both DIC and the size of the standard deviations of the random effects, proved to be much better in Model 2, as confirmed by the cancer incidence estimates from the various study sites.

**Limitations**

This study had some limitations. First, the duration of the study period was 14 years, during which time changes may have occurred in the geographic distribution of cancer incidence, the population pyramid, and the indicators that constitute the deprivation index. This could have produced bias in the results, which were obtained using models that do not consider dynamic behaviors because annually updated sources of census information are not available. Second, the deprivation index may not truly measure deprivation among women. However, even if it does not, the study results indicate that the variance in statistical significance across age groups and index quintiles seems to function differently among women versus men. Third, the fact that the statistical significance of the breast cancer index does not disappear may be explained by the age cutoff points (45 and 64 years respectively) for incidence of two types of breast cancer (onset before age 45–50 and onset after age 45–50) and the fact that incidence among older age groups is better explained by the deprivation index. None of these limitations are likely to have rendered the main results of this work invalid, however, because the evidence suggests that both geographic variability of incidence and the deprivation index depend on age.

**Conclusion**

This study found that when attempting to explain the relative risk of incidence of cancer using ecological models that control geographic variability (meaning both spatial and nonstructural extra variability), crude rates should be used as a response variable and age included as another explanatory variable. The adjustments obtained would appear to be in line with Rosenbaum and Rubin (2), Anselin (3, 4), and Grisotto et al. (5), suggesting that the parameter estimators using SIRs as a response variable without standardizing the deprivation index (the standard focus) are biased and thus providing more evidence of the "mutual standardization problem" (biased results stemming from the use of standardized rates as a response variable in regression models). Introducing age as another explanatory variable in an ecological regression model relating crude rates of cancer incidence and the deprivation index provides better results because there is no reason to assume that the effect of each deprivation index quintile is constant across the different strata of the age variable (5). The current study results also suggest that geographic variability of incidence and the deprivation index depend on age. Therefore, the standard (SIR) focus should probably not be used, as its effect would not be independent of the variable associated with risk.

**Acknowledgments**. The authors appreciate the comments of the attendees at the XXVIII Reunión Anual de la Sociedad Española de Epidemiología (Valencia, Spain, October 27–29, 2010), where a preliminary version of this work was presented.

**Funding**. This work was partially supported by the Health Research Fund (*Fondo de Investigación Sanitaria*, FIS) of the Spanish Ministry of Science and Innovation, through projects ETS-06/90553, ETS-07/90453, and FIS-08/0142; the Agency for Technology Evaluation and Medical Research (*Agencia de Evaluación de Tecnología e Investigación Médicas*, AATRM); the Catalan Health Service of the Catalan Government (*Generalitat de Catalunya*), through project AATRM-006/01/2006; and the Second Programme of Community action in the field of Health (2008–2013) of the European Commission's Directorate General for Health and Consumer Protection (DG-SANCO), through project A/101156.

**Conflicts of interest**. None. All authors disclose any financial and personal relationships with other people or organizations that could inappropriately influence and/or bias their work.

**REFERENCES**

1. Morgenstern H. Ecologic studies. In: Rothman KJ, Greenland S, Lash TI, editors. Modern epidemiology. 3rd ed. Philadelphia: Lippincott-Williams & Wilkins; 2008.

2. Rosenbaum P, Rubin D. Difficulties with regression analyses of age-adjusted rates. Biometrics. 1984;40:437–43.

3. Anselin L, Lozano N, Koschinsky J. Rate transformations and smoothing. Working paper. Urbana-Champaign, Illinois: Spatial Analysis Laboratory, Department of Geography, University of Illinois; 2006.

4. Anselin L. How (not) to lie with spatial statistics. Am J Prev Med. 2006;30(2 Suppl):S3–6.

5. Grisotto L, Catelan D, Accetta G, Biggeri A. Material deprivation as marker of health needs. Statistica. 2010;70(3):343–52.

6. Borrell C, Marí-Dell'olmo M, Serral G, Martínez-

Beneito M, Gotsens M; MEDEA members. Inequalities in mortality in small areas of eleven Spanish cities (the multicenter MEDEA project). Health Place. 2010;16(4):703–11.

7. Barceló MA, Saez M, Cano-Serral G, Martínez-

Beneito MA, Martínez JM, Borrell C, et al. Métodos para la suavización de indicadores de mortalidad: aplicación al análisis de desigualdades en mortalidad en ciudades del Estado español (Proyecto MEDEA). Gac Sanit. 2008;22(6):596–608.

8. Unitat d'Epidemiologia i Registre de Càncer de Girona (ES). Cancer in Girona 2003–2004 [in Catalan]. Girona: UERCG-Pla Director d'Oncologia; 2009. (CanGir No. 2, March 2009).

9. Curado MP, Edwards B, Shin HR, Storm H, Ferlay J, Heanue M, et al., editors. Cancer incidence in five continents, vol. IX. Lyon: International Agency for Research on Cancer; 2007. (Scientific Publication No. 160).

10. Clayton D, Kaldor J. Empirical Bayes estimates of age-standardized relative risks. Biometrics. 1987;43(3):671–81.

11. Lawson AB, Browne WJ, Vidal-Rodeiro CL. Disease mapping with WinBUGS and MLwiN.

Chichester, West Sussex: John Wiley & Sons; 2003.

12. Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. Ann Inst Statist Math. 1991;43(1):1–59.

13. Mollié A. Bayesian mapping of disease. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov chain Monte Carlo in practice. London: Chapman & Hall; 1996. Pp. 359–79.

14. Domínguez-Berjón MF, Borrell C, Cano-Serral G, Esnaola S, Nolasco A, Pasarín MI, et al. Construcción de un índice de privación a partir de datos censales en grandes ciudades españolas (Proyecto MEDEA). Gac Sanit. 2008; 22(3):179–87.

15. Besag J. Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc B. 1974;36(2):192–236.

16. Clayton DG, Bernadinelli L, Montomoli C. Spatial correlation in ecological analysis. Int J Epidemiol. 1993;22(6):1193–202.

17. Kelsall J, Wakefield J. Modeling spatial variation in disease risk: a geostatistical approach. J Am Stat Assoc. 2002;97(459):692–701.

18. Saez M, Saurina C. Estadística y epidemiología espacial. Girona: Edicions a Petició; 2007.

19. Elliot P, Best NG. Geographical patterns of disease. In: Armitage P, Colton T, editors. Encyclopedia of biostatistics. 2nd ed. Chichester: John Wiley & Sons; 2005. Available from: http://eu.wiley.com/legacy/wileychi/eob/articles.html

20. Ocaña-Riola R. Common errors in disease mapping. Geospat Health. 2010;4(2):139–54.

21. Ocaña-Riola R, Mayoral-Cortés JM. Spatio-temporal trends of mortality in small areas of Southern Spain. BMC Public Health. 2010; 10:26.

Manuscript received on 21 June 2012.

Revised version accepted for publication on 16 July 2013.

**Appendix**