The article describes methodological aspects in defining the study population, sampling plan, and sample weigthing and calibration of effective sample of the Brazilian National Survey on Child Nutrition (ENANI-2019). This population-based household survey assessed breastfeeding and dietary intake, anthropometric assessment of nutritional status, and micronutrient deficiencies by blood biomarkers in children under five years of age. The data were obtained with a probability sample, with stratification by the five geographic regions in the country and clustering by census enumeration areas (CEAs). The sample was calculated at 15,000 households distributed in 1,500 CEAs, with 300 allocated in each of Brazil’s five major geographic regions and 10 eligible households per CEA, sampled using inverse sampling. The required population parameters were thus estimated to reach the study’s objectives. The basic sampling design weights were calculated as the inverse probabilities of the households’ inclusion in the study. Imputation was used to compensate for non-response to items in the target variables, except for data on the blood biomarkers. Finally, calibration used population totals of children in 60 post-strata, defined by cross-classification of the following variables: major geographic region, sex, and age. The final sample included 14,558 children residing in 12,524 households, distributed in 1,382 CEAs in the 26 states of Brazil and the Federal District. The data from the ENANI-2019 survey will support strategies for the promotion and implementation of public policies for children under five years of age.
Infant; Preschool Child; Statistical Models; Sampling Studies; Methods
O objetivo deste artigo é descrever aspectos metodológicos referentes à definição da população da pesquisa, plano amostral, ponderação e calibração da amostra efetiva do Estudo Nacional de Alimentação e Nutrição Infantil (ENANI-2019). Trata-se de um inquérito populacional de base domiciliar que realizou avaliação do aleitamento materno e de consumo alimentar, avaliação antropométrica do estado nutricional, e avaliação das deficiências de micronutrientes mediante análise de biomarcadores sanguíneos em crianças menores de 5 anos de idade. Seus dados foram obtidos por meio de uma amostra probabilística domiciliar, com estratificação geográfica por macrorregião e conglomeração por setores censitários. A amostra foi dimensionada em 15.000 domicílios, distribuídos em 1.500 setores censitários, sendo 300 em cada macrorregião e 10 domicílios elegíveis por setor, através de amostragem inversa. Assim, estimaram-se os parâmetros populacionais requeridos para atingir os objetivos do estudo. Os pesos amostrais básicos do desenho foram calculados como inversos das probabilidades de inclusão dos domicílios na pesquisa. Para compensar a não resposta de itens das variáveis pesquisadas foi usada imputação, com exceção para os dados de biomarcadores sanguíneos. A calibração empregou totais populacionais de crianças para 60 pós-estratos definidos por cruzamento das variáveis macrorregião, sexo e idade. A amostra final compreendeu 14.558 crianças, residentes em 12.524 domicílios, distribuídos em 1.382 setores censitários nas 27 Unidades da Federação. Os dados do ENANI-2019 poderão subsidiar estratégias de promoção e implementação de políticas públicas para crianças menores de 5 anos.
Lactente; Pré-escolar; Modelos Estatísticos; Amostragem; Métodos
El objetivo de este artículo es describir aspectos metodológicos referentes a definición de la población de la investigación, plan de muestreo, ponderación de la muestra y muestra efectiva del Estudio Nacional de Alimentación y Nutrición Infantil (ENANI-2019). Se trata de una encuesta poblacional de base domiciliaria, que realizó una evaluación de la lactancia materna y de consumo alimentario, así como una evaluación antropométrica del estado nutricional y de las deficiencias de micronutrientes, mediante análisis de biomarcadores sanguíneos en niños menores de cinco años de edad. Sus datos se obtuvieron mediante una muestra probabilística domiciliaria, con estratificación geográfica por macrorregión y conglomerados por sectores censitarios. La muestra se circunscribió a 15.000 domicilios, distribuidos en 1.500 sectores censitarios, encontrándose 300 en cada macrorregión, junto 10 domicilios elegibles por sector, a través de un muestreo inverso. De esta forma, se estimaron los parámetros poblacionales requeridos para alcanzar los objetivos del estudio. Los pesos básicos de las muestras del diseño se calcularon como inversos a las probabilidades de inclusión de los domicilios en la investigación. Para compensar la no respuesta de ítems de las variables investigadas se usó la imputación, con excepción de los datos con biomarcadores sanguíneos. La calibración empleó totales poblacionales de niños para los 60 post estratos, definidos mediante el cruce de las variables macrorregión, sexo y edad. La muestra final comprendió 14.558 niños, residentes en 12.524 domicilios, distribuidos en 1.382 sectores censitarios dentro de las 27 Unidades de la Federación. Los datos del ENANI-2019 podrán apoyar estrategias de promoción e implementación de políticas públicas para niños menores de cinco años.
Lactante; Preescolar; Modelos Estadísticos; Muestreo; Métodos
The Ministry of Health funded the Brazilian National Survey on Child Nutrition (ENANI-2019) (call for projects CNPq/MS/SCTIE/DECIT/SAS/DAB/CGAN n. 11/2017). ENANI-2019 is structured in three domains: assessment of breastfeeding and dietary intake; anthropometric assessment of nutritional status; and assessment of micronutrient deficiencies in children under five years of age, by major geographic region, sex, and age group.
The data were obtained with a probabilistic household sample survey, with geographic stratification and clustering by census enumeration areas (CEAs), conducted with sampling methods such as those adopted by the official statistical institutes in their human population surveys 11. Instituto Brasileiro de Geografia e Estatística. Projeções da população. Pesquisa Nacional por Amostra de Domicílios 2012. Notas metodológicas. Pesquisa básica. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2013.. This allowed the ENANI-2019 to reproducibly and scientifically estimate the required population parameters to reach its objectives. The basic idea when sampling human populations consists of sampling them through the households, grouped in turn in CEAs, which are grouped in turn according to the situation (urban versus rural) in subdistricts, districts, municipalities, and so on. The basis is the concept of household and resident (the latter to ensure that individuals with more than one residential address would not be more likely to enter the sample) and selection of areas.
ENANI-2019 provides a unique opportunity to elucidate the various aspects of the nutritional assessment of children and to support public health policies for this vulnerable age group. Thus, the manuscript aims to describe methodological aspects in defining of the study population, sampling plan, sample weighting, and effective sample of the ENANI-2019.
The study population for ENANI-2019 was defined as the set of children under five years of age residing in permanent private households throughout Brazil with at least one child under five years of age on the date of the survey interview. Therefore, the study population did not include: (1) children residing in collective households (hotels, boarding houses, orphanages, shelters, detention centers, barracks, hospitals, etc.), improvised private households, and permanent private households without children; (2) indigenous children living in villages; (3) foreign children living in households where Portuguese was not spoken; and (4) children with conditions that prevented them from undergoing anthropometric measurement.
The Institutional Review Board of the Clementino Fraga Filho University Hospital of the Federal University of Rio de Janeiro (UFRJ) approved the study under number CAAE 89798718.7.0000.5257. Data were collected after the child’s parents or guardians signed two copies of the free and informed consent form. The methods used in the development of ENANI-2019 have been described in detail in specific publications 22. Alves-Santos NH, Castro IRR, Anjos LA, Lacerda EMA, Normando P, Freitas MB, et al. General methodological aspects in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00300020.,33. Lacerda EMA, Boccolini CS, Alves-Santos NH, Castro IRR, Anjos LA, Crispim SP, et al. Methodological aspects of the assessment of dietary intake in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00301420.,44. Anjos LA, Ferreira HS, Alves-Santos NH, Freitas MB, Boccolini CS, Lacerda EMA, et al. Methodological aspects of the anthropometric assessment in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; e00293320.,55. Castro IRR, Normando P, Alves-Santos NH, Bezerra FF, Citelli M, Pedrosa LFC, et al. Methodological aspects of the micronutrient assessment in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00301120..
The sampling plan of ENANI-2019 used stratification and clustering, incorporating two or three selection stages. The population’s stratification for sampling purposes was guided by the study’s objectives and the definition of the five major geographic regions in Brazilian territory as target domains for publication of results.
The primary sampling units (PSUs) were the municipalities or the CEAs, and the elementary sampling units were always the households. In each selected household, all residents were enrolled, and the study’s target data were recorded for all the resident children under five years of age.
Strata were formed through the allocation of Brazilian municipalities (according to the territorial base used by the Brazilian Institute of Geography and Statistics - IBGE, in the population estimates for July 1, 2016) 66. Instituto Brasileiro de Geografia e Estatística. Estimativas da população residente nos municípios e para as Unidades da Federação brasileiros com data de referência em 1º de julho de 2016 (notas metodológicas). Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2016. in two blocks: (1) each of the state capitals plus the Federal District (27 strata) and each of the 20 municipalities with more than 500,000 inhabitants (20 strata) and (2) the other municipalities in each major geographic region (5 strata) (Table 1). Therefore, all the state capitals and municipalities with large populations (> 500,000 inhabitants) defined as strata in block 1 were included in the sample with certainty and are not primary sampling units but selection strata.
Projection of the Brazilian population under five years of age according to major geographic regions and sample selection strata. Brazilian National Survey on Child Nutrition (ENANI-2019).
The data for the total population and the population of children under five years of age were estimated for July 1, 2016, for each of the 5,570 Brazilian municipalities using the linear trend method 77. Madeira JL, Simões CCS. Estimativas preliminares da população urbana e rural segundo as unidades da federação, de 1960/1980 por uma nova metodologia. Revista Brasileira de Estatística 1972; 33:3-11., the same used by the IBGE in the elaboration of the population estimates used by the Federal Accounts Court to determine their share of the participatory fund for municipalities 66. Instituto Brasileiro de Geografia e Estatística. Estimativas da população residente nos municípios e para as Unidades da Federação brasileiros com data de referência em 1º de julho de 2016 (notas metodológicas). Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2016.. Table 1 shows the estimates obtained per selection stratum.
In the 47 strata formed by each of the municipalities included with certainty in the sample (block 1), the PSU was the CEA (IBGE), and the secondary sampling unit (SSU) was the eligible household (with children from the study population). In the other strata (block 2), the PSU was the municipality, the SSU was the CEA, and the tertiary sampling unit (TSU) was the eligible household.
Calculation of the sample size
Calculation of the sample size was guided by the project’s budget parameters, the blood sample collection logistics, and the experience with similar surveys conducted by the Society for the Development of Scientific Research (Science).
Considering the target domain (major geographic region), the minimum proportion was specified as Pmin = 2%. The estimated relative margin of error should be a maximum of dR = 35%, with a confidence coefficient of (1-α) = 95%. According to Cochran 88. Cochran WG. Sampling techniques. 3rd Ed. New York: John Wiley & Sons; 1977. and assuming simple random sampling without replacement (SRS), the necessary sample size to estimate proportions equal to or greater than Pmin with a relative error no greater than dR with a level of confidence 1-α is calculated by:
where z∝/2 is the (1 - α/2) quantile of the standard normal distribution.
Since the sample design is complex (stratified and clustered), it is necessary to consider the design effect on calculating the sample size. Pessoa & Silva 99. Pessoa DG, Silva PLN. Análise de dados amostrais complexos. São Paulo: Associação Brasileira de Estatística; 1998. recommend multiplying the sample size obtained by the Expression 1 by an estimate of the design effect (deff) referring to the key survey variable. A deff of 1.95 was set for calculating the sample size since there were no data on deff from previous household surveys on the topic. However, selecting an arbitrary value for deff greater than one is preferable to the alternative of not making any adjustment to the sample size for the expected effects of clustering with the sampling design adopted. Data from the study showed that the deff for the estimates of the proportion of children that did not receive breastmilk on the eve of the interview by sex and age group varied from 2.3 to 5.7, and the proportions varied from 12.8% (girls under six months of age) to 97.4% (girls four years old). Similar ranges for deff were observed for estimates by sex and age concerning the children’s average weight (2.9 to 7.3) and height (2.7 to 6.6). These results suggest that the value used in calculating the sample size was small. For future calculation of samples from the same population, the data from this study can be used to estimate deff values for other key survey variables.
The sample size of households to be interviewed for each major geographic region was thus calculated by the Expression 2:
Since there are five estimation domains, the total sample size was calculated at 14,990 (= 5 x 2,998) households.
It was also determined that ten eligible households would be interviewed for each selected CEA, which led to a sample of m = 1,500 CEAs, 300 in each major geographic region. This definition also resulted from the accumulated experience with samples from household surveys by the Science team and from the evidence of the effects of CEA sample size on the estimates’ precision and data collection costs. The number ten could be considered small compared to that adopted in other household surveys, such as the Brazilian Continuous National Household Sample Survey (PNAD Contínua), which selects 14 households (eligible or not) per CEA 1010. Instituto Brasileiro de Geografia e Estatística. Pesquisa Nacional por Amostra de Domicílios Contínua. Notas metodológicas, versão 1.8. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2020.. However, in ENANI-2019, it would be difficult to reach 14 eligible households per CEA. Based on an average CEA size of 300 households, considering that the proportion of children under five years of age in 2016 was estimated at 7.2%, besides assuming that each household would have a maximum of one child under five years of age, there would be an expected number of 21.6 eligible households per CEA. Since the CEA sizes vary considerably (above and below the average number of 300 households), and since the above estimate is optimistic, dependent on the hypothesis of one eligible child per household, the target sample size of ten eligible households per CEA appeared reasonable and was adopted.
Allocation of the CEA sample in the selection strata
There are various ways of allocating the CEA sample size among the selection strata. At one extreme, there is proportional allocation, which ensures that the sample size in each stratum is proportional to its population, with the disadvantage of concentrating the sample in the strata with the largest population. The other extreme is equal allocation (as among the major geographic regions), which ensures that the margin of error (or sampling precision) is similar across strata, but only recommendable when the strata are estimation domains. Finally, between these extremes, there is power allocation, which ensures a certain proportionality between the sample size in the stratum and a power p (0 < p < 1) of its population. The larger the power p, the more closely power allocation approximates proportional allocation, and the smaller the power p, the closer it gets to equal allocation.
Expression 3 presents the form of power allocation 1111. Bankier MD. Power allocations: determining sample sizes for subnational areas. Am Stat 1988; 42:174-7. used to define the sample size of CEAs for each selection stratum h within each major geographic region:
where poph represents the population under five years of age in stratum h, estimated for July 1, 2016, as previously indicated in Table 1.
The Science experience in household sample surveys led to the use of a power allocation with p = 1/3, which displays a certain proportionality with the stratum’s population, without allowing excessive concentration in the more highly populated strata.
For the strata of “other municipalities” in the five major geographic regions, the definition of the number of CEAs to select in each municipality determined the number of municipalities to select in each of these strata. The decision was to select five CEAs per municipality in all the major geographic regions, except in the North, where eight CEAs were selected per municipality. This larger number of CEAs per municipality in the stratum “other municipalities” in the North of Brazil allowed reducing the number of selected municipalities. The North of Brazil has huge difficulties involving access and traveling time from the municipalities to their respective state capitals. In most municipalities, the traveling time could prevent taking blood samples and increasing the study’s costs. Table 2 shows the planned sample size for CEAs and households.
Size of sample of census enumeration areas and households for Brazil and according to major geographic regions, selection strata, and municipalities. Brazilian National Survey on Child Nutrition (ENANI-2019).
Sample selection methods in the various stages
When the municipality was the PSU (block 2, strata “other municipalities” of the major geographic regions), its selection was performed with systematic sampling with probabilities proportional to size (PPS), used as a measure of the size of the population under five years of age in the municipality, estimated for July 1, 2016.
Since lower-income CEAs were expected to have more eligible children than the higher-income CEAs, care was taken for the sample to cover the range of the population’s income in the selected municipalities, guaranteeing different children’s feeding patterns in the study population. Thus, before the CEAs’ selection, an additional stratification was performed, based on quartiles of the distribution of the average head-of-household’s income in each CEA, according to the 2010 Population Census. Next, the numbers of CEAs to be selected in each income stratum were allocated. Finally, within each municipality and income stratum, CEAs were selected by Pareto’s PPS sampling 1212. Rosén BA. A user's guide to Pareto pps sampling. Stockholm: Statistiska Centralbyrån; 2000.,1313. Freitas MPS, Antonaci GA. Sistema Integrado de Pesquisas Domiciliares: amostra mestra 2010 e amostra da PNAD Contínua. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2014.. The size measure for CEA sampling was the number of children under five years of age in the CEA, based on the 2010 Population Census, the most recent source of information available per CEA at the time of the survey.
The adopted selection scheme prioritized the CEAs’ stratification by income and did not consider stratification by the urban-versus-rural situation. In this sense, the participation of rural CEAs in the sample would be approximately proportional to that observed in the municipalities. However, due to the logistic difficulty of household blood sample collection and the samples’ transportation to the local laboratory for processing, 46 rural CEAs which were more than two hours’ travel time from the municipal center (time interval greater than allowed by the study’s protocol for collection and transportation of blood samples) were replaced by closer CEAs. Later, as a function of the blood sample logistics, another 11 rural CEAs were also replaced during data collection. The implication of these operational restrictions of the blood sample collection and processing was the small presence of rural CEAs in the sample (only 32 rural CEAs among the 1,392 CEAs with data collected), resulting in estimates with a low level of precision for this setting.
The selection of eligible households within each selected CEA used inverse sampling 1414. Haldane JBS. On a method of estimating frequencies. Biometrika 1945; 33:222-5.,1515. Vasconcellos MTL, Silva PLN, Szwarcwald CL. Sampling design for the World Health Survey in Brazil. Cad Saúde Pública 2005; 21 Suppl:S89-99.,1616. Vasconcellos MTL, Silva PLN, Anjos LA. Sample design for the Nutrition, Physical Activity and Health Survey (PNAFS), Niterói, Rio de Janeiro, Brazil. Estadística 2013; 65:47-61. during the data collection operation.
The collection began with the identification of selected CEAs (maps, descriptions, limits, and areas of exclusion, and the list of addresses in the National Registry of Addresses for Statistical Purposes - CNEFE, all available on the IBGE website). This was followed by updating the registry of addresses per CEA via the Census Tract Address Updating System (SAES), an app developed by Science and operated via the mobile data collection device (MDC). At this time, the interviewers canvassed each selected CEA, conducting the confirmations, corrections, inclusions, and exclusions of addresses for the buildings found along the way. Each identified building was classified as either a household (private or collective) or an establishment.
In each selected CEA, having concluded the update of the address registry, the SAES numbered the addresses classified as private households (PH) sequentially, starting with one, according to the order of the path taken by the interviewer in the CEA. Then, selection tables were used to generate a random permutation of the PH by blocks of ten for the CEA’s addresses (in each block, the ten PH were placed in the order of the path to facilitate the interviewer’s movement). The interviewer’s MDC displayed the first 20 addresses (in random order) to be visited to define the household’s eligibility and obtain (if eligible) the family’s consent to conduct the interview.
For each selected household in which the visit and contact did not result in an interview (ineligible PH, vacant PH, refusal, etc.), the data control app installed in the MDC added a new address to the list of PH addresses to be visited. This procedure ended when ten complete interviews had been obtained in the CEA or when all PH in the CEA had been visited. Thus, in each eligible interviewed household, information was collected on all the resident children under five years of age.
Probabilistic sampling scheme
The probability of inclusion in the sample of municipality i in stratum h, represented by P(Mhi), depends on it being included with certainty in the sample (making it a selection stratum) or on it having been a PSU in one of the “other municipalities” strata, as indicated in Expression 4:
where pophi represents the population under five years of age in municipality Mhi, estimated for July 1, 2016, by the linear trend method 77. Madeira JL, Simões CCS. Estimativas preliminares da população urbana e rural segundo as unidades da federação, de 1960/1980 por uma nova metodologia. Revista Brasileira de Estatística 1972; 33:3-11.; Th represents the total number of municipalities in stratum h; and th represents the size of the sample of municipalities in stratum h.
The conditional probability of inclusion in the sample of CEA j in municipality i in stratum h, conditioned by the selection or inclusion of municipality Mhi, represented by P(Shij|Mhi), is indicated by the Expression 5:
where domhij represents the number of households in CEA Shij according to the 2010 Population Census; Thi represents the total number of CEAs in income stratum g, to which CEA j of municipality Mhi belongs; and thi represents the sample size of CEAs in income stratum g, to which CEA j of municipality Mhi belongs.
The sum of households in the CEAs was calculated in the set of CEAs belonging to each income stratum g in the municipality.
Thus, the probability of inclusion in the sample of CEA Shij is expressed by:
In CEA Shij, the conditional probability of interviewing household Dhijk is expressed by:
where represents the number of private households in CEA Shij obtained after updating the CEA’s address registry, performed at the time of the study; vhij is the total number of eligible private households selected and visited in CEA Shij; and ehij represents the total number of households interviewed in CEA Shij.
Thus, the probability of inclusion in the sample of household Dhijk is expressed by:
The objective of this stage was to calculate and assign sampling weights to the children to allow estimating target parameters in the study population as a whole and for specific target analyses. Good sampling weights allow unbiased estimation for the target population parameters, compensating for non-response effects (of units) and estimating with efficiency (small margin of error). The guide proposed by Valliant & Dever 1717. Valliant R, Dever JA. Survey weights: a step-by-step guide to calculation. College Station: Stata Press; 2018. was followed in the elaboration of the study’s final sampling weights.
Since the study sample was stratified and clustered with unequal selection probabilities, it was necessary to calculate and use sampling weights for each of the households interviewed to allow unbiased estimation of target parameters in the population. The sampling weights were calculated in three or four stages, depending on the set of target information. The sampling weights were all calibrated to known population totals, seeking to correct typical biases in household samples and biases resulting from potential differential non-response or due to other difficulties faced while conducting the study.
Basic sampling weights were obtained in the first stage, corresponding to the inverse probabilities of inclusion of interviewed households. The basic weights for the households were calculated with the Expression (9):
To better control the estimates’ variability, the basic weights received upper truncation at 10,000 (that is, weights greater than 10,000 were trimmed to this value). This type of treatment is frequently used when the basic weights vary widely 1818. Potter FJ. Survey of procedures to control extreme sampling weights. In: Proceedings of the Survey Research Methods Section. Alexandria: American Statistical Association; 1988. p. 453-8..
The household’s basic weight is applied to all the data obtained since no selection is made among the resident children. Therefore, the basic weight for all the children was set equal to their household’s basic weight. The basic weight calculated with Expression 9 underwent two or three adjustment stages, depending on the set of target variables for the analysis.
The study’s data collection was interrupted on March 17, 2020, due to the adoption of social distancing measures in response to the COVID-19 pandemic. Due to the interruption of data collection, the sample of CEAs was not collected in its entirety. The collection was concluded in most strata and PSUs, but in some, it did not occur (Table 2). In these strata and PSU, the weights calculated in Expression 9 were adjusted via multiplication by a given factor as indicated in the Expression 10:
where Ahi is the set of CEAs sampled in PSU i in stratum h; and Chi is the set of CEAs collected in PSU i in stratum h.
To facilitate the presentation of the following stages in the weights’ adjustment, we will change the notation, omitting the stratum, municipality, and CEA indices, which are unnecessary to facilitate understanding the expressions and calculations of the adjustment factors in the subsequent stages.
In the absence of non-response, the population total for a study variable y, denoted , could be estimated without bias using the Horvitz-Thompson estimator 1919. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952; 47:663-85., as given by the Expression 11:
where wk is the adjusted basic weight of unit k, obtained by the Expression 10 at the end of stage 1, and s is the set of units in the sample.
Likewise, the population average , where N is the population’s size, would be estimated using the Hàjek estimator 2020. Hàjek J. Comment on a paper by D Basu. In: Godambe VP, Sprott DA, editors. Foundations of statistical inference. Toronto: Holt, Rinehart and Winston; 1971. p. 236., as shown in Expression 12:
As in any study, the ENANI-2019 sample presented both unit and item non-response that need to be compensated for in the analyses. Therefore, imputation was used to compensate for item non-response for most of the variables.
Laboratory analyses of blood samples showed unit non-response (lack of measures for all the biomarkers) and item non-response (lack of measures for some subset of biomarkers), as observed in Castro et al. 55. Castro IRR, Normando P, Alves-Santos NH, Bezerra FF, Citelli M, Pedrosa LFC, et al. Methodological aspects of the micronutrient assessment in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00301120.. Considering the nature of these measurements, we decided that it was not possible to compensate for non-response in this set of variables using imputation. To compensate for this non-response, adjustments were made to the children’s basic weights via the following steps.
Step 1: 25 groups of children were created with available responses for different subsets of variables in the data of blood biomarkers (Table 3). These groups were identified by dummy variables gkr taking a value of 1 for available responses in group r for child k, and 0 otherwise, with r varying from 1 to 25.
Number of children 6 to 59 months of age with results of blood biomarkers, according to age group. Brazilian National Survey on Child Nutrition (ENANI-2019).
Step 2: for each dummy variable with an available response in group r, a logistic regression model was fitted for the probability of response, defined in the Expression 13:
where xk is a vector with selected predictive variables for explaining the propensity to respond, and θ is a vector of parameters to be estimated.
The fitted model was used to obtain estimates of response probabilities in group r, as shown in the Expression 14:
The predictive variables considered in the fitted models in all the response groups were the same and are listed in Box 1. The selection of variables for inclusion in these models was based on a set of potentially relevant predictors for explaining the pattern of responses to groups of blood biomarkers, including characteristics of the region, households, and children. Next, initial models were fitted to the data, followed by step-by-step inclusion of new predictors until reaching the set of variables with significant and relevant main effects. No models were tested for interactions between predictors.
Predictive variables used to model the probability of response for each group of blood biomarkers. Brazilian National Survey on Child Nutrition (ENANI-2019).
Step 3: for each group of records with an available response, the inverse estimated probability of response in the group was used as a factor to correct the child´s basic weight, obtaining adjusted weights according to the Expression 15:
Since 25 groups of children were formed with different sets of available variables in the section on blood biomarkers, there are 25 sets of weights adjusted for non-response. In addition to the basic weight recommended for all the other analyses, each child will have a specific weight for each of these 25 sets. For each set of variables in which the child presents a complete response in all the variables, the corresponding weight is positive. It is null in case of non-response in at least one variable in the set of target variables. The data analysts will be responsible for selecting the adequate weights for the analyses that include blood biomarkers.
The final stage in the adjustment of the basic weights was calibration. The basic idea of calibration is to estimate factors fk (called calibration factors) that multiply basic weights to generate the calibrated weights. These factors have the property of eliminating differences between estimates obtained with the calibrated weights and the corresponding population totals (known from other sources) for a set of ancillary calibration or post-strata variables 2121. Silva PLN. Calibration estimation: when and why, how much and how. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2004. (Textos para Discussão da Diretoria de Pesquisas, 14).,2222. Deville JC, Särndal CE. Calibration estimators in survey sampling. J Am Stat Assoc 1992; 87:376-82.. Calibration helps compensate for children’s total non-response, seeking to mitigate the effects of differential non-response that can affect estimates derived from the sample.
Calibration in ENANI-2019 employed total populations of children for 60 post-strata defined by cross-classifying the following variables: major geographic region (5 classes), sex (2 classes), and age (6 classes - 0 to 5 months; 6 to 11 months; 1 year; 2 years; 3 years; 4 years).
The subdivision of children under one year in two age classes for calibration purposes was necessary given the rules for applying part of the questionnaire and collecting blood samples: only children six months of age or older had blood samples drawn. Therefore, to avoid the need to use different population totals for calibration of the principal weights and the weights for groups of blood biomarker variables, the calibrations of all the weights to the two age classes for children under one year were considered separately. The totals used for the calibration are population projections by IBGE for January 1, 2020, disaggregated by major geographic region, sex, and five groups of individual ages in years. To obtain the totals for the two age groups under one year, the IBGE projections for children under one year were divided by two.
In calibration of weights, the objective is to minimize the distance (Expression 16):
between the calibrated weights (fkwk) and the weights one wishes to calibrate (wk), simultaneously complying with two sets of restrictions:
where U represents the set of children in the study population; C represents the set of children in the available effective sample; H represents a household with two or more children interviewed; xk is the vector with values for the variables that identify the post-stratum to which the children belong (indicators of cells in the table obtained by cross-classifying major geographic region x sex x age group); X c is the estimated total with calibrated weights fkwk for the post-strata, and are the population totals for the post-strata according to the respective population projections.
The estimator using calibrated weights for totals is expressed as:
and the corresponding estimator for population means is expressed as:
The calibrated weights should be used in all the analyses, not only with the children’s data but also with other data, such as those of the households, the children’s parents or guardians. Calibration of weights in the way described here is called “integrated household weighting”, ensuring that all the units (children, etc.) from the same household have equal weights 2323. Lemaitre GE, Dufour J. An integrated method for weighting persons and families. Surv Methodol 1987; 13:199-207..
This statement applies to basic weights but not to weights for the groups of blood biomarker variables. In this case, if the household has children under six months and children over six months of age, the former will have null weights since they did not participate in this part of the survey. As mentioned, for groups of blood biomarker variables, it will always be up to the data analyst to select the adequate weight for each analysis.
To estimate variances, it is recommended to use a combination of the ultimate cluster and linearization methods 99. Pessoa DG, Silva PLN. Análise de dados amostrais complexos. São Paulo: Associação Brasileira de Estatística; 1998., as implemented, for example, in the survey package 2424. Lumley T. Complex surveys: a guide to analysis using R. Hoboken: John Wiley & Sons; 2010. (Wiley Series in Survey Methodology). of R software (http://www.r-project.org).
Even samples with optimal planning may undergo adjustments during data collection for various reasons. Although potential sources of bias, in practice, such adjustments are unavoidable. In the specific case of the ENANI-2019 sample, the main problem during data collection was the interruption on March 17, 2020, due to the COVID-19 pandemic. Before the interruption, there was a need to make substitutions and inclusions of CEAs in the sample, as described below. Even one entire municipality, Jataí (Goiás State), had to be replaced by the municipality of Luziânia (Goiás State), since it was not possible to find a clinical laboratory that could perform the blood sample collection in Jataí (Table 2).
As presented in Table 2, from a total of 1,500 selected CEAs, data collection was performed in 1,382, for a total loss of 7.9%. Losses resulting from the data collection’s premature interruption were the highest in the North and Northeast regions (22.3% and 13.7% of the 300 selected CEAs in these regions, respectively). Conversely, there were no losses in the South, and the losses were smaller in the Central West and Southeast (3% and 0.3% of the CEAs, respectively; Table 2).
In the total sample of CEAs, 37 had to be replaced due to difficulties during data collection, because of distance to the municipal center (preventing the blood sample collection), or difficulties in access to the CEA, resulting from civil unrest (drug trafficking, land disputes, etc.), representing 2.5% of the total planned sample of CEAs. Besides these replacements, CEAs had to be added to the study sample to solve 18 cases. The data collection did not produce interviews with eligible households, having exhausted all the PH addresses. Eighteen CEAs were added to the sample to compensate for these cases.
As for the sample of households, 12,524 (83.5%) households were obtained, compared to the expected total of 15,000 eligible households. In addition, data were obtained from 14,558 children, with a loss of only 3% in relation to the expected total of 15,000.
One technique used in the survey was the inverse sampling of households, which functions as “sample screening”. In a sampling process that seeks to locate households with members of a specific population, as in the current survey (children under five years of age), a standard alternative procedure would be to use complete or “census screening”, visiting all the households in the selected CEAs and attempting to determine whether they contained members of the target population. This alternative would involve a higher cost in updating the registry of addresses in the CEAs. In addition, it would represent a stage of creation of a complete registry of eligible households in each CEA. This stage would involve an increase not only in costs but also in time.
By adopting inverse sampling, the screening for eligible households was carried out by sampling and allowed the sample selection and approach for interviewing to occur during the same process of visiting the CEA to locate and visit the selected households. The necessary cost and time for data collection were thus much smaller. An effect of this approach is the selection of large numbers of addresses that lack an eligible household. Still, this cost is much lower than with the alternative approach of registering all the households in each CEA with a visit to verify eligibility.
Overall, it was necessary to select and visit 193,212 addresses in the selected CEAs, resulting in an average of 140 households visited per CEA, with an average of 9 eligible households interviewed per selected CEA. Of all the selected addresses, 75.1% were ultimately ineligible for various reasons, mostly households without children under five years of age (Table 4).
Number of addresses visited, according to the result of visit. Brazilian National Survey on Child Nutrition (ENANI-2019).
In addition to the ineligible households, 14.2% of all the selected and visited households were classified as closed at the end of the operation (Table 4). The classification of households as closed was for those that had residents at the time of the survey, but where it was not possible to contact residents to apply the protocol to attempt to conduct the data collection for the survey (at least four visits on different days and at different times).
ENANI-2019 experienced a refusal rate of 35.8%, considering only selected eligible households that were contacted successfully (Table 4). Considering all the selected and visited households as the denominator, the refusal rate was 3.7%. The largest loss among the selected and visited households was due to initial refusal to participate in the survey (33.2% of eligible households, as shown in Table 4), not surprising in a survey with the kind of demand that ENANI-2019 exerts on families (obtaining data on their children, including the collection of blood samples).
ENANI-2019 is the first nationwide household survey in Brazil that jointly investigated breastfeeding and complementary feeding practices, individual dietary intake, anthropometric nutritional status, and micronutrient deficiencies in children under five years of age. Determination of the sample size and the methodology used in allocating CEAs in the selection strata allowed the representation of the target population in each major geographic region. The study’s fieldwork presented good results compared to other household surveys using the highest sampling standards in Brazil. The results will allow comparisons with previous studies and support strategic decisions on implementing public policies for under-five children.
To the field staff and participating families who made this study possible. To the Brazilian Ministry of Health/Brazilian Nacional Research Council (CNPq) - process: 440890/2017-9.
- 1Instituto Brasileiro de Geografia e Estatística. Projeções da população. Pesquisa Nacional por Amostra de Domicílios 2012. Notas metodológicas. Pesquisa básica. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2013.
- 2Alves-Santos NH, Castro IRR, Anjos LA, Lacerda EMA, Normando P, Freitas MB, et al. General methodological aspects in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00300020.
- 3Lacerda EMA, Boccolini CS, Alves-Santos NH, Castro IRR, Anjos LA, Crispim SP, et al. Methodological aspects of the assessment of dietary intake in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00301420.
- 4Anjos LA, Ferreira HS, Alves-Santos NH, Freitas MB, Boccolini CS, Lacerda EMA, et al. Methodological aspects of the anthropometric assessment in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; e00293320.
- 5Castro IRR, Normando P, Alves-Santos NH, Bezerra FF, Citelli M, Pedrosa LFC, et al. Methodological aspects of the micronutrient assessment in the Brazilian National Survey on Child Nutrition (ENANI-2019): a population-based household survey. Cad Saúde Pública 2021; 37:e00301120.
- 6Instituto Brasileiro de Geografia e Estatística. Estimativas da população residente nos municípios e para as Unidades da Federação brasileiros com data de referência em 1º de julho de 2016 (notas metodológicas). Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2016.
- 7Madeira JL, Simões CCS. Estimativas preliminares da população urbana e rural segundo as unidades da federação, de 1960/1980 por uma nova metodologia. Revista Brasileira de Estatística 1972; 33:3-11.
- 8Cochran WG. Sampling techniques. 3rd Ed. New York: John Wiley & Sons; 1977.
- 9Pessoa DG, Silva PLN. Análise de dados amostrais complexos. São Paulo: Associação Brasileira de Estatística; 1998.
- 10Instituto Brasileiro de Geografia e Estatística. Pesquisa Nacional por Amostra de Domicílios Contínua. Notas metodológicas, versão 1.8. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2020.
- 11Bankier MD. Power allocations: determining sample sizes for subnational areas. Am Stat 1988; 42:174-7.
- 12Rosén BA. A user's guide to Pareto pps sampling. Stockholm: Statistiska Centralbyrån; 2000.
- 13Freitas MPS, Antonaci GA. Sistema Integrado de Pesquisas Domiciliares: amostra mestra 2010 e amostra da PNAD Contínua. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2014.
- 14Haldane JBS. On a method of estimating frequencies. Biometrika 1945; 33:222-5.
- 15Vasconcellos MTL, Silva PLN, Szwarcwald CL. Sampling design for the World Health Survey in Brazil. Cad Saúde Pública 2005; 21 Suppl:S89-99.
- 16Vasconcellos MTL, Silva PLN, Anjos LA. Sample design for the Nutrition, Physical Activity and Health Survey (PNAFS), Niterói, Rio de Janeiro, Brazil. Estadística 2013; 65:47-61.
- 17Valliant R, Dever JA. Survey weights: a step-by-step guide to calculation. College Station: Stata Press; 2018.
- 18Potter FJ. Survey of procedures to control extreme sampling weights. In: Proceedings of the Survey Research Methods Section. Alexandria: American Statistical Association; 1988. p. 453-8.
- 19Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952; 47:663-85.
- 20Hàjek J. Comment on a paper by D Basu. In: Godambe VP, Sprott DA, editors. Foundations of statistical inference. Toronto: Holt, Rinehart and Winston; 1971. p. 236.
- 21Silva PLN. Calibration estimation: when and why, how much and how. Rio de Janeiro: Instituto Brasileiro de Geografia e Estatística; 2004. (Textos para Discussão da Diretoria de Pesquisas, 14).
- 22Deville JC, Särndal CE. Calibration estimators in survey sampling. J Am Stat Assoc 1992; 87:376-82.
- 23Lemaitre GE, Dufour J. An integrated method for weighting persons and families. Surv Methodol 1987; 13:199-207.
- 24Lumley T. Complex surveys: a guide to analysis using R. Hoboken: John Wiley & Sons; 2010. (Wiley Series in Survey Methodology).
- Publication in this collection
30 Aug 2021
- Date of issue
13 Feb 2021
29 Apr 2021
18 May 2021