Ana Luiza BierrenbachI; Antony Peter StevensI; Adriana Bacelar Ferreira GomesI; Elza Ferreira NoronhaII; Ruth GlattI; Carolina Novaes CarvalhoI; João Gregório de Oliveira JuniorI; Maria de Fátima Marinho de SouzaI
ISecretaria de Vigilância em Saúde. Ministério da Saúde. Brasília, DF, Brasil
IIFaculdade de Medicina. Universidade de Brasília. Brasília, DF, Brasil
OBJECTIVE: To evaluate the impact on tuberculosis (TB) incidence rates of removal of improper duplicate records from the notification system.
METHODS: Data from the Sistema de Informação de Agravos de Notificação (Brazilian Information System for Tuberculosis Notification) from 2000 to 2004 were analyzed. Repeat records were identified through probabilistic record linkage and classified into six mutually exclusive categories and then kept, combined or removed from database.
RESULTS: Of all TB records, 73.7% had no duplicate, 18.9% were duplicate, 4.7% were triplicate, and 2.7% were quadruplicate or more. Of all repeat records, 47.3% were classified as transfer in/out; 23.6% return after default, 16.4% true duplicates, 10% relapse, 2.5% inconclusive and 0.2% had missing data. These proportions were different in Brazilian states. Removal of improper duplicate records reduced TB incidence rate per 100.000 inhabitants by 6.1% in the year 2000 (from 44 to 41.3), 8.3% in 2001 (from 44.5 to 40.8), 9.4% in 2002 (from 45.8 to 41.5), 9.2% in 2003 (from 46.9 to 42.6) and 8.4% in 2004 (from 45.4 to 41.6).
CONCLUSIONS: The study results indicate that the observed tuberculosis incidence rates represent estimates that would be closer to the actual rates than those obtained from the raw database at state and country level. The use of record linkage approach should be promoted for better quality of notification system data.
Key words: Tuberculosis, epidemiology. Disease Notification. Diseases registries. Data sources. Information Systems. Brazil.
The Sistema de Informação de Agravos de Notificação (SINAN Brazilian Information System for Disease Notification) collects and processes data on compulsory disease notification nationwide.1 Improper repeat records in health information systems jeopardize correct interpretation of epidemiological surveillance data.
Repeat notification of chronic diseases such as tuberculosis (TB) can be attributed to data entry or processing errors. Also, the same patient can be reported repeated times by different health units due to authorized or voluntary transfers between units during treatment or different treatments due to relapse after cure or return after default.2 Although they concern to the same patient, relapses and returns are considered valid entries in this database as they are new TB episodes. But all other repeat records must be removed.
The objetive of the present study was to assess the impact on TB incidence rates of removal of repeat improper records from the notification system.
Nationwide TB notification records for the period 20002004 were studied. Data provided by health departments at state level were made available by SINAN-TB National Management on February 2006.
The following steps were taken to identify repeat records: 1) database pre-processing; 2) identification of matched records (matches) using record linkage Link-Plus software; 3) ascertainment whether matched records concerned the same patient (links); 4) post-processing with regrouping of records concerning the same patient. Linked records concerning the same patient were considered repeat records.
In database pre-processing content of variables "patient's name" and "patient's mother name" were corrected aiming to increase the likelihood of finding matched records. These procedures included: 1) correction of obvious typing errors; 2) elimination or replacement of special characters (%, /); 3) capitalization of names; 4) removal of any individual letters or prepositions from names, 5) removal of terms indicating lack of information on patient's name and patient's mother name (e.g. don't know, unknown).
Matched records were identified using the record linkage Link-Plus software (CDC, Atlanta, Georgia, USA)3 through probabilistic search for repeat records. The probabilistic record linkage (PRL), developed by Fellegi & Sunter,2 allowed to estimate the likelihood of agreement and disagreement of variables selected for record linkage (linkage variables).
The software was set up to search for repeat records. Variables such as "patient's name," "mother's name" and "date of birth" were included as matching variables. The variable "gender" was selected as blockage variable, i.e., a variable used for separating the file into smaller blocks to speed up linkage process.
Probabilities in the linkage process were obtained through an indirect approach, i.e., probability estimates were determined by the selection of records in SINAN-TB database undergoing linkage. Default probabilities or probabilities preset by the investigators were not used.
Link-Plus software estimates scores for each pair of matched records. The higher the score, the more likely a matched pair concerns the same patient. Scores above the set cutoff value are considered repeat records and score below the cutoff value are considered single records. A cutoff value of six was set. When linkage process is complete, reports with listings of pairs of matched records and single records are issued.
Three successive manual removals were conducted to ascertain whether pairs of matched records concerned the same patient, in which case they were called linked or repeat records. Those pairs with records that did not concern the same patient were broken down based on a set of information and criteria. For example, common misreporting of date of birth evidenced by inconsistencies between date of birth and patient's age. Records with inconsistent dates of birth have low negative predictive value in the ascertainment of a pair of linked records concerning the same patient while consistent dates have high positive predictive value. Investigators' knowledge on the composition of Brazilian proper names was also applied. For example, family customs of giving similar names to their children allowed, using Link-Plus program, to identify records of likely siblings as concerning to the same patient, and break them down during manual removal treatment. For uncertain cases, the investigators chose to take a conservative approach and not consider matched records as repeat.
The first two rounds of removal were based only on linkage variables and program scores. The third round of removal was carried out after regrouping of repeat records and other linkage variables were compared, such as municipality and notifying health unit or municipality and home address. In all steps, program scores helped to determine records requiring careful consideration during removal.
Link-Plus software yields results as paired records but some records are transitively paired. According to the transitive logic, if record A is associated to records B and C, then records B and C are also necessarily associated. Thus, records A, B and C were regrouped as a record triplet concerning the same patient even if records A and C had not been matched by the record linkage program.
As a result of this process, groups of three, four or more records were considered as concerning to one patient. The largest group of repeat records concerning the same patient included 15 records.
In the last step, records were classified as single (one notification and no repeat), duplicate (one notification and one repeat), triplicate (one notification and two repeats), and so on.
For the classification of repeat records, the following variables values were compared: notification number, date of notification, date of diagnosis, date of current notification, date of current treatment start, date of notification completion, code of notifying municipality, code of notifying health unit, code of health unit proving patient follow-up, type of system entry, TB clinical form and status at completion.
Repeat records were classified in six mutually exclusive categories as follows:
- Missing data: repeat records with missing information for variables "date of notification" and/or "type of system entry" and/or "code of notifying health unit".
- True duplication: repeat records with the same (but no missing) information for the variable "code of notifying municipality" and the same date of notification or time interval up to 60 days between notifications and were from the same notifying health unit. As there could have been concurrent use of two different charts for coding health units, records were considered from the same health unit if they had the same code or a corresponding code for both charts. All states were asked to provide their plan of health unit code change but only some of them provided it timely to be included in the study.
- Relapse: repeat records where categories in the variables related to "type of system entry" and/or "status at completion" indicated prior cure.
- Return: repeat records where categories in the variables related to "type of system entry" and/or "status at completion" indicated prior default.
- Transfer between health units: repeat records notified by different health units with information in the variables related to "type of system entry" and/or "status at completion" indicating case transfer. Repeat records that, although with same (or corresponding) codes for notifying health unit, showed different code for health unit providing patient follow-up were also classified as transfer between health units.
- Inconclusive: classification was not possible even though variables did not have any missing information.
Repeat records classified as "transfer between health units" were grouped as within municipalities, when the notifying health units belonged to the same municipality; between municipalities, when they were from different municipalities but within the same state; and inter-state when they were from different states.
Score comparison and classification were carried out using Stata 8.2 software.
After classification, repeat records were then either excluded or remained in the database following SINAN working guidelines. Hence, records classified as relapse, return, and inconclusive remained in the database. For "true duplication," the oldest record (or the most complete one, if both had same date of notification) was left in the database. For "transfer between health units," notification form information of the oldest record was joined to follow-up form information of the most recent record.4 A database was defined as "complete" when it included all notified records and "lean" when it included non-excluded records only.
Following SINAN guidelines for epidemiological surveillance actions,1 a new TB case was defined when: 1) any notification where the variable "system entry" reported "new case" or "don't know"; 2) the variable "status at completion" was left blank in the category "diagnosis change".
TB incidence rates were estimated as the number of new cases living in a given area diagnosed in a given year, divided by the population living in this area in the same year and multiplied by 100,000. Population-based data were provided by the Instituto Brasileiro de Geografia e Estatística (Brazilian Institute of Geography and Statistics IBGE).5
TB notification database for the period 20002004 included 482,501 records comprising all types of system entries and all TB clinical forms. Of these, more than 70% were single records and no clear trend was seen in single, duplicate, triplicate, and quadruplicate or more records (Table 1). In all Brazilian regions, the proportion of single, duplicate, triplicate and quadruplicate or more records did not vary much over the years studied but it varied widely in some states.
Table 2 shows that, in 2003, states with the lowest and highest rates of single records were Goiás (21.1%) and Roraima (86.9%), respectively.
Table 3 displays the annual proportions in the six repeat record classifications. "Transfers between health units" was the most prevalent category in the study period, accounting for 55.4% of all repeat records in the first year and then remaining around 47% in subsequent years. There were 12% of returns in 2000 and then they remained constant around 25%. Overall, true duplications decreased and relapses increased over the period studied.
Of all 32,341 repeat records classified as "transfers between health units," 40.4% were within municipality; 47.8% between municipalities; and 11.8% between states.
Table 4 shows the classification of repeat records notified in 2003 by regions and states. Different proportions in each classification were found between states of the same region. Although some states had a small number of repeat records, Roraima, Amazonas and Amapá had the highest rates of transfers between health units, while Acre had the lowest rate. In Goiás, true duplication accounted for 74% of repeat records, more than twice the proportion found in Paraíba, ranked second in this category.
Table 5 shows a comparison of annual TB incidence rates between complete and lean databases, i.e., before and after removal of duplicate records and joining of transferred cases. With rare exceptions, different annual TB incidence rates were found in all states over the period studied. Differences were greater than 10% in at least one year in the states of Amapá, Goiás, Paraíba, Piauí, Rio Grande do Norte, São Paulo and Tocantins. Goiás showed a difference higher than 34% in all years studied. Nationwide, the observed incidence rates varied in the different databases, from 6.1% in 2000 to 9.4% in 2002 with no clear trend. Table 5 also shows rate differences between regions and states over the years that cannot be attributed to repeat records in database but this analysis is beyond the scope of this study.
SINAN was created in the beginning of 1990s and has undergone several updates to eliminate errors and make it more suitable to meet new demands in epidemiological surveillance. Although all Brazilian municipalities pass on their information to SINAN, around 70% carry out direct entry of electronic data. Database update at higher hierarchical levels is routinely conducted through vertical data transfers. Working guidelines and task description at local, state and country levels are regulated in official documents available to all users.1
In accordance with epidemiological surveillance guidelines, SINAN has implemented specific routine procedures for managing repeat TB patient records and has its own tools to help identification of potential duplicates as well as correction procedures. However, given the number of repeat records found in SINAN-TB database, these routine procedures are possibly not implemented as necessary and/or not adequately followed by system users, especially at local level. Implementation of routine procedures is a priority action that should be taken by TB surveillance officials at administrative level working together with information system managers.3,6
The study results showed quality issues in SINAN-TB databases in all Brazilian states. Reduction in annual TB incidence rates resulting from record linkage, classification and removal of improper repeat records from SINAN-TB database may have actually be even greater since there were unclassified repeat records and plans of health unit code changes were not available for all states. It is also likely that repeat records were left undetected in the linkage process as there is no gold standard to ascertain the sensitivity of Link-Plus software. Preliminary studies in SINAN database (unpublished data) showed its sensitivity was comparable to that obtained using Levenshtein distance algorithm applied to patient's name, patient's mother name and date of birth.7
Alternatively, it is possible that the magnitude of reduction in TB annual incidence rates may have been overestimated if linked records of different patients were misclassified as repeat records. Misclassification of repeat records as true duplication or transfer between health units may have also contributed to overestimation. Though possible, these assumptions are unlikely given the study conservative approach.
In a probabilistic approach, accurate agreement between linkage variables is not required for record linkage. But improper classification of records as concerning the same patient was prevented by the investigators' subsequent check of matched records. Thorough manual removal of matched records helped to improve specificity without affecting its sensitivity in finding repeat records in SINAN-TB database.
With respect to repeat records classification, only relapses, returns after default and transfers between health units in different states would be actually expected in the core national database. The other categories found reflect flawed operation and management of information system at the different levels engaged in TB surveillance and control.
Although their reporting to SINAN is mandatory, there was missing information for the variables "date of notification," "type of system entry," and "health unit code". This can be explained by faulty system operation where corrupted files are generated due to inadequate use of tools to access the original database (Sinanw.GDB) which eventually damages the system. Errors may also occur due to the fact that some states use parallel reporting systems and data are passed on to SINAN with missing mandatory fields producing incomplete databases.
Record true duplication can be generated at the time when a patient receives care from different providers in the same health unit after the visit that elicited the first notification, for example when the patient comes to the unit once again to sputum collection or medicine supply. These are the times when health providers can make a new reporting for assurance purposes and both records are eventually entered in the database. However, if main fields have any different information (notification number, date of notification, notifying municipality and unit), the system will not recognize the records as concerning the same patient and duplication will be generated.
Potential duplication in SINAN database can be ascertained using two different approaches. The first one is from listings of notifications including patient's name or their mothers' name in alphabetical order. The second approach is from listings of potential duplicates identified as having same information in a variable automatically created by the program. This automatic variable consists of a combination of patient's first and last name, gender and date of birth. Health providers engaged in TB surveillance are required to check these listings and investigate potential duplicates by contacting notifying health units so as to take the proper action. When such procedures are not routinely implemented, duplicates amass at all system levels.
The finding of records with codes of different health units but same information for the remaining variables was attributed to the introduction of new health unit codes and flawed standardization of new codes. Records with old codes were not replaced with records with new codes during vertical data transfer and thus duplicates were generated. After this programming failure was identified, SINAN national management provided the states an explanatory technical note and program correction application. The number of duplications generated due to this program failure yet to be removed from the database is now small. Therefore, the authors chose to classify this information together with other repeat records in the true duplication category. However, this program application was not widely used in the state of Goiás at the time of the study, producing 97.6% of true duplication and affecting the state's TB incidence rates.
In regard to repeat records related to transfers between health units, almost 90% were within municipalities or within the same state and these records should have been joined at local or state level, respectively. Routine procedures available in SINAN for identification and joining of transferred patient records are not automatically implemented and involvement of surveillance data management officials is necessary as these procedures require knowledge on specific TB surveillance notions. For adequate intervention the reasons why joining routine procedures are not available should be investigated.
It is also likely that, among repeat records classified as inconclusive, there may be transfers between health units or returns after defaults which were not identified as such by the heath system and therefore not properly recorded in SINAN. To overcome this problem, better TB patient follow-up is needed as well as surveillance staff reporting to source health units of any case transfer or return after default.
The variations observed between states of data quality of SINAN-TB databases should be carefully assessed as all data management levels are equally responsible for generating repeat records. Moreover, the interpretation of data presented here is limited to the comparison of data quality related to repeat records. Analysis of underreporting, missing information, data inconsistence, and delayed information transmission was out of the scope of the present study but it would have been necessary if the aim of the study was a comprehensive assessment of data quality in SINAN-TB database.
Besides considerations on the study approach, it is believed that the TB annual incidence rates found in this study reflect closer estimates to the actual true rates than those obtained based on crude data both at national and state levels. TB record linkage using SINAN's core tools or other related linkage applications should be continuously promoted for improving quality of notification data.1
The present study is part of the Programa Nacional de Controle da Tuberculose (National Program for Tuberculosis Control) evaluation study, coordinated by the Department of Health Status Analysis and the Brazilian Ministry of Health Department of Epidemiological Surveillance. Data linkage using the approach here described allowed to assess baseline quality of SINAN-TB database for 20002004 and to develop an intervention strategy implemented in the second half of the year 2005.
1. Camargo Jr KR, Coeli CM. Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage. Cad Saude Publica. 2000;16(2):439-47.
2. Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc.1969; 64(328):1183-210.
3. Laguardia J, Domingues CMA, Carvalho C, Lauerman CR, Macário E, Glatt R. Sistema de Informação de Agravos de Notificação (Sinan): desafios no desenvolvimento de um sistema de informação em saúde. Epidemiol Serv Saude. 2004;13(3):135-46.
Ana L Bierrenbach
Esplanada dos Ministérios, Bloco G Edifício Sede, 1º andar, sala 150
70058-900 Brasília, DF, Brasil
Note: See the Letter to the Editor in this Supplement.
1 Ministério da Saúde. Secretaria de Vigilância em Saúde. Sistema de Informações de Agravos de Notificação Normas e Rotinas. Brasília; 2004. (Série A: normas e manuais técnicos)
2 Ministério da Saúde. Fundação Nacional de Saúde. Tuberculose - Guia de Vigilância Epidemiológica. Brasília; 2002
3 Centers for Disease Control and Prevention. Link Plus fact sheet. Atlanta: 2004 [access on: Sept 02, 2005]. Available from: http://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/Link%20Plus.htm
4 Ministério da Saúde. Fundação Nacional de Saúde. Tuberculose - Guia de Vigilância Epidemiológica. Brasília; 2002
5 Departamento de Informática do Sistema Único de Saúde. Informações de saúde: demográficas e socioeconômicas. Brasília; 2005. [Access on Sept 2, 2005]. Available from: http://w3.datasus.gov.br/datasus/datasus.php?area=359A1B379C6D0E0F359G23HIJd6L26M0N&VInclude= ../site/infsaude.php
6 Glatt R. Análise da qualidade da base de dados de Aids do Sistema de Informação de Agravos de Notificação (Sinan) [master's dissertation]. Rio de Janeiro: Escola Nacional de Saúde Pública da FIOCRUZ; 2004.
7 Black PE. Levenshtein distance. In: Black PE, editor. Dictionary of Algorithms and Data Structures. Gaithersburg: National Institute of Standards and Technology; 2005. Available from: http://www.nist.gov/dads/HTML/Levenshtein.html [Accessed on Nov 3 2006]