User Notes


Purpose of Merging
Unique Identifiers in NHIS Public Use Files
Linking Keys in IHIS
Merging Variables from Original NHIS Public Use Files to IHIS Data Files

  1. Obtain original NHIS public use files
  2. Download and Edit Merging Syntax Files
    Person (Stata) Linking Files
    Person (SAS) Linking Files
    Household (Stata) Linking Files
    Household (SAS) Linking Files
    Mortality (Stata) Linking Files
    Mortality (SAS) Linking Files
  3. Merge NHIS data to IHIS date

Check merge results
Merge type 1: One-to-one merge within a year
Merge type 2: Many-to-one merge within a year
Example of merging variables from NHIS data with a multiple-year IHIS data file
Help with Merging


Purpose of Merging

In some circumstances, users may want to link additional variables from the original National Health Interview Survey (NHIS) public use files (that are not yet in the Integrated Health Interview Series (IHIS) system) to an IHIS data extract. A linking key must be used for this purpose. IHIS has created linking keys from the series of original NHIS variables that are used to uniquely identify households or individuals. However, to ensure correct linkage, unique identifiers identical to those created in the IHIS data must be generated in the NHIS public use data. While the text that follows is relevant to both person-level merging and household-level merging, most of the discussion emphasizes person-level merging, as that is the most likely need for IHIS users. However, the same principles generally apply to household-level merging.

[top]

Unique Identifiers in NHIS Public Use Files

Unique identifiers in NHIS data vary slightly across time, due to changes in the variables released in the public use data files. Please refer to tables 1 and 2 for details. Following the specified sequence of linking variables is critical for creating an IHIS-compatible unique identifier in the NHIS data. The variable names of these unique identifiers in the NHIS data should not be changed if users intend to use the IHIS Stata and SAS syntax files provided here for merging.

Table 1: Household-level unique identifiers in NHIS data

Year File Variable sequence
1969 H quarter psurandr weekcen segment hhid
1970–1991, 1993, 1994 H quarter psunumr weekcen segnum hhnum
1992 H year quarter psunumr weekcen segnum hhnum
1995–1996 H hhid
1997–2006 H hhx

Table 2: Person-level unique identifiers in NHIS data

Year File Variable sequence
1969 P quarter psurandr weekcen segment hhid pernum
1970–1991, 1993, 1994 P quarter psunumr weekcen segnum hhnum pnum
1992 P year quarter psunumr weekcen segnum hhnum pnum
1995–1996 P hhid pnum
1997–2003 P hhx fmx px
2004–2006 P hhx fmx fpx

Table 3: Mortality-level unique identifiers in NHIS data

Year File Variable sequence
1986 - 1996 M publicid
1997 - 2000 M publicid2

[top]

Linking Keys in IHIS

IHIS has taken the different unique identifiers across years of NHIS data into account and generated linking keys in the IHIS data. There are two linking keys in IHIS. The first linking key is NHISHID, which is a unique identifier for household records. The second linking key is NHISPID, which is a unique identifier for the person records. Each IHIS linking key uniquely identifies households or persons, respectively, within a year.

Certain modifications to the original unique identifiers in NHIS data have been made in IHIS to achieve comparably coded but uniquely identified across the multiple years of data. These modifications include:

  1. Concatenating the original unique identifier variables and generating a single string or character variable as the linking key;
  2. Padding each component variable with leading zeros to achieve comparable width across years; and
  3. Replacing erroneous characters in a handful of cases.

These same modifications must be applied to the NHIS data to ensure proper linkage. To help users merge variables from the original NHIS public use data with IHIS data, linking syntax files for each year from 1970 to 2006 have been provided in Stata and SAS format. See below for an overview of the merging process and a discussion of the merging syntax files, with annotated examples.

[top]

Merging Variables from Original NHIS Public Use Files to IHIS Data Files

There are three general steps that users need to take to merge variables from the original NHIS public use file to an IHIS data file:

  • Obtain the NHIS data;
  • Download and edit the merging syntax files for Stata or SAS; and
  • Merge the data files.

The discussion that follows will mostly focus on person-level merging, since that is likely to be the most common need for IHIS users. However, the same principles generally apply to household-level merging.

  1. Obtain original NHIS public use files

    Original NHIS public use files can be downloaded from the National Center for Health Statistics (NCHS).

     

  2. Download and Edit Merging Syntax Files
  3. To help users properly link variables from the original NHIS public use data with IHIS data, linking syntax files for each year from 1970 to 2006 have been provided in Stata and SAS format.

    These linking files will work with multiple years of IHIS data if users merge on YEAR and NHISPID for person-level files. YEAR and NHISHID is likewise needed for merging household-level files. Users can copy and paste each individual linkage program to a single file as needed, including programming statements for whichever years of data are required for a particular research project.

    Person (Stata) Linking Files
    Person (SAS) Linking Files
    Household (Stata) Linking Files
    Household (SAS) Linking Files

    MORTALITY FILES: All NHIS public use files can be linked using the person-level or household-level linking keys, with the exception of the recently released NHIS Mortality Files. To link the NHIS mortality files to IHIS, another set of linking files are required because the unique ID in the NHIS mortality files were constructed in a different way.

    Mortality (Stata) Linking Files
    Mortality (SAS) Linking Files

    The merging syntax files contain four sections. Users will need to edit sections 1 and 2 for their specific research project. Sections 3 and 4 will run based on user specifications made in the preceding two sections. The following discussion provides a general overview of these four sections.

    Section 1 is where users will specify the directory location and name of the data files with which they are working.

    This specification includes the original NHIS data file to be merged, the IHIS data file, and the newly created merged data file.

    Section 2 is where users will specify the names of the specific variables from the NHIS data that they want to merge to their IHIS data.

    Users are cautioned that variables in the NHIS public use files may or may not be comparable over time. Users are strongly advised against merging entire NHIS data files. Rather, users should identify variables from the NHIS source data for which a recoding plan is already devised. Users should then merge only this subset of variables with IHIS data. We strongly recommend that users rename variables in the NHIS source data before the merge, so they can clearly distinguish NHIS variables from IHIS variables.

    Section 3 contains syntax to prepare the NHIS data for each specific year. Users do not need to make any changes in this section. The syntax is written to check whether there are duplicates of the unique identifiers and records this information in the log. For person-level files, there should not be any duplicates. For episode-level files, such as conditions or doctor visits, duplicates will occur because an individual can have none, one, or many records. As mentioned previously, linking keys need to be created in the NHIS source data to ensure correct linking to the IHIS data. The code in this section also generates linking keys that are identical to the linking keys in IHIS for the same year.

    Finally, the syntax in this section checks for duplicates in the newly created linking key and writes this information to the log file. Results of this second duplicates check should be the same as the first. Next, the user-selected variables (specified in section 2) are kept, and this modified NHIS data file is saved as a temporary file for the merge.

    Section 4 contains code to merge the data files and to assess the quality of the merge. Users do not need to make any changes in this section. The syntax is written so that a user's specified IHIS data file is accessed, duplicates of the linking key in the IHIS data file are assessed, and the results are written to the log. The modified NHIS data file, with a subset of variables, is then merged to the IHIS data file. Syntax has been written to assess the status of the merge, and the results are written to the log. Duplicate checks and merge statistics can be reviewed by the user to evaluate the status of the merge. Additional information about interpreting the merge results is given below.

    [top]

  4. Merge NHIS data to IHIS data
  5. Once the merging syntax files are edited with user specifications, these files can be run to complete the merge. This section contains a discussion of the two types of merging users may encounter and how to assess the quality of the merge, using the statistics produced by the merging syntax files.

    There are two main types of merges possible when combining NHIS source data with IHIS data. The first merge type is a one-to-one merge (for example, merging person-level variables from the NHIS person files or sample adult files, where there is only one possible record per person, to the IHIS data). A second merge type is a many-to-one merge (for example, merging NHIS condition files where there can be none, one, or many condition records for a person in the IHIS data).

    As discussed earlier, NHISHID and NHISPID uniquely identify households and persons, respectively, within each year. When using multiple years of IHIS data, there is no need to subset multiple year data into single year files for proper merging. However, users must include the variable YEAR in combination with NHISHID or NHISPID for linking to be successful.

[top]

Checking Merge Results

Users should review the results of each merge, to ensure that the merge occurred as expected. After merging, tabulating the frequencies of the variable _merge within a year will allow users to assess the status of the merge. The values of the _merge variable report the merge status for each record. Values of _merge are as follows:

_merge = 1  observation in master dataset only (the IHIS data)
_merge = 2  observation in merging dataset only (the original NHIS data)
_merge = 3  observation in both master (IHIS) and merging (NHIS) datasets  

[top]

Merge type 1: One-to-one merge within a year

There are two types of one-to-one merges that users may see. The first is an exact match, and the second is a subset match.

An exact match occurs when there is exactly one record in the original NHIS data file for each individual record in the IHIS data. For example, users merging additional variables from the NHIS person files with IHIS data will have exactly one record per person per year in both the NHIS data and the IHIS data. After merging, results like those described in the following examples should show in the results window and the log file.

For example, the following merge assessment statistics will occur when merging additional variables from the 1994 NHIS person file to IHIS data. These statistics will appear in the results window and the log file. All records have a value of _merge=3, since all observations are in both the IHIS data (master file) and the NHIS data (merging file). The frequency of observations in _merge=3 should equal the total number of observations in the IHIS file for the specified year or the total number of observations in the NHIS file being merged.

Stata example:

.  bysort year: tab _merge
-> year = 1994
_merge| Freq.  Percent  Cum.
3| 116,179 100.00 100.00
Total| 116,179 100.00  

SAS example:

The FREQ Procedure DATA SET SOURCE
FOR OBS
_merge Frequency Percent Cumulative
Frequency
Cumulative
Percent
116179 100.00 116179  100.00

A subset match occurs when the NHIS data file only represents a sub-sample of the NHIS survey respondents in a given year. For example, the 2005 NHIS cancer supplement file includes only a subset of adults from the 2005 NHIS data. Users who wish to merge this supplement with IHIS data will only have matching records for a subset of those in the original NHIS data. While this is still a one-to-one merge, the merge will only occur for this subset of adults in the IHIS data. How the merge occurs will be similar to the one-to-one match, but after merging, results like those described in the following examples should show in the results window and the log file.

For example, the following merge assessment statistics will occur when merging variables from the 2005 cancer supplement to IHIS data. These statistics will appear in the results window and the log file. Records now have values of _merge=1 and _merge=3, since observations either are only in the IHIS data (master file) or are in both the IHIS data (master file) and the NHIS data (merging file). The frequency of observations for _merge=3 should equal the total number of observations in the NHIS file that was merged (n = 31,428). The frequency of total observations should equal the total number of person-record observations in the IHIS file for the specified year (n = 98,649).

Stata example:

.  bysort year: tab _merge
-> year = 2005
_merge| Freq.  Percent  Cum.  
1| 67,221 68.14 68.14 IHIS only
3| 31,428 31.86 100.00 IHIS and Cancer Supp
Total| 98,649 100.00    

SAS example:

The FREQ Procedure DATA SET SOURCE
FOR OBS
_merge Frequency Percent Cumulative
Frequency
Cumulative
Percent
 
67221 68.14 67221  68.14 IHIS only
31428 31.86 98649  100.00 IHIS and Cancer Supp

[top]

Merge type 2: Many-to-one merge within a year

Some NHIS data files, such as condition files and doctor visit files, contain multiple records for some persons. When merging these files, the IHIS data will be expanded to represent multiple records for those individuals who had multiple records in the merging file. Duplicates in the linking key are now expected for individuals who have more than one record.

As an example, the following merge assessment statistics will occur when merging variables from the 1974 condition file to IHIS data. These statistics will appear in the results window and the log file. Checking the merge status is more difficult in this situation. The frequency of observations for _merge=3 should equal the total number of observations in the original NHIS data that was merged (n = 37, 453). The total number of observations (n = 126,571) should now be larger than the number of observations in the IHIS data for the specified year (n = 116,287) but smaller than the combination of the number of records in the master file (n = 116,287) and the number of records in the merging file (n = 37,453). This is because some individuals have multiple records, some have a single record, and some have no record in the merging file.

Stata example:

. bysort year: tab _merge
-> year = 1974
_merge| Freq. Percent  Cum.
1| 89,118 70.41 70.41
3| 37,453 29.59 100.00
Total| 126,571 100.00  

SAS example:

The FREQ Procedure DATA SET SOURCE
FOR OBS
_merge Frequency Percent Cumulative
Frequency
Cumulative
Percent
89118 70.41 89118 70.41
37453 29.59 126571 100.00

 

[top]

Example of merging variables from NHIS data with a multiple-year IHIS data file

When merging data from multiple years to a multi-year IHIS data file, users can copy and paste syntax from each individual year to a single syntax file. In the first merge, the user must specify the directory path and name of their IHIS data file (see section 1.3 in the example below) and specify the directory path of the final merged data file they are creating (see section 1.4 in the example below). In the second, or subsequent, merge, the user must use the directory path and name of the final merged data file specified in the previous merge (Section 1.4) as the IHIS data master file (Section 1.3). This will ensure that variables from multiple years of NHIS data are merged with a single multi-year IHIS data file.

The following links provide a simplified example of merging two years of NHIS data to a multi-year IHIS file. For this example, the IHIS data file contains variables from 2004 and 2005. We want to merge additional variables from the NHIS 2004 and NHIS 2005 data files. Users can run each of the year-specific syntax files individually. Alternatively, users can copy and paste syntax from the separate merging syntax files to a single merging syntax file. Users are urged to specify carefully all directory paths and file name, paying special attention to the master data file to be specified in the second year. The master data file is now the merged file from the previous section.

Stata example (pdf)
SAS example (pdf)

[top]

Help with Merging

This user note and the accompanying Stata and SAS syntax files were written to provide general guidance and facilitate the process of merging variables from the original NHIS public use files to an IHIS data file. We attempted to anticipate issues that might occur in the most common merging scenarios. However, if problems arise, users are encouraged to contact IHIS for assistance.

For assistance, please e-mail us at: IHIS@pop.umn.edu

Please provide a brief description of the problem and attach the Stata or SAS log file for us to review.

[top]

Last revised: 19 Dec 2008