The Centers for Disease Control (CDC) collects vital statistics (mortality and natality) and hosts these for public use. These data can be accessed via a web-based query builder or by download. The files are available as zip archives, and when uncompressed contain tabular data with observations stored in fixed width format.
This post documents the steps I took to prepare the 2016 CDC national natality data for an analysis in R.
The first step is to download the data necessary. In this case, I was interested in 2016 statistics. But it’s worth pointing out that the data for births in the United States was collected and is available for public use all the way back to 1968. This data is collated from standard birth certificates, so some fields may vary across years.
After downloading the zip file (133MB) and extracting the archive, the key step for me was reading the fixed width format into R. These data are tidy in the sense that each observation is stored in a row, and each feature in a column. But unlike some other tabular formats, there isn’t a standard separator between columns. Instead data must be parsed by the position (start and stop) of the column.
You can read data from this format into R as a
read.fwf. Alternatively, the
readr package has the
read_fwf() function, which is what I chose to use.
The CDC provides a user guide with documentation on the length, order, possible values and names for each column in the corresponding datasets. After looking through the files for the fields I needed, I put together a query for only the columns of interest.
library(readr) births2016 <- read_fwf(births2016"Nat2016PublicUS.c20170517.r20170913.txt", fwf_positions(start = c(75,490,521), end = c(76,491,521), col_names = c("maternal_age","ega", "antibiotics") ) )
If you wanted to read in all of the columns, you could use
fwf_empty() instead of
fwf_positions in the call to
births2016 <- read_fwf("Nat2016PublicUS.c20170517.r20170913.txt", fwf_empty("Nat2016PublicUS.c20170517.r20170913.txt") )
Keep in mind
fwf_empty can accommodate a “col_names” argument as well. But in this case there would be > 100 column names to provide, some of which are “filler” according to the dataset documentation.