Multiple R Markdown Reports, Multiple Data Sets, Single File

Until recently I had yet to run into a scenario where I needed to use R Markdown to produce a “templated” set of reports. That is, a group of PDFs or HTML files with common R code (and therefore common report features, like figures, tables, summary statistics, etc.) executed dynamically on different data sets.

I encountered this problem when I was trying to automate a workflow for generating reports on publication data. The datasets are compiled individually for each department (Neurology, Pathology, etc.) but the final product (a PDF) features a common set of plots based on the input data set.

The brute force GUI solution here would be to:

  1. Open up RStudio
  2. Write an R Markdown report that reads in the first data set
  3. Click “Knit PDF”
  4. Point the read_csv() in the R Markdown file to the second data set
  5. Click “Knit PDF”
  6. Repeat steps 4 & 5 until all reports have been generated

I could have potentially > 50 data sets. So the process above wasn’t going to work for me.

Instead I started looking around for ways to loop through a list of files and generate multiple PDF reports. Each PDF would be based on data from each file.

There are several Stack Overflow posts that speak to this issue. One describes a similar problem, but in terms of knitr with Sweave. Another seems more relevant to my workflow (since I’m using R Markdown rather than Sweave) but explains how to create multiple reports from a single data set.

My solution borrows from the posts above. The basic premise here is that you can use a single R script to “drive” the rendering of your R Markdown file. Another piece of documentation I found does a great job detailing this idea.

All that said … here’s the workflow that works for me:

  1. Move all data files to a “data” subdirectory of my project root
library(gapminder)
library(dplyr)
library(readr)

gapminder %>%
  filter(continent == "Asia") %>%
  write_csv("data/asia.csv")

gapminder %>%
  filter(continent == "Americas") %>%
  write_csv("data/americas.csv")

gapminder %>%
  filter(continent == "Africa") %>%
  write_csv("data/africa.csv")

gapminder %>%
  filter(continent == "Europe") %>%
  write_csv("data/europe.csv")

gapminder %>%
  filter(continent == "Oceania") %>%
  write_csv("data/oceania.csv")
  1. Check that data is prepared correctly (i.e. all columns are named the same, missing values are represented consistently)

  2. Write an R Markdown file that is generic enough to work for each data set … and the title of in the YAML header for the Rmd can include data specific to the file being processed

     
      ---
    title: '`r unique(cont$continent)`'
    output: pdf_document
    ---
    library(ggplot2) 
    library(dplyr) 
    
    cont %>% 
      # average life expectancy and gdp per capita 
      group_by(country) %>% 
      summarise(lifeExp = mean(lifeExp), gdpPercap = mean(gdpPercap)) %>% 
      # scatterplot of life expectancy by gdp per capita 
      ggplot(aes(lifeExp, gdpPercap, label = country)) + 
      geom_text() + 
      theme_minimal() 
  3. Write an R script that:

    1. Loops through the data folder
    2. Creates an individual object (pubs) based on the given file in the loop
    3. Renders a PDF (using rmarkdown::render()) that is named according to the raw data file.
library(knitr)
library(readr)
library(dplyr)

filelist <- list.files("data")

# loop through the file list to read in data and clean it up

for (file in filelist) {
  
  fp <- paste("data/", file, sep="")
  
  # read in continents
  cont <- 
    read_csv(fp) %>%
    # look only at data since 1980
    filter(year >= 1980)
  
  rmarkdown::render(input = "multireport.Rmd", 
                    output_format = "pdf_document",
                    output_file = paste0(gsub(".csv","", file), ".pdf"),
                    output_dir = "reports")
  
}
  1. Run the R script
source("multireports.R")

Related