Looping Across Files

When performing repetitive file manipulation operations, it can be useful to write a loop.

Let’s assume you have a file structure organized as follows:

.
+-- loop.sh
+-- loop.R
+-- raw
|   +-- sample1.bed
|   +-- sample2.bed
|   +-- sample3.bed
|   +-- sample4.bed
|   +-- sample5.bed
+-- processed
|   +-- 

You’d like to loop through all of the .bed files in the raw/ directory, and output the first and last columns in a .csv file in the processed/ directory that shares the same file prefix (i.e. “sample1”) as the given input.

To do so you could write a bash script that is in a file called loop.sh:

#!/bin/bash

# loop through all .bed files in the raw dir
FILES=raw/*.bed

for f in $FILES

do
  # message
  printf "\n%s\n%s\n" "processing $f ..." "------------------------------"

  # get input file name ...
  # ... and create output file based on that
  in_fn=$(basename -- "$f")
  out_fn="${in_fn%.*}"

  # use awk to get first and last columns  ...
  # ... pipe output to csv format
  awk 'BEGIN{FS="\t";OFS=","}{print $1, $NF}' $f > processed/"$out_fn".csv

done

And in loop.R you could write something like this:

files <- list.files("raw", full.names = TRUE)

for (f in files) {
  
  bed <- read.delim(f, sep = "\t")
  
  out_fp <- paste0("processed/", 
                   tools::file_path_sans_ext(basename(f)),
                   ".csv")
  
  write.csv(bed[,c(1,ncol(bed))], row.names = FALSE, file = out_fp)
  
}

Both of the above should work and yield the following:

.
+-- loop.sh
+-- loop.R
+-- raw
|   +-- sample1.bed
|   +-- sample2.bed
|   +-- sample3.bed
|   +-- sample4.bed
|   +-- sample5.bed
+-- processed
|   +-- sample1.csv
|   +-- sample2.csv
|   +-- sample3.csv
|   +-- sample4.csv
|   +-- sample5.csv

However, your choice of method could have serious performance implications.

I tested this on dummy files that were roughly 100MB each (~ 2.5 million rows, 4 columns).

The R method took ~ 90 seconds … the bash script finished in 30 seconds.

Related