A simple mistake I occasionally still make when scripting in Bash

I was writing a simple script to download dbSNP151 and do some light processing of that download. Specifically, I wanted to make sure all the SNPs have RSIDs in the name column, that there are no alternative chromosomes, e.g., chr6_cox_hap2, and then create a BED format file (tab-separated with 4 columns: chromosome, start position (bp), end position (bp), and ID, here the RSID). Like I said the script was quite simple, just 43 lines.

#!/bin/bash
# Download data from UCSC
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp151.txt.gz

pv -p snp151.txt.gz  | gunzip  > snp151.txt

orig_wc=$(wc -l snp151.txt)
orig_wc=$(printf "%'d\n" $orig_wc)

awk -F'\t' '$5 ~ /^rs/' snp151.txt > snp151_rsid_only.txt

rsid_wc=$(wc -l snp151_rsid_only.txt)
rsid_wc=$(printf "%'d\n" $rsid_wc)

awk -F'\t' '!($2 ~ /^chr[0-9A-Za-z]+_/)' snp151.txt > snp151_chr_clean.txt

chr_wc=$(wc -l snp151_chr_clean.txt)
chr_wc=$(printf "%'d\n" $chr_wc)

awk -F'\t' '{print $2"\t"$3"\t"$4"\t"$5}' snp151_chr_clean.txt > snp151_clean.bed

bed_wc=$(wc -l snp151_clean.bed)
bed_wc=$(printf "%'d\n" $bed_wc)

echo -e "Orig len: ${orig_wc}"
echo -e "RSID len: ${rsid_wc}"
echo -e "CHR  len: ${chr_wc}"
echo -e "BED  len: ${bed_wc}"

# Chr counts
for curr_chr in chr{1..22} chrX chrY; do

  cnt=0
  gc=$(grep -c "${curr_chr}" snp151_clean.bed)
  
  if [[ "${gc}" -gt 0 ]]; then
    cnt="${gc}"
  fi
  
  cnt=$(printf "%'d\n" $cnt)
  
  echo -e "${curr_chr}\t${cnt}"
done

I then run it and get the following error:

./dl_snp151.sh: line 8: printf: snp151.txt: invalid number

This points to the ‘printf’ call on line 8 getting an invalid number. After a quick look at the code, I found an old nemesis, calling ‘wc’ without using ‘awk’ before processing the ‘wc’ output as a number.

orig_wc=$(wc -l snp151.txt)
orig_wc=$(printf "%'d\n" $orig_wc)

The utility ‘wc’ outputs the count followed by the name of the file it was called on, whether you are asking for the number of lines, bytes, or characters in the file.

wc -l dl_snp151.sh
43 dl_snp151.sh

wc -c dl_snp151.sh
1017 dl_snp151.sh

wc -m dl_snp151.sh
1017 dl_snp151.sh

Luckily this has a simple fix, by just piping the output of ‘wc’ to ‘awk’, you get just the numerical value and not the filename too.

orig_wc=$(wc -l snp151.txt | awk '{print $1}' )
orig_wc=$(printf "%'d\n" $orig_wc)

So after updating a few instances in the script I made this mistake, it now runs perfectly.

#!/bin/bash
# Download data
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp151.txt.gz

pv -p snp151.txt.gz  | gunzip  > snp151.txt

orig_wc=$(wc -l snp151.txt | awk '{print $1}' )
orig_wc=$(printf "%'d\n" $orig_wc)

awk -F'\t' '$5 ~ /^rs/' snp151.txt > snp151_rsid_only.txt

rsid_wc=$(wc -l snp151_rsid_only.txt | awk '{print $1}' )
rsid_wc=$(printf "%'d\n" $rsid_wc)

awk -F'\t' '!($2 ~ /^chr[0-9A-Za-z]+_/)' snp151.txt > snp151_chr_clean.txt

chr_wc=$(wc -l snp151_chr_clean.txt | awk '{print $1}' )
chr_wc=$(printf "%'d\n" $chr_wc)

awk -F'\t' '{print $2"\t"$3"\t"$4"\t"$5}' snp151_chr_clean.txt > snp151_clean.bed

bed_wc=$(wc -l snp151_clean.bed | awk '{print $1}')
bed_wc=$(printf "%'d\n" $bed_wc )

echo -e "Orig len: ${orig_wc}"
echo -e "RSID len: ${rsid_wc}"
echo -e "CHR  len: ${chr_wc}"
echo -e "BED  len: ${bed_wc}"

# Chr counts
for curr_chr in chr{1..22} chrX chrY; do

  cnt=0
  gc=$(grep -c "${curr_chr}" snp151_clean.bed)
  
  if [[ "${gc}" -gt 0 ]]; then
    cnt="${gc}"
  fi
  
  cnt=$(printf "%'d\n" $cnt)
  
  echo -e "${curr_chr}\t${cnt}"
done

...
Orig len: 683,635,300
RSID len: 683,635,300
CHR  len: 660,963,690
BED  len: 660,963,690
chr1    277,198,171
chr2    87,835,164
chr3    45,423,006
chr4    43,658,468
chr5    40,974,723
chr6    38,316,687
chr7    36,568,742
chr8    34,770,865
chr9    28,774,967
chr10   30,525,909
chr11   31,276,099
chr12   30,325,780
chr13   22,389,392
chr14   20,384,489
chr15   19,100,663
chr16   21,010,247
chr17   18,595,923
chr18   17,708,515
chr19   14,227,801
chr20   14,550,626
chr21   8,717,936
chr22   9,065,231
chrX    26,167,218
chrY    1,273,382

I tend to run into this after coding in some other language for a bit and then coming back to Bash. It’s just a simple mistake luckily with a simple solution.

A simple mistake I occasionally still make when scripting in Bash

Be First to Comment

Leave a Reply Cancel reply