I was writing a simple script to download dbSNP151 and do some light processing of that download. Specifically, I wanted to make sure all the SNPs have RSIDs in the name column, that there are no alternative chromosomes, e.g., chr6_cox_hap2, and then create a BED format file (tab-separated with 4 columns: chromosome, start position (bp), end position (bp), and ID, here the RSID). Like I said the script was quite simple, just 43 lines.
#!/bin/bash
# Download data from UCSC
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp151.txt.gz
pv -p snp151.txt.gz | gunzip > snp151.txt
orig_wc=$(wc -l snp151.txt)
orig_wc=$(printf "%'d\n" $orig_wc)
awk -F'\t' '$5 ~ /^rs/' snp151.txt > snp151_rsid_only.txt
rsid_wc=$(wc -l snp151_rsid_only.txt)
rsid_wc=$(printf "%'d\n" $rsid_wc)
awk -F'\t' '!($2 ~ /^chr[0-9A-Za-z]+_/)' snp151.txt > snp151_chr_clean.txt
chr_wc=$(wc -l snp151_chr_clean.txt)
chr_wc=$(printf "%'d\n" $chr_wc)
awk -F'\t' '{print $2"\t"$3"\t"$4"\t"$5}' snp151_chr_clean.txt > snp151_clean.bed
bed_wc=$(wc -l snp151_clean.bed)
bed_wc=$(printf "%'d\n" $bed_wc)
echo -e "Orig len: ${orig_wc}"
echo -e "RSID len: ${rsid_wc}"
echo -e "CHR len: ${chr_wc}"
echo -e "BED len: ${bed_wc}"
# Chr counts
for curr_chr in chr{1..22} chrX chrY; do
cnt=0
gc=$(grep -c "${curr_chr}" snp151_clean.bed)
if [[ "${gc}" -gt 0 ]]; then
cnt="${gc}"
fi
cnt=$(printf "%'d\n" $cnt)
echo -e "${curr_chr}\t${cnt}"
done
I then run it and get the following error:
./dl_snp151.sh: line 8: printf: snp151.txt: invalid number
This points to the ‘printf’ call on line 8 getting an invalid number. After a quick look at the code, I found an old nemesis, calling ‘wc’ without using ‘awk’ before processing the ‘wc’ output as a number.
orig_wc=$(wc -l snp151.txt)
orig_wc=$(printf "%'d\n" $orig_wc)
The utility ‘wc’ outputs the count followed by the name of the file it was called on, whether you are asking for the number of lines, bytes, or characters in the file.
wc -l dl_snp151.sh
43 dl_snp151.sh
wc -c dl_snp151.sh
1017 dl_snp151.sh
wc -m dl_snp151.sh
1017 dl_snp151.sh
Luckily this has a simple fix, by just piping the output of ‘wc’ to ‘awk’, you get just the numerical value and not the filename too.
orig_wc=$(wc -l snp151.txt | awk '{print $1}' )
orig_wc=$(printf "%'d\n" $orig_wc)
So after updating a few instances in the script I made this mistake, it now runs perfectly.
#!/bin/bash
# Download data
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp151.txt.gz
pv -p snp151.txt.gz | gunzip > snp151.txt
orig_wc=$(wc -l snp151.txt | awk '{print $1}' )
orig_wc=$(printf "%'d\n" $orig_wc)
awk -F'\t' '$5 ~ /^rs/' snp151.txt > snp151_rsid_only.txt
rsid_wc=$(wc -l snp151_rsid_only.txt | awk '{print $1}' )
rsid_wc=$(printf "%'d\n" $rsid_wc)
awk -F'\t' '!($2 ~ /^chr[0-9A-Za-z]+_/)' snp151.txt > snp151_chr_clean.txt
chr_wc=$(wc -l snp151_chr_clean.txt | awk '{print $1}' )
chr_wc=$(printf "%'d\n" $chr_wc)
awk -F'\t' '{print $2"\t"$3"\t"$4"\t"$5}' snp151_chr_clean.txt > snp151_clean.bed
bed_wc=$(wc -l snp151_clean.bed | awk '{print $1}')
bed_wc=$(printf "%'d\n" $bed_wc )
echo -e "Orig len: ${orig_wc}"
echo -e "RSID len: ${rsid_wc}"
echo -e "CHR len: ${chr_wc}"
echo -e "BED len: ${bed_wc}"
# Chr counts
for curr_chr in chr{1..22} chrX chrY; do
cnt=0
gc=$(grep -c "${curr_chr}" snp151_clean.bed)
if [[ "${gc}" -gt 0 ]]; then
cnt="${gc}"
fi
cnt=$(printf "%'d\n" $cnt)
echo -e "${curr_chr}\t${cnt}"
done
...
Orig len: 683,635,300
RSID len: 683,635,300
CHR len: 660,963,690
BED len: 660,963,690
chr1 277,198,171
chr2 87,835,164
chr3 45,423,006
chr4 43,658,468
chr5 40,974,723
chr6 38,316,687
chr7 36,568,742
chr8 34,770,865
chr9 28,774,967
chr10 30,525,909
chr11 31,276,099
chr12 30,325,780
chr13 22,389,392
chr14 20,384,489
chr15 19,100,663
chr16 21,010,247
chr17 18,595,923
chr18 17,708,515
chr19 14,227,801
chr20 14,550,626
chr21 8,717,936
chr22 9,065,231
chrX 26,167,218
chrY 1,273,382
I tend to run into this after coding in some other language for a bit and then coming back to Bash. It’s just a simple mistake luckily with a simple solution.
Be First to Comment