Filtering Lines in One File Based on Matching Conditions in Another File Using AWK
In this article, we will explore how to use the AWK scripting language to filter lines in one file based on matching conditions specified in another file. We’ll go through a step-by-step explanation of the problem, discuss the limitations of the provided R code, and then delve into the AWK solutions offered.
Understanding the Problem
We have two files: file1
with 511 lines and file2
with approximately 12,500,003 lines. The contents of these files are as follows:
file1
chr start end
1 1227897 2779043
1 6644723 8832944
1 11067792 11372913
1 17287414 17342924
1 23254576 23590651
file2
CHR POS REF A1 OBS_CT BETA SE P ALT_FREQS RSID_UKB
1 58396 T C 382851 0.0882097 0.0677502 0.192923 0.000249012 rs570371753
1 91588 G A 382852 0.265908 0.0879796 0.00250811 0.000148375 rs554639997
1 713979 C G 382837 0.00630607 0.0925289 0.945664 0.000138059 rs117217250
1 715265 C T 377557 0.00260829 0.00617561 0.672768 0.0331599 rs12184267
1 715367 A G 377954 0.00212642 0.00615857 0.729886 0.0333038 rs12184277
1 717485 C A 377980 0.00449142 0.00615965 0.465899 0.0332908 rs12184279
1 6702159 G T 378749 0.00305772 0.00604916 0.613223 0.0345562 rs116801199
1 9902231 G C 378573 0.00216983 0.00607117 0.720793 0.0342995 rs12565286
1 23364524 C G 377155 0.00505093 0.00588132 0.390447 0.0368034 rs2977670
Our goal is to remove all lines in file2
where the position (POS
) is between the start and end of corresponding lines in file1
, and where the chromosome (CHR
) matches.
Limitations of R Code
The provided R code takes approximately 1 hour to run, which may be slow for large files. The R code uses a nested loop approach with conditional statements to filter out matching records, resulting in slower performance.
for (row in 1:nrow(file1)) {
file2 <- file2[!(file2$CHR == file1$chr[row] &
file2$POS >= file1$start[row] &
file2$POS <= file1$end[row])]
}
AWK Solutions
Ugly but Quick Solution
The first offered solution by the user creates a list indexed by all possibilities of CHR
and its corresponding range (start
to end
). However, it’s very slow because it generates a large number of keys.
awk '(NR==FNR) {for(i=$2;i<$3;++i) a[$1,i]; next}(FNR==1);!(($1,$2) in a)' file1 file2
Quicker Solution
A quicker approach involves creating lists for each chromosome and its corresponding range. For every new record of file2
, it stores the record under t
and keeps track of the chromosome value (c=$1
). By assigning a[c]
to $0
, it recomputes all fields, so $1 = 1227897, $2 = 2779043, etc. So all we have to do is jump in groups of two over all fields and check if the current record lays in the respective range or not.
awk '(NR==FNR) { a[$1]=a[$1] FS $2 FS $3; next}
{ t=$0;c=$1;p=$2; $0=a[c] }
{ for(i=1;i<=NF;i+=2) if( $i<$p && p<=$(i+1) ) next; print t }' file1 file2
Optimized Solution
The third solution, suggested by SiegeX in the comment section, further optimizes performance by sorting file2
. It stores each chromosome’s records under a[$1]
, then updates all fields for every new record.
awk '(NR==FNR) { a[$1]=a[$1] FS $2 FS $3 FS a[$1]; next}
{ t=$0;c=$1;p=$2; $0=a[c] }
{ for(i=NF-1;i>0;i+=2) {
if ( $(i+1)<p ) { $i=$(i+1)=""; $0=$0; $1=$1; a[c]=$0 }
else if ( $i<=$p && p<$(i+1) ) { next }
print t
}' file1 file2
Each of these AWK solutions addresses the filtering requirements while improving performance compared to the R code. The third solution, with sorted file2
, is likely to be the fastest approach for large files.
Conclusion
In this article, we have explored how to use AWK scripting language to filter lines in one file based on matching conditions specified in another file. We discussed the limitations of the provided R code and presented three optimized AWK solutions that improve performance by leveraging list indexing, re-computation of fields for each record, and sorting the target file.
Each solution addresses specific concerns about speed and performance, making them suitable for handling large datasets efficiently.
Additional Advice
- Regularly sort your data before running filtering scripts to take advantage of optimized algorithms.
- Utilize list-based approaches when dealing with multiple values per record.
- Compute all fields for each new record whenever possible to reduce iteration complexity.
Last modified on 2024-10-29