Description
I am processing VCF file to split the columns until the INFO field (columns 1-9) into tab separated columns. The example input has 250K variants.
To do this i am using the below code:
###STEP1: Reading the VCF file took upto approx ˜4min (196 sec)####
system.time(pop_vcfinput <- vcfR::read.vcfR("pop.vcf.gz"))
user system elapsed
196.416 5.611 202.040
###STEP2: Extracting all the key/value pairs in INFO field #####
my_INFOs <- grep("INFO", queryMETA(pop_vcf), value = TRUE)
my_INFOs <- sub("INFO=ID=", "", my_INFOs)
### STEP3a: Using lapply function to extract the values for all the keys in INFO column. This took ˜5 min (302sec) for the below function####
system.time(my_INFOm <- matrix(unlist(lapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
ncol = length(my_INFOs), byrow = FALSE))
user system elapsed
302.210 0.092 302.336
### STEP3b: The same was implemented with mclapply function and this took ˜5 min (316 sec) ####
system.time(my_INFOm <- matrix(unlist(mclapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
ncol = length(my_INFOs), byrow = FALSE))
user system elapsed
316.789 1.800 170.253
###STEP4: The INFO fields are combined with columns 1-8 to get the final output####
colnames(my_INFOm) = as.character(my_INFOs)
popcount_out <- cbind(getFIX(pop_vcf), my_INFOm)
The input dataset with 250K variants is taking ˜10min for executing the code in STEP1 + STEP3a or STEP1+ STEP3b. And the run time is increasing with increase in variants in the input VCF file. Is it possible to reduce the runtime with some parallel processing methods in R so that it performs better with the size of the input VCF? I am not familiar with parallel processing methods, in the above code i have used lapply and mclapply for STEP3 which do not have a significant improvement.
Any suggestions to improve performance while reading the file in STEP1 and STEP3a/3b?
Since this has been a performance issue and not a bug in the package to show a reproducible example, i am not able to share any input file.