Skip to content

Improving run time performance #140

Open
@arumds

Description

@arumds

I am processing VCF file to split the columns until the INFO field (columns 1-9) into tab separated columns. The example input has 250K variants.

To do this i am using the below code:

###STEP1: Reading the VCF file took upto approx ˜4min (196 sec)####
system.time(pop_vcfinput <- vcfR::read.vcfR("pop.vcf.gz"))
   user  system elapsed 
196.416   5.611 202.040

###STEP2: Extracting all the key/value pairs in INFO  field #####
my_INFOs <- grep("INFO", queryMETA(pop_vcf), value = TRUE)
my_INFOs <- sub("INFO=ID=", "", my_INFOs)


### STEP3a: Using lapply function to extract the values for all the keys in INFO column. This took ˜5 min (302sec) for the below function####

system.time(my_INFOm <- matrix(unlist(lapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
                   ncol = length(my_INFOs), byrow = FALSE))
                   
 user  system elapsed 
302.210   0.092 302.336 

### STEP3b: The same was implemented with mclapply function and this took ˜5 min (316 sec) ####
system.time(my_INFOm <- matrix(unlist(mclapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
                   ncol = length(my_INFOs), byrow = FALSE))
                   
   user  system elapsed 
316.789   1.800 170.253

###STEP4: The INFO fields are combined with columns 1-8 to get the final output####
colnames(my_INFOm) = as.character(my_INFOs)
popcount_out <- cbind(getFIX(pop_vcf), my_INFOm)

The input dataset with 250K variants is taking ˜10min for executing the code in STEP1 + STEP3a or STEP1+ STEP3b. And the run time is increasing with increase in variants in the input VCF file. Is it possible to reduce the runtime with some parallel processing methods in R so that it performs better with the size of the input VCF? I am not familiar with parallel processing methods, in the above code i have used lapply and mclapply for STEP3 which do not have a significant improvement.

Any suggestions to improve performance while reading the file in STEP1 and STEP3a/3b?
Since this has been a performance issue and not a bug in the package to show a reproducible example, i am not able to share any input file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions