Improving run time performance

I am processing VCF file to split the columns until the INFO field (columns 1-9) into tab separated columns. The example input has 250K variants.

To do this i am using the below code:

```
###STEP1: Reading the VCF file took upto approx ˜4min (196 sec)####
system.time(pop_vcfinput <- vcfR::read.vcfR("pop.vcf.gz"))
   user  system elapsed 
196.416   5.611 202.040

###STEP2: Extracting all the key/value pairs in INFO  field #####
my_INFOs <- grep("INFO", queryMETA(pop_vcf), value = TRUE)
my_INFOs <- sub("INFO=ID=", "", my_INFOs)


### STEP3a: Using lapply function to extract the values for all the keys in INFO column. This took ˜5 min (302sec) for the below function####

system.time(my_INFOm <- matrix(unlist(lapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
                   ncol = length(my_INFOs), byrow = FALSE))
                   
 user  system elapsed 
302.210   0.092 302.336 

### STEP3b: The same was implemented with mclapply function and this took ˜5 min (316 sec) ####
system.time(my_INFOm <- matrix(unlist(mclapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
                   ncol = length(my_INFOs), byrow = FALSE))
                   
   user  system elapsed 
316.789   1.800 170.253

###STEP4: The INFO fields are combined with columns 1-8 to get the final output####
colnames(my_INFOm) = as.character(my_INFOs)
popcount_out <- cbind(getFIX(pop_vcf), my_INFOm)
```

The input dataset with 250K variants is taking ˜10min for executing the code in STEP1 + STEP3a or STEP1+ STEP3b. And the run time is increasing with increase in variants in the input VCF file. Is it possible to reduce the runtime with some parallel processing methods in R so that it performs better with the size of the input VCF? I am not familiar with parallel processing methods, in the above code i have used lapply and mclapply for STEP3 which do not have a significant improvement.

Any suggestions to improve performance while reading the file in STEP1 and STEP3a/3b?
Since this has been a performance issue and not a bug in the package to show a reproducible example, i am not able to share any input file. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving run time performance #140

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improving run time performance #140

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions