Skip to content

Publish to cloud tooling providers like Dockstore, AnVIL, etc #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
whaleyr opened this issue Jul 22, 2024 · 9 comments
Open

Publish to cloud tooling providers like Dockstore, AnVIL, etc #188

whaleyr opened this issue Jul 22, 2024 · 9 comments
Assignees
Labels

Comments

@whaleyr
Copy link
Contributor

whaleyr commented Jul 22, 2024

We want to make PharmCAT easily available on cloud genomics analysis platforms. We already publish a Docker image to Docker Hub so it should be relatively easy to make that image available to different cloud providers. For example, we want to enable access from AnVIL.

After doing some research it seems the best route is publishing a workflow through Dockstore. This will make it available through AnVIL but also DNAStack, DNAnexus, and others.

Current questions are:

  • How do we get this integrated with our release process?
  • How do we get stats on usage?
  • How can we test to ensure availability and success on all the downstream platforms?
@markwoon
Copy link
Contributor

markwoon commented Sep 10, 2024

Right now, the PharmCAT-Pipeline workflow only works on a single VCF file and does not handle outside calls file. The way the pipeline script is set up is to use naming conventions of files in the same directory, but this doesn't work because there's no concecpt of a directory in the cloud.

Part 1: accept a outside_call_file parameter for an outside call file. This will limit us to single sample VCFs.

  • check that this file is using the same basename as the VCF and uses .outside.tsv suffix
  • if yes, move to data dir
  • if no, move to data dir and rename

Part 2: accept file parameter

  • if vcf file, proceed as normal
  • if compressed file, uncompress and copy contents to data directory (will have to abide by file naming conventions)

Can either provide vcf_file and/or outside_call_file OR file parameter (maybe allow outside_call_file if file points to a VCF file? But that might just be extra complication).

Documentation tasks:

  • Update docs for WDL to deal with all these cases.
  • Should link to pipeline script's docs on file naming conventions
  • Provide an example download that does this

Super bonus part 3: do we want to support URLs in addition to files? There is probably no good reason to do so from the user perspective, but it does mean that we can write tests for the WDL and it will get tested automatically (I think).

@markwoon
Copy link
Contributor

Note: I messed up dockstore integration on last release. It should be fixed for next release though.

@AndreRico
Copy link
Collaborator

Proposal for Aligning and Simplifying the PharmCAT Pipeline

To simplify the maintenance of the PharmCAT_Pipeline and ensure it remains robust, I propose we keep the WDL focused on its core functionality of processing a single VCF file at a time. By doing this, we maintain clarity and ease of maintenance in the WDL itself, while offloading the complexity of file management to earlier workflow steps.

For handling issues like multiple files, compressed formats, and file naming conventions, we can delegate these tasks to upstream workflows within Terra or AnVIL. These workflows can manage tasks such as:

  • Decompressing files if needed.
  • Mapping multiple files for individual processing.
  • Renaming or organizing files according to required naming conventions.

By leveraging Terra and AnVIL’s ability to orchestrate custom workflows, users can create preprocessing steps that handle file management and preparation before invoking the PharmCAT_Pipeline for each individual file. This modular approach keeps the pipeline clean and focused while allowing flexibility for diverse file formats and workflows.

Next Suggested Steps:

Use Case Simulations: We can simulate a few use cases involving multiple files, compressed files, and naming conventions. Then, we’ll build workflows that manage these tasks before calling the PharmCAT_Pipeline. This will ensure the process is flexible and can handle different scenarios.

Comprehensive Documentation: We should document these workflows to guide users on how to set up file preprocessing workflows in Terra or AnVIL. This documentation will include examples of how to manage files and call them in the WDL one by one.

This modular approach will reduce the complexity within the pipeline itself, delegating file handling logic to other parts of the workflow, which simplifies both maintenance and usability across multiple platforms.

@markwoon
Copy link
Contributor

Details on file inputs: https://pharmcat.org/using/Running-PharmCAT-Pipeline/#inputs

@BinglanLi
Copy link
Collaborator

This is the link to the PharmCAT tutorial. It includes some real-world VCFs and outside call files.

@AndreRico
Copy link
Collaborator

Hi all, apologies for the delay! I took some time to dive deeper into the PharmCAT_Pipeline code, and it’s clear that it isn’t fully optimized for cloud environments. You had mentioned this issue before, but it really hit home after reviewing the code more closely.

I’m currently working on creating individual WDLs for each of the 4 modules, trying to replicate the logic of the PharmCAT_Pipeline in AnVIL. I’m not entirely sure if we’ll be able to replicate it 100%, but I do think having these modules separated could be valuable for future use cases.

That said, what do you think about developing a version of PharmCAT_Pipeline specifically designed to work in cloud environments?

@markwoon
Copy link
Contributor

Yes, the pipeline script is meant as a very simple wrapper around our main tools.

Using it was the quickest way to get going in Dockstore. You're welcome to create a better WDL script, but let's review because maybe we can then enable more functionality.

@AndreRico
Copy link
Collaborator

@markwoon, I created a new WDL https://dockstore.org/workflows/github.com/AndreRico/PharmCAT_Dockstore/PharmCAT-VCF_Preprocessor:main?tab=files with two tasks: one to convert the cloud environment into a Path environment, and a second to receive this path environment and run the vcf-preprocessor.
I conducted some tests using a txt file pointing to Google Cloud Storage, but I will need a help test other functionalities of the VCF-Preprocessor. I believe we can replicate it for the full Pipeline, adding this task conversion before calling the PharmCAT-Pipeline.
I will keep you informed of the progress.

@markwoon
Copy link
Contributor

convert the cloud environment into a Path environment

I assume this is cloud_reader.wdl. I'm not sure I understand why this is necessary. If the files are already in the cloud, then you can pass it directly to the WDL. We just need to accept a file array and users can select multiple files.

On the other hand, now that I'm thinking of this, this would also resolve the original problems I had with the PharmCAT_Pipeline.wdl...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants