Open-PMC

A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral.

For more details, see the following resources:

arXiv Paper: http://arxiv.org/abs/2503.14377
Dataset: https://huggingface.co/datasets/vector-institute/open-pmc
Model Checkpoint: https://huggingface.co/vector-institute/open-pmc-clip

Installing dependencies

We use poetry for dependency management. Please make sure it is installed. Then, follow below instructions to set up your virtual environment.

Create a venv with python3.10 and activate it.

python --version  # must print 3.10
python -m venv <your-venv-name>
source <your-venv-name>/bin/activate

Navigate to the root directory of pmc-data-extraction repository and install dependencies. Two of the required dependencies are mmlearn and open_clip. You have the option to either install them with pip or from source.

To install mmlearn and open_clip with pip, run

cd path/to/pmc-data-extraction
pip install --upgrade pip
poetry install --no-root --with test,open_clip,mmlearn --all-extras

then skip to step 6: Check Installations.

To install mmlearn and open_clip from source, run

cd path/to/pmc-data-extraction
pip install --upgrade pip
poetry install --no-root --with test --all-extras

The above command assumes that you would install mmlearn or open_clip packages from source using the submodules found in pmc-data-extraction/openpmcvl/experiment.

Clone mmlearn and open_clip submodules.

git submodule init
git submodule update

You should see the source files inside pmc-data-extraction/openpmcvl/experiment/open_clip and pmc-data-extraction/openpmcvl/experiment/mmlearn.

Install mmlearn from source.

cd openpmcvl/experiment/mmlearn
python3 -m pip install -e .

Install open_clip from source.

cd ../open_clip
make install
make install-training

Check installations.

pip freeze | grep mmlearn
pip freeze | grep open_clip
python
> import mmlearn
> import open_clip
> mmlearn.__file__
> open_clip.__file__

Note: Since these submodules (mmlearn and open_clip) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors.

Download and parse image-caption pairs from Pubmed Articles

The codebase used to download Pubmed articles and parse image-text pairs from them is stored in openpmcvl/foundation. This codebase heavily relies on Build PMC-OA codebase[1]. To download and parse articles with licenses that allow commercial use, run

# activate virtual environment
source /path/to/your/venv/bin/activate
# navigate to root directory of the package
cd openpmcvl/foundation
# download all 11 volumes with commercailly usable license
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/commercial --license-type comm --volumes 0 1 2 3 4 5 6 7 8 9 10 11

To download and parse open-access articles which are not allowed commercial use, run

python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/noncommercial --license-type noncomm --volumes 1 2 3 4 5 6 7 8 9 10 11

To download and parse open-access articles which other licenses than what is mentioned above, run

python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/other --license-type other --volumes 0 1 2 3 4 5 6 7 8 9 10 11

Run Benchmarking Experiments

We use mmlearn to run benchmarking experiments. Many experiments can be run with our dataset and mmlearn. A simple example of training with our dataset is given below:

# navigate to root directory of the repository
cd pmc-data-extraction
# set pythonpath
export PYTHONPATH="./"
# run training experiment
mmlearn_run \
    'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \
    +experiment=pmcoa2_matched \
    experiment_name=pmcoa2_matched_train \
    dataloader.train.batch_size=256 \
    task.encoders.text.pretrained=False \
    task.encoders.rgb.pretrained=False

Four downstream evaluation experiments can be run with checkpoints generated during training: cross-modal retrieval, zero-shot classification, linear probing, and patient-to-patient retrieval. An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below:

mmlearn_run \
    'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \
    +experiment=pmcoa2_matched \
    experiment_name=pmcoa2_matched_retrieval_mimic \
    job_type=eval \
    ~datasets.test.pmcoa2 \
    +datasets@datasets.test.mimic=MIMICIVCXR \
    datasets.test.mimic.split=test \
    +datasets/transforms@datasets.test.mimic.transform=biomedclip_vision_transform \
    datasets.test.mimic.transform.job_type=eval \
    dataloader.test.batch_size=64 \
    resume_from_checkpoint="path/to/model/checkpoint"

For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to openpmcvl/experiment/scripts. For more information about mmlearn, please refer to the package's official codebase.

Citation

If you find the code useful for your research, please consider citing

@article{baghbanzadeh2025advancing,
  title={Advancing Medical Representation Learning Through High-Quality Data},
  author={Baghbanzadeh, Negin and Fallahpour, Adibvafa and Parhizkar, Yasaman and Ogidi, Franklin and Roy, Shuvendu and Ashkezari, Sajad and Khazaie, Vahid Reza and Colacci, Michael and Etemad, Ali and Afkanpour, Arash and Dolatabadi, Elham},
  journal={arXiv preprint arXiv:2503.14377},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 360 Commits
.github		.github
openpmcvl		openpmcvl
working		working
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
codecov.yml		codecov.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-PMC

Table of Contents

Installing dependencies

Download and parse image-caption pairs from Pubmed Articles

Run Benchmarking Experiments

Citation

About

Releases

Packages

Contributors 4

Languages

License

VectorInstitute/pmc-data-extraction

Folders and files

Latest commit

History

Repository files navigation

Open-PMC

Table of Contents

Installing dependencies

Download and parse image-caption pairs from Pubmed Articles

Run Benchmarking Experiments

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages