A toolkit to download, augment, and benchmark Open-PMC; a large dataset of image-text pairs extracted from open-access scientific articles on PubMedCentral.
For more details, see the following resources:
- arXiv Paper: http://arxiv.org/abs/2503.14377
- Dataset: https://huggingface.co/datasets/vector-institute/open-pmc
- Model Checkpoint: https://huggingface.co/vector-institute/open-pmc-clip
- Installing Dependencies
- Download and Parse Image-Caption Pairs
- Run Benchmarking Experiments
- Citation
We use poetry for dependency management. Please make sure it is installed. Then, follow below instructions to set up your virtual environment.
- Create a venv with python3.10 and activate it.
python --version # must print 3.10
python -m venv <your-venv-name>
source <your-venv-name>/bin/activate
- Navigate to the root directory of pmc-data-extraction repository and install dependencies.
Two of the required dependencies are mmlearn and open_clip.
You have the option to either install them with
pip
or from source.
To install mmlearn
and open_clip
with pip
, run
cd path/to/pmc-data-extraction
pip install --upgrade pip
poetry install --no-root --with test,open_clip,mmlearn --all-extras
then skip to step 6: Check Installations.
To install mmlearn
and open_clip
from source, run
cd path/to/pmc-data-extraction
pip install --upgrade pip
poetry install --no-root --with test --all-extras
The above command assumes that you would install mmlearn
or open_clip
packages from source using the submodules found in pmc-data-extraction/openpmcvl/
experiment.
- Clone
mmlearn
andopen_clip
submodules.
git submodule init
git submodule update
You should see the source files inside pmc-data-extraction/openpmcvl/experiment/open_clip
and pmc-data-extraction/openpmcvl/experiment/mmlearn
.
- Install
mmlearn
from source.
cd openpmcvl/experiment/mmlearn
python3 -m pip install -e .
- Install
open_clip
from source.
cd ../open_clip
make install
make install-training
- Check installations.
pip freeze | grep mmlearn
pip freeze | grep open_clip
python
> import mmlearn
> import open_clip
> mmlearn.__file__
> open_clip.__file__
Note: Since these submodules (mmlearn
and open_clip
) are only part of the main branch in a single repository, if you change your branch to a branch where these submodules don't exist, your python interpretor won't be able to find these packages and you will face errors.
The codebase used to download Pubmed articles and parse image-text pairs from them is stored in openpmcvl/foundation
.
This codebase heavily relies on Build PMC-OA codebase[1].
To download and parse articles with licenses that allow commercial use, run
# activate virtual environment
source /path/to/your/venv/bin/activate
# navigate to root directory of the package
cd openpmcvl/foundation
# download all 11 volumes with commercailly usable license
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/commercial --license-type comm --volumes 0 1 2 3 4 5 6 7 8 9 10 11
To download and parse open-access articles which are not allowed commercial use, run
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/noncommercial --license-type noncomm --volumes 1 2 3 4 5 6 7 8 9 10 11
To download and parse open-access articles which other licenses than what is mentioned above, run
python -u src/fetch_oa.py --num-retries 5 --extraction-dir path/to/download/directory/other --license-type other --volumes 0 1 2 3 4 5 6 7 8 9 10 11
We use mmlearn
to run benchmarking experiments.
Many experiments can be run with our dataset and mmlearn
.
A simple example of training with our dataset is given below:
# navigate to root directory of the repository
cd pmc-data-extraction
# set pythonpath
export PYTHONPATH="./"
# run training experiment
mmlearn_run \
'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \
+experiment=pmcoa2_matched \
experiment_name=pmcoa2_matched_train \
dataloader.train.batch_size=256 \
task.encoders.text.pretrained=False \
task.encoders.rgb.pretrained=False
Four downstream evaluation experiments can be run with checkpoints generated during training: cross-modal retrieval, zero-shot classification, linear probing, and patient-to-patient retrieval. An example of cross-modal retrieval on the MIMIC-IV-CXR dataset is given below:
mmlearn_run \
'hydra.searchpath=[pkg://openpmcvl.experiment.configs]' \
+experiment=pmcoa2_matched \
experiment_name=pmcoa2_matched_retrieval_mimic \
job_type=eval \
~datasets.test.pmcoa2 \
+datasets@datasets.test.mimic=MIMICIVCXR \
datasets.test.mimic.split=test \
+datasets/transforms@datasets.test.mimic.transform=biomedclip_vision_transform \
datasets.test.mimic.transform.job_type=eval \
dataloader.test.batch_size=64 \
resume_from_checkpoint="path/to/model/checkpoint"
For more comprehensive examples of shell scripts that run various experiments with Open-PMC, refer to openpmcvl/experiment/scripts
.
For more information about mmlearn
, please refer to the package's official codebase.
If you find the code useful for your research, please consider citing
@article{baghbanzadeh2025advancing,
title={Advancing Medical Representation Learning Through High-Quality Data},
author={Baghbanzadeh, Negin and Fallahpour, Adibvafa and Parhizkar, Yasaman and Ogidi, Franklin and Roy, Shuvendu and Ashkezari, Sajad and Khazaie, Vahid Reza and Colacci, Michael and Etemad, Ali and Afkanpour, Arash and Dolatabadi, Elham},
journal={arXiv preprint arXiv:2503.14377},
year={2025}
}