OpenDeepArxiv is an open-sourced project designed to streamline the process of searching for research papers on arXiv, filtering based on relevance, and generating comprehensive PDF reports complete with summaries. This pipeline leverages state-of-the-art summarization techniques to help researchers quickly grasp the latest findings in their fields.
The project is organized into several stages:
- ArXiv Search: Retrieves papers using custom queries constructed with OpenAI. It handles API communication and downloads PDFs.
- Similarity Filtering: Processes paper metadata and applies text similarity measures to shortlist the most relevant publications.
- PDF Report Generation: Generates a final PDF report combining paper metadata with concise summaries generated for each relevant paper.
- Constructs a custom query using a prompt and OpenAI.
- Communicates with the arXiv API to retrieve search results.
- Downloads available PDFs and stores metadata for further processing.
- Computes text embeddings for retrieved papers.
- Compares these embeddings against the user-defined topic.
- Removes papers below a configurable similarity threshold and trims outlier entries based on the percentage setting.
- Utilizes a summarization pipeline to process filtered papers.
- Leverages advanced models (e.g. OpenAI’s) to generate concise summaries.
- Combines summaries with metadata into a comprehensive PDF report.
- Automated Research: The end-to-end pipeline automates paper retrieval, relevance filtering, and report generation.
- Advanced Filtering: Uses modern embedding techniques for smart similarity filtering, ensuring only the most pertinent papers are selected.
- Customizable: Configuration via environment variables makes it easy to tailor search parameters, filters, and report formatting.
- Open Source Collaboration: Contributions are welcome; the package is designed to evolve with community feedback and improvements.
- Automated search and download of arXiv papers.
- Intelligent filtering using similarity thresholds.
- Comprehensive and visually appealing PDF reports.
- Easy configuration via environment variables.
- Python 3.7+
- Libraries: argparse, logging, pandas, tqdm, arxiv, openai, dotenv, feedparser, requests, etc.
- An OpenAI API key (set in the
.env
file) - A virtual environment is recommended.
- Clone the repository:
git clone https://github.com/GreamDesu/OpenDeepArxiv.git cd OpenDeepArxiv
- Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Windows: .\venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Create a
.env
file in the root with the following content:OPENAI_API_KEY="your_openai_api_key_here" ARXIV_PDF_FOLDER="arxiv_downloads" EMBEDDING_FILE=bert_embeddings.npy FILENAME_FILE=pdf_filenames.csv SIMILARITY_MATRIX_FILE=bert_similarity_matrix.csv FINAL_CSV_PATH=filtered_papers_final.csv
- Run the pipeline:
python main.py --topic "machine learning"
- ArXiv Search: Initiated from
main.py
, it usesArxivSearch
to create a search query and download papers. - Similarity Filtering: The
PaperFilter
processes the metadata and applies similarity thresholds as configured inconfig.py
. - Report Generation: The
SummarizationPipeline
generates a PDF report with summaries of the selected papers.
-
User Query:
- The user starts by providing a topic or research question. This input represents their area of interest and drives the entire workflow.
-
LLM Query Formulation:
- The user query is sent to a large language model (LLM).
- The LLM leverages its background knowledge and embedded instructions (including details from arXiv API documentation and module code) to generate precise search parameters.
- This process converts the vague user interest into a structured query that can include filters like subject categories, date ranges, and sorting priorities.
-
Downloading Papers:
- The generated query is handed off to the ArXiv API.
- Relevant papers are retrieved based on the custom query parameters.
- Each paper’s metadata is saved locally, and its corresponding PDF is downloaded if available.
-
Rough Filtering (First-Level Filtering):
- The system computes text embeddings for each paper’s abstract or full text using advanced models (e.g., BERT).
- At this stage, the pipeline removes papers that are either:
- Overly similar, preventing redundancy (i.e., duplicate or near-duplicate content).
- Irrelevant to the user-specified topic based on a similarity threshold defined in the configuration.
- This stage helps in reducing the overall pool to a more manageable and relevant subset.
-
Fine Filtering (Second-Level Filtering):
- The refined list from the rough filtering is then passed to the LLM once again.
- The LLM reviews each paper’s content against the original user query and its own prior knowledge.
- This step further eliminates any papers that, despite passing the rough filtering, do not sufficiently match the research focus.
- This dual-step filtering ensures high precision in the paper selection process.
-
Summarization:
- For each paper that survives the filtering stages, a summarization pipeline is activated.
- The LLM processes the paper’s content to generate concise yet comprehensive summaries.
- This step is crucial to ensure that all useful information is preserved even as the content is compressed to a digestible format.
-
Report Generation:
- The system concatenates the summaries along with the original metadata (title, authors, publication year, etc.) to form a comprehensive PDF report.
- This report provides users with an in-depth overview of the field, highlighting state-of-the-art research and key insights in a readable format.
- The final report serves as a valuable resource for quickly familiarizing oneself with the research domain.
OpenDeepArxiv/
├── __init__.py # Package initialization and exports
├── __main__.py # Entry point for running the package as a module
├── .env # Environment variables (e.g., API keys)
├── arxiv_api_instructions.md
├── arxiv_downloads/ # Directory to save downloaded PDFs
├── arxiv_implementation_instructions.py
├── arxiv_search.py # Module for constructing queries and downloading papers
├── config.py # Configuration settings for the project
├── main.py # Main pipeline entry point with CLI support
├── README.md # Project documentation (this file)
├── similarity_filters.py # Module for paper filtering and similarity analysis
├── summarization.py # Module for generating summaries and reports
├── taxonomy.md # Taxonomy details used by the OpenAI query generator
└── ... # Other files and notebooks
The pipeline can be executed directly from the command line. For example:
python -m OpenDeepArxiv --topic "Diffusion models in robotics"
Alternatively, you can also run the pipeline using the main.py
script:
python main.py --topic "Diffusion models in robotics"
You can import the main
function from the package in your own scripts:
from OpenDeepArxiv import main
topic = "Diffusion models in robotics"
main(topic)
Import the package directly into your notebook:
from OpenDeepArxiv import main
topic = "Diffusion models in robotics"
main(topic)
- Modularity: The project is broken into modules (
arxiv_search.py
,similarity_filters.py
, andsummarization.py
) to ease maintenance and future enhancements. - Logging & Error Handling: Replace print statements with logging and check error handling for robust pipeline execution.
- Extensibility: Further customization can be added, for example, to support asynchronous processing or integration with additional APIs.
Contributions are welcome! This project is open-sourced and intended to promote collaborative improvements. If you’d like to contribute:
- Fork the repository and make your changes.
- Submit a pull request with your improvements or bug fixes.
- Open issues for bugs or feature requests.
Your input helps make OpenDeepArxiv more robust and user-friendly.
This project is licensed under the MIT License. See the LICENSE file for details.
OpenDeepArxiv is maintained by GreamDesu.