Skip to content

It can´t scrape URLs from the source #637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JorgeICS opened this issue Sep 5, 2024 · 12 comments
Closed

It can´t scrape URLs from the source #637

JorgeICS opened this issue Sep 5, 2024 · 12 comments
Assignees

Comments

@JorgeICS
Copy link

JorgeICS commented Sep 5, 2024

Due to the policies of some LLMs that don't allow showing URLs through a prompt response, I believe it's not possible to retrieve full URLs from a webpage or document. I’ve tried using Llama 3 via Groq API and the Chat-GPT API, but I only get hyperlinks or page titles. Is there any other way to extract links using this library?

@ekinsenler
Copy link
Contributor

ekinsenler commented Sep 5, 2024

@JorgeICS you mean from the whole domain or just from the URL?

@VinciGit00
Copy link
Collaborator

Can you provide the code please?

@JorgeICS
Copy link
Author

JorgeICS commented Sep 6, 2024

@JorgeICS you mean from the whole domain or just from the URL?

I mean, i just get the domain if i ask for the URLs

@VinciGit00
Copy link
Collaborator

Ok but can you share the code?

@JorgeICS
Copy link
Author

JorgeICS commented Sep 6, 2024

Can you provide the code please?

These days, I've been trying in various ways. I've noticed that it's very inconsistent with URLs; it depends on the model and the webpage (which I also changed). With the same prompt, I got different results: sometimes it worked, and sometimes it didn’t. Sometimes it shows the domain, other times incorrect or auto-completed URLs, and in some cases, it did give me the URL that appears on the page.

I use Spyder:

from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
import nest_asyncio 
nest_asyncio.apply()

load_dotenv()

groq_key
graph_config = {
    "llm": {
        "model": "groq/llama3-70b-8192",
        "api_key": groq_key,
        "temperature": 0.5,
        "max_tokens":8192
    },
    "headless": False
}

smart_scraper_graph = SmartScraperGraph(
    prompt="I want a list with the URLs of each AI tool mentioned in the article",
    source="https://guides.library.georgetown.edu/ai/tools",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

I get 3 different results by varying the temperature

I've extracted the URLs of the AI tools mentioned in the article. Here they are:
- Elicit: https://elicit.ai
- Perplexity: https://perplexity.ai
- Consensus: https://consensus.ai
- Semantic Scholar: https://www.semanticscholar.org
- Research Rabbit: https://researchrabbit.com
- Connected Papers: https://www.connectedpapers.com
- scite: https://scite.ai
- Scholarcy: https://scholarcy.com
- ChatGPT: https://chat.openai.com
- Gemini: https://geminimodel.com

{'AI Tools': [{'Elicit': 'Elicit FAQs'}, {'Perplexity': 'Perplexity FAQs'}, {'Consensus': 'Consensus FAQs'}, {'Semantic Scholar': 'Semantic Scholar FAQs'}, {'Research Rabbit': 'Research Rabbit FAQs'}, {'Connected Papers': 'Connected Papers - About'}, {'scite': 'scite FAQs; how scite works'}, {'Scholarcy': 'Scholarcy FAQs'}, {'ChatGPT': 'OpenAI Help Center - ChatGPT'}, {'Gemini': 'Gemini FAQ'}]}

{'AI Tools URLs': ['https://elicit.org/faq', 'https://perplexity.ai/faq', 'https://consensus.ai/faq', 'https://www.semanticscholar.org/faq', 'https://researchrabbit.ai/faq', 'https://connectedpapers.com/about', 'https://scite.ai/faq', 'https://scholarcy.com/faq', 'https://help.openai.com/en/courses/course-getting-started-chatgpt', 'https://gemini.google/faq']}

@ekinsenler
Copy link
Contributor

ekinsenler commented Sep 8, 2024

@JorgeICS You can change the line 70 in parse_node.py from docs_transformed = Html2TextTransformer().transform_documents(input_data[0]) to docs_transformed = Html2TextTransformer(ignore_links=False).transform_documents(input_data[0]) that way llm will receive the links as well.

@VinciGit00 VinciGit00 self-assigned this Sep 8, 2024
@JorgeICS
Copy link
Author

It worked, thanks!

@VinciGit00
Copy link
Collaborator

Ok please update to the new beta

@axibo-reiner
Copy link

from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.3:70b-instruct-q8_0",
        "model_tokens": 8192,
        "temperature": 0.5,
        "base_url": "http://localhost:11434"
    },
    "verbose": True,
    "headless": True,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="I want a list with the URLs of each AI tool mentioned in the article",
    source="https://guides.library.georgetown.edu/ai/tools",
    config=graph_config
)


# Run the pipeline
result = smart_scraper_graph.run()

import json
print(json.dumps(result, indent=4))

I seem to be having the same issue using ollama

@VinciGit00
Copy link
Collaborator

@ekinsenler Does it work with older versions?

@ekinsenler
Copy link
Contributor

I don't think it's the same issue as the current version already includes the docs_transformed=Html2TextTransformer(ignore_links=False)

@axibo-reiner
Copy link

moved here -> #952

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants