It can´t scrape URLs from the source #637

JorgeICS · 2024-09-05T17:40:32Z

Due to the policies of some LLMs that don't allow showing URLs through a prompt response, I believe it's not possible to retrieve full URLs from a webpage or document. I’ve tried using Llama 3 via Groq API and the Chat-GPT API, but I only get hyperlinks or page titles. Is there any other way to extract links using this library?

ekinsenler · 2024-09-05T20:05:26Z

@JorgeICS you mean from the whole domain or just from the URL?

VinciGit00 · 2024-09-05T21:12:59Z

Can you provide the code please?

JorgeICS · 2024-09-06T18:54:08Z

@JorgeICS you mean from the whole domain or just from the URL?

I mean, i just get the domain if i ask for the URLs

VinciGit00 · 2024-09-06T19:17:10Z

Ok but can you share the code?

JorgeICS · 2024-09-06T19:21:51Z

Can you provide the code please?

These days, I've been trying in various ways. I've noticed that it's very inconsistent with URLs; it depends on the model and the webpage (which I also changed). With the same prompt, I got different results: sometimes it worked, and sometimes it didn’t. Sometimes it shows the domain, other times incorrect or auto-completed URLs, and in some cases, it did give me the URL that appears on the page.

I use Spyder:

from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
import nest_asyncio 
nest_asyncio.apply()

load_dotenv()

groq_key
graph_config = {
    "llm": {
        "model": "groq/llama3-70b-8192",
        "api_key": groq_key,
        "temperature": 0.5,
        "max_tokens":8192
    },
    "headless": False
}

smart_scraper_graph = SmartScraperGraph(
    prompt="I want a list with the URLs of each AI tool mentioned in the article",
    source="https://guides.library.georgetown.edu/ai/tools",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

I get 3 different results by varying the temperature

I've extracted the URLs of the AI tools mentioned in the article. Here they are:
- Elicit: https://elicit.ai
- Perplexity: https://perplexity.ai
- Consensus: https://consensus.ai
- Semantic Scholar: https://www.semanticscholar.org
- Research Rabbit: https://researchrabbit.com
- Connected Papers: https://www.connectedpapers.com
- scite: https://scite.ai
- Scholarcy: https://scholarcy.com
- ChatGPT: https://chat.openai.com
- Gemini: https://geminimodel.com

{'AI Tools': [{'Elicit': 'Elicit FAQs'}, {'Perplexity': 'Perplexity FAQs'}, {'Consensus': 'Consensus FAQs'}, {'Semantic Scholar': 'Semantic Scholar FAQs'}, {'Research Rabbit': 'Research Rabbit FAQs'}, {'Connected Papers': 'Connected Papers - About'}, {'scite': 'scite FAQs; how scite works'}, {'Scholarcy': 'Scholarcy FAQs'}, {'ChatGPT': 'OpenAI Help Center - ChatGPT'}, {'Gemini': 'Gemini FAQ'}]}

{'AI Tools URLs': ['https://elicit.org/faq', 'https://perplexity.ai/faq', 'https://consensus.ai/faq', 'https://www.semanticscholar.org/faq', 'https://researchrabbit.ai/faq', 'https://connectedpapers.com/about', 'https://scite.ai/faq', 'https://scholarcy.com/faq', 'https://help.openai.com/en/courses/course-getting-started-chatgpt', 'https://gemini.google/faq']}

ekinsenler · 2024-09-08T08:57:53Z

@JorgeICS You can change the line 70 in parse_node.py from docs_transformed = Html2TextTransformer().transform_documents(input_data[0]) to docs_transformed = Html2TextTransformer(ignore_links=False).transform_documents(input_data[0]) that way llm will receive the links as well.

JorgeICS · 2024-09-10T16:12:02Z

It worked, thanks!

VinciGit00 · 2024-09-10T16:30:29Z

Ok please update to the new beta

axibo-reiner · 2025-03-15T23:57:47Z

from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.3:70b-instruct-q8_0",
        "model_tokens": 8192,
        "temperature": 0.5,
        "base_url": "http://localhost:11434"
    },
    "verbose": True,
    "headless": True,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="I want a list with the URLs of each AI tool mentioned in the article",
    source="https://guides.library.georgetown.edu/ai/tools",
    config=graph_config
)


# Run the pipeline
result = smart_scraper_graph.run()

import json
print(json.dumps(result, indent=4))

I seem to be having the same issue using ollama

VinciGit00 · 2025-03-16T08:29:38Z

@ekinsenler Does it work with older versions?

ekinsenler · 2025-03-16T18:45:56Z

I don't think it's the same issue as the current version already includes the docs_transformed=Html2TextTransformer(ignore_links=False)

axibo-reiner · 2025-03-16T19:06:53Z

moved here -> #952

VinciGit00 self-assigned this Sep 8, 2024

VinciGit00 closed this as completed Sep 10, 2024

LorenzoPaleari mentioned this issue Sep 12, 2024

v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It can´t scrape URLs from the source #637

It can´t scrape URLs from the source #637

JorgeICS commented Sep 5, 2024

ekinsenler commented Sep 5, 2024 •

edited

Loading

VinciGit00 commented Sep 5, 2024

JorgeICS commented Sep 6, 2024

VinciGit00 commented Sep 6, 2024

JorgeICS commented Sep 6, 2024

ekinsenler commented Sep 8, 2024 •

edited

Loading

JorgeICS commented Sep 10, 2024

VinciGit00 commented Sep 10, 2024

axibo-reiner commented Mar 15, 2025

VinciGit00 commented Mar 16, 2025

ekinsenler commented Mar 16, 2025

axibo-reiner commented Mar 16, 2025

It can´t scrape URLs from the source #637

It can´t scrape URLs from the source #637

Comments

JorgeICS commented Sep 5, 2024

ekinsenler commented Sep 5, 2024 • edited Loading

VinciGit00 commented Sep 5, 2024

JorgeICS commented Sep 6, 2024

VinciGit00 commented Sep 6, 2024

JorgeICS commented Sep 6, 2024

ekinsenler commented Sep 8, 2024 • edited Loading

JorgeICS commented Sep 10, 2024

VinciGit00 commented Sep 10, 2024

axibo-reiner commented Mar 15, 2025

VinciGit00 commented Mar 16, 2025

ekinsenler commented Mar 16, 2025

axibo-reiner commented Mar 16, 2025

ekinsenler commented Sep 5, 2024 •

edited

Loading

ekinsenler commented Sep 8, 2024 •

edited

Loading