Not actually scraping/no response #507

Chris-421 · 2024-08-06T13:01:37Z

Chris-421
Aug 6, 2024

Hi so i am starting out with this project to scrape some data from the following website: jumbo.com. however, I am not getting the response. the code is basically this tutorial and only adding headless: False and changing both the link and prompt.

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3.1",
        "temperature": 0,
        "format": "json",
        "base_url": "http://localhost:11434",
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "temperature": 0,
        "base_url": "http://localhost:11434",
    },
    "verbose": True,
    "headless": False,
    "loader_kwargs": {
         "proxy" : {
            "server": "broker",
            "criteria": {
                "anonymous": True,
                "secure": True,
                "countryset": {"IT"},
                "timeout": 10.0,
                "max_shape": 3
            },
        },
    },
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="list me all categories of the products and corresponding links to these categories",
    source="https://www.jumbo.com/producten/",
    config=graph_config,
)

# Run the scraper graph
result = smart_scraper_graph.run()

print("Scraper Result:", result)

graph_exec_info = smart_scraper_graph.get_execution_info()
print(graph_exec_info)

This however does not generate the expected response. instead of an expected list of product categories and weblinks i get:
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.jumbo.com/producten/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Processing chunks: 0%| | 0/1 [00:28<?, ?it/s]
Scraper Result: {'type': 'accordion', 'title': 'Openingstijden', 'content': 'https://www.jumbo.com/winkels'}
exec_info: [{'node_name': 'Fetch', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 83.95596218109131}, {'node_name': 'Parse', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 0.00398707389831543}, {'node_name': 'GenerateAnswer', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 28.795868158340454}, {'node_name': 'TOTAL RESULT', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 112.75581741333008}]

I also tested the tutorial itself (aka the original prompt and link) which only results in one video with title:
Scraper Result: {'type': 'video', 'title': 'Tech Support: Pyrotechnician Answers Fireworks Questions From Twitter', 'description': 'WIRED is where tomorrow is realized. It is the essential source of information and ideas that make sense of a world in constant transformation.', 'url': 'https://www.wired.com/video/watch/tech-support-pyrotechnician-answers-fireworks-questions-from-twitter'} with a similar exec info with 0 tokens.
What am i doing wrong? Is the code incorrect or is my llm setup not working or what?
PS: from similar discussion i found out it might be due to blockers, so i tried other sites, including wikipedea. however the results were still not matching the prompt or the tutorial's. additionally these blockers should theoretically be circumvented using the proxy and headless: False right?

VinciGit00 · 2024-09-19T12:28:56Z

VinciGit00
Sep 19, 2024
Maintainer

ok please update to the new version

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Not actually scraping/no response #507

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Not actually scraping/no response #507

Uh oh!

Chris-421 Aug 6, 2024

Replies: 1 comment

Uh oh!

VinciGit00 Sep 19, 2024 Maintainer

Chris-421
Aug 6, 2024

VinciGit00
Sep 19, 2024
Maintainer