-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
It can´t scrape URLs from the source #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@JorgeICS you mean from the whole domain or just from the URL? |
Can you provide the code please? |
I mean, i just get the domain if i ask for the URLs |
Ok but can you share the code? |
These days, I've been trying in various ways. I've noticed that it's very inconsistent with URLs; it depends on the model and the webpage (which I also changed). With the same prompt, I got different results: sometimes it worked, and sometimes it didn’t. Sometimes it shows the domain, other times incorrect or auto-completed URLs, and in some cases, it did give me the URL that appears on the page. I use Spyder:
I get 3 different results by varying the temperature
|
@JorgeICS You can change the line 70 in |
It worked, thanks! |
Ok please update to the new beta |
from scrapegraphai.graphs import SmartScraperGraph
# Define the configuration for the scraping pipeline
graph_config = {
"llm": {
"model": "ollama/llama3.3:70b-instruct-q8_0",
"model_tokens": 8192,
"temperature": 0.5,
"base_url": "http://localhost:11434"
},
"verbose": True,
"headless": True,
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="I want a list with the URLs of each AI tool mentioned in the article",
source="https://guides.library.georgetown.edu/ai/tools",
config=graph_config
)
# Run the pipeline
result = smart_scraper_graph.run()
import json
print(json.dumps(result, indent=4)) I seem to be having the same issue using ollama |
@ekinsenler Does it work with older versions? |
I don't think it's the same issue as the current version already includes the |
moved here -> #952 |
Due to the policies of some LLMs that don't allow showing URLs through a prompt response, I believe it's not possible to retrieve full URLs from a webpage or document. I’ve tried using Llama 3 via Groq API and the Chat-GPT API, but I only get hyperlinks or page titles. Is there any other way to extract links using this library?
The text was updated successfully, but these errors were encountered: