-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@skrawcz will you do it? |
Yes, we will take a look shortly. Thanks for bringing up! |
Hey! I thought this was a Burr error, but it looks like this is an error with the workflow. It looks like the problem is that not all the state items are written. So, I'm not sure I'm the best to debug this (I don't have full context), but I did create this to make it easier to debug/read. In the burr case, it's fairly clear what's happening, the non-burr case (with this fix) displays the same information. Thoughts on how to proceed? Did something change recently? https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/611/files |
Hi, I do not know if this is helpful or not. The error I encountered is in the "standard" ScrapeGraphAi work mode: without using burr. When I firstly saw the error I tried to understand if it was an issue on my side and tried out burr to see if I could figure out something. Unfortunately also burr was giving an error. To me it seemed that burr was raising an error due to the underling structure of OmniScraper not working properly, I included both logs in my Issue simply because to me burr error felt much more detailed than the generic error thrown by ScrapeGraphAi. Thinking it could be of help figuring out where the bug is I included both log. |
Got it! Yeah I think that Burr was slightly more helpful (it showed the keys that were missing), but I added that back to the core library too :) |
I dug down and found the issue. Looking at the omni_scraper_graph.py-L59 file it can be easily seen how the AbstractGraph gets created: def _create_graph(self) -> BaseGraph:
"""
Creates the graph of nodes representing the workflow for web scraping.
Returns:
BaseGraph: A graph instance representing the web scraping workflow.
"""
fetch_node = FetchNode(
input="url | local_dir",
output=["doc", "link_urls", "img_urls"],
node_config={
"loader_kwargs": self.config.get("loader_kwargs", {}),
}
)
parse_node = ParseNode(
input="doc",
output=["parsed_doc"],
node_config={
"chunk_size": self.model_token
}
)
image_to_text_node = ImageToTextNode(
input="img_urls",
output=["img_desc"],
node_config={
"llm_model": OpenAIImageToText(self.config["llm"]),
"max_images": self.max_images
}
)
generate_answer_omni_node = GenerateAnswerOmniNode(
input="user_prompt & (relevant_chunks | parsed_doc | doc) & img_desc",
output=["answer"],
node_config={
"llm_model": self.llm_model,
"additional_info": self.config.get("additional_info"),
"schema": self.schema
}
)
return BaseGraph(
nodes=[
fetch_node,
parse_node,
image_to_text_node,
generate_answer_omni_node,
],
edges=[
(fetch_node, parse_node),
(parse_node, image_to_text_node),
(image_to_text_node, generate_answer_omni_node)
],
entry_point=fetch_node,
graph_name=self.__class__.__name__
) Possible First Issue (Minor) fetch_node = FetchNode(
input="url| local_dir",
output=["doc", "link_urls", "img_urls"],
node_config={
"llm_model": self.llm_model,
"force": self.config.get("force", False),
"cut": self.config.get("cut", True),
"loader_kwargs": self.config.get("loader_kwargs", {}),
"browser_base": self.config.get("browser_base")
}
) MAIN ISSUE By just looking at the input/output parameters it theoretically all makes sense. FetchNode provides in output The issue is in FetchNode. It actually only generate in output Proposed Solution I do not know if you want this feature to be implemented inside the ParseNode or the FetchNode so I didn't try to push a fix for this, but with a very simple patch I putted testing, with this described small function it should start working again. |
yes please can you fix it and make a pull request? |
Should I add the function on FetchNode or on ParseNode? |
Parse |
Hi @LorenzoPaleari can you update and say if everything is ok? |
I pushed the changes and it should work, although beta5 is broken, I'm opening an issue |
now beta is stable, can you try again? |
Sorry, I got busy last days.
ParseNode needs to be able to extract from the document url_links and images_links and update the status with this values. I used a parameter with a flag since I didn't know if this particular function were going to be used somewhere else that is not OmniScraperGraph, but can always be used in CustomGraph creation. On 1.19.0-beta2 the fix I made got removed, the code I added didn't changed in any way the normal execution workflow of the ParseNode, it was just going to call a couple of functions to extract links, and than the document parsing was flowing as it was before. With the only difference that before parsing I had extracted links to update the status with. My change is just changing the state variable in order to correctly pass to all other nodes that are called later the images_links and url_links. While the issue was referring to GPT (or any LLM) not being able to extract urls from the parsed document itself which I didn't change. |
Hi @LorenzoPaleari, sorry for what I've done. |
@VinciGit00 Pull Request |
hi please update to the ne version |
Hi everyone, Is there any solution for this issue? I'm encountering the same error. Any help is welcomed |
Turns out I had this issue because I was using Python 3.9. I installed 3.11 and it worked. |
Describe the bug
OmniScraperGraph throws error. Tested on minimal example on GitHub.
omni_scraper_openai.py
To Reproduce
Output
Adding burr arguments to graph config:
Desktop
google-crc32c
does not support Python 12 #568 )The text was updated successfully, but these errors were encountered: