Adding line to find_table in order to detect row header without outside border #4567
Replies: 4 comments 3 replies
-
Don't be afraid to add as many lines (for Maybe you find this script useful: |
Beta Was this translation helpful? Give feedback.
-
The script should work for many Form-10K tables - as long as there only is one per page. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the example, I will have to study it a bit more since I only have basic coding experience and new to Python. |
Beta Was this translation helpful? Give feedback.
-
I tried to replicate your code but factoring in some LLM since there's no way to guarantee that the table will end with a shaded row. I used PandaQueryEngine in Llama Index to ask if a column header contains date and it can return a True or False string value. With that in mind, I'm trying to extending the table's bounding box but ran into a problem with to_pandas function including text outside of the detected bounding box as the column header. `import pymupdf doc = pymupdf.open("sample_docs/Ascend_10Q_2025.pdf") #Find the table using default behavior #Determine the X0 / Left position of all $, count them, filter out all X0 position that happens less than 2 times #Determine shaded rows and calculate the height of 1 row #Create a new bounding box by extending the previously detected table by 1 row height #Try to draw lines to the left of $ from top to bottom of the table bounding box #Call find table again with the new bounding box and lines for c in tabsFinder2.cells: I'm a bit confused about your methods regarding treating the $ sign and amount as a single box. If I'm reading it right, you are drawing extra boxes around the 2 cells? But the initial list of y-values are determined from Shaded rects only so wouldn't it not work for white filled cells? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to extract data from financial documents such as quarterly and annual reports. I came across a PDF that looks like the following:
The default find_tables function could not see the header containing the date values. Reading the documentation, I thought including a line in the function call would make it work. Here's my attempt:
But the extracted dataframe keep looking like this:
Also, any suggestions on how to fix the column alignment issue? Would it be some kind of dataframe modification to fix it (ie, overwrite the $ sign with the value to the right and drop the column).
Beta Was this translation helpful? Give feedback.
All reactions