Adding line to find_table in order to detect row header without outside border #4567

TastyMoocow · 2025-06-20T06:15:41Z

TastyMoocow
Jun 20, 2025

I'm trying to extract data from financial documents such as quarterly and annual reports. I came across a PDF that looks like the following:

The default find_tables function could not see the header containing the date values. Reading the documentation, I thought including a line in the function call would make it work. Here's my attempt:

Figure out the page bound so I know how far to draw the horizontal line
Figure out the position of the words "(in thousands, except per share amounts)" so I know where the horizontal line should start
Create 2 points representing the start of the words and the end of the page in the horizontal direction
Create a line
Call find_table with the add_lines option using the line created in Step 4

But the extracted dataframe keep looking like this:

Also, any suggestions on how to fix the column alignment issue? Would it be some kind of dataframe modification to fix it (ie, overwrite the $ sign with the value to the right and drop the column).

JorjMcKie · 2025-06-20T11:46:29Z

JorjMcKie
Jun 20, 2025
Maintainer

Don't be afraid to add as many lines (for add_lines) or rectangles (for add_boxes) as you can think of. Redundant information will be filtered out by the table finder.
You could for example look at the bbox of the shaded rectangles, take the top one and compute a new rectangle by subtracting the rectangle's height from its y-values. Then use this as an item for add_boxes.
Similar for the last shaded rectangle.
Compute the union of the rectangles and also use it as an item for the add_boxes parameter.
Search for "$" and use each of the returned x0 values to make a vertical line within the computed union bbox.

Maybe you find this script useful:
test.zip

2 replies

JorjMcKie Jun 20, 2025
Maintainer

There still are 2 useless columns, but that could be cleaned up in one way or another ...

JorjMcKie Jun 20, 2025
Maintainer

For example use a larger value for snapping small column widths: snap_x_tolerance=10. Or use pandas to remove stuff:

df = tab.to_pandas()
df.replace("", np.nan, inplace=True)
df.dropna(axis=0, how="all", inplace=True)
df.dropna(axis=1, how="all", inplace=True)
print(df.to_markdown(index=False))

JorjMcKie · 2025-06-20T12:49:28Z

JorjMcKie
Jun 20, 2025
Maintainer

The script should work for many Form-10K tables - as long as there only is one per page.
Otherwise, you could develop additional logic that detects clusters of shaded rows and treats each cluster as its own bbox.

1 reply

JorjMcKie Jun 20, 2025
Maintainer

... or, more radically, do a find_tables() first, then look at each tab.bbox and apply the script's logic to it separately.
The repeat find_tables(), but this time provide all the lines and bboxes identified.

TastyMoocow · 2025-06-20T20:59:32Z

TastyMoocow
Jun 20, 2025
Author

Thanks for the example, I will have to study it a bit more since I only have basic coding experience and new to Python.

0 replies

TastyMoocow · 2025-06-24T00:50:34Z

TastyMoocow
Jun 24, 2025
Author

I tried to replicate your code but factoring in some LLM since there's no way to guarantee that the table will end with a shaded row. I used PandaQueryEngine in Llama Index to ask if a column header contains date and it can return a True or False string value. With that in mind, I'm trying to extending the table's bounding box but ran into a problem with to_pandas function including text outside of the detected bounding box as the column header.

`import pymupdf
from pymupdf import Point, Rect, Page
import os
from collections import Counter

doc = pymupdf.open("sample_docs/Ascend_10Q_2025.pdf")
data = doc.load_page(5)

#Find the table using default behavior
tabsFinder = data.find_tables()
df = tabsFinder[0].to_pandas()
tbbox = tabsFinder[0].bbox

#Determine the X0 / Left position of all $, count them, filter out all X0 position that happens less than 2 times
dollarxZero = []
for r in data.search_for("$"):
dollarxZero.append(r.x0)
dollarxZeroCounter = dict(Counter(dollarxZero))
filteredDollarX = [k for k,v in dollarxZeroCounter.items() if v >= 2]

#Determine shaded rows and calculate the height of 1 row
shaded = [p for p in data.get_drawings() if p["rect"] in data.rect and p["fill"] != (1,1,1) and p["rect"].height >= 10 and p["type"]=="f"]
singlePath = shaded[0]
extendheight = (singlePath["rect"].y1 - singlePath["rect"].y0)

#Create a new bounding box by extending the previously detected table by 1 row height
newBbox = Rect(Point(tbbox[0],(tbbox[1]-extendheight)),Point(tbbox[2],tbbox[3]))

#Try to draw lines to the left of $ from top to bottom of the table bounding box
lines = []
lines.append((Point(filteredDollarX[0],newBbox[1]),Point(filteredDollarX[0],newBbox[3])))
lines.append((Point(filteredDollarX[1],newBbox[1]),Point(filteredDollarX[1],newBbox[3])))
#lines.append((point0,point1))

#Call find table again with the new bounding box and lines
tabsFinder2 = data.find_tables(add_boxes=[newBbox],add_lines=lines)
df2 = tabsFinder2[0].to_pandas()
print(df2.to_markdown())

for c in tabsFinder2.cells:
if c:
data.draw_rect(c, color=(1,0,0))
doc.ez_save("sample_docs/second_detected2.pdf")`

I'm a bit confused about your methods regarding treating the $ sign and amount as a single box. If I'm reading it right, you are drawing extra boxes around the 2 cells? But the initial list of y-values are determined from Shaded rects only so wouldn't it not work for white filled cells?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding line to find_table in order to detect row header without outside border #4567

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Adding line to find_table in order to detect row header without outside border #4567

Uh oh!

TastyMoocow Jun 20, 2025

Replies: 4 comments · 3 replies

Uh oh!

JorjMcKie Jun 20, 2025 Maintainer

Uh oh!

JorjMcKie Jun 20, 2025 Maintainer

Uh oh!

JorjMcKie Jun 20, 2025 Maintainer

Uh oh!

JorjMcKie Jun 20, 2025 Maintainer

Uh oh!

JorjMcKie Jun 20, 2025 Maintainer

Uh oh!

TastyMoocow Jun 20, 2025 Author

Uh oh!

TastyMoocow Jun 24, 2025 Author

TastyMoocow
Jun 20, 2025

Replies: 4 comments 3 replies

JorjMcKie
Jun 20, 2025
Maintainer

JorjMcKie Jun 20, 2025
Maintainer

JorjMcKie Jun 20, 2025
Maintainer

JorjMcKie
Jun 20, 2025
Maintainer

JorjMcKie Jun 20, 2025
Maintainer

TastyMoocow
Jun 20, 2025
Author

TastyMoocow
Jun 24, 2025
Author