Skip to content

[BUG] Record Manager: unstable source path in S3 Directory Loader causes duplicate chunks #4689

Open
@rquintanab

Description

@rquintanab

Describe the bug
When Record Manager is enabled for an S3 Directory Loader and the SourceId Key is left at its default value (source), Flowise stores a different temporary path in metadata.source on every run:

source:"/tmp/s3fileloader-GjaLgi/docs/XXXX.pdf"
source:"/tmp/s3fileloader-fkHJjB/docs/XXXX.pdf"

Because the path changes each time, Record Manager treats the same PDF as a new document and inserts duplicate chunks into the vector database.


To Reproduce

  1. Go to Data Sources → New Loader → S3 Directory Loader.
  2. Configure the bucket/prefix
  3. Enable Record Manager and leave SourceId Key as source (default).
  4. Run the loader (process and upsert).
  5. Run the loader again on the same data set.
  6. Inspect the vector store or logs – you will see that new chunks are inserted instead of being matched to existing records.

Expected behavior
metadata.source should be stable for a given S3 object (e.g. use the S3 key docs/XXXX.pdf), so Record Manager can recognise existing documents and avoid duplicates.


Screenshots
N/A


Flow
N/A


Setup

  • Installation: docker
  • Flowise Version: 3.0.1
  • OS: Windows 11
  • Browser: Google Chrome

Additional context

  • The current behaviour forces users either to supply a custom SourceId Key or to accept data duplication.
  • Setting a custom SourceId Key is not a viable workaround: the same value is applied to every document within the S3 folder, which again prevents Record Manager from distinguishing individual files.
  • A consistent identifier derived from each object’s S3 key (or an option to choose that behaviour) would let Record Manager properly update existing documents instead of re-inserting them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions