Skip to content

Misc SimilarityIndex.add_to_index() issues #169

Open
@allenleetc

Description

@allenleetc

Minor issues for similarity indexes when add_to_index() is called with overlapping/existing sample_ids.

I believe the issues are in fob.internal.core.utils.add_ids() so may extend to all backends. That said I am testing with mongodb backend

Dataset Prep (currently must run over direct connection to MongoDB Atlas for mongodb backend

# Create a dataset that is half-indexed
dsn = 'quickstart-test-similarity-image-8'
if fo.dataset_exists(dsn):
    fo.delete_dataset(dsn)
    
dataset = foz.load_zoo_dataset("quickstart",max_samples=8,dataset_name=dsn)
dataset.persistent = True
si = fob.compute_similarity(
   dataset[:4],
   model='clip-vit-base32-torch',
   embeddings='embs_half',
   brain_key='bk_half',
   backend='mongodb'
)
print(si.total_index_size) # returns 4

Issue 0, "last" sample_id to be added already exists

Apparent issue here, maybe use max(jj) rather than last element of jj

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')

ids_add = ds[-2:].values('id')
ids_add.append(ds.first().id)
print(ids_add)
print(ds.values('id'))
d = 512
embs_add = np.random.rand(len(ids_add),d)
embs_add = embs_add.tolist()
si.add_to_index(embs_add, ids_add) # errors IndexError: index 4 is out of bounds for axis 0 with size 4

For reference, docstring from SimilarityIndex.add_to_index

            overwrite (True): whether to replace (True) or ignore (False)
                existing embeddings with the same sample/label IDs
            allow_existing (True): whether to ignore (True) or raise an error
                (False) when ``overwrite`` is False and a provided ID already
                exists in the
            warn_existing (False): whether to log a warning if an embedding is
                not added to the index because its ID already exists

Issue 1, overwrite=False, allow_existing=False. Expect error here

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add)) # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, allow_existing=False) # this works/succeeds silently

Issue 2, overwrite=False, warn_existing=True. Expect warning here


dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add))  # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, warn_existing=True) # works/succeeds silently

Issue 3, overwrite=True, allow_existing=False. Don't expect error here as overwrite=True

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = [ds.first().id, ds.last().id]
print(set(si.sample_ids).intersection(ids_add))
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=True, allow_existing=False) # errors

Metadata

Metadata

Assignees

Labels

bugBug fixes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions