Open
Description
Minor issues for similarity indexes when add_to_index()
is called with overlapping/existing sample_ids.
I believe the issues are in fob.internal.core.utils.add_ids()
so may extend to all backends. That said I am testing with mongodb
backend
Dataset Prep (currently must run over direct connection to MongoDB Atlas for mongodb
backend
# Create a dataset that is half-indexed
dsn = 'quickstart-test-similarity-image-8'
if fo.dataset_exists(dsn):
fo.delete_dataset(dsn)
dataset = foz.load_zoo_dataset("quickstart",max_samples=8,dataset_name=dsn)
dataset.persistent = True
si = fob.compute_similarity(
dataset[:4],
model='clip-vit-base32-torch',
embeddings='embs_half',
brain_key='bk_half',
backend='mongodb'
)
print(si.total_index_size) # returns 4
Issue 0, "last" sample_id to be added already exists
Apparent issue here, maybe use max(jj)
rather than last element of jj
dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[-2:].values('id')
ids_add.append(ds.first().id)
print(ids_add)
print(ds.values('id'))
d = 512
embs_add = np.random.rand(len(ids_add),d)
embs_add = embs_add.tolist()
si.add_to_index(embs_add, ids_add) # errors IndexError: index 4 is out of bounds for axis 0 with size 4
For reference, docstring from SimilarityIndex.add_to_index
overwrite (True): whether to replace (True) or ignore (False)
existing embeddings with the same sample/label IDs
allow_existing (True): whether to ignore (True) or raise an error
(False) when ``overwrite`` is False and a provided ID already
exists in the
warn_existing (False): whether to log a warning if an embedding is
not added to the index because its ID already exists
Issue 1, overwrite=False, allow_existing=False. Expect error here
dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add)) # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, allow_existing=False) # this works/succeeds silently
Issue 2, overwrite=False, warn_existing=True. Expect warning here
dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add)) # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, warn_existing=True) # works/succeeds silently
Issue 3, overwrite=True, allow_existing=False. Don't expect error here as overwrite=True
dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = [ds.first().id, ds.last().id]
print(set(si.sample_ids).intersection(ids_add))
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=True, allow_existing=False) # errors