Hi, I found that the loss used in this repo is a cross-entropy loss between prediction and mask. `loss = F.binary_cross_entropy_with_logits(pred, mask)` But the loss mentioned in the paper is a contrastive loss between visual and textual features.