Skip to content

Best practices for running a model for sentence embeddings on two different columns #14468

Closed as not planned
@macarran

Description

@macarran

Link to the documentation pages (if available)

No response

How could the documentation be improved?

Hi,
I tried searching for existing documentation or discussions on how to run a model for sentence embeddings over two separate columns and did not find any. I was wondering if there are any recommendations or known gotchas on the topic. Say I have a data frame with a name and address column, and would like to use a RoBERTa model to compute sentence embeddings for both. Best I could come up with was something as follows:

def createPipeline(source: String): Pipeline = {
  val documentAssembler = new DocumentAssembler()
    .setInputCol(source)
    .setOutputCol("document")

  val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

  val embeddings = XlmRoBertaEmbeddings
    .pretrained("xlm_roberta_base", "xx")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")
    .setCaseSensitive(false)

  val sentenceEmbeddings = new SentenceEmbeddings()
    .setInputCols(Array("document", "embeddings"))
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")

  val embeddingsFinisher = new EmbeddingsFinisher()
    .setInputCols("sentence_embeddings")
    .setOutputCols("finished_embeddings")
    .setOutputAsVector(true)
    .setCleanAnnotations(false)

  new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    sentenceEmbeddings,
    embeddingsFinisher
  ))
}

And then basically doing something like this:

val testDataWithNameEmbeddings = createPipeline("name").fit(testData).transform(testData).select($"name", $"address", $"finished_embeddings".alias("name_embeddings"))
val testDataWithBothEmbeddings = createPipeline("address").fit(testDataWithNameEmbeddings).transform(testDataWithNameEmbeddings).select($"name", $"address", $"name_embeddings", $"finished_embeddings".alias("address_embeddings"))

This appears to work, but feels... wrong? The existence of MultiDocumentAssembler and setInputCols APIs on several of the stages led me down a rabbit hole of trying out different approaches to see if I could annotate and tokenize multiple columns in one stage, but I hit a variety of issues and assertions for different components of the pipeline. For example, calling setInputCols on Tokenizer with an array containing more than one column results in:

IllegalArgumentException: requirement failed: setInputCols in REGEX_TOKENIZER_2889f26665ad expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: document.

Closest thing I stumbled upon is this old issue where someone was trying to run multiple models in one pipeline, but if I try to add two embeddings stages in the pipeline spark-ml fails with:

IllegalArgumentException: requirement failed: Cannot have duplicate components in a pipeline.

Not sure how common of a use case this is given there don't seem to be other issues like it, would appreciate some thoughts on the topic.

Thanks!

Environment: Spark 3.5.0, Scala 2.12, Spark NLP 5.5.1

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions