Best practices for running a model for sentence embeddings on two different columns

### Link to the documentation pages (if available)

_No response_

### How could the documentation be improved?

Hi,
I tried searching for existing documentation or discussions on how to run a model for sentence embeddings over two separate columns and did not find any. I was wondering if there are any recommendations or known gotchas on the topic. Say I have a data frame with a `name` and `address` column, and would like to use a RoBERTa model to compute sentence embeddings for both. Best I could come up with was something as follows:

```
def createPipeline(source: String): Pipeline = {
  val documentAssembler = new DocumentAssembler()
    .setInputCol(source)
    .setOutputCol("document")

  val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

  val embeddings = XlmRoBertaEmbeddings
    .pretrained("xlm_roberta_base", "xx")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")
    .setCaseSensitive(false)

  val sentenceEmbeddings = new SentenceEmbeddings()
    .setInputCols(Array("document", "embeddings"))
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")

  val embeddingsFinisher = new EmbeddingsFinisher()
    .setInputCols("sentence_embeddings")
    .setOutputCols("finished_embeddings")
    .setOutputAsVector(true)
    .setCleanAnnotations(false)

  new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    sentenceEmbeddings,
    embeddingsFinisher
  ))
}
```

And then basically doing something like this:
```
val testDataWithNameEmbeddings = createPipeline("name").fit(testData).transform(testData).select($"name", $"address", $"finished_embeddings".alias("name_embeddings"))
val testDataWithBothEmbeddings = createPipeline("address").fit(testDataWithNameEmbeddings).transform(testDataWithNameEmbeddings).select($"name", $"address", $"name_embeddings", $"finished_embeddings".alias("address_embeddings"))
```

This appears to work, but feels... wrong? The existence of `MultiDocumentAssembler` and `setInputCols` APIs on several of the stages led me down a rabbit hole of trying out different approaches to see if I could annotate and tokenize multiple columns in one stage, but I hit a variety of issues and assertions for different components of the pipeline. For example, calling `setInputCols` on `Tokenizer` with an array containing more than one column results in:

`IllegalArgumentException: requirement failed: setInputCols in REGEX_TOKENIZER_2889f26665ad expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: document`.

Closest thing I stumbled upon is [this old issue](https://github.com/JohnSnowLabs/spark-nlp/issues/2839) where someone was trying to run multiple models in one pipeline, but if I try to add two `embeddings` stages in the pipeline spark-ml fails with:

`IllegalArgumentException: requirement failed: Cannot have duplicate components in a pipeline.`

Not sure how common of a use case this is given there don't seem to be other issues like it, would appreciate some thoughts on the topic.

Thanks!

Environment: `Spark 3.5.0, Scala 2.12, Spark NLP 5.5.1`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for running a model for sentence embeddings on two different columns #14468

Link to the documentation pages (if available)

How could the documentation be improved?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best practices for running a model for sentence embeddings on two different columns #14468

Description

Link to the documentation pages (if available)

How could the documentation be improved?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions