Description
Link to the documentation pages (if available)
No response
How could the documentation be improved?
Hi,
I tried searching for existing documentation or discussions on how to run a model for sentence embeddings over two separate columns and did not find any. I was wondering if there are any recommendations or known gotchas on the topic. Say I have a data frame with a name
and address
column, and would like to use a RoBERTa model to compute sentence embeddings for both. Best I could come up with was something as follows:
def createPipeline(source: String): Pipeline = {
val documentAssembler = new DocumentAssembler()
.setInputCol(source)
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings
.pretrained("xlm_roberta_base", "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(false)
val sentenceEmbeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("sentence_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)
new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
sentenceEmbeddings,
embeddingsFinisher
))
}
And then basically doing something like this:
val testDataWithNameEmbeddings = createPipeline("name").fit(testData).transform(testData).select($"name", $"address", $"finished_embeddings".alias("name_embeddings"))
val testDataWithBothEmbeddings = createPipeline("address").fit(testDataWithNameEmbeddings).transform(testDataWithNameEmbeddings).select($"name", $"address", $"name_embeddings", $"finished_embeddings".alias("address_embeddings"))
This appears to work, but feels... wrong? The existence of MultiDocumentAssembler
and setInputCols
APIs on several of the stages led me down a rabbit hole of trying out different approaches to see if I could annotate and tokenize multiple columns in one stage, but I hit a variety of issues and assertions for different components of the pipeline. For example, calling setInputCols
on Tokenizer
with an array containing more than one column results in:
IllegalArgumentException: requirement failed: setInputCols in REGEX_TOKENIZER_2889f26665ad expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: document
.
Closest thing I stumbled upon is this old issue where someone was trying to run multiple models in one pipeline, but if I try to add two embeddings
stages in the pipeline spark-ml fails with:
IllegalArgumentException: requirement failed: Cannot have duplicate components in a pipeline.
Not sure how common of a use case this is given there don't seem to be other issues like it, would appreciate some thoughts on the topic.
Thanks!
Environment: Spark 3.5.0, Scala 2.12, Spark NLP 5.5.1