You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello Villu,
Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:
My input: a column named 'sentence'
My output: a column named 'prediction' produced by logistic classification for the column 'sentence'
My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression
After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:
After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period ( . ). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily.
I have no choice, but continue hacking. Then, I try to split sentence by pattern \\b[^a-zA-Z]{0,}\\b which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error
As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.
I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.
The text was updated successfully, but these errors were encountered:
I found that PMML only support splitter mode, use wordSeparatorCharacterRE to pass a regular expression as the separator character, according to its specification v4.3.
It seems not possible to add matching mode unless the specification updates.
Hello Villu,
Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:
My input: a column named 'sentence'
My output: a column named 'prediction' produced by logistic classification for the column 'sentence'
My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression
Problem 1:
my RegexTokenizer code as below
But it throws a error
So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:
Problem 2:
After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:
Then I got:
Problem 3:
After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period (
.
). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily.I have no choice, but continue hacking. Then, I try to split sentence by pattern
\\b[^a-zA-Z]{0,}\\b
which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another errorProblem 4:_
As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.
I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.
The text was updated successfully, but these errors were encountered: