Request matching mode and setMinTokenLength suports for RegexTokenizer #42

tong-zeng · 2018-04-24T00:55:08Z

Hello Villu,
Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:

My input: a column named 'sentence'
My output: a column named 'prediction' produced by logistic classification for the column 'sentence'
My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression

Problem 1:
my RegexTokenizer code as below

tokenizer = feature.RegexTokenizer()
  .setGaps(False)\
  .setPattern("\\b[a-zA-Z]{3,}\\b")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

But it throws a error

IllegalArgumentException: 'Expected splitter mode, got token matching mode'

So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:

Problem 2:

IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:

tokenizer = feature.RegexTokenizer()
  .setGaps(True)\
  .setPattern("\\s+")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

Then I got:

Problem 3:

 java.lang.IllegalArgumentException: .
	at org.jpmml.sparkml.feature.CountVectorizerModelConverter.encodeFeatures(CountVectorizerModelConverter.java:118)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:80)

After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period ( . ). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily.
I have no choice, but continue hacking. Then, I try to split sentence by pattern \\b[^a-zA-Z]{0,}\\b which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error

Problem 4:_

java.lang.IllegalArgumentException: Expected 1 as minimum token length, got 3 as minimum token length
	at org.jpmml.sparkml.feature.RegexTokenizerConverter.encodeFeatures(RegexTokenizerConverter.java:51)

As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.

I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.

The text was updated successfully, but these errors were encountered:

tong-zeng · 2018-04-24T06:19:19Z

I found that PMML only support splitter mode, use wordSeparatorCharacterRE to pass a regular expression as the separator character, according to its specification v4.3.
It seems not possible to add matching mode unless the specification updates.

tong-zeng · 2018-04-24T06:25:06Z

I also notice that the tokens cannot startwith and endwith punctuations is a requirement of PMML standard.

But spark doesn't require this. It's good to add these difference in the documentation. To help user considering this when training the model.

tong-zeng · 2018-04-24T20:28:23Z

In the end, I've adjusted my input data to cater to jpmml, re-trained the model in spark, then export to pmml. Haha, It works, thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request matching mode and setMinTokenLength suports for RegexTokenizer #42

Request matching mode and setMinTokenLength suports for RegexTokenizer #42

tong-zeng commented Apr 24, 2018

tong-zeng commented Apr 24, 2018

tong-zeng commented Apr 24, 2018 •

edited

Loading

tong-zeng commented Apr 24, 2018 •

edited

Loading

Request matching mode and setMinTokenLength suports for RegexTokenizer #42

Request matching mode and setMinTokenLength suports for RegexTokenizer #42

Comments

tong-zeng commented Apr 24, 2018

tong-zeng commented Apr 24, 2018

tong-zeng commented Apr 24, 2018 • edited Loading

tong-zeng commented Apr 24, 2018 • edited Loading

tong-zeng commented Apr 24, 2018 •

edited

Loading

tong-zeng commented Apr 24, 2018 •

edited

Loading