Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request matching mode and setMinTokenLength suports for RegexTokenizer #42

Open
tong-zeng opened this issue Apr 24, 2018 · 3 comments
Open

Comments

@tong-zeng
Copy link

Hello Villu,
Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:

My input: a column named 'sentence'
My output: a column named 'prediction' produced by logistic classification for the column 'sentence'
My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression

Problem 1:
my RegexTokenizer code as below

tokenizer = feature.RegexTokenizer()
  .setGaps(False)\
  .setPattern("\\b[a-zA-Z]{3,}\\b")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

But it throws a error

IllegalArgumentException: 'Expected splitter mode, got token matching mode'

So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:

Problem 2:

IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:

tokenizer = feature.RegexTokenizer()
  .setGaps(True)\
  .setPattern("\\s+")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

Then I got:

Problem 3:

 java.lang.IllegalArgumentException: .
	at org.jpmml.sparkml.feature.CountVectorizerModelConverter.encodeFeatures(CountVectorizerModelConverter.java:118)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:80)

After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period ( . ). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily.
I have no choice, but continue hacking. Then, I try to split sentence by pattern \\b[^a-zA-Z]{0,}\\b which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error

Problem 4:_

java.lang.IllegalArgumentException: Expected 1 as minimum token length, got 3 as minimum token length
	at org.jpmml.sparkml.feature.RegexTokenizerConverter.encodeFeatures(RegexTokenizerConverter.java:51)

As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.

I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.

@tong-zeng
Copy link
Author

I found that PMML only support splitter mode, use wordSeparatorCharacterRE to pass a regular expression as the separator character, according to its specification v4.3.
It seems not possible to add matching mode unless the specification updates.

@tong-zeng
Copy link
Author

tong-zeng commented Apr 24, 2018

I also notice that the tokens cannot startwith and endwith punctuations is a requirement of PMML standard.

But spark doesn't require this. It's good to add these difference in the documentation. To help user considering this when training the model.

@tong-zeng
Copy link
Author

tong-zeng commented Apr 24, 2018

In the end, I've adjusted my input data to cater to jpmml, re-trained the model in spark, then export to pmml. Haha, It works, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant