Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The batch prediction mode should support row-level exception handling #26

Open
vruusmann opened this issue Aug 15, 2024 · 4 comments
Open

Comments

@vruusmann
Copy link
Member

See jpmml/jpmml-evaluator#271 (comment) and jpmml/jpmml-evaluator#271 (comment)

@brother-darion
Copy link

on the batch predict scene, I think it would be better if has option to choose raise exception or set this record result to NAN and keep predict the others.

The Java interface o.j.e.Evaluator only supports single-row prediction mode via Evaluator#evaluate(Map).

The Python interface builds its batch prediction mode jpmml_evaluator.Evaluator.evaluateAll(DataFrame) on top it. The main benefit of the batch interface is to send all rows from Java to Python as a single call (instead of many calls, one call per row).

Now, this is actually a good idea that the JPMML-Evaluator-Python should provide an option for configuring a "what to do about an EvaluationException".

I can quickly think of two options:

  1. "return invalid" aka "as-is". Matches the current behaviour, where the Java exception is propagated to the top, and the evaluation is stopped at that location.
  2. "replace with NaN" aka "ignore". The Java component will catch a row-specific exception, and replaces the result for that row with Double#NaN (or some other user-specified constant?).

Also, in "return invalid" aka "as-is" mode, it should be possible to configure if partial results can be returned or not. Suppose there is a batch of 10'000 rows, and the evaluation fails on row 8566 because of some data input error. I think it might makse sense to return the leading 8565 results in that case.

right, it's really friendly options; and this two options is adding under current version behavior which is just throw exception, right? like you said , it's importance to clear feedback, this options is importance either.

and I was thinking the "replace with NaN" need a threshold or specified rows number to stop evaluation or not, because on some specified scene which is people use the wrong data, it would be a little annoying that still evaluation all data.

what is your thinking?

@vruusmann
Copy link
Member Author

There is a third option - "omit row" aka "drop". If there are evaluation errors, then the corresponding rows are simply omitted from the results batch.

The "omit row" option assumes that the user has assigned custom identifiers to the rows of the arguments batch. So, if there are 156 argument rows, and only 144 result rows (meaning that 12 rows errored out), then the user can locally identify "successful" vs "failed" rows in her application code.

See #23 about row identifiers.

@vruusmann
Copy link
Member Author

As a general comment - my "design assumption" behind the Evaluator.evaluateAll(X) method is that the size of the arguments dataframe is about/up to 10'000 cells (eg. a dataframe of 10 features x 1000 rows).

My thinking is that the data is being moved between Python and Java environments using the Pickle protocol. If the pickle payload gets really big (say, 1'000'000 cells instead of 10'000 cells), then the Java component responsible for loading/dumping might start hitting unexpected memory/processing limitations.

If the dataset is much bigger than 10'000 cells, then it should be partitioned into multiple chunks in Python application code. And the chunking algorithm should be prepared to handle the "omit row" option gracefully.

@vruusmann
Copy link
Member Author

my "design assumption" behind the Evaluator.evaluateAll(X) method is that the size of the arguments dataframe is about/up to 10'000 cells

The Evaluator.evaluateAll(X) method should have an extra parameter for controlling the batch size. The default would be my design assumption - about 10'000 cells. But the end user can increase or decrease its value if needed.

This way, the chunking logic would be nicely available at the JPMML-Evaluator-Python library level, leaving the actual Python application code clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants