-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshooting XGBoost model performance #128
Comments
PMML is a high-level ML workflow representation, and is therefore able to operate with string values as-is. In contrast, Apache Spark ML pipeline is a much lower representation, and needs to transform string values into numeric values first (here: map category levels to category indices). From the PMML perspective, the
The JPMML-SparkML library provides API for making redundant features visible by converting them to org.jpmml.converter.Schema schema = ...
List<? extends Feature> features = schema.getFeatures();
List<Feature> allNumericFeaures = features.stream()
// THIS!
.map(feature -> feature.toContinuous())
.collect(Collectors.toList());
Schema allNumericSchema = new Schema(schema.getEncoder(), schema.getLabel(), allNumericFeatures); The
Do you have 10k features entering the pipeline, or exiting it? Consider ZIP code as a categorical string feature. A PMML representation would accept one string feature, and return one string feature. An Apache Spark ML pipeline would accept one string feature and return 50k+ binary indicator features (one per ZIP code). Clearly, the latter is not sustainable.
My educated guess is that the performance loss happens because of the Do your categorical features contain any missing values or not? If you have both kinds of categorical features, you may split their processing between two sub-pipelines - the first one (with missing values) performs
If you want to fix this ML workflow once and for all, then you should simply upgrade the XGBoost library to some 1.5.X version, so that the one-hot-encoding of categorical features happens "natively" inside the XGBoost library. That is, it will be possible to pass the output of Better yet, upgrade to XGBoost version 1.6.X, and you shall get native multi-category splits on categorical features (as opposed to primitivistic one-category-against-all-other-categories splits as is the case with OHE). |
Leaving this issue open as a reminder to implement some kind of "transform smart PMML-level feature representation into dumb Apache Spark ML-level feature representation" functionality, as demonstrated in the above code snippet. This "dumbing down" requirement applies to other ML frameworks as well (eg. Scikit-Learn). Therefore, it is likely to land in the core JPMML-Converter library. |
@mlsquareup What's your target Apache Spark ML version? Also, what's your current XGBoost version, have you considered upgrading it to 1.5+, 1.6+? Maybe I can do a small tutorial about this topic... It's year 2023, and nobody should be doing "external OHE plus legacy XGBoost" anymore. It's "native XGBoost" now! |
Hi, I collaborate with @mlsquareup. Wanted to respond to a few points here.
Entering it. There's a lot of features. We haven't checked exactly how many are exiting, but it's 10k + a little more. Only a small fraction of the 10k features are string features that will be one-hot encoded.
We are completely aware of the limitations of one-hot encoding. We don't expect it to work well, or performantly, unless the string column has very few distinct values. This is the only case for which we are using such encodings. We wouldn't attempt to one-hot encode a zip code.
Probably not, because we have a few different datasets and not all of them have missing features. In any case, I suppose what I wanted to check was, if there are 10k features going in, and the XGBoost booster has a few hundred trees that are relatively deep, does it sound reasonable that doing
I believe to do this, we would at least still need the StringIndexer. The categorical values accepted by XGBoost are integers in the range [0, number_of_categories).
We really want to do this, but the newer versions of XGBoost appear to have performance issues for workflows with high numbers of features, which is why we haven't upgraded to 1.5+. This has been to some extent acknowledged by the XGBoost maintainers, see dmlc/xgboost#7214 (comment) and the mention of the Epsilon dataset there. We're still investigating how we can get around the problem and upgrade, but until we find a solution that doesn't massively regress our training time, we can't upgrade.
We're on 1.0. I know, I know, that must seem preposterous. It is 8x faster than 1.5 on our high-feature-count datasets! We're looking into what settings could be tweaked to restore some of the performance and upgrade.
We're on Spark 3.2 |
Apparently, upgrading XGBoost won't help any JVM-based applications currently, because the categorical features support won't be around till XGBoost 2.0: dmlc/xgboost#7802
Your pipeline doesn't change the total number of features much. But does it transform them? For example, do numeric features undergo scaling, normalization etc.? The JPMML-SparkML library performs feature usage analysis, and only keeps those features that are actually used by XGBoost split conditions. I'm wondering about the used/unused feature ratio in your pipeline - how many of those 10k features make their way into the PMML document. You can check this manually - open the PMML file in text editor, and count the number of In any case, a 600 millis prediction time is rather unexpected. Is this the "first version" of a model (ie. trying to get some new idea working), or did the performance of an existing model regress that much? Also, what's your PMML engine? Is it PyPMML (as mentioned in the opening comment) or is it JPMML-Evaluator-Python? In single prediction mode (ie. one data record at a time), I'd expect JPMML-Evaluator to be able to match XGBoost-via-Python performance. |
@eugeneyarovoi Could you please contact me via e-mail so that I could ask more technical details? |
No, numeric features are not transformed in any way, since XGBoost models don't need feature normalization or imputation.
It is the first version. The same model was implemented in Python as follows:
This implementation ran in about 15ms vs. 600ms for PMML. Our PMML engine was PyPMML. I believe my coworker who posted earlier tried JPMML-Evaluator, and it did improve performance, but not nearly to the point of the 15ms solution. If this is of interest, I can get more details. We are indeed testing in single-prediction mode. Just to clarify, "single-prediction mode" isn't some setting that has to be manually enabled, correct? It is just what happens when you test with one row of data? Our tests test average inference time for 1 row of data, since this emulates the online inference setting. If you like, maybe we can try an "ablation study" where we remove the SQLTransformer and Sparse2Dense steps, to confirm those have nothing to do with the PMML performance. We may be forced to keep SQLTransformer because Spark errors on empty strings in some cases (I don't recall the details). Removing Sparse2Dense will result in a working model that incorrectly treats 0 as missing, but that's OK if it's just for this test. |
I'm very much interested in analyzing this misbehaviour in more detail. I'd need a PMML file, and some test data to run it everything locally. Let's co-ordinate via e-mail.
Yes, it's the default - one In principle, this Map-oriented API may be sub-optimal here, because instantiating a In the end, I'd love to run the example in transpiled mode using the JPMML-Transpiler library. |
We can't provide the original PMML file, as it may leak confidential info as it was trained on confidential data. However, I can try to run it through a transformation where just replace all data with random values (keeping string column cardinality the same), retrain the model, and check that it still exhibits the performance characteristics we've mentioned. Have you tested PMML with a very high number of input features (10K+) before? Based on our experiences so far, we tend to think PMML pipelines containing OHE + XGBoost with very large numbers of features generally have these performance characteristics. |
A PMML file plus 1k .. 10k data records would be sufficient (covers most XGBoost branches, plus warms up the JVM sufficiently). You can obfuscate a PMML file by hashing field names. There's even a small command-line application for that (first hash the PMML file, then CSV header cells):
Nothing very systematic. I'm mostly testing with smaller datasets, as my main goal is to ensure correctness and reproducibility of predictions. |
XGBoost differs from regular Apache Spark ML models, because it converts numeric features from In XGBoost-native environment, this conversion is very cheap - a primitive value cast. In PMML environment, this conversion may be encoded differently - via field data type declarations (good), or an explicit cast using a special-purpose Perhaps the model performance is bad, because the PMML document contains instructions for casting 10k numeric features using 10k @eugeneyarovoi Can you help me to identify how are casts to the Option one:
Option two:
Which option describes your PMML file? |
Fixed this issue by introducing a Consider the Iris species classification: val labelIndexer = new StringIndexer().setInputCol("Species").setOutputCol("idx_Species")
val labelIndexerModel = labelIndexer.fit(df)
val assembler = new VectorAssembler().setInputCols(Array("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")).setOutputCol("featureVector")
val classifier = new XGBoostClassifier(Map("objective" -> "multi:softprob", "num_class" -> 3, "num_round" -> 17)).setLabelCol(labelIndexer.getOutputCol).setFeaturesCol(assembler.getOutputCol)
val pipeline = new Pipeline().setStages(Array(labelIndexer, assembler, classifier))
val pipelineModel = pipeline.fit(df) The default behaviour is to accept <PMML>
<DataDictionary>
<DataField name="Petal_Length" optype="continuous" dataType="double"/>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="float(Petal_Length)" optype="continuous" dataType="float">
<FieldRef field="Petal_Length"/>
</DerivedField>
</TransformationDictionary>
</PMML> Activating this transformation option: import org.jpmml.sparkml.PMMLBuilder
new PMMLBuilder(df.schema, pipelineModel).putOption(org.jpmml.sparkml.xgboost.HasSparkMLXGBoostOptions.OPTION_INPUT_FLOAT, true).buildFile(new File("/path/to/XGBoostIris.pmml")) The new behaviour is to accept <PMML>
<DataDictionary>
<DataField name="Petal_Length" optype="continuous" dataType="float"/>
</DataDictionary>
</PMML> In the context of the original issue, this transformation eliminates the need to perform 10'000 This transformation is currently "off" by default. It should not have any side effects, if the Spark pipeline only contains XGBoost estimators. This claim is backed by actual integration tests: https://github.com/jpmml/jpmml-sparkml/blob/2.0.2/pmml-sparkml-xgboost/src/test/java/org/jpmml/sparkml/xgboost/testing/XGBoostTest.java#L61-L71 |
Accepting this challenge:
Expecting to see a <1 ms average prediction time. Will report back as soon as I have my results (ETA: may 2023). |
Created a 1000 x 10'000 dataset using SkLearn: from pandas import DataFrame, Series
from sklearn.datasets import make_regression
X, y = make_regression(n_samples = 1000, n_features = 10000, n_informative = 5000, random_state = 13)
X = DataFrame(X, columns = ["x" + str(i + 1) for i in range(X.shape[1])])
y = Series(y, name = "y") Trained an XGBoost regressor using Apache Spark 3.4.0 and XGBoost4J-Spark(_2.12) 1.7.5: val assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("featureVector")
val trackerConf = TrackerConf(0, "scala")
val regressor = new XGBoostRegressor(Map("objective" -> "reg:squarederror", "num_round" -> 500, "max_depth" -> 5, "tracker_conf" -> trackerConf)).setLabelCol(labelCol).setFeaturesCol(assembler.getOutputCol)
val pipeline = new Pipeline().setStages(Array(assembler, regressor))
val pipelineModel = pipeline.fit(df) Exported the import org.jpmml.sparkml.PMMLBuilder
var pmmlBuilder = new PMMLBuilder(schema, pipelineModel)
pmmlBuilder = pmmlBuilder.putOption(org.jpmml.sparkml.model.HasPredictionModelOptions.OPTION_KEEP_PREDICTIONCOL, false)
pmmlBuilder.buildFile(new File("pipeline.pmml"))
pmmlBuilder = pmmlBuilder.putOption(org.jpmml.sparkml.xgboost.HasSparkMLXGBoostOptions.OPTION_INPUT_FLOAT, true)
pmmlBuilder.buildFile(new File("pipeline-float.pmml")) Evaluated the PMML file with JPMML-Evaluator-Python 0.9.0 and PyPMML 0.9.17: from jpmml_evaluator import make_evaluator
from pypmml import Model
evaluator = make_evaluator(pmml_path, lax = True)
evaluator.verify()
model = Model.fromFile(pmml_path)
# Evaluate in batch mode
print(timeit.Timer("evaluator.predict(df)", globals = globals()).timeit(number = rounds))
# Evaluate in row-by-row mode
def evaluate_row(X):
return evaluator.evaluate(X.to_dict())["y"]
print(timeit.Timer("df.apply(evaluate_row, axis = 1)", globals = globals()).timeit(number = rounds))
# Evaluate in batch mode
print(timeit.Timer("model.predict(df)", globals = globals()).timeit(number = rounds)) The dataset was scored twice - first all 10k input features, and then only the actually used 3.5k features. Limiting the dataset using PMML model schema information: df = pandas.read_csv(csv_path)
print(df.shape)
# Drop unused input columns - retains around 3.5k columns out of initial 10k columns
df = df[[inputField.name for inputField in evaluator.getInputFields()]]
print(df.shape) Timings for
Timings for
|
Conclusions: This issue was raised because the OP was using PyPMML in full dataset mode. By switching from PyPMML to JPMML-Evaluator-Python it will be possible to speed up the evaluation (1467 ms / 23 ms) = ~63 times instantly. Next, the "input_float" transformation doesn't seem to be as effective as hoped. It appears to speed up things by ~20%. However, what helps evaluation speeds considerably is limiting the amount of data transfer between Python and Java environments (sending the feature vector). By eliminating roughly 2/3 of input columns - that are provably not needed by the XGBoost model - it's possible to speed up evaluation (23 ms / 10 ms) = ~2 .. 2.5 times. Putting everything together, it's possible to go from PyPMML's 1467 ms to JPMML-Evaluator-Python's 8 ms by changing only a couple lines of Python code in the final |
The attained 8 ms average prediction time (on my computer) is still far away from the stated <1 ms goal. The bottleneck appears to be data transfer between Python and Java environments. In the current case, the solution would be about passing |
I've just released New timings for
New timings for
Now that the majority of data exchange overhead has been eliminated, it's possible to see that the "input float" transformation option actually makes a difference - the average scoring time falls from 5.1 ms to 3.6 ms for full dataset (all 10'000 input features) and from 4.0 ms 2.3 ms for the limited dataset (actually used 3'638 input features). Got some ideas for future optimization work. The 1ms goal (on my computer) is not that far off anymore. |
Hi,
We're attempting to convert a SparkML Pipeline of
[SQLTransformer(simple string replacement for empty/null strings), StringIndexer, OneHotEncoder, VectorAssember, Sparse2Dense, xgboost classifier]
into just[SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember(optional), Sparse2Dense(optional)]
.The output .pmml file for
[SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember]
, when loaded via pypmml and called with.predict()
, outputs only a few of the derived string columns, not any of the many, many numeric features. Also, the derived string columns do not come out as one hot encodings, or even indexed values. Is there a way to convert a pipeline of [SQLTransformer, StringIndexer, OneHotEncoder, VectorAssember(optional)] that outputs the exact output we would get from the original SparkML pipeline? ie it should have all numeric features, the string features string-indexed then one hot encoded?Context:
We noticed severe performance issues for pmml models that had 10k+ features. The pmml model is a converted
[SQLTransformer(simple string replacement for empty/null strings), StringIndexer, OneHotEncoder, VectorAssember, Sparse2Dense, xgboost classifier]
spark pipeline. We wanted to determine the cause of the poor performance, so we separated the xgboost classifier and are trying to performance test just the preprocessing portion of the pipeline.Does this make sense with PMML, to only do the preprocessing portion via pmml? Or should we just do all pmml or not at all?
The text was updated successfully, but these errors were encountered: