Check if the number of points to sample for top-level tree is less than the number of records in training dataset #21

jaceklaskowski · 2017-02-27T13:34:41Z

Just faced the issue and the reason was that the number of points (defaults to 1000) was higher than the number of records in the training dataset. Perhaps obvious for ML practitioners, but I spent few minutes debugging to nail it down.

It'd be nice to know it before fitting a model or get a more user-friendly error message.

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (333.3333333333333) must be on interval [0, 1]
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.util.random.BernoulliSampler.<init>(RandomSampler.scala:148)
	at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:495)
	at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:490)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.sample(RDD.scala:490)
	at org.apache.spark.ml.knn.KNN.fit(KNN.scala:387)

The text was updated successfully, but these errors were encountered:

kaushikacharya · 2017-07-16T06:28:31Z

In org.apache.spark.rdd.RDD.scala sample function,
there's a check for fraction >= 0

require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

But corresponding check for fraction <= 1 isn't present.
I am just wondering shouldn't this check should also be there.

EDIT: It does throw an error like this:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.7921146953405018) must be on interval [0, 1]

jaceklaskowski changed the title ~~Check if the number of points to sample for top-level tree is greater than the number of records in training dataset~~ Check if the number of points to sample for top-level tree is less than the number of records in training dataset Feb 27, 2017

kaushikacharya mentioned this issue Jan 11, 2018

knn.fit(training) throws an exception #32

Open

aaronquantexa mentioned this issue Apr 21, 2021

Add check that data size is greater than topTreeSize #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if the number of points to sample for top-level tree is less than the number of records in training dataset #21

Check if the number of points to sample for top-level tree is less than the number of records in training dataset #21

jaceklaskowski commented Feb 27, 2017

kaushikacharya commented Jul 16, 2017 •

edited

Loading

Check if the number of points to sample for top-level tree is less than the number of records in training dataset #21

Check if the number of points to sample for top-level tree is less than the number of records in training dataset #21

Comments

jaceklaskowski commented Feb 27, 2017

kaushikacharya commented Jul 16, 2017 • edited Loading

kaushikacharya commented Jul 16, 2017 •

edited

Loading