Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if the number of points to sample for top-level tree is less than the number of records in training dataset #21

Open
jaceklaskowski opened this issue Feb 27, 2017 · 1 comment

Comments

@jaceklaskowski
Copy link
Contributor

Just faced the issue and the reason was that the number of points (defaults to 1000) was higher than the number of records in the training dataset. Perhaps obvious for ML practitioners, but I spent few minutes debugging to nail it down.

It'd be nice to know it before fitting a model or get a more user-friendly error message.

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (333.3333333333333) must be on interval [0, 1]
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.util.random.BernoulliSampler.<init>(RandomSampler.scala:148)
	at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:495)
	at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:490)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.sample(RDD.scala:490)
	at org.apache.spark.ml.knn.KNN.fit(KNN.scala:387)
@jaceklaskowski jaceklaskowski changed the title Check if the number of points to sample for top-level tree is greater than the number of records in training dataset Check if the number of points to sample for top-level tree is less than the number of records in training dataset Feb 27, 2017
@kaushikacharya
Copy link
Contributor

kaushikacharya commented Jul 16, 2017

In org.apache.spark.rdd.RDD.scala sample function,
there's a check for fraction >= 0

require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

But corresponding check for fraction <= 1 isn't present.
I am just wondering shouldn't this check should also be there.

EDIT: It does throw an error like this:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.7921146953405018) must be on interval [0, 1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants