Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithm fails on small data #41

Open
nsutcliffe opened this issue Apr 15, 2019 · 0 comments
Open

Algorithm fails on small data #41

nsutcliffe opened this issue Apr 15, 2019 · 0 comments

Comments

@nsutcliffe
Copy link

As part our CICD pipeline, we have a daily build that runs on relatively small amounts of data. As part of this, we discovered an interesting bug; as part of the method estimateTau, there is the following line:

val y = DenseVector(estimators.map { case (_, d) => math.log(d) })

In this case, d is the average distance between points. We are finding that on the small data used in our daily build, beta can exceed 0. When this happens, yMax, which is defined as:

val yMax = breeze.linalg.max(y)

is below negative one, and subsequently used as the bufferSize.

Specifically, the following appears in the log:

ERROR KNN: Unable to estimate Tau with positive beta: 0.1577160047542901. This maybe because data is too small.
Setting to -1.3153582722102333 which is the maximum average distance we found in the sample.
This may leads to poor accuracy. Consider manually set bufferSize instead.
You can also try setting balanceThreshold to zero so only metric trees are built.

(this does not cause the code to stop, and it continues)

Exception in thread "main" java.lang.IllegalArgumentException: knn_2166a4d536d3 parameter bufferSize given invalid value -1.3153582722102333

This then causes an error and the pipeline stops.

From my understanding, very low average distances would always cause errors if beta exceeds 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant