Algorithm fails on small data #41

nsutcliffe · 2019-04-15T17:29:58Z

As part our CICD pipeline, we have a daily build that runs on relatively small amounts of data. As part of this, we discovered an interesting bug; as part of the method estimateTau, there is the following line:

val y = DenseVector(estimators.map { case (_, d) => math.log(d) })

In this case, d is the average distance between points. We are finding that on the small data used in our daily build, beta can exceed 0. When this happens, yMax, which is defined as:

val yMax = breeze.linalg.max(y)

is below negative one, and subsequently used as the bufferSize.

Specifically, the following appears in the log:

ERROR KNN: Unable to estimate Tau with positive beta: 0.1577160047542901. This maybe because data is too small.
Setting to -1.3153582722102333 which is the maximum average distance we found in the sample.
This may leads to poor accuracy. Consider manually set bufferSize instead.
You can also try setting balanceThreshold to zero so only metric trees are built.

(this does not cause the code to stop, and it continues)

Exception in thread "main" java.lang.IllegalArgumentException: knn_2166a4d536d3 parameter bufferSize given invalid value -1.3153582722102333

This then causes an error and the pipeline stops.

From my understanding, very low average distances would always cause errors if beta exceeds 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithm fails on small data #41

Algorithm fails on small data #41

nsutcliffe commented Apr 15, 2019

Algorithm fails on small data #41

Algorithm fails on small data #41

Comments

nsutcliffe commented Apr 15, 2019