-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for subsampling when inputting large training data #1
Comments
It might be better to allow the user to set the learning schedule freely by allowing the learned learner to be given as an argument. |
Should the subsampling be for bags or instances? The advantage of the method performed on bags is its simplicity. Because of its simplicity, this method goes well with the solution of giving a trained model as an argument. However, if the number of bags is not large, there is a concern that the accuracy will deteriorate. The method for instances is more complicated in that you need to change the values of lower_threshold and upper_threshold. For example, if you want to subsample by a factor of 10, you should multiply lower_threshold and upper_threshold by 0.1. If this makes the lower_threshold and upper_threshold too small, the target instance may not be included in any of the subsamples. Therefore, you may need to change the subsample ratio for each bag. |
By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits. Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account. |
We can get the posterior probability that the number of target instances does not reach the lower limit, and conversely, the posterior probability that the number of target instances exceeds the upper limit. Let's use this. |
In addition to subsampling, the following approaches can be taken:
One way to implement a distributed algorithm is as follows:
|
When subsample the sample of a bag, it is very likely that the fewest classes in that bag will disappear. |
I came up with a fast way to find the initial label without subsampling.
|
minimal code: from sklearn.cluster import KMeans
flatten_features = np.vstack(bags)
n_clusters = 15
cluster_generator = KMeans(n_clusters=n_clusters, random_state=0)
cluster_generator.fit(flatten_features)
# flatten_features.shape
# cluster_generator.labels_
from sklearn.preprocessing import OneHotEncoder
cluster_encoder = OneHotEncoder(sparse=False)
cluster_encoder.fit(np.array([np.arange(n_clusters)]).T)
one_hot_encoded = [
cluster_encoder.transform(
cluster_generator.predict(b).reshape((-1, 1))) for b in bags]
cluster_encoder.transform(
cluster_generator.predict(bags[0]).reshape((-1, 1))
)
bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded])
print(bag_cluster_table[:8])
print("---")
print(lower_threshold[:8])
# lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster]
# upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster] |
I have implemented this idea. It required some non-trivial improvements, but I basically followed this idea. The accuracy of this method may not be good. However, the convergence is fast and it is a good candidate as a method for generating initial labels. |
how about Kernel Approximation and SGDClassifier based Linear SVM? |
The following policy is a good one.
|
When the total number of instances is large, the computation time of most scikit-learn trainers becomes unacceptably long.
In early iterations of multiple instance learning, it may not be necessary to include all instances in the training data.
Therefore, we will add a function to adjust the amount of subsamples used for training according to the progress of convergence of multiple instance learning.
The text was updated successfully, but these errors were encountered: