Skip to content
This repository has been archived by the owner on Aug 10, 2022. It is now read-only.

Logistic regression/ DT/RF Under the hood dealing with missing values WEKA #43

Open
panos1998 opened this issue Apr 5, 2022 · 7 comments
Labels
question Further information is requested

Comments

@panos1998
Copy link

Hello. Iam trying to implement some algorithms defined in a paper which used weka software, but mine must be implemented in python. Python does not deal with missing values, in contrary to weka. Iam asking what do logistic regression, decision tree, random forest under the hood so that they run without throwing error about missing values

@fracpete fracpete added the question Further information is requested label Apr 5, 2022
@fracpete
Copy link
Member

fracpete commented Apr 5, 2022

There are numerous ways of dealing with missing values:

  • weka.classifiers.functions.Logistic: uses ReplaceMissingValues filter (the easiest approach, though not necessarily the best)
  • weka.classifiers.trees.RandomForest: uses weka.classifiers.trees.RandomTree inside weka.classifiers.meta.Bagging; RandomTree ignores missing values during building the tree (from what I can tell) and at prediction time takes the probabilities of its children into account if there is a missing value for a specific attribute
  • decision tree describes are wide range of algorithms with different approaches

@panos1998
Copy link
Author

Thanks for your nice answer.For decision tree i mean c4.5 and cart. Also what going on with Naive Bayes? I found in weka source code this if ((m_Instances.numInstances() > 0) && !m_Instances.instance(0).isMissing(attribute)) { double lastVal = m_Instances.instance(0).value(attribute); double currentVal, deltaSum = 0; int distinct = 0; for (int i = 1; i < m_Instances.numInstances(); i++) { Instance currentInst = m_Instances.instance(i); if (currentInst.isMissing(attribute)) { break; }

Does this means to remove records with missing values or the coressponding column? I tried both methods and the second approach gave closer results to the original implemented in the paper

@fracpete
Copy link
Member

fracpete commented Apr 5, 2022

  • J48 (improved version of C4.5): it's a rather complicated algorithm, as the tree building depends on whether binary trees are used and whether reduced error pruning is applied
  • CART: not sure what's going on there
  • NaiveBayes: that bit of code just determines the numeric precision to use, otherwise it uses a default precision

If you are using Python, why are you not using sklearn? The algorithms in that framework should produce similar results to Weka. It also already has ways of imputing missing values.

Finally, these repos are just downstream mirrors of the main SVN repo (and might disappear again in the future). Please use the mailing list for questions regarding Weka.

@eibe
Copy link

eibe commented Apr 5, 2022 via email

@panos1998
Copy link
Author

  • J48 (improved version of C4.5): it's a rather complicated algorithm, as the tree building depends on whether binary trees are used and whether reduced error pruning is applied
  • CART: not sure what's going on there
  • NaiveBayes: that bit of code just determines the numeric precision to use, otherwise it uses a default precision

If you are using Python, why are you not using sklearn? The algorithms in that framework should produce similar results to Weka. It also already has ways of imputing missing values.

Finally, these repos are just downstream mirrors of the main SVN repo (and might disappear again in the future). Please use the mailing list for questions regarding Weka.

Because the purposes of my project is a diploma thesis, i need a documented research about what weka does and how can do the same job with sklearn. SK learn classification models do not support directly imputation or in general missing values handling. I must use some imputation or drop the empty values/columns apart from the classification algorithm. But in my case, first i need to know what weka does in the context of missing values imputation and then if it is possible to 'translate' the procedure in python with sklearn. So thats why iam asking you about hidden missing value procedure in weka, because after reading the weka source code, only for Logistic regression is clear that uses mean/mode and this is well documented. But for naive bayes /random forest/trees i did not derive a clear image of how the missing values are treated, its more complicated. The idea is having the algorithms and the corresponding results from the scientific paper, then to reproduce them with sklearn. Another way is to take an initiative and try some imputations like nearest neighbor, play with the number of neighbors and the choose according which parameter value gave me the closest result. If you asked me what i would do after your explanations, maybe for forest/trees i will play with nearest neighbor and mean/mode

@panos1998
Copy link
Author

panos1998 commented Apr 5, 2022

Naive Bayes just skips missing values when estimating and calculating probabilities.

For the naive bayes, i tried to drop rows with missing values, the dataset decreased to 50% and the results where very different from the paper results. I tried to drop columns with missing values keeping dataset length constant and the results where not perfect but feasible. So iam confused if naive bayes really drops the missing rows, or the word 'skips missing values' is something that i do not understand at all. And something last, i read in one paper the 'fractional' procedure but i dont understand what is this and if is possible in python

@eibe
Copy link

eibe commented Apr 6, 2022 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants