Documentation [WIP]

parul100495 · Dec 11, 2020 · a9a12c5 · a9a12c5
1 parent 9d4eecf
commit a9a12c5
Show file tree

Hide file tree

Showing 118 changed files with 4,009 additions and 2,729 deletions.
diff --git a/__pycache__/create_plots.cpython-37.pyc b/__pycache__/create_plots.cpython-37.pyc
diff --git a/__pycache__/equation_parser.cpython-37.pyc b/__pycache__/equation_parser.cpython-37.pyc
diff --git a/__pycache__/equation_parser_extension.cpython-37.pyc b/__pycache__/equation_parser_extension.cpython-37.pyc
diff --git a/__pycache__/gather_results.cpython-37.pyc b/__pycache__/gather_results.cpython-37.pyc
diff --git a/__pycache__/get_bounds.cpython-37.pyc b/__pycache__/get_bounds.cpython-37.pyc
diff --git a/__pycache__/inequalities.cpython-37.pyc b/__pycache__/inequalities.cpython-37.pyc
diff --git a/__pycache__/logistic_regression_functions.cpython-37.pyc b/__pycache__/logistic_regression_functions.cpython-37.pyc
diff --git a/__pycache__/qsa.cpython-37.pyc b/__pycache__/qsa.cpython-37.pyc
diff --git a/create_plots.py b/create_plots.py
@@ -0,0 +1,48 @@
+import matplotlib.pyplot as plt
+from gather_results import gather_results
+import numpy as np
+
+csv_path = 'exp/lag_exp/csv/'
+img_path = 'exp/lag_exp/images/'
+
+
+def loadAndPlotResults(fileName, ylabel, output_file, is_yAxis_prob, legend_loc):
+    file_ms, file_QSA, file_QSA_stderror, file_LS, file_LS_stderror = np.loadtxt(fileName, delimiter = ',',
+                                                                                   unpack = True )
+
+    fig = plt.figure()
+
+    plt.xlim(min(file_ms), max(file_ms))
+    plt.xlabel( "Amount of data", fontsize = 16 )
+    plt.xscale( 'log' )
+    plt.xticks( fontsize = 12 )
+    plt.ylabel( ylabel, fontsize = 16 )
+
+    if is_yAxis_prob:
+        plt.ylim(-0.1, 1.1)
+    # else:
+    # plt.ylim(-0.2, 2.2)
+    # plt.plot([1, 100000], [1.25, 1.25], ':k');
+    # plt.plot([1, 100000], [2.1,  2.1],  ':k');
+
+    plt.plot ( file_ms, file_QSA, 'b-', linewidth = 3, label = 'QSA' )
+    plt.errorbar ( file_ms, file_QSA, yerr = file_QSA_stderror, fmt = '.k' );
+    plt.plot ( file_ms, file_LS, 'r-', linewidth = 3, label = 'LogRes' )
+    plt.errorbar ( file_ms, file_LS, yerr = file_LS_stderror, fmt = '.k' );
+    plt.legend ( loc = legend_loc, fontsize = 12 )
+    plt.tight_layout ()
+
+    plt.savefig( output_file, bbox_inchesstr = 'tight' )
+    plt.show( block = False )
+
+
+if __name__ == "__main__":
+    gather_results()
+
+    loadAndPlotResults(csv_path + 'fs.csv', 'Log Loss', img_path + 'tutorial7MSE_py.png', False, 'lower right')
+    loadAndPlotResults(csv_path + 'solutions_found.csv', 'Probability of Solution',
+                         img_path + 'tutorial7PrSoln_py.png', True, 'best' )
+    loadAndPlotResults(csv_path + 'failures_g1.csv', r'Probability of $g(a(D))>0$',
+                         img_path + 'tutorial7PrFail1_py.png', True, 'best' )
+    loadAndPlotResults(csv_path+'upper_bound.csv',     r'upper bound', img_path+'tutorial7PrFail2_py.png', False,  'best')
+    plt.show()
diff --git a/docs/_images/bound-no.png b/docs/_images/bound-no.png
diff --git a/docs/_images/bound-yes.png b/docs/_images/bound-yes.png
diff --git a/docs/_images/const.png b/docs/_images/const.png
diff --git a/docs/_sources/code.rst.txt b/docs/_sources/code.rst.txt
@@ -1,22 +1,25 @@
 Code Documentation
 ==================
 
-.. automodule:: config
+.. automodule:: qsa
     :members:
 
-.. automodule:: connect_database
+.. automodule:: logistic_regression_functions
     :members:
 
-.. automodule:: data_pre_processing
+.. automodule:: equation_parser
     :members:
 
-.. automodule:: regression_model
+.. automodule:: get_bounds
     :members:
 
-.. automodule:: error_analysis
+.. automodule:: inequalities
     :members:
 
-.. automodule:: plot_results
+.. automodule:: gather_results
+    :members:
+
+.. automodule:: create_plots
     :members:
 
 
diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt
@@ -1,35 +1,35 @@
 
-.. MSR Project documentation master file, created by
+.. FairSeldonian Project documentation master file, created by
    sphinx-quickstart on Tue Nov 17 18:50:48 2020.
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Welcome to MSR Project!
+Welcome to FairSeldonian Project!
 =======================================
 
-The `International Conference on Mining Software Repositories (MSR) <https://2020.msrconf.org/>`_  has hosted the mining challenge every year, since 2006.
-With this challenge, they aim at everyone interested to apply their tools to a common dataset.
-The challenge is for researchers and practitioners to use their mining tools and approaches on a dare.
-The goal is to answer questions about a given dataset, that were previously unanswered.
+With the growing usage of machine learning and artificial intelligence in real lives, 
+the need for incorporating the mitigation of unethical and unfair elements in machine 
+learning models is rising at an alarming rate. 
+This requires us to develop a tool to evaluate and mitigate unfairness in these models 
+that would help data scientists in dealing with such problems in the models they develop.
 
-MSR 2020 is a challenge which presents `Software Heritage Graph Dataset <https://docs.softwareheritage.org/devel/swh-dataset/index.html>`_.
-This is one of the largest known existing public archive of software source code and accompanying development history.
-More information can be found in 'Dataset' tab.
-
-Our goal here is to answer certain research questions (mentioned in 'Problem Statement' tab) and gain some meaninful insights from the dataset.
+Fair-Seldonian leverages Seldonian algorithm to tackles this problem where the responsibility 
+of regulating the undesirable behavior of machine learning algorithms and making them `fair' 
+is transferred from user to designer of the algorithm (i.e. at the creation step itself by ML 
+researcher). 
+This is implemented for fairness in machine learning for any generic constraint.
+In addition, some extensions for optimizing the bounds and making it more efficient
+are also present.
 
+More details of Seldonian is present `here <https://aisafety.cs.umass.edu>`_.
 
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
 
-   problem
-   dataset
-   pre_req
-   queries
-   model
+   intro
+   quickstart
+   variants
    code
-   results
-   references
 
 
diff --git a/docs/_sources/intro.rst.txt b/docs/_sources/intro.rst.txt
@@ -0,0 +1,18 @@
+Introduction
+============
+
+With the growing usage of machine learning and artificial intelligence in real lives, 
+the need for incorporating the mitigation of unethical and unfair elements in machine 
+learning models is rising at an alarming rate. 
+This requires us to develop a tool to evaluate and mitigate unfairness in these models 
+that would help data scientists in dealing with such problems in the models they develop.
+
+Fair-Seldonian leverages Seldonian algorithm to tackles this problem where the responsibility 
+of regulating the undesirable behavior of machine learning algorithms and making them `fair' 
+is transferred from user to designer of the algorithm (i.e. at the creation step itself by ML 
+researcher). 
+This is implemented for fairness in machine learning for any generic constraint.
+In addition, some extensions for optimizing the bounds and making it more efficient
+are also present.
+
+More details of Seldonian is present `here <https://aisafety.cs.umass.edu>`_.
diff --git a/docs/_sources/quickstart.rst.txt b/docs/_sources/quickstart.rst.txt
@@ -0,0 +1,47 @@
+Getting Started
+===============
+
+This page is a user guide for the developers to quickly
+be able to use FairSeldonian Python library in their own project 
+or, extending/enhancing the codebase to enhance the framework.
+
+Pre-requisites
+---------------
+Library pre-requisites
+
+* sklearn - machine learning model implementations like logistic regression.
+
+* matplotlib - visualizations of the plots.
+
+* numpy - handling data.
+
+* pandas - handling data in the form of dataframes.
+
+* ray - for parallelisation of the algorithm.
+
+* torch - for tensors used in the codebase.
+
+Quick start
+-----------
+The complete code resides in `code` folder.
+To run the experiment, you need to amend configurations in `main.py`.
+
+
+Go to terminal or any IDE and run that file using the following command:
+
+.. code-block::
+
+    python main.py <mode>
+
+The default mode is set as `base`. To understand other modes present in the code, refer to the `Variants` section of this documentation.
+
+Configuration
+-------------
+The user must setup the following things to make full use of this framework:
+
+- **Configuration of the experiment structure**
+
+
+Collaboration
+-------------
+Please feel free to contribute to this code base.
diff --git a/docs/_sources/variants.rst.txt b/docs/_sources/variants.rst.txt
@@ -0,0 +1,91 @@
+Variants
+========
+
+The codebase allows you to choose the variant for tuning and experimenting with the framework.
+
+Basic Seldonian
+---------------
+To begin with, we implemented vanilla Seldonian algorithm to classify the datapoint into 2 groups with difference in true positives as the fairness constraint.
+
+To use this mode, you need to add CLI parameter `base` as:
+
+.. code-block::
+
+    python main.py base
+
+Improvements to confidence interval
+-----------------------------------
+In the candidate selection process, we used Hoeffding inequality confidence interval as follows-
+
+.. math::
+
+    estimate \pm 2 \sqrt{\frac{ln(1/\delta)}{2 |D_{safety}|}}
+
+Instead, this interval can be improved by using a separate values for - a.) error in candidate estimate and b.) confidence interval in safety set as follows-
+
+.. math::
+
+    estimate \pm \sqrt{\frac{ln(1/\delta)}{2 |D_{safety}|}} + \sqrt{\frac{ln(1/\delta)}{2 |D_{candidate}|}}
+
+This will specifically be helpful in cases where the difference between the sizes of the 2 data splits is huge.
+
+To use this mode, you need to add CLI parameter `mod` as:
+
+.. code-block::
+
+    python main.py mod
+
+Improvement in bound propagation around constant values
+-------------------------------------------------------
+As constants have fixed value, there is no need to wrap a confidence interval around them. Thus, the 
+:math:`\delta` value can directly go to other variable child and need not be split equally into half in case of 
+binary operator when the other child is a constant. The figures below show naive and improved implementation 
+of bound propagation in case of constant value of a node of the same tree respectively.
+
+.. image:: images/const.png
+
+To use this mode, you need to add CLI parameter `const` as:
+
+.. code-block::
+
+    python main.py const
+
+Improvement in bound propagation from union bound
+-------------------------------------------------
+A user may defined the fairness constraint in such a way that a particular element appears multiple times 
+in the same tree. Instead of treating all those entities as independent elements, we can combine all the 
+elements together union bound and then use the final value of :math:`\delta`. This will theoretically improve the 
+bound and give us better accuracy and more valid solutions. 
+
+Example: Suppose we have A appearing 3 times with :math:`\delta/2`, :math:`\delta/4` and :math:`\delta/8`. We can simply take the
+
+.. math:: 
+    \delta_{sum} = 7\delta/8
+
+and find the confidence interval using that :math:`\delta`. The figures below show the naive and improved implement using this functionality respectively.
+
+.. image:: images/bound-no.png
+
+.. image:: images/bound-yes.png
+
+Optimization with Lagrangian/KKT
+--------------------------------
+To use Lagrangian/KKT technique to optimise the objective function to get candidate solution, several additional modification are done:
+
+- Objective function: The implementation to find the candidate solution and setting the value of the objective function (which is minimized) is changed to the following-
+
+.. math::
+    -fhat + (\mu * upperbound)
+
+- Value of :math:`\mu` : We calculate the value of :math:`\mu` as
+
+.. math::
+    -\nabla f( \theta^{*})/ \nabla g_{i}( \theta^{*})
+
+which must be positive to support the inequality of the fairness constraint and thus, in case the value is negative, then, we hard-code it to some positive value (say, 1).
+
+- Change prediction to continuous function: Classification is essentially a step function (0/1 in case of binary classifier as in this case). Thus, instead of getting a label, we change the function to give the probability of getting a label instead of exact label value. This helps us find the derivative of the function easily. This change must be made by the user when he/she changes the predict function for their use-case.
+
+- 2-player approach to solve KKT: One of the ways to solve KKT optimization problem is to use a 2-player approach where we fix a value of :math:`\mu` and then optimize the function w.r.t. :math:`\theta` and then , we fix :math:`\theta` and optimize the function w.r.t. :math:`\mu`. This goes on until we converge to some value or exceed a specific number of iterations.  Instead of doing a 2-player approach, to fasten the optimization process, we did one run of this by using a single value of :math:`\mu`, fetched from derivative of log-loss divided by derivative of fairness constraint with the initial :math:`\theta` values and optimizing the Lagrangian value using Powell optimizer.
+
+