Skip to content

Commit

Permalink
Documentation [WIP]
Browse files Browse the repository at this point in the history
  • Loading branch information
parul100495 committed Dec 11, 2020
1 parent 9d4eecf commit a9a12c5
Show file tree
Hide file tree
Showing 118 changed files with 4,009 additions and 2,729 deletions.
Binary file added __pycache__/create_plots.cpython-37.pyc
Binary file not shown.
Binary file added __pycache__/equation_parser.cpython-37.pyc
Binary file not shown.
Binary file not shown.
Binary file added __pycache__/gather_results.cpython-37.pyc
Binary file not shown.
Binary file added __pycache__/get_bounds.cpython-37.pyc
Binary file not shown.
Binary file added __pycache__/inequalities.cpython-37.pyc
Binary file not shown.
Binary file not shown.
Binary file added __pycache__/qsa.cpython-37.pyc
Binary file not shown.
48 changes: 48 additions & 0 deletions create_plots.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import matplotlib.pyplot as plt
from gather_results import gather_results
import numpy as np

csv_path = 'exp/lag_exp/csv/'
img_path = 'exp/lag_exp/images/'


def loadAndPlotResults(fileName, ylabel, output_file, is_yAxis_prob, legend_loc):
file_ms, file_QSA, file_QSA_stderror, file_LS, file_LS_stderror = np.loadtxt(fileName, delimiter = ',',
unpack = True )

fig = plt.figure()

plt.xlim(min(file_ms), max(file_ms))
plt.xlabel( "Amount of data", fontsize = 16 )
plt.xscale( 'log' )
plt.xticks( fontsize = 12 )
plt.ylabel( ylabel, fontsize = 16 )

if is_yAxis_prob:
plt.ylim(-0.1, 1.1)
# else:
# plt.ylim(-0.2, 2.2)
# plt.plot([1, 100000], [1.25, 1.25], ':k');
# plt.plot([1, 100000], [2.1, 2.1], ':k');

plt.plot ( file_ms, file_QSA, 'b-', linewidth = 3, label = 'QSA' )
plt.errorbar ( file_ms, file_QSA, yerr = file_QSA_stderror, fmt = '.k' );
plt.plot ( file_ms, file_LS, 'r-', linewidth = 3, label = 'LogRes' )
plt.errorbar ( file_ms, file_LS, yerr = file_LS_stderror, fmt = '.k' );
plt.legend ( loc = legend_loc, fontsize = 12 )
plt.tight_layout ()

plt.savefig( output_file, bbox_inchesstr = 'tight' )
plt.show( block = False )


if __name__ == "__main__":
gather_results()

loadAndPlotResults(csv_path + 'fs.csv', 'Log Loss', img_path + 'tutorial7MSE_py.png', False, 'lower right')
loadAndPlotResults(csv_path + 'solutions_found.csv', 'Probability of Solution',
img_path + 'tutorial7PrSoln_py.png', True, 'best' )
loadAndPlotResults(csv_path + 'failures_g1.csv', r'Probability of $g(a(D))>0$',
img_path + 'tutorial7PrFail1_py.png', True, 'best' )
loadAndPlotResults(csv_path+'upper_bound.csv', r'upper bound', img_path+'tutorial7PrFail2_py.png', False, 'best')
plt.show()
Binary file added docs/_images/bound-no.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_images/bound-yes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_images/const.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 9 additions & 6 deletions docs/_sources/code.rst.txt
Original file line number Diff line number Diff line change
@@ -1,22 +1,25 @@
Code Documentation
==================

.. automodule:: config
.. automodule:: qsa
:members:

.. automodule:: connect_database
.. automodule:: logistic_regression_functions
:members:

.. automodule:: data_pre_processing
.. automodule:: equation_parser
:members:

.. automodule:: regression_model
.. automodule:: get_bounds
:members:

.. automodule:: error_analysis
.. automodule:: inequalities
:members:

.. automodule:: plot_results
.. automodule:: gather_results
:members:

.. automodule:: create_plots
:members:


36 changes: 18 additions & 18 deletions docs/_sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -1,35 +1,35 @@

.. MSR Project documentation master file, created by
.. FairSeldonian Project documentation master file, created by
sphinx-quickstart on Tue Nov 17 18:50:48 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to MSR Project!
Welcome to FairSeldonian Project!
=======================================

The `International Conference on Mining Software Repositories (MSR) <https://2020.msrconf.org/>`_ has hosted the mining challenge every year, since 2006.
With this challenge, they aim at everyone interested to apply their tools to a common dataset.
The challenge is for researchers and practitioners to use their mining tools and approaches on a dare.
The goal is to answer questions about a given dataset, that were previously unanswered.
With the growing usage of machine learning and artificial intelligence in real lives,
the need for incorporating the mitigation of unethical and unfair elements in machine
learning models is rising at an alarming rate.
This requires us to develop a tool to evaluate and mitigate unfairness in these models
that would help data scientists in dealing with such problems in the models they develop.

MSR 2020 is a challenge which presents `Software Heritage Graph Dataset <https://docs.softwareheritage.org/devel/swh-dataset/index.html>`_.
This is one of the largest known existing public archive of software source code and accompanying development history.
More information can be found in 'Dataset' tab.

Our goal here is to answer certain research questions (mentioned in 'Problem Statement' tab) and gain some meaninful insights from the dataset.
Fair-Seldonian leverages Seldonian algorithm to tackles this problem where the responsibility
of regulating the undesirable behavior of machine learning algorithms and making them `fair'
is transferred from user to designer of the algorithm (i.e. at the creation step itself by ML
researcher).
This is implemented for fairness in machine learning for any generic constraint.
In addition, some extensions for optimizing the bounds and making it more efficient
are also present.

More details of Seldonian is present `here <https://aisafety.cs.umass.edu>`_.

.. toctree::
:maxdepth: 2
:caption: Contents:

problem
dataset
pre_req
queries
model
intro
quickstart
variants
code
results
references


18 changes: 18 additions & 0 deletions docs/_sources/intro.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Introduction
============

With the growing usage of machine learning and artificial intelligence in real lives,
the need for incorporating the mitigation of unethical and unfair elements in machine
learning models is rising at an alarming rate.
This requires us to develop a tool to evaluate and mitigate unfairness in these models
that would help data scientists in dealing with such problems in the models they develop.

Fair-Seldonian leverages Seldonian algorithm to tackles this problem where the responsibility
of regulating the undesirable behavior of machine learning algorithms and making them `fair'
is transferred from user to designer of the algorithm (i.e. at the creation step itself by ML
researcher).
This is implemented for fairness in machine learning for any generic constraint.
In addition, some extensions for optimizing the bounds and making it more efficient
are also present.

More details of Seldonian is present `here <https://aisafety.cs.umass.edu>`_.
47 changes: 47 additions & 0 deletions docs/_sources/quickstart.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Getting Started
===============

This page is a user guide for the developers to quickly
be able to use FairSeldonian Python library in their own project
or, extending/enhancing the codebase to enhance the framework.

Pre-requisites
---------------
Library pre-requisites

* sklearn - machine learning model implementations like logistic regression.

* matplotlib - visualizations of the plots.

* numpy - handling data.

* pandas - handling data in the form of dataframes.

* ray - for parallelisation of the algorithm.

* torch - for tensors used in the codebase.

Quick start
-----------
The complete code resides in `code` folder.
To run the experiment, you need to amend configurations in `main.py`.


Go to terminal or any IDE and run that file using the following command:

.. code-block::
python main.py <mode>
The default mode is set as `base`. To understand other modes present in the code, refer to the `Variants` section of this documentation.

Configuration
-------------
The user must setup the following things to make full use of this framework:

- **Configuration of the experiment structure**


Collaboration
-------------
Please feel free to contribute to this code base.
91 changes: 91 additions & 0 deletions docs/_sources/variants.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
Variants
========

The codebase allows you to choose the variant for tuning and experimenting with the framework.

Basic Seldonian
---------------
To begin with, we implemented vanilla Seldonian algorithm to classify the datapoint into 2 groups with difference in true positives as the fairness constraint.

To use this mode, you need to add CLI parameter `base` as:

.. code-block::
python main.py base
Improvements to confidence interval
-----------------------------------
In the candidate selection process, we used Hoeffding inequality confidence interval as follows-

.. math::
estimate \pm 2 \sqrt{\frac{ln(1/\delta)}{2 |D_{safety}|}}
Instead, this interval can be improved by using a separate values for - a.) error in candidate estimate and b.) confidence interval in safety set as follows-

.. math::
estimate \pm \sqrt{\frac{ln(1/\delta)}{2 |D_{safety}|}} + \sqrt{\frac{ln(1/\delta)}{2 |D_{candidate}|}}
This will specifically be helpful in cases where the difference between the sizes of the 2 data splits is huge.

To use this mode, you need to add CLI parameter `mod` as:

.. code-block::
python main.py mod
Improvement in bound propagation around constant values
-------------------------------------------------------
As constants have fixed value, there is no need to wrap a confidence interval around them. Thus, the
:math:`\delta` value can directly go to other variable child and need not be split equally into half in case of
binary operator when the other child is a constant. The figures below show naive and improved implementation
of bound propagation in case of constant value of a node of the same tree respectively.

.. image:: images/const.png

To use this mode, you need to add CLI parameter `const` as:

.. code-block::
python main.py const
Improvement in bound propagation from union bound
-------------------------------------------------
A user may defined the fairness constraint in such a way that a particular element appears multiple times
in the same tree. Instead of treating all those entities as independent elements, we can combine all the
elements together union bound and then use the final value of :math:`\delta`. This will theoretically improve the
bound and give us better accuracy and more valid solutions.

Example: Suppose we have A appearing 3 times with :math:`\delta/2`, :math:`\delta/4` and :math:`\delta/8`. We can simply take the

.. math::
\delta_{sum} = 7\delta/8
and find the confidence interval using that :math:`\delta`. The figures below show the naive and improved implement using this functionality respectively.

.. image:: images/bound-no.png

.. image:: images/bound-yes.png

Optimization with Lagrangian/KKT
--------------------------------
To use Lagrangian/KKT technique to optimise the objective function to get candidate solution, several additional modification are done:

- Objective function: The implementation to find the candidate solution and setting the value of the objective function (which is minimized) is changed to the following-

.. math::
-fhat + (\mu * upperbound)
- Value of :math:`\mu` : We calculate the value of :math:`\mu` as

.. math::
-\nabla f( \theta^{*})/ \nabla g_{i}( \theta^{*})
which must be positive to support the inequality of the fairness constraint and thus, in case the value is negative, then, we hard-code it to some positive value (say, 1).

- Change prediction to continuous function: Classification is essentially a step function (0/1 in case of binary classifier as in this case). Thus, instead of getting a label, we change the function to give the probability of getting a label instead of exact label value. This helps us find the derivative of the function easily. This change must be made by the user when he/she changes the predict function for their use-case.

- 2-player approach to solve KKT: One of the ways to solve KKT optimization problem is to use a 2-player approach where we fix a value of :math:`\mu` and then optimize the function w.r.t. :math:`\theta` and then , we fix :math:`\theta` and optimize the function w.r.t. :math:`\mu`. This goes on until we converge to some value or exceed a specific number of iterations. Instead of doing a 2-player approach, to fasten the optimization process, we did one run of this by using a single value of :math:`\mu`, fetched from derivative of log-loss divided by derivative of fairness constraint with the initial :math:`\theta` values and optimizing the Lagrangian value using Powell optimizer.


Loading

0 comments on commit a9a12c5

Please sign in to comment.