Merge pull request #3 from ihmeuw-msca/bugfix/plots-and-erroring

Bugfix/plots and erroring
ihmeuw-msca · Sep 17, 2024 · b55bc2b · b55bc2b
2 parents 7d39aa7 + 5c666fe
commit b55bc2b
Show file tree

Hide file tree

Showing 27 changed files with 569 additions and 398 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -4,6 +4,7 @@ on:
   push:
     tags:
       - "v[0-9]+.[0-9]+.[0-9]+"
+  workflow_dispatch:
 
 permissions:
   contents: write
@@ -20,14 +21,14 @@ jobs:
         python-version: "3.12"
     - name: Install dependencies
       run: python -m pip install build . --upgrade pip
-    - name: Build package distribution
-      run: python -m build --sdist --wheel --outdir dist/ .
-    - name: Publish package distribution to PyPI
-      uses: pypa/gh-action-pypi-publish@release/v1
-      with:
-        skip_existing: true
-        user: __token__
-        password: ${{ secrets.PYPI_API_TOKEN }}
+    # - name: Build package distribution
+    #   run: python -m build --sdist --wheel --outdir dist/ .
+    # - name: Publish package distribution to PyPI
+    #   uses: pypa/gh-action-pypi-publish@release/v1
+    #   with:
+    #     skip_existing: true
+    #     user: __token__
+    #     password: ${{ secrets.PYPI_API_TOKEN }}
 
   docs:
     runs-on: ubuntu-latest

diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,8 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# Misc.
+.DS_Store
+*.csv
+*.parquet
diff --git a/docs/_static/css/custom.css b/docs/_static/css/custom.css
@@ -16,4 +16,4 @@ h6 {
   font-size: 1rem;
   font-weight: 500;
   margin: auto;
-}
+}
diff --git a/docs/_templates/sidebar/variant-selector.html b/docs/_templates/sidebar/variant-selector.html
@@ -12,4 +12,4 @@
     </ul>
   <li>
   </ul>
-</div>
+</div>
diff --git a/docs/api_reference/index.rst b/docs/api_reference/index.rst
@@ -10,4 +10,4 @@ API reference
 .. note::
    Briefly describe the organization of the API reference if any.
 
-In PyPkg, we only provide a dummy function :py:func:`.example.add` to show the bone structure of a Python pacakge.
+In ensemble, we only provide TBD features
diff --git a/docs/developer_guide/index.rst b/docs/developer_guide/index.rst
@@ -8,4 +8,4 @@ Developer guide
 
     * briefly describe how to contribute
     * contributing to the documentation
-    * contributing to the code base
+    * contributing to the code base
diff --git a/docs/getting_started/index.rst b/docs/getting_started/index.rst
@@ -7,19 +7,22 @@ Getting started
    installation
    quickstart
 
-.. note::
+Welcome to ensemble!
+--------------------
 
-    This page can be used to introduce some basic concepts to beginners.
-    It should be welcoming and not technical.
-    You can also use it address some of the pre-requisites for the package.
+ensemble allows you to fit a weighted linear combination of distributions to individual-level data,
+or create an ensemble distribution given mean, variance, the distributions you want to include, and
+their respective weights.
 
+**BEFORE YOU PROCEED**
 
-Welcome to PyPkg!
------------------
+* We define an ensemble distribution in this package to be a weighted sum of individual named
+distributions.
+* Weights must sum to 1 in order to satisfy the property that a probability density function (PDF)
+must integrate to 1.
+* Distributions with differing supports cannot be in the same ensemble.
+* Current implementation forces mean and variance to be equivalent over each distribution, and
+forces ensemble mean and variance to match data, if relevant.
 
-PyPkg package is a template package.
-It tries to show the basic bone structure of a package with minimal functionality.
-It is meant to be used as a starting point for new packages.
-
-For installing the package please check :ref:`Installing PyPkg`.
-For a simple example please check :ref:`Quickstart`.
+For installing the package please check :ref:`Installing distrx`.
+For a simple example please check :ref:`Quickstart`.
diff --git a/docs/getting_started/installation.rst b/docs/getting_started/installation.rst
@@ -1,26 +1,27 @@
-================
-Installing pypkg
-================
+===================
+Installing ensemble
+===================
 
 Python version
 --------------
 
-The package :code:`pypkg` is written in Python
-and requires Python 3.10 or later.
+The package :code:`ensemble` is written in Python
+and requires Python 3.12 or later.
 
-:code:`pypkg` package is distributed at
-`PyPI <https://pypi.org/project/pypkg/>`_.
+:code:`ensemble` package is distributed at
+.. `PyPI <https://pypi.org/project/ensemble/>`_.
+TBD
 To install the package:
 
 .. code::
 
-   pip install pypkg
+   pip install ensemble
 
 For developers, you can clone the repository and install the package in the
 development mode.
 
 .. code::
 
-    git clone https://github.com/ihmeuw-msca/pypkg.git
-    cd pypkg
-    pip install -e ".[test,docs]"
+    git clone https://github.com/ihmeuw-msca/ensemble.git
+    cd ensemble
+    pip install -e ".[test,docs]"
diff --git a/docs/getting_started/quickstart.rst b/docs/getting_started/quickstart.rst
@@ -7,7 +7,22 @@ Example
 
 .. code-block:: python
 
-    from pypkg.example import add
+    import scipy.stats
+    from ensemble.ensemble_model import EnsembleModel, EnsembleFitter
+    # creates an ensemble distribution composed of the normal and gumbel distributions both sharing
+    # the same mean and variance; the normal distribution can be thought of as contributing a
+    # quarter of the "height" of the density curve to the ensemble's density, and the gumbel as
+    # contributing the remaining 3 quarters
+    normal_gumbel = EnsembleModel(distributions=["normal", "gumbel"],
+                                  weights=[0.25, 0.75],
+                                  mean=4,
+                                  variance=1)
 
-    result = add(1, 2)
-    assert result == 3
+    # fits an EnsembleModel object to standard normal draws. Here, the user has specified a
+    # distribution (the gumbel) that is not reflected in the truth. The model typically correctly
+    # identifies this and will give weights close to 1 for the normal, and 0 for the gumbel
+    std_norm_draws = scipy.stats.norm.rvs(0, 1, size=100)
+    model = EnsembleFitter(["normal", "gumbel"], "L2").fit(std_norm_draws)
+
+    fitted_weights = model.weights
+    fitted_model = model.ensemble_model
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,4 +1,4 @@
-PyPkg documentation
+ensemble documentation
 ===================
 
 .. toctree::
@@ -9,22 +9,10 @@ PyPkg documentation
    api_reference/index
    developer_guide/index
 
-.. note::
-
-   In this page, please use one or two paragraphs to summarize the main purpose of the package.
-   Following the summary, please provide guidence of how people should use this documentation.
-
-PyPkg is a template package to guide you setup your own Python package.
-It can be cloned and used as a starting point for your project.
-We also want to use this documentation to help users understand key concepts when building a Python package, include
-
-* Project organization
-* Style guide
-* Testing and documentation
-* Continuous integration and deployment
-
-It will bring standards, consistency and best practices into your projects and
-make collaborations easier.
+Ensemble distributions in this package are a weighted linear combination of user specified
+distributions. This package provides the functionality to both fit an ensemble distribution
+to individual-level data (microdata) and create custom ensemble distributions given a list of
+distributions and their respective weights.
 
 .. list-table::
    :header-rows: 1
@@ -33,8 +21,8 @@ make collaborations easier.
    * - :ref:`Getting started`
      - :ref:`User guide`
 
-   * - If you are new to PyPkg, this is where you should go. It contains main use cases and examples to help you get started.
-     - The user guide provides in-depth information on key concepts of package building with useful background information and explanation.
+   * - If you are new to ensemble, this is where you should go. It contains main use cases and examples to help you get started.
+     - The user guide provides in-depth information on key concepts of ensemble distributions with useful background information and explanation.
 
 .. list-table::
    :header-rows: 1

diff --git a/docs/meta.toml b/docs/meta.toml
@@ -1,3 +1,3 @@
 versions = [
     "0.1.0",
-]
+]
diff --git a/docs/user_guide/cicd.rst b/docs/user_guide/cicd.rst
diff --git a/docs/user_guide/concepts.rst b/docs/user_guide/concepts.rst
@@ -0,0 +1,48 @@
+========
+Concepts
+========
+
+Distributions
+-------------
+
+Each individual distribution in an ensemble is fit to the given mean and variance of the data. This
+process typically involves using algebra to isolate the parameters of the distributions with the
+sample mean and variance as given, and then solving for the 2 parameter system. You may look within
+the :code:`create_scipy_dist()` function to find the equations used. The single exception is the
+Fisk distribution, where the form of the PDF necessitates the use of numerical minimization
+
+EnsembleModel
+-------------
+
+PDF, CDF, PPF
+^^^^^^^^^^^^^
+
+Methods used for creating the PDF, CDF, and PPF of the EnsembleDistribution object are relatively
+"off the shelf" so to speak, generally following the structure and methodology of scipy's
+implementation `here <https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_distn_infrastructure.py>`_.
+In summary, the PDF and CDF can just be weighted linear combinations of the component distributions
+while the PPF requires use use of Brent's algorithm to solve for the quantile corresponding to the
+correct point in the PDF.
+
+rvs
+^^^
+
+A.K.A. scipy's function to generate draws, was not implemented by solving for the PPF, as listed in
+the source code above. Instead, since a linear combination of distributions is functionally
+equivalent to sampling from individual distributions with probability of sampling from a
+distribution dictated by a multinomial distribution, the latter method has been chosen here for
+efficiency purposes.
+
+stats_temp
+^^^^^^^^^^
+
+A getter function for the mean and variance supplied to the EnsembleDistribution object, does not
+supply skewness and kurtosis like scipy's :code:`stats()`.
+
+EnsembleFitter
+--------------
+
+The :code:`fit()` function performs fitting of ensemble distributions by minimizing the distances
+of the eCDF of given microdata to the CDF of an ensemble distribution subject to some penalty.
+Legacy code at IHME implements only the Kolmogorov-Smirnoff distance, but the sum of squares and L1
+norm distance metrics have also been implemented as well.
diff --git a/docs/user_guide/documentation.rst b/docs/user_guide/documentation.rst
diff --git a/docs/user_guide/ensemble_fitting.rst b/docs/user_guide/ensemble_fitting.rst
@@ -0,0 +1,53 @@
+================
+Ensemble Fitting
+================
+
+In order to fit an ensemble distribution to microdata, use the :code:`EnsembleFitter` object. The
+object must be initialized with 2 things.
+
+*A list of named distributions.* These distributions have "supports" that differ from each other. A
+support, for our purposes, can be thought of as the x values that are compatible with some given
+distribution. For example, the Normal distribution is supported on the entire real line, so it can
+take negative x values, but the Gamma is only supported on (0, :math:`\infty`), so it cannot take
+negative values. **Recall: you are not permitted to use distributions with differing supports in the
+same ensemble.**
+
+*A penalty function of choice.* In a nutshell, we are minimizing the distances between the empirical
+cumulative distribution function (eCDF) and the CDF of the ensemble subject to said chosen penalty.
+The penalties currently implemented are as follows:
+
+* :code:`"L1"`: L1 norm
+* :code:`sum_squares"`: sum of squares
+* :code:`"KS"`: the Kolmogorov-Smirnoff distance, A.K.A. infinity norm
+
+Finally, the function of interest for this use case is the :code:`fit()` function.
+
+Example: Fitting an Ensemble
+----------------------------
+
+Suppose we have microdata for systolic blood pressure (SBP) from a certain population of young men
+in Seattle. Since SBP must be positive, let's use all the distributions (except the exponential)
+with a positive support to fit this data.
+
+.. code-block:: python
+
+    import scipy.stats as stats
+    from ensemble.model import EnsembleFitter
+
+    SBP_vals = stats.norm(loc=120, scale=7).rvs(size=100)
+    model = EnsembleFitter(
+        distributions=["gamma", "invgamma", "fisk", "lognormal"],
+        objective="L2"
+    )
+    res = model.fit(SBP_vals)
+
+:code:`res` contains an array of fitted weights as well as an :code:`EnsembleDistribution` object
+that has already been initialized with the distributions provided to :code:`model`. They can be
+accessed as follows:
+
+.. code-block:: python
+    # fitted weights
+    fitted_weights = res.weights
+
+    # fitted ensemble
+    fitted_ensemble = res.ensemble_distribution
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,4 +12,4 @@ @@
         </ul>
       <li>
       </ul>
-    </div>
+    </div>