Skip to content

Commit

Permalink
Merge pull request #3 from ihmeuw-msca/bugfix/plots-and-erroring
Browse files Browse the repository at this point in the history
Bugfix/plots and erroring
  • Loading branch information
mbi6245 authored Sep 17, 2024
2 parents 7d39aa7 + 5c666fe commit b55bc2b
Show file tree
Hide file tree
Showing 27 changed files with 569 additions and 398 deletions.
Binary file added .DS_Store
Binary file not shown.
17 changes: 9 additions & 8 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ on:
push:
tags:
- "v[0-9]+.[0-9]+.[0-9]+"
workflow_dispatch:

permissions:
contents: write
Expand All @@ -20,14 +21,14 @@ jobs:
python-version: "3.12"
- name: Install dependencies
run: python -m pip install build . --upgrade pip
- name: Build package distribution
run: python -m build --sdist --wheel --outdir dist/ .
- name: Publish package distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
skip_existing: true
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
# - name: Build package distribution
# run: python -m build --sdist --wheel --outdir dist/ .
# - name: Publish package distribution to PyPI
# uses: pypa/gh-action-pypi-publish@release/v1
# with:
# skip_existing: true
# user: __token__
# password: ${{ secrets.PYPI_API_TOKEN }}

docs:
runs-on: ubuntu-latest
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,8 @@ dmypy.json

# Pyre type checker
.pyre/

# Misc.
.DS_Store
*.csv
*.parquet
2 changes: 1 addition & 1 deletion docs/_static/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ h6 {
font-size: 1rem;
font-weight: 500;
margin: auto;
}
}
2 changes: 1 addition & 1 deletion docs/_templates/sidebar/variant-selector.html
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@
</ul>
<li>
</ul>
</div>
</div>
2 changes: 1 addition & 1 deletion docs/api_reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ API reference
.. note::
Briefly describe the organization of the API reference if any.

In PyPkg, we only provide a dummy function :py:func:`.example.add` to show the bone structure of a Python pacakge.
In ensemble, we only provide TBD features
2 changes: 1 addition & 1 deletion docs/developer_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ Developer guide

* briefly describe how to contribute
* contributing to the documentation
* contributing to the code base
* contributing to the code base
27 changes: 15 additions & 12 deletions docs/getting_started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,22 @@ Getting started
installation
quickstart

.. note::
Welcome to ensemble!
--------------------

This page can be used to introduce some basic concepts to beginners.
It should be welcoming and not technical.
You can also use it address some of the pre-requisites for the package.
ensemble allows you to fit a weighted linear combination of distributions to individual-level data,
or create an ensemble distribution given mean, variance, the distributions you want to include, and
their respective weights.

**BEFORE YOU PROCEED**

Welcome to PyPkg!
-----------------
* We define an ensemble distribution in this package to be a weighted sum of individual named
distributions.
* Weights must sum to 1 in order to satisfy the property that a probability density function (PDF)
must integrate to 1.
* Distributions with differing supports cannot be in the same ensemble.
* Current implementation forces mean and variance to be equivalent over each distribution, and
forces ensemble mean and variance to match data, if relevant.

PyPkg package is a template package.
It tries to show the basic bone structure of a package with minimal functionality.
It is meant to be used as a starting point for new packages.

For installing the package please check :ref:`Installing PyPkg`.
For a simple example please check :ref:`Quickstart`.
For installing the package please check :ref:`Installing distrx`.
For a simple example please check :ref:`Quickstart`.
23 changes: 12 additions & 11 deletions docs/getting_started/installation.rst
Original file line number Diff line number Diff line change
@@ -1,26 +1,27 @@
================
Installing pypkg
================
===================
Installing ensemble
===================

Python version
--------------

The package :code:`pypkg` is written in Python
and requires Python 3.10 or later.
The package :code:`ensemble` is written in Python
and requires Python 3.12 or later.

:code:`pypkg` package is distributed at
`PyPI <https://pypi.org/project/pypkg/>`_.
:code:`ensemble` package is distributed at
.. `PyPI <https://pypi.org/project/ensemble/>`_.
TBD
To install the package:

.. code::
pip install pypkg
pip install ensemble
For developers, you can clone the repository and install the package in the
development mode.

.. code::
git clone https://github.com/ihmeuw-msca/pypkg.git
cd pypkg
pip install -e ".[test,docs]"
git clone https://github.com/ihmeuw-msca/ensemble.git
cd ensemble
pip install -e ".[test,docs]"
21 changes: 18 additions & 3 deletions docs/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,22 @@ Example

.. code-block:: python
from pypkg.example import add
import scipy.stats
from ensemble.ensemble_model import EnsembleModel, EnsembleFitter
# creates an ensemble distribution composed of the normal and gumbel distributions both sharing
# the same mean and variance; the normal distribution can be thought of as contributing a
# quarter of the "height" of the density curve to the ensemble's density, and the gumbel as
# contributing the remaining 3 quarters
normal_gumbel = EnsembleModel(distributions=["normal", "gumbel"],
weights=[0.25, 0.75],
mean=4,
variance=1)
result = add(1, 2)
assert result == 3
# fits an EnsembleModel object to standard normal draws. Here, the user has specified a
# distribution (the gumbel) that is not reflected in the truth. The model typically correctly
# identifies this and will give weights close to 1 for the normal, and 0 for the gumbel
std_norm_draws = scipy.stats.norm.rvs(0, 1, size=100)
model = EnsembleFitter(["normal", "gumbel"], "L2").fit(std_norm_draws)
fitted_weights = model.weights
fitted_model = model.ensemble_model
26 changes: 7 additions & 19 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
PyPkg documentation
ensemble documentation
===================

.. toctree::
Expand All @@ -9,22 +9,10 @@ PyPkg documentation
api_reference/index
developer_guide/index

.. note::

In this page, please use one or two paragraphs to summarize the main purpose of the package.
Following the summary, please provide guidence of how people should use this documentation.

PyPkg is a template package to guide you setup your own Python package.
It can be cloned and used as a starting point for your project.
We also want to use this documentation to help users understand key concepts when building a Python package, include

* Project organization
* Style guide
* Testing and documentation
* Continuous integration and deployment

It will bring standards, consistency and best practices into your projects and
make collaborations easier.
Ensemble distributions in this package are a weighted linear combination of user specified
distributions. This package provides the functionality to both fit an ensemble distribution
to individual-level data (microdata) and create custom ensemble distributions given a list of
distributions and their respective weights.

.. list-table::
:header-rows: 1
Expand All @@ -33,8 +21,8 @@ make collaborations easier.
* - :ref:`Getting started`
- :ref:`User guide`

* - If you are new to PyPkg, this is where you should go. It contains main use cases and examples to help you get started.
- The user guide provides in-depth information on key concepts of package building with useful background information and explanation.
* - If you are new to ensemble, this is where you should go. It contains main use cases and examples to help you get started.
- The user guide provides in-depth information on key concepts of ensemble distributions with useful background information and explanation.

.. list-table::
:header-rows: 1
Expand Down
2 changes: 1 addition & 1 deletion docs/meta.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
versions = [
"0.1.0",
]
]
9 changes: 0 additions & 9 deletions docs/user_guide/cicd.rst

This file was deleted.

48 changes: 48 additions & 0 deletions docs/user_guide/concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
========
Concepts
========

Distributions
-------------

Each individual distribution in an ensemble is fit to the given mean and variance of the data. This
process typically involves using algebra to isolate the parameters of the distributions with the
sample mean and variance as given, and then solving for the 2 parameter system. You may look within
the :code:`create_scipy_dist()` function to find the equations used. The single exception is the
Fisk distribution, where the form of the PDF necessitates the use of numerical minimization

EnsembleModel
-------------

PDF, CDF, PPF
^^^^^^^^^^^^^

Methods used for creating the PDF, CDF, and PPF of the EnsembleDistribution object are relatively
"off the shelf" so to speak, generally following the structure and methodology of scipy's
implementation `here <https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_distn_infrastructure.py>`_.
In summary, the PDF and CDF can just be weighted linear combinations of the component distributions
while the PPF requires use use of Brent's algorithm to solve for the quantile corresponding to the
correct point in the PDF.

rvs
^^^

A.K.A. scipy's function to generate draws, was not implemented by solving for the PPF, as listed in
the source code above. Instead, since a linear combination of distributions is functionally
equivalent to sampling from individual distributions with probability of sampling from a
distribution dictated by a multinomial distribution, the latter method has been chosen here for
efficiency purposes.

stats_temp
^^^^^^^^^^

A getter function for the mean and variance supplied to the EnsembleDistribution object, does not
supply skewness and kurtosis like scipy's :code:`stats()`.

EnsembleFitter
--------------

The :code:`fit()` function performs fitting of ensemble distributions by minimizing the distances
of the eCDF of given microdata to the CDF of an ensemble distribution subject to some penalty.
Legacy code at IHME implements only the Kolmogorov-Smirnoff distance, but the sum of squares and L1
norm distance metrics have also been implemented as well.
8 changes: 0 additions & 8 deletions docs/user_guide/documentation.rst

This file was deleted.

53 changes: 53 additions & 0 deletions docs/user_guide/ensemble_fitting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
================
Ensemble Fitting
================

In order to fit an ensemble distribution to microdata, use the :code:`EnsembleFitter` object. The
object must be initialized with 2 things.

*A list of named distributions.* These distributions have "supports" that differ from each other. A
support, for our purposes, can be thought of as the x values that are compatible with some given
distribution. For example, the Normal distribution is supported on the entire real line, so it can
take negative x values, but the Gamma is only supported on (0, :math:`\infty`), so it cannot take
negative values. **Recall: you are not permitted to use distributions with differing supports in the
same ensemble.**

*A penalty function of choice.* In a nutshell, we are minimizing the distances between the empirical
cumulative distribution function (eCDF) and the CDF of the ensemble subject to said chosen penalty.
The penalties currently implemented are as follows:

* :code:`"L1"`: L1 norm
* :code:`sum_squares"`: sum of squares
* :code:`"KS"`: the Kolmogorov-Smirnoff distance, A.K.A. infinity norm

Finally, the function of interest for this use case is the :code:`fit()` function.

Example: Fitting an Ensemble
----------------------------

Suppose we have microdata for systolic blood pressure (SBP) from a certain population of young men
in Seattle. Since SBP must be positive, let's use all the distributions (except the exponential)
with a positive support to fit this data.

.. code-block:: python
import scipy.stats as stats
from ensemble.model import EnsembleFitter
SBP_vals = stats.norm(loc=120, scale=7).rvs(size=100)
model = EnsembleFitter(
distributions=["gamma", "invgamma", "fisk", "lognormal"],
objective="L2"
)
res = model.fit(SBP_vals)
:code:`res` contains an array of fitted weights as well as an :code:`EnsembleDistribution` object
that has already been initialized with the distributions provided to :code:`model`. They can be
accessed as follows:

.. code-block:: python
# fitted weights
fitted_weights = res.weights
# fitted ensemble
fitted_ensemble = res.ensemble_distribution
Loading

0 comments on commit b55bc2b

Please sign in to comment.