Skip to content

Latest commit

 

History

History
174 lines (152 loc) · 5.78 KB

README.md

File metadata and controls

174 lines (152 loc) · 5.78 KB

Rormula

Test PyPI

Rormula is a Python package that parses the Wilkinson notation to create model matrices useful in design of experiments. Additionally, it can be used for column arithmetics similar to df.eval where df is a Pandas dataframe. Rormula is significantly faster for small matrices than df.eval or Formulaic and still a not well tested prototype.

Getting Started with Wilkinson Notation

pip install rormula

Currently, the supported operations are +, :, and ^. We can add new operators easily but we have to do this explicitly. There are different options how to receive results and provide inputs. The result can either be a Pandas dataframe or a list of names and a Numpy array.

import numpy as np
import pandas as pd
from rormula import Wilkinson, SeparatedData
data_np = np.random.random((10, 2))
data = pd.DataFrame(data=data_np, columns=["a", "b"])
ror = Wilkinson("a+b+a:b")

# option 1 returns the model matrix as pandas dataframe
mm_df = ror.eval_asdf(data)
assert isinstance(mm_df, pd.DataFrame)
print(mm_df)

# option 2 is faster
mm_names, mm = ror.eval(data)
assert isinstance(mm, np.ndarray)
assert isinstance(mm_names, list)

Regarding inputs, the fastest option is to use the interface with separated categorical and numerical data, even if there is no categorical data. The categorical data is expected to have the object-dtype O. Admittedly, the current interface is rather tedious.

data = pd.DataFrame(
   data=np.random.random((100, 3)),
   columns=["alpha", "beta", "gamma"],
)
separated_data = SeparatedData(
   numerical_cols=data.columns.to_list(),
   numerical_data=data.to_numpy(),
   categorical_cols=[],
   categorical_data=np.zeros((100, 0), dtype="O"),
)
ror = Wilkinson("alpha + beta + alpha:gamma")
names, mm = ror.eval(separated_data)
assert names == ["Intercept", "alpha", "beta", "alpha:gamma"]
assert mm.shape == (100, 4)

Getting Started with Columns Arithmetics

You can calculate with columns of a Pandas dataframes.

import numpy as np
import pandas as pd
from rormula import Arithmetic

df = pd.DataFrame(
   data=np.random.random((100, 3)), columns=["alpha", "beta", "gamma"]
)
s = "beta*alpha - 1 + 2^beta + alpha / gamma"
rormula = Arithmetic(s, "s")
df_ror = rormula.eval_asdf(df.copy())
pd_s = f's={s.replace("^", "**")}'
assert df_ror.shape == (100, 4)
assert np.allclose(df_ror, df.eval(pd_s))

To evaluate a string as data frame there is Arithmetic.eval_asdf which puts the result into your input dataframe. Arithmetic.eval returns the column as 2d-Numpy array with 1 column. In contrast to pd.DataFrame.eval the method Arithmetic.eval does not execute any Python code but understands a list of predefined operators. Besides the usual suspects such as +, -, and ^ the operators contain a conditioned restriction. You can use a comparison operator like == which compares float values with a tolerance. The result of == is internally a list of indices that can be used to reduce the columns with |, see the following example.

data = np.ones((100, 3))
data[5, :] = 2.5
data[7, :] = 2.5
df = pd.DataFrame(data=data, columns=["alpha", "beta", "gamma"])
s = "beta|alpha==2.5"
rormula = Arithmetic(s, s)
res = rormula.eval_asdf(df)
assert res.shape == (2, 1)
assert np.allclose(res, 2.5)
print(res)

The output is

   reduced
0      2.5
1      2.5

Since the resulting dataframe has less rows than the input dataframe, the result is a new dataframe with a single column.

Contribute

To run the tests, you need to have Rust installed.

Python Tests

  1. Go to the directory of the Python package
    cd rormula
    
  2. Install dev dependencies via
    pip install -r requirements.txt
    
  3. Create a development build of Rormula
    maturin develop --release
    
  4. Run
    python test/test.py
    

Rust Tests

Run

cargo test

from the project's root.

Rough Time Measurements

We compare the Rormula to the well-established and way more mature package Formulaic. The tests create a formula in Wilkinson notation and sample 100 random data points. The output on my machine is

- test just numerical 100 rows
Rormula took 0.0009s
Rormula asdf took 0.0213s
Formulaic took 0.1193s
- test numerical and categorical 100 rows
Rormula took 0.0032s
Rormula asdf took 0.0149s
Formulaic took 0.1705s
- test just numerical 100000 rows
Rormula took 0.2240s
Rormula asdf took 0.2895s
Formulaic took 0.2300s

For the first and forth lines that start with Rormula took, we have separated categorical and numerical data beforehand. For the result in the second and fifth lines that start with Rormula asdf took, we pass and receive pandas dataframes. The time is measured for 100 applications of the formula. We used a small data set with 100 rows. For more rows, e.g., 10k+, formulaic becomes competitive and better.

Profiling

We use Counts for profiling Rust code.

To run profiling one can use

maturin develop --release --features print_timings
python test/test_wilkinson.py 2> counts.txt
counts -i -e counts.txt

see rormula/profile.sh. To profile other specific parts of the Rust-code use the timing!-macro.

let res = timing!(some_calculation(), "name of some calculation");

Note that running in profiling mode makes the whole program slower and the time measurements of the section above will not hold anymore.