-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a data generator class based on the demo notebook #411
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still taking a look at this. You might want to run make lint
locally in order to format
pymc_marketing/mmm/data_generator.py
Outdated
df = self.df | ||
self.is_lagged = True # signals that adstock has been applied | ||
self.lag_fn = fn | ||
self.alphas = np.random.beta(.5, .5, self.num_channels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to use or pass another random seed here
Can you post a picture of some of the different plots and their variations |
pymc_marketing/mmm/data_generator.py
Outdated
Parameters | ||
suffix: str | ||
Channel suffix. I.e. x1_adstock "_adstock" would be the | ||
channel suffix. | ||
show: bool | ||
Can be turned off for overlays.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stick with numpy docstrings
pymc_marketing/mmm/data_generator.py
Outdated
|
||
def plot_adstock_effect(self): | ||
'''Plots the raw channels compared to when their adstock effects are applied.''' | ||
assert self.is_lagged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more informative error messages and error types could be helpfuk
pymc_marketing/mmm/data_generator.py
Outdated
if not self.is_lagged: | ||
fig, ax = self.plot_channel_spends(show=False) | ||
self.plot_channel_spends( | ||
suffix="_saturated", title="Saturation Effect", fig=fig, ax=ax, alpha=.5) | ||
else: | ||
fig, ax = self.plot_channel_spends(suffix="_adstock", show=False) | ||
self.plot_channel_spends( | ||
suffix="_adstock_saturated", alpha=.5, title="Saturation Effect", | ||
fig=fig, ax=ax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be consolidate and by using if block to define different kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call
pymc_marketing/mmm/data_generator.py
Outdated
(self.sin_coef, self.cos_coef) = np.random.exponential(seasonality_scale, size=2) | ||
num_periods = len(df) | ||
freq = 52 # weeks per year | ||
df["cs"] = -self.sin_coef * np.sin(num_periods * 2 * np.pi * df.index / freq) | ||
df["cc"] = self.cos_coef * np.cos(num_periods * 2 * np.pi * df.index / freq) | ||
df["seasonality"] = 0.5 * (df["cs"] + df["cc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already exists in the mmm/utils.py
Could that be utilized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That native approach of creating 2N columns and dotting with the sin and cos coefs would make the df unwieldy to pass around. I prefer keeping them compressed into a single 'seasonality' column.
I did however generalize this to N degrees of Fourier series.
pymc_marketing/mmm/data_generator.py
Outdated
import seaborn as sns | ||
import warnings | ||
|
||
warnings.filterwarnings('ignore', category=FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python3.11/site-packages/seaborn/_core.py:1218: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
moved it back to matplotlib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, too many plots in other areas, just going to keep the warnings statement unless it's a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're coming from seaborn^
Codecov Report
@@ Coverage Diff @@
## main #411 +/- ##
==========================================
- Coverage 88.67% 78.81% -9.86%
==========================================
Files 21 22 +1
Lines 1880 2115 +235
==========================================
Hits 1667 1667
- Misses 213 448 +235
|
A gist with the plots: |
Everything should be addressed now. Let me know if there's anything else. |
Also, looks like a new version just dropped. Let me know if you'd like me to remake this PR on top of a fresh branch or not. |
logistic_saturation, | ||
) | ||
|
||
warnings.filterwarnings("ignore", category=FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have a blank warning filter like this. You can suppress specific warnings around specific function calls instead with a local with warnings.catch_warnings():
) | ||
|
||
warnings.filterwarnings("ignore", category=FutureWarning) | ||
FIGSIZE = (15, 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matplotlib / Seaborn / Arviz use a global variable (something rc) that controls things like this. We should let those be used by default, and users can tweak them that way as well.
seed: int = sum(map(ord, "six-mmm")) | ||
self.rng: np.random.Generator = np.random.default_rng(seed=seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow user to provide an optional seed?
This functionality is fine, although I fear a bit hard to maintain as we change how MMM works. Ideally the MMM models could be used to generate synthetic data, but that's another story. We should add a couple of tests to make sure basic functionality is not fundamentally broken. |
Interesting... I can remove the pymc_marketing dependencies (on the transform functions) to reduce your having to worry about breaking the generator with core functionality upgrades. Let me know.
Sure, I'll try and get some tests in there (this week perhaps). |
Hi @nikisix It might be good to separate the data generation into:
The y data generation like @ricardoV94 mentioned can be do with the model itself. We have some examples of that being done here and here with the Separating these would allow us to not duplicate the data generation process defined by the model! |
Hi @wd60622,
A bit, but honestly I don't have a ton of time for this right now.
Doesn't that seem a bit circular though? As I implied to @ricardoV94, my intuition actually pulls me the opposite direction and remove all model/transform dependencies. This has 2 benefits:
Am I missing something perhaps? However, I did write a test file with graphs (as requested). So if that's the delta between this PR and acceptance I can commit it, and y'all can take it from there. |
@wd60622 on second thought I agree with you guys about leveraging the models directly. Mainly because I can't think of a way to implement an independent data generation process. So you might as well couple this to a model type, give up on using it for model selection, but have a cleaner approach to parameter recovery. Point taken. Also the |
Hi @nikisix
Then just remove the target generation process which can be saved for later. I.e. remove intercept, trend, seasonality, etc because that is handled by the model. Maybe @juanitorduz has some thoughts too |
f9a38fd
to
818ba39
Compare
📚 Documentation preview 📚: https://pymc-marketing--411.org.readthedocs.build/en/411/