Adding a data generator class based on the demo notebook #411

nikisix · 2023-10-24T02:53:41Z

📚 Documentation preview 📚: https://pymc-marketing--411.org.readthedocs.build/en/411/

wd60622

Still taking a look at this. You might want to run make lint locally in order to format

wd60622 · 2023-10-26T19:12:44Z

pymc_marketing/mmm/data_generator.py

+        df = self.df
+        self.is_lagged = True # signals that adstock has been applied
+        self.lag_fn = fn
+        self.alphas = np.random.beta(.5, .5, self.num_channels)


Might be good to use or pass another random seed here

wd60622 · 2023-10-26T19:24:15Z

Can you post a picture of some of the different plots and their variations

wd60622 · 2023-10-26T19:18:43Z

pymc_marketing/mmm/data_generator.py

+        Parameters
+        suffix: str
+            Channel suffix. I.e. x1_adstock "_adstock" would be the
+            channel suffix.
+        show: bool
+            Can be turned off for overlays.'''


Stick with numpy docstrings

wd60622 · 2023-10-26T19:20:21Z

pymc_marketing/mmm/data_generator.py

+
+    def plot_adstock_effect(self):
+        '''Plots the raw channels compared to when their adstock effects are applied.'''
+        assert self.is_lagged


Some more informative error messages and error types could be helpfuk

wd60622 · 2023-10-26T19:22:46Z

pymc_marketing/mmm/data_generator.py

+        if not self.is_lagged:
+            fig, ax = self.plot_channel_spends(show=False)
+            self.plot_channel_spends(
+                    suffix="_saturated", title="Saturation Effect", fig=fig, ax=ax, alpha=.5)
+        else:
+            fig, ax = self.plot_channel_spends(suffix="_adstock", show=False)
+            self.plot_channel_spends(
+                    suffix="_adstock_saturated", alpha=.5, title="Saturation Effect",
+                    fig=fig, ax=ax)


These can be consolidate and by using if block to define different kwargs

wd60622 · 2023-10-26T19:23:25Z

pymc_marketing/mmm/data_generator.py

+        (self.sin_coef, self.cos_coef) = np.random.exponential(seasonality_scale, size=2)
+        num_periods = len(df)
+        freq = 52 # weeks per year
+        df["cs"] = -self.sin_coef * np.sin(num_periods * 2 * np.pi * df.index / freq)
+        df["cc"] =  self.cos_coef * np.cos(num_periods * 2 * np.pi * df.index / freq)
+        df["seasonality"] = 0.5 * (df["cs"] + df["cc"])


This already exists in the mmm/utils.py
Could that be utilized?

That native approach of creating 2N columns and dotting with the sin and cos coefs would make the df unwieldy to pass around. I prefer keeping them compressed into a single 'seasonality' column.

I did however generalize this to N degrees of Fourier series.

wd60622 · 2023-10-26T19:27:03Z

pymc_marketing/mmm/data_generator.py

+import seaborn as sns
+import warnings
+
+warnings.filterwarnings('ignore', category=FutureWarning)


What are these coming from?

python3.11/site-packages/seaborn/_core.py:1218: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector):

moved it back to matplotlib

Nevermind, too many plots in other areas, just going to keep the warnings statement unless it's a big deal.

They're coming from seaborn^

codecov · 2023-10-31T20:15:46Z

Codecov Report

Merging #411 (b490b15) into main (916dce4) will decrease coverage by 9.86%.
The diff coverage is 0.00%.

❗ Current head b490b15 differs from pull request most recent head 93b2b45. Consider uploading reports for the commit 93b2b45 to get more accurate results

@@            Coverage Diff             @@
##             main     #411      +/-   ##
==========================================
- Coverage   88.67%   78.81%   -9.86%     
==========================================
  Files          21       22       +1     
  Lines        1880     2115     +235     
==========================================
  Hits         1667     1667              
- Misses        213      448     +235

Files	Coverage Δ
pymc_marketing/mmm/data_generator.py	`0.00% <0.00%> (ø)`

nikisix · 2023-10-31T21:33:39Z

A gist with the plots:
https://gist.github.com/nikisix/df755ade12c6a05f879f9241c7c422d4

nikisix · 2023-10-31T21:39:58Z

Everything should be addressed now. Let me know if there's anything else.

nikisix · 2023-10-31T21:41:53Z

Also, looks like a new version just dropped. Let me know if you'd like me to remake this PR on top of a fresh branch or not.

ricardoV94 · 2023-11-01T11:56:18Z

pymc_marketing/mmm/data_generator.py

+    logistic_saturation,
+)
+
+warnings.filterwarnings("ignore", category=FutureWarning)


We shouldn't have a blank warning filter like this. You can suppress specific warnings around specific function calls instead with a local with warnings.catch_warnings():

ricardoV94 · 2023-11-01T11:57:12Z

pymc_marketing/mmm/data_generator.py

+)
+
+warnings.filterwarnings("ignore", category=FutureWarning)
+FIGSIZE = (15, 8)


Matplotlib / Seaborn / Arviz use a global variable (something rc) that controls things like this. We should let those be used by default, and users can tweak them that way as well.

ricardoV94 · 2023-11-01T11:57:44Z

pymc_marketing/mmm/data_generator.py

+        seed: int = sum(map(ord, "six-mmm"))
+        self.rng: np.random.Generator = np.random.default_rng(seed=seed)


Allow user to provide an optional seed?

ricardoV94 · 2023-11-01T12:00:09Z

This functionality is fine, although I fear a bit hard to maintain as we change how MMM works. Ideally the MMM models could be used to generate synthetic data, but that's another story.

We should add a couple of tests to make sure basic functionality is not fundamentally broken.

nikisix · 2023-11-01T17:47:28Z

This functionality is fine, although I fear a bit hard to maintain as we change how MMM works. Ideally the MMM models could be used to generate synthetic data, but that's another story.

Interesting... I can remove the pymc_marketing dependencies (on the transform functions) to reduce your having to worry about breaking the generator with core functionality upgrades. Let me know.

We should add a couple of tests to make sure basic functionality is not fundamentally broken.

Sure, I'll try and get some tests in there (this week perhaps).

wd60622 · 2024-05-09T16:14:31Z

Hi @nikisix
Are you still interested in this?

It might be good to separate the data generation into:

X data generation
y data generation

The y data generation like @ricardoV94 mentioned can be do with the model itself. We have some examples of that being done here and here with the pm.do operator.

Separating these would allow us to not duplicate the data generation process defined by the model!

nikisix · 2024-05-10T19:02:37Z

Hi @wd60622,
Thanks for reaching out, responses inline.

Hi @nikisix Are you still interested in this?

A bit, but honestly I don't have a ton of time for this right now.

It might be good to separate the data generation into:

X data generation

y data generation

The y data generation like @ricardoV94 mentioned can be do with the model itself. We have some examples of that being done here and here with the pm.do operator.

Separating these would allow us to not duplicate the data generation process defined by the model!

Doesn't that seem a bit circular though? As I implied to @ricardoV94, my intuition actually pulls me the opposite direction and remove all model/transform dependencies. This has 2 benefits:

Model selection can run on generated data in an unbiased way. Ex. if generating data from an adstock hill model for instance then adstock hill model types would have an unfair advantage in model comparison.
Decouples the data generator from model code, making the code-base more robust.

Am I missing something perhaps?

However, I did write a test file with graphs (as requested). So if that's the delta between this PR and acceptance I can commit it, and y'all can take it from there.

nikisix · 2024-05-10T19:21:40Z

@wd60622 on second thought I agree with you guys about leveraging the models directly. Mainly because I can't think of a way to implement an independent data generation process. So you might as well couple this to a model type, give up on using it for model selection, but have a cleaner approach to parameter recovery. Point taken. Also the do operator looks slick.
Like I said, kind of swamped right now. Would you be ok accepting this (+test) as a first round, and then we can update the core generator to use pymc.do as a second pass?

wd60622 · 2024-06-12T15:25:05Z

@wd60622 on second thought I agree with you guys about leveraging the models directly. Mainly because I can't think of a way to implement an independent data generation process. So you might as well couple this to a model type, give up on using it for model selection, but have a cleaner approach to parameter recovery. Point taken. Also the do operator looks slick. Like I said, kind of swamped right now. Would you be ok accepting this (+test) as a first round, and then we can update the core generator to use pymc.do as a second pass?

Hi @nikisix
I think it'd be good to just have a generation process for

dates
controls / events
spends

Then just remove the target generation process which can be saved for later. I.e. remove intercept, trend, seasonality, etc because that is handled by the model.

Maybe @juanitorduz has some thoughts too

Adding a data generator class based on the demo notebook

b490b15

wd60622 reviewed Oct 26, 2023

View reviewed changes

Nick Tomasino added 2 commits October 31, 2023 14:34

MMM data_generator: responding to PR feedback

3f53f55

linted

93b2b45

ricardoV94 reviewed Nov 1, 2023

View reviewed changes

ricardoV94 added enhancement New feature or request MMM labels Nov 1, 2023

ulfaslak force-pushed the main branch 3 times, most recently from f9a38fd to 818ba39 Compare September 8, 2024 12:38

twiecki force-pushed the main branch from 35dc7f1 to 9f9c67f Compare September 10, 2024 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a data generator class based on the demo notebook #411

Adding a data generator class based on the demo notebook #411

nikisix commented Oct 24, 2023 •

edited by github-actions bot

Loading

wd60622 left a comment

wd60622 Oct 26, 2023

wd60622 commented Oct 26, 2023 •

edited

Loading

wd60622 Oct 26, 2023

wd60622 Oct 26, 2023

wd60622 Oct 26, 2023

nikisix Oct 31, 2023

wd60622 Oct 26, 2023

nikisix Oct 31, 2023

wd60622 Oct 26, 2023

nikisix Oct 31, 2023

nikisix Oct 31, 2023

nikisix Oct 31, 2023

codecov bot commented Oct 31, 2023 •

edited

Loading

nikisix commented Oct 31, 2023

nikisix commented Oct 31, 2023

nikisix commented Oct 31, 2023

ricardoV94 Nov 1, 2023 •

edited

Loading

ricardoV94 Nov 1, 2023

ricardoV94 Nov 1, 2023

ricardoV94 commented Nov 1, 2023

nikisix commented Nov 1, 2023

wd60622 commented May 9, 2024

nikisix commented May 10, 2024 •

edited

Loading

nikisix commented May 10, 2024

wd60622 commented Jun 12, 2024

		seed: int = sum(map(ord, "six-mmm"))
		self.rng: np.random.Generator = np.random.default_rng(seed=seed)

Adding a data generator class based on the demo notebook #411

Are you sure you want to change the base?

Adding a data generator class based on the demo notebook #411

Conversation

nikisix commented Oct 24, 2023 • edited by github-actions bot Loading

wd60622 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wd60622 commented Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 31, 2023 • edited Loading

Codecov Report

nikisix commented Oct 31, 2023

nikisix commented Oct 31, 2023

nikisix commented Oct 31, 2023

ricardoV94 Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricardoV94 commented Nov 1, 2023

nikisix commented Nov 1, 2023

wd60622 commented May 9, 2024

nikisix commented May 10, 2024 • edited Loading

nikisix commented May 10, 2024

wd60622 commented Jun 12, 2024

nikisix commented Oct 24, 2023 •

edited by github-actions bot

Loading

wd60622 commented Oct 26, 2023 •

edited

Loading

codecov bot commented Oct 31, 2023 •

edited

Loading

ricardoV94 Nov 1, 2023 •

edited

Loading

nikisix commented May 10, 2024 •

edited

Loading