stateful op? #7946

cboulay · 2022-05-17T22:12:26Z

cboulay
May 17, 2022

Hello,

I'm trying to use dagster to simulate how an online signal processing pipeline would work. While I would prefer if it could actually run online indefinitely on streaming data, this seems impossible with dagster; please correct me if I'm wrong. If I can simulate then I can start with an offline dataset, which might be a 100 GB time series, and I'd like to process it chunk-by-chunk through the full signal processing pipeline. For this to work, the ops need to maintain some state between chunks.

Is this at all possible?

I wouldn't mind having the op output its state to shared memory then accessing that shared memory on subsequent calls, for example.

The second problem is that I'm pretty sure dynamic_op().map(stateful_op) is unordered.

Any suggestions? Or is dagster not the right framework for what I want?

Thanks in advance.

Edit: Thanks to both of you for your input. For the current project, dagster meets all of my strict requirements but falls short on a couple desired qualities. I'll keep looking but I very well might come back to dagster and just limit my project to offline data analysis only.

Answered by jamiedemaria

May 17, 2022

Hey there! I'm not familiar with what an online signal processing pipeline would look like, so please forgive any incorrect terminology.
I think your best course of action will be to write a sensor. if you write a job that can process a single piece of data, then you could write a sensor that watch the dataset and start a new execution of the job each time there is new data. The sensor runs on a cadence (by default 30 seconds, but can be modified) so you would handle the case where multiple pieces of new data appear between runs of the sensor. One potential drawback for this approach is that each execution of the job will run independently (and potentially parallel) to other executions of…

View full answer

jamiedemaria · 2022-05-17T22:48:16Z

jamiedemaria
May 17, 2022
Maintainer

Hey there! I'm not familiar with what an online signal processing pipeline would look like, so please forgive any incorrect terminology.
I think your best course of action will be to write a sensor. if you write a job that can process a single piece of data, then you could write a sensor that watch the dataset and start a new execution of the job each time there is new data. The sensor runs on a cadence (by default 30 seconds, but can be modified) so you would handle the case where multiple pieces of new data appear between runs of the sensor. One potential drawback for this approach is that each execution of the job will run independently (and potentially parallel) to other executions of the job. This may not be an issue for you, but it seems like processing each new piece of data requires some information about all of the data that was processed before. With the sensor approach you may need to do some additional work to ensure that the data is processed in the correct order. This may be as simple as just using the in-process executor, but there might be some complexity in your system that i don't know. To share the results of a computation with each subsequent computation, you would want to use a resource to store the results in a shared db or some other store.

1 reply

cboulay May 19, 2022
Author

(Note: I provided this reply yesterday but didn't use proper reply threading. Sorry!)

Thanks for answering so quickly!

Do you know what the best case scenario spin up time is on a job? I'm dealing with 100s of channels sampled at 30 kHz. I'm hoping to window my data to 600 (20 msec) x 512 (channels) chunks. If the job creation overhead is more than a few milliseconds then it's not worth it. Given everything a job does, I don't imagine that's possible, but I'd love to be wrong!

gibsondan · 2022-05-18T22:26:28Z

gibsondan
May 18, 2022
Maintainer

Hi @cboulay - if you want sub-ms startup time than I don't think you want to create a job, yeah. Using a resource and the in-process executor to share state between ops in a single job may be the way to go here.

I believe you're also correct that map() is unordered - the thinking there was you'd be likely to want to run the mapped outputs in parallel.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stateful op? #7946

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

stateful op? #7946

cboulay May 17, 2022

Replies: 2 comments · 1 reply

jamiedemaria May 17, 2022 Maintainer

cboulay May 19, 2022 Author

gibsondan May 18, 2022 Maintainer

cboulay
May 17, 2022

Replies: 2 comments 1 reply

jamiedemaria
May 17, 2022
Maintainer

cboulay May 19, 2022
Author

gibsondan
May 18, 2022
Maintainer