stateful op? #7946
-
Hello, I'm trying to use dagster to simulate how an online signal processing pipeline would work. While I would prefer if it could actually run online indefinitely on streaming data, this seems impossible with dagster; please correct me if I'm wrong. If I can simulate then I can start with an offline dataset, which might be a 100 GB time series, and I'd like to process it chunk-by-chunk through the full signal processing pipeline. For this to work, the ops need to maintain some state between chunks. Is this at all possible? I wouldn't mind having the op output its state to shared memory then accessing that shared memory on subsequent calls, for example. The second problem is that I'm pretty sure Any suggestions? Or is dagster not the right framework for what I want? Thanks in advance. Edit: Thanks to both of you for your input. For the current project, dagster meets all of my strict requirements but falls short on a couple desired qualities. I'll keep looking but I very well might come back to dagster and just limit my project to offline data analysis only. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hey there! I'm not familiar with what an online signal processing pipeline would look like, so please forgive any incorrect terminology. |
Beta Was this translation helpful? Give feedback.
-
Hi @cboulay - if you want sub-ms startup time than I don't think you want to create a job, yeah. Using a resource and the in-process executor to share state between ops in a single job may be the way to go here. I believe you're also correct that map() is unordered - the thinking there was you'd be likely to want to run the mapped outputs in parallel. |
Beta Was this translation helpful? Give feedback.
Hey there! I'm not familiar with what an online signal processing pipeline would look like, so please forgive any incorrect terminology.
I think your best course of action will be to write a sensor. if you write a job that can process a single piece of data, then you could write a sensor that watch the dataset and start a new execution of the job each time there is new data. The sensor runs on a cadence (by default 30 seconds, but can be modified) so you would handle the case where multiple pieces of new data appear between runs of the sensor. One potential drawback for this approach is that each execution of the job will run independently (and potentially parallel) to other executions of…