-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-blocking queries #299
Comments
Thanks. Sure, that would be awesome. Would you like to contribute a short design document first, so that we can discuss before actually implementing? |
Great! Where should I contribute the design document? |
Perhaps a vignette in this package?
|
@krlmlr What do you think about this? https://github.com/zmbc/RPostgres/blob/async-queries-vignette/vignettes/async-queries.Rmd |
@krlmlr Did you get a chance to read the design document? Would love to know your feedback. |
Thanks, sorry it took me so long. The document looks good, I have a few questions:
|
Hi @krlmlr!
Yes, I think so. And I don't understand your second question. I was able to get asynchronous queries working by moving a single line of code --
dbHasCompleted doesn't need an async version, I think. Or are you proposing that we rename dbHasExecutedQuery to dbHasCompletedAsync? If the latter, I don't think it is a good idea for two reasons: dbHasCompleted is in the DBI spec and will be expected to behave that way. And dbHasExecutedQuery answers a different question entirely, and having them use similar names would be confusing.
I mentioned my quick-and-dirty approach above (of course, that simple change does not implement the status-checking method, but that's not too hard). I am not sure exactly the best way to avoid code duplication -- perhaps the synchronous result and the async result can be subclasses of the same class, or the synchronous result can use the async result under the hood? I favor the latter but would like to know what you think. |
Thanks. I think I missed an important detail: is this about supporting multiple open result sets, or about being able to do a nonblocking |
As stated in the design doc, the purpose of this feature is to be able to send a query in a non-blocking fashion, so that other R code can run while the query is being executed by the database. There can still only be one query and one result set at a time per connection, unless I am seriously misunderstanding the libpq API. Maybe you thought I was saying this because of a typo in the design document, which I have just corrected. |
I now have a better understanding of the problem after working on #272, your comments helped a lot. library(RPostgres)
conn1 <- dbConnect(Postgres())
sql <- "SELECT pg_sleep(2)"
print(Sys.time())
#> [1] "2021-09-19 18:05:31 CEST"
res <- dbSendQuery(conn1, sql)
print(Sys.time())
#> [1] "2021-09-19 18:05:33 CEST"
dbFetch(res)
#> pg_sleep
#> 1 blob[0 B]
print(Sys.time())
#> [1] "2021-09-19 18:05:33 CEST"
dbClearResult(res)
dbDisconnect(conn1) Created on 2021-09-19 by the reprex package (v2.0.1) I understand you're asking that #272 and this issue are very similar, both affect the
This class already combines sending queries (with a result set) and statements (that returns the number of rows affected). I'm afraid all these degrees of freedom will complicate the implementation. I also have doubts about the usefulness for queries that return data: RPostgres operates in "single row mode" via Regarding your draft, I have a few questions:
That said, I'll consider it when resuming work on #272. I have a proof of concept with two implementations of the |
Thanks @krlmlr.
This is a fair point. I'd actually be curious to see if a query could be constructed that would return rows with a large interval between them. But, I do not think this is the common case, see below.
The much more common case in my opinion is that the delay is caused by network latency between the database and the computer where the R code is running. In modern web development (and perhaps other types of development besides), IO is almost always asynchronous for this reason -- it can be much more expensive than actual computation and having a process block for it is quite inefficient. Primarily queries. But I don't see a good reason why this would be easier to implement for only queries as opposed to all kinds of statements.
This was addressed in my vignette. No need to complicate the API with a timeout, users can implement that themselves quite easily using
I didn't address this in the vignette nor my proof-of-concept, but this should just cancel the query without sending a new one.
I don't think RPostgres needs to worry about this. Users of RPostgres already need to be aware that they can only do one thing on a connection at a time (otherwise their results are overwritten). Users can use the pool package to manage making multiple async requests at once, yes. I didn't do this in my vignette to keep things simple, but in my actual testing of my proof-of-concept I was using pool.
I don't think this is the only way to do it. Why not this: make the PqResultImpl class only perform queries asynchronously. Then, the synchronous API uses the asynchronous API but blocks until the result is present. Even if that doesn't work, it definitely seems possible to combine these concerns in PqResultImpl. Or have two versions of PqResultImpl using inheritance. This may be unrelated to asynchronicity, but I do not understand #272. It's unclear to me what the word "immediate" has to do with prepared statements, and why the user should have to specify it. If the user tries to run a query/statement with params, it should use prepared statements, and if there are no params, it shouldn't. Am I missing something? |
Thanks for your detailed response. Can you point me to a Shiny app that would benefit from asynchronous I/O? From my experience, I/O is often the result of a user interaction, and little computation can happen while we're waiting for the I/O. I'm curious to see such an example. I'm a bit wary to handle everything asynchronously. I'm open to a pull request that adds asynchronous processing as an (internal) option, exposing it as you describe in the vignette. We should use the Perhaps "immediate" is not the best wording. Doing everything with prepared statements would have been much simpler, but it turns out that we need to use the other "immediate" API occasionally. A future version should distinguish explicitly between these two modes of operation. |
If a user requests something that requires a database query, the issue is not that they must wait for the database query, it is that everyone else using the Shiny app must also wait. Because database connections cannot be passed to another process (e.g. with Of course, there are workarounds, such as running multiple Shiny processes. But that complicates the routing of requests and still sets an arbitrary limit on the number of concurrent queries that can be running before even basic requests, such as CSS, take far longer than they should. A much more elegant solution is to not block the process in the first place.
As I noted, RPostgres already handles everything asynchronously. All I'm proposing is that this busy-loop logic be extracted into a separate class, and that an R interface is exposed to the version that does not have the busy-loop.
Makes sense. I did not know about this convention.
No, I am suggesting exactly the opposite. Use the non-prepared API (the "immediate" API) all the time, except when there are query parameters.
Again, my suggestion was the opposite. I may be missing something, but I see no reason why I as a user of RPostgres should have to know about different libpq APIs. From my reading of the libpq documentation, the difference between them is that the prepared API allows you to parse and plan a query once and then run it multiple times. Even if we want to use the prepared API for all queries with parameters (I don't know why that is a desirable default), there is no reason to use it for queries without parameters, as they by definition can't be run multiple times with different parameters. Put another way, to justify the |
Thanks, the multi-session Shiny is a good example. To be most useful in Shiny, do we need/want to integrate with the {promises} package? How to achieve that? I suspect we can already offload database queries to a different process, but the serialization overhead might be substantial. Maybe we can start with a draft implementation that changes the I wasn't very clear. Historically, RPostgres (and some other backend packages, including odbc) only had this one code path -- everything was routed via the "prepared" API. It turned out that this doesn't work for at least odbc and RMariaDB -- backends will need to use both simple and "prepared" API. If we accept this, there's no reason except compatibility to use the "prepared" API for simple statements. Changing to use "simple" by default feels like a breaking change, perhaps it's not. |
We could. I wasn't sure about whether to directly integrate with {promises} or not. Whether it's done by RPostgres or by the user, it's very easy; with the R API I outlined in the vignette, it's about 15 lines of code.
Interesting. I'm just going off this documentation for the {future} package; I tried a few different ways and wasn't able to successfully transfer a DB connection between processes. But in any case, it seems to me a very hacky solution to use a worker process (really more of a "slacker" process!) just to execute a wait loop.
I'm happy to take a shot at this and make a PR.
Thanks for this explanation. Based on the documentation I don't see any reason to think it would be a breaking change, except that it would allow multiple statements/queries to be sent in a single string (which doesn't break any old functionality, only adds new functionality). But, I understand why you would be worried that it could break some existing behavior. I guess my question would be, what would make you feel good about going forward with that change? Would the DBItest tests passing be enough? |
Thanks for the pointer. Can we create a promise that doesn't need polling? Either way, looking forward to review your PR. I'd say it's impossible to transfer a database connection to another process. The process that runs the future will have to instantiate its own connection. Offloading to an external process brings the advantage that we can execute multiple queries truly simultaneously. I have opened r-dbi/dbi3#47 for further discussion. Most users won't really care about simple vs. prepared, and it's easy to implement a wrapper that always sets |
To avoid polling: Can we create a background task from C++ with {later}, just for the execution of the query? |
Here you can find a related issue. |
It will require some polling no matter what, I think. This is about concurrency, not parallelism; a promise cannot resolve until the R process is idle. So, we can't handle the database's response the moment it comes in, but rather as soon as the R process finishes what it's doing. I don't see a way to do that without doing some kind of polling from the event loop of the R process. Now, it's possible that doing something with a background task is smart, for example if it's not good to call libpq's I haven't tried this with R's {promises}, but it's possible that |
Maybe using promises::then's |
Yes, I think we'll have to work along these lines. I think something like the following might work:
For a draft, we can implement a new |
@krlmlr I understand what you're proposing and I think it would work. However, I don't understand why we would do it. Is it a performance optimization? That would imply that polling Note that something will have to be waiting for the query no matter what. In your proposal, it's the background thread. In my original promise implementation, it's the R process. The question is whether waiting for the query is a time-consuming operation that shouldn't interfere with R process work. I think it won't be. If your concern isn't the expensiveness of polling but that it may not resolve the promise immediately, |
Thanks. To me, it seems that polling means that either the CPU is busy, or that we employ some sort of sleeping and might be seeing the update too late. We can start with polling and switch to a signal-based approach later. |
@krlmlr Just to be clear about what "the CPU is busy" means: a polling loop as described above wouldn't significantly block other R code from running, since R code would run before event loop callbacks and would therefore wait for at most one check of I definitely think we can start with a polling approach. I see how there could be an advantage to the signal-based approach but my guess is that in practice it will be very small and not worth the complexity. But, we can test that out after doing the initial implementation. |
As someone trying to put an API in production that uses R to connect to a database, the ability to send a query without blocking the R process is sorely missed. (See https://github.com/r-dbi/DBI/issues/69.)
RPostgres is very close to supporting this already--it uses libpq's asynchronous API and already has an asynchronous loop to check for interrupts. In hacking on a local source install, I was able to very easily add async functionality. Of course, I'd probably need to add tests, error handling, etc before it could actually be merged.
@krlmlr Would you be open to a PR adding a
dbSendQueryAsync
method or similar?The text was updated successfully, but these errors were encountered: