Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep profile and history data in shared memory #66

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

maksm90
Copy link
Collaborator

@maksm90 maksm90 commented Feb 1, 2023

To simplify interaction between collector process and client backends requesting profile or history data the current patch adds shared data structures to store statistics: the fixed-size shared hash table to keep profile and fixed-size shared array to implement ring buffer for history data.
Shared hash table for profile has fixed size specified by pg_wait_sampling.max_profile_entries GUC. The least used entries are diplaced from hash table when its overflow encounters. The eviction algorithm is the same that is used in pg_stat_kcache extension - it's based on usage metric stored within hash table entries.
The shared structures for profile and history are solely in-memory and not persisted to external disk. So after server restart all statistics fully reset. This is not bad because for wait monitoring it's enough to keep track differential counters in profile statistics.

Current patch also makes all timing period GUCs reloadable via SIGHUP. Other GUCs in some way have impact on allocation of shared resources so they are done changable via server restart.

The history keeping looks not usable for regular monitoring of wait events so in current patch it's disabled by default by specifying zero value for pg_wait_sampling.history_period GUC.

@maksm90 maksm90 requested a review from shinderuk February 1, 2023 07:47
To simplify interaction between collector process and client backends
requesting profile or history data the current patch adds shared data
structures to store statistics: the fixed-size shared hash table to keep
profile and fixed-size shared array to implement ring buffer for history
data.
Shared hash table for profile has fixed size specified by
`pg_wait_sampling.max_profile_entries` GUC. The least used entries are
diplaced from hash table when its overflow encounters. The eviction
algorithm is the same that is used in pg_stat_kcache extension - it's
based on usage metric stored within hash table entries.
The shared structures for profile and history are solely in-memory and
not persisted to external disk. So after server restart all statistics
fully reset. This is not bad because for wait monitoring it's enough to
keep track differential counters in profile statistics.

Current patch also makes all timing period GUCs reloadable via SIGHUP.
Other GUCs in some way have impact on allocation of shared resources so
they are done changable via server restart.

The history keeping looks not usable for regular monitoring of wait
events so in current patch it's disabled by default by specifying zero
value for `pg_wait_sampling.history_period` GUC.
@maksm90 maksm90 changed the title Keep profile and history data in shared structures Keep profile and history data in shared memory Feb 1, 2023
@maksm90
Copy link
Collaborator Author

maksm90 commented Feb 1, 2023

@rjuju could you make review of this PR and give some feedback?

* or without any locks as it's done in pg_stat_get_activity() function.
* These arrays are not accessible externally and require to add some
* iterator object into corresponding containing modules.
*/
if (pid == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can follow pg_stat_get_activity and get away without any locks. The only risk is that we take a few incorrect samples and attribute wait_event_info to a wrong pid. That doesn't seem to be a problem. At least pg_stat_activity lives with it.

Concerning the pid == 0 check. I think we are safe, because the wait_event_info == 0 condition guards us from sampling PGPROC's of terminated backends. Although PGPROC.pid is not cleared upon termination, wait_event_info is (not in ProcKill, but at various other places, e.g. AbortTransaction calls pgstat_report_wait_end).

Shall we decide to sample non-waiting states (wait_event_info == 0), then we could iterate over BackendStatusArray to visit only live backends, like pg_stat_activity does via pgstat_read_current_status.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can follow pg_stat_get_activity and get away without any locks.

In pg_stat_activity we iterate over auxiliary PROC slots without any lock. This is practical as these slots are reserved for long-lived background workers as bgwriter, checkpointer, walwriter, archiver and wal_receiver. Any changes in this specific range of slots occur very rare and most commonly are initiated by external forces. Although we might use ProcStructLock spinlock to protect access when we gather wait event info as this array is small enough (now it occupies 5 slots).
Iteration over other slots intended mostly for regular backends in pg_stat_activity is protected by ProcArrayLock and iteration have to be performed over ProcArray as indirection level to access to real working PROC slots.

Concerning the pid == 0 check. I think we are safe, because the wait_event_info == 0 condition guards us from sampling PGPROC's of terminated backends. Although PGPROC.pid is not cleared upon termination, wait_event_info is (not in ProcKill, but at various other places, e.g. AbortTransaction calls pgstat_report_wait_end).

Yes, in current case it works. But when we will start to add the state "no wait" as pointed in #10 we have to consider the working criteria whether PROC slot is freed.

Shall we decide to sample non-waiting states (wait_event_info == 0), then we could iterate over BackendStatusArray to visit only live backends, like pg_stat_activity does via pgstat_read_current_status.

Requesting PROC slots from pid values gotten from BackendStatusArray items makes iteration routine quadratic in time complexity. This factor is main reason that sampling of pg_stat_activity in much enough frequent manner as in pg_wait_sampling extension is not practical. IMO the most appropriate solution that I propose in comment is to organize (in future) the iteration over ProcArray and AuxiliaryProcs

* Therefore to reduce possible contention it's worth to segregate the logic
* of PGPROCs iteration under ProcArrayLock and storing results to profile
* and/or history under corresponding another lock.
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we take ProcArrayLock anyway? What concurrent operations does it prevent?
I don't think that blocking ProcArrayAdd and ProcArrayRemove buys us anything. We are not iterating over the procArray, but over allProcs. And we don't need the lock to read PGPROC.pid and wait_event_info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. In current case when we verify slot validity also via non-zero wait_event_info value we might omit acquisition of this lock. This is mostly inherited from previous implementation.

| pg_wait_sampling.profile_queries | bool | Whether profile should be per query | true |
| Parameter name | Data type | Description | Default value | Change policy |
| ------------------------------------ | --------- | ----------------------------------------------------------------------------------- | ------------- | ------------- |
| pg_wait_sampling.max_profile_entries | int4 | Maximum number of entries in profile hash table | 5000 | restart |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just profile_size, like history_size?

Copy link
Collaborator Author

@maksm90 maksm90 Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly mimicked to max_connections name, the prefix _size for me is associated to something measured in bytes

processes are left in hash table. Therefore, it's neccessary to take into
account periodic flushing of profile to prevent recycling of 32-bit pid values
in profile hash table and as consequence possible increments to profile entries
belonging to some old processes with the same pid values as for current ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think pid recycling is such an important concern in practice to justify this warning? Wouldn't we run out of hash table entries and evict the old ones, before we hit a recycled pid. I'm not sure that this is frequent enough to draw user's attention to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, theoretically confusion caused by pid recycling is possible. And we have to frankly declare about it, I suppose

Copy link
Collaborator Author

@maksm90 maksm90 Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it's wise to notice in doc that if the size of profile hash table is reasonably not huge and there are no other than postgres actively spawn processes in a system the probability of such pid recycling goes to zero

| pg_wait_sampling.profile_period | int4 | Period for profile sampling in milliseconds (zero value disables profile gathering) | 10 | reload |
| pg_wait_sampling.history_period | int4 | Period for history sampling in milliseconds (zero value disables history gathering) | 0 | reload |
| pg_wait_sampling.profile_pid | bool | Whether profile should be per pid | true | restart |
| pg_wait_sampling.profile_queries | bool | Whether profile should be per query | true | restart |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does changing profile_pid or profile_queries really require a restart?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it looks redundantly. This is intended for future work.
Tracking pid would define whether we have to add a separate hash table to account counters for currently active processes when we have another hash table for historic data that accumulates counters for finished ones. Or how big a single hash table for historic and actual statistics have to be when statistics from completed processes is flushed to corresponding historic entries.
Similarly for tracking queryid: taking into account that the number of pid values and wait_events is bounded, we might more finely tune required memory for profile hash table by knowing about queryid tracking and expected maximum number of tracked queryid values (or another future fields with variable quantity in profile key) when this option is on.

In addition, it's some confusing to have entries in profile with null values and not null ones in pid and/or queryid fields corresponding in fact to the same entries. This requires profile reset after setting applying.

typedef uint64 pgwsQueryId;
#else
typedef uint32 pgwsQueryId;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use uint64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds additional space to profile hash key for older PG versions and as consequence increases the size of profile hash table. Also this segregation in type definition was added to pg_wait_sampling before - I just moved it to compat.h to refuse using of #if directive in code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query.queryId is 64-bit since v11. I don't think we need to worry about allocating a bit more memory than strictly necessary for v10 and below. But surely we can leave this for another patch, if you like.

*vars = get_guc_variables();
*num_vars = GetNumConfigOptions();
#define PG_WAIT_EXTENSION -1
return WaitLatch(latch, wakeEvents, timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can readily drop support for v9.6 and even v10. So we don't need this compatibility wrapper.

Copy link
Collaborator Author

@maksm90 maksm90 Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I think anyway this have to be eventual decision from your side and I'll build on it further.

HTAB *pgws_profile_hash = NULL;
LWLock *pgws_profile_lock = NULL;
History *pgws_history_ring = NULL;
LWLock *pgws_history_lock = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put these variables into a single struct pgws_shared_state or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean put all these pointers into some structure? I think this doesn't bring smth even in code readability.

DefineCustomBoolVariable("pg_wait_sampling.profile_queries",
"Sets whether profile should be collected per query.", NULL,
&WhetherProfileQueryId, true,
PGC_POSTMASTER, 0, NULL, pgwsEnableQueryId, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call MarkGUCPrefixReserved or EmitWarningsOnPlaceholders after we've defined all of GUCs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks

"pg_wait_sampling queryids",
sizeof(pgwsQueryId) * get_max_procs_count(),
&found);
MemSet(pgws_proc_queryids, 0, sizeof(pgwsQueryId) * get_max_procs_count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this MemSet should be in the if (!found) branch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Thanks

info.entrysize = sizeof(ProfileHashEntry);
pgws_profile_hash = ShmemInitHash("pg_wait_sampling hash",
MaxProfileEntries, MaxProfileEntries,
&info, HASH_ELEM | HASH_BLOBS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be inside the if (!found) branch. Otherwise, this code will be called by every new backend in the EXEC_BACKEND configuration (e.g. on Windows).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShmemInitHash() function assumes idempotent call within itself.

profile->items = (ProfileItem *) receive_array(PROFILE_REQUEST,
sizeof(ProfileItem), &profile->count);
/* Extract profile from shared memory */
profile_count = hash_get_num_entries(pgws_profile_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this after locking the hash. Otherwise, we would allocate not enough memory if the hash grows before we lock it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Thanks

DefineCustomIntVariable("pg_wait_sampling.profile_period",
"Sets period of waits profile sampling in milliseconds.",
"0 disables profiling.",
&ProfilePeriod, 10, 0, INT_MAX,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must pass the GUC_UNIT_MS flag, so that time units, e.g. '10ms', are accepted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good addition!

if (!profile_pid)
item.pid = 0;
/* Set up key for hashtable search */
key.pid = WhetherProfilePid ? pid : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better put a memset here:

Caller must ensure there are no undefined padding bits in the keys!

https://github.com/postgres/postgres/blob/master/src/backend/utils/hash/dynahash.c#L37

We don't have any padding currently, but let's be future proof.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good addition! Will be added

@shinderuk
Copy link
Contributor

Why do we need to maintain our own array of queryIds? Why can't we just read PgBackendStatus.st_query_id as pg_stat_activity does? It turns out st_query_id is zero during execution of a prepared statement. (exec_execute_message in postgres.c calls pgstat_report_activity(STATE_RUNNING) and it resets st_query_id.)

Here is a quick demo. In a psql session execute: select pg_sleep(30) \bind \g. (\bind uses the extended query protocol, like prepared statements do.) In another session query pg_stat_activity for the first session while it's sleeping:

wait_event_type  | Timeout
wait_event       | PgSleep
state            | active
query_id         | 
query            | select pg_sleep(30) 

Oops, query_id is blank.

Arguably it's a bug: https://www.postgresql.org/message-id/CA%2B427g8DiW3aZ6pOpVgkPbqK97ouBdf18VLiHFesea2jUk3XoQ%40mail.gmail.com

@maksm90
Copy link
Collaborator Author

maksm90 commented May 26, 2024

Why do we need to maintain our own array of queryIds? Why can't we just read PgBackendStatus.st_query_id as pg_stat_activity does?

The answer is in tracking of queryId for just top-level statement in PgBackendStatus.st_query_id. More discussion about current design is in #42 (comment) and related issues is inside #43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants