-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect when the kernel multiplexed hw counters. #2421
Detect when the kernel multiplexed hw counters. #2421
Conversation
If this happens, the counter values are incorrect. This happens when e.g. multiple rr instances run in parallel on the same machine.
Thanks! |
BTW, this is NOT correct. We run many rr instances on the same machine a lot, e.g. rr tests routinely do this. AFAIK triggering this requires monitoring "too many" different hardware events in the same process using some combination of system-wide and process-specific perf events. rr only monitors one or two events, and those only in the processes recorded or replayed by rr. |
I guess it is most likely that your systems have some system-wide monitoring going on that competes with rr for the PMU hardware. It's unfortunate that perf_event_open doesn't support an option that lets us indicate that we don't support multiplexing. |
This actually broke things on my machine :-( |
I've reverted this change, sorry. Instead I've applied 3edf7ed which sets |
BTW the error was here:
We needed to pass We could reland this with that fixed, but I think it should not be needed now that we're requesting pinned counters. |
Thanks for the pinning change. I tried it together with the detection for multiplexing, and the multiplexing is still happening :-( So I created a roll forward #2423 that fixes the problem with |
I think it's worth looking into why multiplexing is still happening. My
understanding is that with pinned counters, only pinned perCPU counters
would displace rr's counters, and in that case rr would error out when it
next reads the counter value. If this isn't true it's important to know why.
…On Tue, 17 Dec 2019, 07:02 Tobias Bosch, ***@***.***> wrote:
Thanks for the pinning change.
I tried it together with the detection for multiplexing, and the
multiplexing is still happening :-( So I created a roll forward #2423
<#2423> that fixes the problem with
tmp_attr that you mentioned above.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2421>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACIJRYBXLTPYS3LM5AUMK3QY67AZANCNFSM4JZCSFGA>
.
|
Mmh, do you have an idea how we could narrow this further down? I am thinking of creating an isolated test program for this that demonstrates whether multiplexing / the error behavior you mentioned happens... |
Yeah that sounds like a good idea. I was planning to run a similar
experiment after I reassemble my laptop.
…On Tue, 17 Dec 2019, 10:40 Tobias Bosch, ***@***.***> wrote:
Mmh, do you have an idea how we could narrow this further down?
I am thinking of creating an isolated test program for this that
demonstrates whether multiplexing / the error behavior you mentioned
happens...
WDYT?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2421>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACIJR3Z6IZXINAUJ3EYDLTQY7YVHANCNFSM4JZCSFGA>
.
|
Okay, here's my test program:
|
Testing unpinned per-CPU counters:
Then
|
But rr, using pinned counters, works as expected:
|
If I kill those unpinned counters and restart with pinned per-CPU counters:
Then
This error could be more informative, but as expected, having requested pinning, we get a read error instead of getting invalid counter values. |
These results all match my expectations: kernel 5.3.11-300.fc31.x86_64. You may wish to verify these results on whatever kernel you're using. I guess it's possible that older kernels have different behavior, or your kernel has patches making it different from upstream. Another possibility is that your systemwide counters are using |
If this happens, the counter values are incorrect.
This happens when e.g. multiple rr instances run
in parallel on the same machine.