linux-kernel - Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 8 Aug 2018 10:33:01 -0700
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Dave Hansen <dave.hansen@...el.com>, tglx@...utronix.de,
        mingo@...hat.com, fenghua.yu@...el.com, tony.luck@...el.com,
        vikas.shivappa@...ux.intel.com, gavin.hindman@...el.com,
        jithu.joseph@...el.com, hpa@...or.com, x86@...nel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination
 with perf

Hi Peter and Tony,

On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
>>> FWIW, how long is that IRQ disabled section? It looks like something
>>> that could be taking a bit of time. We have these people that care about
>>> IRQ latency.
>>
>> We work closely with customers needing low latency as well as customers
>> needing deterministic behavior.
>>
>> This measurement is triggered by the user as a validation mechanism of
>> the pseudo-locked memory region after it has been created as part of
>> system setup as well as during runtime if there are any concerns with
>> the performance of an application that uses it.
>>
>> This measurement would thus be triggered before the sensitive workloads
>> start - during system setup, or if an issue is already present. In
>> either case the measurement is triggered by the administrator via debugfs.
> 
> That does not in fact include the answer to the question. Also, it
> assumes a competent operator (something I've found is not always true).

My apologies, I did not intend to avoid your question. The main goal of
this measurement is for an operator to test if a pseudo-locked region
has been set up correctly. It is thus targeted for system setup time,
before the IRQ sensitive workloads - some of which would use these
memory regions - start.

>>>  - I don't much fancy people accessing the guts of events like that;
>>>    would not an inline function like:
>>>
>>>    static inline u64 x86_perf_rdpmc(struct perf_event *event)
>>>    {
>>> 	u64 val;
>>>
>>> 	lockdep_assert_irqs_disabled();
>>>
>>> 	rdpmcl(event->hw.event_base_rdpmc, val);
>>> 	return val;
>>>    }
>>>
>>>    Work for you?
>>
>> No. This does not provide accurate results. Implementing the above produces:
>> pseudo_lock_mea-366   [002] ....    34.950740: pseudo_lock_l2: hits=4096
>> miss=4
> 
> But it being an inline function should allow the compiler to optimize
> and lift the event->hw.event_base_rdpmc load like you now do manually.
> Also, like Tony already suggested, you can prime that load just fine by
> doing an extra invocation.
> 
> (and note that the above function is _much_ simpler than
> perf_event_read_local())

Unfortunately I do not find this to be the case. When I implement
x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:

l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
l2_hits_after = x86_perf_rdpmc(l2_hit_event);
l2_miss_after = x86_perf_rdpmc(l2_miss_event);


Then the results are not accurate, neither are the consistently
inaccurate to consider a constant adjustment:

pseudo_lock_mea-409   [002] ....   194.322611: pseudo_lock_l2: hits=4100
miss=0
pseudo_lock_mea-412   [002] ....   195.520203: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-415   [002] ....   196.571114: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-422   [002] ....   197.629118: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-425   [002] ....   198.687160: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-428   [002] ....   199.744156: pseudo_lock_l2: hits=4096
miss=2
pseudo_lock_mea-431   [002] ....   200.801131: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-434   [002] ....   201.858141: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-437   [002] ....   202.917168: pseudo_lock_l2: hits=4096
miss=2

I was able to test Tony's theory and replacing the reading of the
"after" counts with a direct rdpmcl() improve the results. What I mean
is this:

l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);

I did not run my full tests with the above but a simple read of 256KB
pseudo-locked memory gives:
pseudo_lock_mea-492   [002] ....   372.001385: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-495   [002] ....   373.059748: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-498   [002] ....   374.117027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-501   [002] ....   375.182864: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-504   [002] ....   376.243958: pseudo_lock_l2: hits=4096
miss=0

We thus seem to be encountering the issue Tony predicted where the
memory being tested is evicting the earlier measurement code and data.

>>>  - native_read_pmc(); are you 100% sure this code only ever runs on
>>>    native and not in some dodgy virt environment?
>>
>> My understanding is that a virtual environment would be a customer of a
>> RDT allocation (cache or memory bandwidth). I do not see if/where this
>> is restricted though - I'll move to rdpmcl() but the usage of a cache
>> allocation feature like this from a virtual machine needs more
>> investigation.
> 
> I can imagine that hypervisors that allow physical partitioning could
> allow delegating the rdt crud to their guests when they 'own' a full
> socket or whatever the domain is for this.

I am using rdpmcl() now

> 
>> Will do. I created the following helper function that can be used after
>> interrupts are disabled:
>>
>> static inline int perf_event_error_state(struct perf_event *event)
>> {
>>         int ret = 0;
>>         u64 tmp;
>>
>>         ret = perf_event_read_local(event, &tmp, NULL, NULL);
>>         if (ret < 0)
>>                 return ret;
>>
>>         if (event->attr.pinned && event->oncpu != smp_processor_id())
>>                 return -EBUSY;
>>
>>         return ret;
>> }
> 
> Nah, stick the test in perf_event_read_local(), that actually needs it.

I will keep this outside for now since it does not seem as though
perf_event_read_local() works well for this use case.


>>> Also, while you disable IRQs, your fancy pants loop is still subject to
>>> NMIs that can/will perturb your measurements, how do you deal with
>>> those?
> 
>> Customers interested in this feature are familiar with dealing with them
>> (and also SMIs). The user space counterpart is able to detect such an
>> occurrence.
> 
> You're very optimistic about your customers capabilities. And this might
> be true for the current people you're talking to, but once this is
> available and public, joe monkey will have access and he _will_ screw it
> up.

I think this ventures into an area that is an existing issue for all
workloads that fall into this category, not unique to this.

> 
>> Please note that if an NMI arrives it would be handled with the
>> currently active cache capacity bitmask so none of the pseudo-locked
>> memory will be evicted since no capacity bitmask overlaps with the
>> pseudo-locked region.
> 
> So exceptions change / have their own bitmask?

My understanding is that interrupts are run with the current active
capacity bitmask. The capacity bitmasks specify which portions of cache
a cpu and/or task may allocate into. No capacity bitmask would overlap
with the pseudo-locked region and thus no cpu or task would be able to
evict the data from the pseudo-locked region. Even if later interrupts
do obtain their own class of service with its own capacity bitmasks -
these btimasks would still not be allowed to overlap with the
pseudo-locked region.

Reinette