lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALcN6mgoRcKKhPYpedE1DmZJCBC=vgU185OgYbU-TPN=Kk9teg@mail.gmail.com>
Date:   Tue, 27 Dec 2016 17:23:46 -0800
From:   David Carrillo-Cisneros <davidcc@...gle.com>
To:     Andi Kleen <andi@...stfloor.org>
Cc:     Shivappa Vikas <vikas.shivappa@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        Fenghua Yu <fenghua.yu@...el.com>,
        Stephane Eranian <eranian@...gle.com>, hpa@...or.com
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

On Tue, Dec 27, 2016 at 3:10 PM, Andi Kleen <andi@...stfloor.org> wrote:
> On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
>> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
>> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
>> ~1170ns
>>
>> most of the time is spend in cgroup ctx switch (~1120ns) .
>>
>> When using continuous monitoring in CQM driver, the avg time to
>> find the rmid to write inside of pqr_context switch  is ~16ns
>>
>> Note that this excludes the MSR write. It's only the overhead of
>> finding the RMID
>> to write in PQR_ASSOC. Both paths call the same routine to find the
>> RMID, so there are
>> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
>> of it comes from iterating over the pmu list.
>
> Do Kan's pmu list patches help?
>
> https://patchwork.kernel.org/patch/9420035/

I think these are independent problems. Kan's patches aim to reduce the overhead
of multiples events in the same task context. The overhead numbers I posted
measure only _one_ event in the cpu's context.

>
>>
>> > Or is there some other overhead other than the MSR write
>> > you're concerned about?
>>
>> No, that problem is solved with the PQR software cache introduced in the series.
>
> So it's already fixed?

Sort of, with PQR sw cache there is only one write to MSR and is only
when either the
RMID or the CLOSID actually changes.

>
> How much is the cost with your cache?

If there is no change on CLOSID or RMID, the hook and comparison takes
about 60 ns.
If there is a change, the write to the MSR + other overhead is about
610 ns (dominated by the MSR write).

>
>>
>>
>> > Perhaps some optimization could be done in the code to make it faster,
>> > then the new interface wouldn't be needed.
>>
>> There are some. One in my list is to create a list of pmus with at
>> least one cgroup event
>> and use it to iterate over in perf_cgroup_switch, instead of using the
>> "pmus" list.
>> The pmus list has grown a lot recently with the addition of all the uncore pmus.
>
> Kan's patches above already do that I believe.

see previous answer.

>
>>
>> Despite this optimization, it's unlikely that the whole sched_out +
>> sched_in gets that
>> close to the 15 ns of the non perf_event approach.
>
> It would be good to see how close we can get. I assume
> there is more potential for optimizations and fast pathing.

I will work on the optimization I described earlier that avoids iterating
over all pmus on the cgroup switch. That should take the bulk of the
overhead, but still more work will probably be needed to get close to the
15ns overhead.

Thanks,
David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ