linux-kernel - Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9ba76110-ea81-2d0d-ba49-68ac1104c10e@linux.intel.com>
Date:   Wed, 12 Sep 2018 11:27:24 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...hat.com>
Cc:     Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly
 parallel CPU bound workloads


Hi,

On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>>> incorporating some POSIX API implementation into the tool, avoiding 
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a very good 
>> starting point.
> 
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.

Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing 
matrix multiplication on 96 core dual socket machines.

> 
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.

Yep, *FENCE may be costly in HW, especially on larger scale.

> 

Thanks,
Alexey