[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ba76110-ea81-2d0d-ba49-68ac1104c10e@linux.intel.com>
Date: Wed, 12 Sep 2018 11:27:24 +0300
From: Alexey Budankov <alexey.budankov@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...hat.com>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Namhyung Kim <namhyung@...nel.org>,
Andi Kleen <ak@...ux.intel.com>,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly
parallel CPU bound workloads
Hi,
On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means
>>> incorporating some POSIX API implementation into the tool, avoiding
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a very good
>> starting point.
>
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.
Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing
matrix multiplication on 96 core dual socket machines.
>
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.
Yep, *FENCE may be costly in HW, especially on larger scale.
>
Thanks,
Alexey
Powered by blists - more mailing lists