linux-kernel - Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1ad36918-ddd0-aa3c-c52e-e4e419409dd4@linux.intel.com>
Date:   Mon, 10 Sep 2018 17:48:46 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly
 parallel CPU bound workloads

Hi Ingo,

On 10.09.2018 15:06, Ingo Molnar wrote:
> 
> * Alexey Budankov <alexey.budankov@...ux.intel.com> wrote:
> 
>> Hi Ingo,
>>
>> On 10.09.2018 12:18, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov <alexey.budankov@...ux.intel.com> wrote:
>>>
>>>>
>>>> Currently in record mode the tool implements trace writing serially. 
>>>> The algorithm loops over mapped per-cpu data buffers and stores 
>>>> ready data chunks into a trace file using write() system call.
>>>>
>>>> At some circumstances the kernel may lack free space in a buffer 
>>>> because the other buffer's half is not yet written to disk due to 
>>>> some other buffer's data writing by the tool at the moment.
>>>>
>>>> Thus serial trace writing implementation may cause the kernel 
>>>> to loose profiling data and that is what observed when profiling 
>>>> highly parallel CPU bound workloads on machines with big number 
>>>> of cores.
>>>
>>> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
>>>
>>>> Data loss metrics is the ratio lost_time/elapsed_time where 
>>>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>>>> records and elapsed_time is the elapsed application run time 
>>>> under profiling.
>>>>
>>>> Applying asynchronous trace streaming thru Posix AIO API
>>>> (http://man7.org/linux/man-pages/man7/aio.7.html) 
>>>> lowers data loss metrics value providing 2x improvement -
>>>> lowering 98% loss to almost 0%.
>>>
>>> Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
>>> to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
>>> events).
>>
>> Explicit threading is surely an option but having more threads 
>> in the tool that stream performance data is a considerable 
>> design complication.
>>
>> Luckily, glibc AIO implementation is already based on pthreads, 
>> but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.> 
> The basic flow of tracing/profiling data is something we should control explicitly,
> via explicit threading.

It may sound too optimistic but glibc API is expected to be backward compatible 
and for POSIX AIO API part too. Internal implementation also tends to evolve to 
better option overtime, more probably basing on modern kernel capabilities 
mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

Well, explicit threading in the tool for AIO, in the simplest case, means 
incorporating some POSIX API implementation into the tool, avoiding 
code reuse in the first place. That tends to be error prone and costly.

Regards,
Alexey

> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write into separate 
> per-CPU files.
> 
> Thanks,
> 
> 	Ingo
>