netdev - Re: [PATCH 27/53] perf/core: Put size of a sample at the end of it by PERF_SAMPLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5694F347.5010700@huawei.com>
Date:	Tue, 12 Jan 2016 20:36:23 +0800
From:	"Wangnan (F)" <wangnan0@...wei.com>
To:	Alexei Starovoitov <alexei.starovoitov@...il.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
CC:	<acme@...nel.org>, <linux-kernel@...r.kernel.org>,
	<pi3orama@....com>, <lizefan@...wei.com>, <netdev@...r.kernel.org>,
	<davem@...emloft.net>, Adrian Hunter <adrian.hunter@...el.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	David Ahern <dsahern@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Yunlong Song <yunlong.song@...wei.com>
Subject: Re: [PATCH 27/53] perf/core: Put size of a sample at the end of it
 by PERF_SAMPLE_TAILSIZE



On 2016/1/12 14:11, Alexei Starovoitov wrote:
> On Tue, Jan 12, 2016 at 01:33:28PM +0800, Wangnan (F) wrote:
>>
>> On 2016/1/12 2:09, Alexei Starovoitov wrote:
>>> On Mon, Jan 11, 2016 at 01:48:18PM +0000, Wang Nan wrote:
>>>> This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size
>>>> field attached at the end of a sample. The idea comes from [1] that,
>>>> with tie size at tail of an event, it is possible for user program who
>>>> read from the ring buffer parse events backward.
>>>>
>>>> For example:
>>>>
>>>>     head
>>>>      |
>>>>      V
>>>>   +--+---+-------+----------+------+---+
>>>>   |E6|...|   B  8|   C    11|  D  7|E..|
>>>>   +--+---+-------+----------+------+---+
>>>>
>>>> In this case, from the 'head' pointer provided by kernel, user program
>>>> can first see '6' by (*(head - sizeof(u64))), then it can get the start
>>>> pointer of record 'E', then it can read size and find start position
>>>> of record D, C, B in similar way.
>>> adding extra 8 bytes for every sample is quite unfortunate.
>>> How about another idea:
>>> . update data_tail pointer when head is about to overwrite it
>>>
>>> Ex:
>>>     head   data_tail
>>>      |       |
>>>      V       V
>>>   +--+-------+-------+---+----+---+
>>>   |E |  ...  |   B   | C |  D | E |
>>>   +--+-------+-------+---+----+---+
>>>
>>> if new sample F is about to overwrite B, the kernel would need
>>> to read the size of B from B's header and update data_tail to point C.
>>> Or even further.
>>> Comparing to TAILSIZE approach, now kernel will be doing both reads
>>> and writes into ring-buffer and there is a concern that reads may
>>> be hitting cold data, but if the records are small they may be
>>> actually on the same cache line brought by the previous
>>> read A's header, write E record cycle. So I think we shouldn't see
>>> cache misses.
>> After ring buffer rewind, we need a read before nearly
>> every write operations. The performance penalty depends on
>> configuration of write allocate. In addition, another data
>> dependency is required: we must wait for the size of
>> event B is retrived before overwrite it.
>>
>> Even in the very first try at 2013 in [1], reading from the ring
>> buffer is avoided. I don't think Peter changes his mind now.
>>
>>> Another concern is validity of records stored. If user space messes
>>> with ring-buffer, kernel won't be able to move data_tail properly
>>> and would need to indicate that to userspace somehow.
>>> But memory saving of 8 bytes per record could be sizable
>> Yes. But I have already discussed with Peter on this in [2].
>> Last month I suggested:
>>
>> <quote>
>>
>>   1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event
>>      size in header. Which eliminate extra space cost;
>> </quote>
>>
>> However:
>>
>> <quote>
>>
>> That would mandate you always parse the stream backwards. Which seems
>> rather unfortunate. Also, no you cannot recoup the extra space, see the
>> alignment and size requirement.
> hmm, in this kernel patch I see that you're adding 8 bytes for
> every record via this extra TAILSISZE flag and in perf you're
> walking the ring buffer backwards by reading this 8 byte
> sizes, comparing header sizes and so on until reaching beginning,
> where you start dumping it as normal.
> So for this 'signal to perf' approach to work the ring buffer
> will contain tailsizes everywhere just so that user space can
> find the beginning. That's not very pretty. imo if kernel
> can do header read to adjust data_tail it would make user
> space side clean. May be there are other solutions.
> Adding tailsize seems like brute force hack.
> There must be some nicer way.
Hi Peter,

  What's your opinion? Should we reconsider moving size field from 
header the end?
Or moving whole header to the end of a record?

Thank you.