linux-kernel - Re: [PATCH 4/4] perf core: Add backward attribute to perf event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160329002855.GC31198@ast-mbp.thefacebook.com>
Date:	Mon, 28 Mar 2016 17:28:56 -0700
From:	Alexei Starovoitov <alexei.starovoitov@...il.com>
To:	Wang Nan <wangnan0@...wei.com>
Cc:	Alexei Starovoitov <ast@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel@...r.kernel.org,
	Brendan Gregg <brendan.d.gregg@...il.com>,
	He Kuang <hekuang@...wei.com>, Jiri Olsa <jolsa@...nel.org>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Namhyung Kim <namhyung@...nel.org>, pi3orama@....com,
	Zefan Li <lizefan@...wei.com>
Subject: Re: [PATCH 4/4] perf core: Add backward attribute to perf event

On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote:
> This patch introduces 'write_backward' bit to perf_event_attr, which
> controls the direction of a ring buffer. After set, the corresponding
> ring buffer is written from end to beginning. This feature is design to
> support reading from overwritable ring buffer.
> 
> Ring buffer can be created by mapping a perf event fd. Kernel puts event
> records into ring buffer, user programs like perf fetch them from
> address returned by mmap(). To prevent racing between kernel and perf,
> they communicate to each other through 'head' and 'tail' pointers.
> Kernel maintains 'head' pointer, points it to the next free area (tail
> of the last record). Perf maintains 'tail' pointer, points it to the
> tail of last consumed record (record has already been fetched). Kernel
> determines the available space in a ring buffer using these two
> pointers, prevents to overwrite unfetched records.
> 
> By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
> Different from normal ring buffer, perf is unable to maintain 'tail'
> pointer because writing is forbidden. Therefore, for this type of ring
> buffers, kernel overwrite old records unconditionally, works like flight
> recorder. This feature would be useful if reading from overwritable ring
> buffer were as easy as reading from normal ring buffer. However,
> there's an obscure problem.
> 
> The following figure demonstrates the state of an overwritable ring
> buffer which is nearly full. In this figure, the 'head' pointer points
> to the end of last record, and a long record 'E' is pending. For a
> normal ring buffer, a 'tail' pointer would have pointed to position (X),
> so kernel knows there's no more space in the ring buffer. However, for
> an overwritable ring buffer, kernel doesn't care the 'tail' pointer.
> 
>    (X)                              head
>     .                                |
>     .                                V
>     +------+-------+----------+------+---+
>     |A....A|B.....B|C........C|D....D|   |
>     +------+-------+----------+------+---+
> 
> After writing record 'E', record 'A' is overwritten.
> 
>       head
>        |
>        V
>     +--+---+-------+----------+------+---+
>     |.E|..A|B.....B|C........C|D....D|E..|
>     +--+---+-------+----------+------+---+
> 
> Now perf decides to read from this ring buffer. However, none of the
> the two natural positions, 'head' and the start of this ring buffer,
> are pointing to the head of a record. Even perf can read the full ring
> buffer, it is unable to find the position to start decoding.
> 
> The first attempt tries to solve this problem AFAIK can be found from
> [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
> buffer is half full. However, this approach introduces overhead to
> fast path. Test result shows a 1% overhead [2]. In addition, this method
> utilizes no more tham 50% records.
> 
> Another attempt can be found from [3], which allow putting the size of
> an event at the end of each record. This approach allows perf to find
> records in a backword manner from 'head' pointer by reading size of a
> record from its tail. However, because of alignment requirement, it
> needs 8 bytes to record the size of a record, which is a huge waste. Its
> performance is also not good, because more data need to be written.
> This approach also introduces some extra branch instructions to fast
> path.
> 
> 'write_backward' is a better solution to this problem.
> 
> Following figure demonstrates the state of the overwritable ring buffer
> when 'write_backward' is set before overwriting:
> 
>        head
>         |
>         V
>     +---+------+----------+-------+------+
>     |   |D....D|C........C|B.....B|A....A|
>     +---+------+----------+-------+------+
> 
> and after overwriting:
>                                      head
>                                       |
>                                       V
>     +---+------+----------+-------+---+--+
>     |..E|D....D|C........C|B.....B|A..|E.|
>     +---+------+----------+-------+---+--+
> 
> In each situation, 'head' points to the beginning of the newest record.
> From this record, perf can iterate over the full ring buffer, fetching
> as mush records as possible one by one.
> 
> The only limitation needs to consider is back-to-back reading. Due to
> the non-deterministic of user program, it is impossible to ensure the
> ring buffer keeps stable during reading. Consider an extreme situation:
> perf is scheduled out after reading record 'D', then a burst of events
> come, eat up the whole ring buffer (one or multiple rounds), but 'head'
> pointer happends to be at the same position when perf comes back.
> Continue reading after 'D' is incorrect now.
> 
> To prevent this problem, we need to find a way to ensure the ring buffer
> is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
> suggested because its overhead is lower than
> ioctl(PERF_EVENT_IOC_ENABLE).
> 
> This patch utilizes event's default overflow_handler introduced
> previously. perf_event_output_backward() is created as the default
> overflow handler for backward ring buffers. To avoid extra overhead to
> fast path, original perf_event_output() becomes __perf_event_output()
> and marked '__always_inline'. In theory, there's no extra overhead
> introduced to fast path.
> 
> Performance result:
> 
> Calling 3000000 times of 'close(-1)', use gettimeofday() to check
> duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
> system calls. In ns.
> 
> Testing environment:
> 
>  CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
>  Kernel : v4.5.0
>                    MEAN         STDVAR
> BASE            800214.950    2853.083
> PRE1           2253846.700    9997.014
> PRE2           2257495.540    8516.293
> POST           2250896.100    8933.921
> 
> Where 'BASE' is pure performance without capturing. 'PRE1' is test
> result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
> patch. 'POST' is test result after this patch. See [4] for detail
> experimental setup.
> 
> Considering the stdvar, this patch doesn't introduce performance
> overhead to fast path.
> 
> [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
> [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
> [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
> [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com
> 
> Signed-off-by: Wang Nan <wangnan0@...wei.com>
> Cc: He Kuang <hekuang@...wei.com>
> Cc: Alexei Starovoitov <ast@...nel.org>
> Cc: Arnaldo Carvalho de Melo <acme@...hat.com>
> Cc: Brendan Gregg <brendan.d.gregg@...il.com>
> Cc: Jiri Olsa <jolsa@...nel.org>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>
> Cc: Namhyung Kim <namhyung@...nel.org>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Zefan Li <lizefan@...wei.com>
> Cc: pi3orama@....com
> ---
>  include/linux/perf_event.h      | 28 +++++++++++++++++++++---
>  include/uapi/linux/perf_event.h |  3 ++-
>  kernel/events/core.c            | 48 ++++++++++++++++++++++++++++++++++++-----
>  kernel/events/ring_buffer.c     | 14 ++++++++++++
>  4 files changed, 84 insertions(+), 9 deletions(-)

Very useful feature. Looks good.
Acked-by: Alexei Starovoitov <ast@...nel.org>