lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1449494917-62970-2-git-send-email-wangnan0@huawei.com>
Date:	Mon, 7 Dec 2015 13:28:35 +0000
From:	Wang Nan <wangnan0@...wei.com>
To:	<acme@...nel.org>, <peterz@...radead.org>
CC:	<linux-kernel@...r.kernel.org>, <paulus@...ba.org>,
	<kan.liang@...el.com>, <pi3orama@....com>,
	<adrian.hunter@...el.com>, <dsahern@...il.com>, <mingo@...nel.org>,
	<yunlong.song@...wei.com>, Wang Nan <wangnan0@...wei.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	"Peter Zijlstra" <a.p.zijlstra@...llo.nl>
Subject: [RFC PATCH v2 1/3] perf/core: Put size of a sample at the end of it

This is an RFC patch which is for overwrite mode ring buffer. I'd like
to discuss the correctness of this new idea for retriving as many
events as possible from overwrite mode ring buffer. If there's no
fundamental problem, I'll start perf side work.

The biggest problem for overwrite ring buffer is that it is hard to find
the start position of valid record. [1] and [2] tries to solve this
problem by introducing 'tail' and 'next_tail' into metadata page, and
update them each time the ring buffer is half full. Which adds more
instructions to event output code path, hurt performance. In addition,
even with them we still unable to recover all possible records. For
example:

             data_tail          head
                |                 |
                V                 V
 +------+-------+----------+------+---+
 |  A   |   B   |   C      |  D   |   |
 +------+-------+----------+------+---+

If a record written at head pointer and it overwrites record A:

   head      data_tail
    |           |
    V           V
 +--+---+-------+----------+------+---+
 |E |...|   B   |   C      |  D   | E |
 +--+---+-------+----------+------+---+

Record B is still valid but we can't get it through data_tail.

This patch suggests a different solution for this problem that, by
appending the length of a record at the end of it, user program is
possible to get every possible record in a backward manner, don't
need saving tail pointer.

For example:

   head
    |
    V
 +--+---+-------+----------+------+---+
 |E6|...|   B  8|   C    11|  D  7|E..|
 +--+---+-------+----------+------+---+

In this case, from the 'head' pointer provided by kernel, user program
can first see '6' by (*(head - sizeof(u16))), then it can get the start
pointer of record 'E', then it can read size and find start position
of record D, C, B in similar way.

Kernel side implementation is easy: simply adding a PERF_SAMPLE_SIZE_AT_END
for the size output.

This sloution requires user program (perf) do more things. At least
following things and limitations should be considered:

 1. Before reading such ring buffer, perf must ensure all events which
    may output to it is already stopped, so the 'head' pointer it get
    is the end of the last record.

Further improvement can be taken:

 1. We must ensure all events attached this ring buffer has
    'PERF_SAMPLE_SIZE_AT_END' selected.

 2. 8 bytes extra space is required for each record.

 3. We can find a way to append size information for tracking events.

[1] http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
[2] http://lkml.kernel.org/r/20151023151205.GW11639@twins.programming.kicks-ass.net

Signed-off-by: Wang Nan <wangnan0@...wei.com>
Cc: Adrian Hunter <adrian.hunter@...el.com>
Cc: Arnaldo Carvalho de Melo <acme@...hat.com>
Cc: David Ahern <dsahern@...il.com>
Cc: Ingo Molnar <mingo@...nel.org>
Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: Yunlong Song <yunlong.song@...wei.com>
---
 include/uapi/linux/perf_event.h | 3 ++-
 kernel/events/core.c            | 9 +++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 1afe962..e79606c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
+	PERF_SAMPLE_SIZE_AT_END			= 1U << 19,
 
-	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c3d61b9..cfe9336 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5452,6 +5452,12 @@ void perf_output_sample(struct perf_output_handle *handle,
 		}
 	}
 
+	/* Should be the last one */
+	if (sample_type & PERF_SAMPLE_SIZE_AT_END) {
+		perf_output_skip(handle, sizeof(u64) - sizeof(header->size));
+		perf_output_put(handle, header->size);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -5571,6 +5577,9 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_SIZE_AT_END)
+		header->size += sizeof(u64);
 }
 
 void perf_event_output(struct perf_event *event,
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ