[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20231219072932.0d7383f2@gandalf.local.home>
Date: Tue, 19 Dec 2023 07:29:32 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Masami Hiramatsu <mhiramat@...nel.org>, Mark Rutland
<mark.rutland@....com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Subject: [for-linus][PATCH] ring-buffer: Fix slowpath of interrupted event
While working on the ring buffer, I found one more bug with the timestamp
code, and the fix for this removed the need for the final 64-bit cmpxchg!
The ring buffer events hold a "delta" from the previous event. If it is
determined that the delta can not be calculated, it falls back to adding an
absolute timestamp value. The way to know if the delta can be used is via
two stored timestamps in the per-cpu buffer meta data:
before_stamp and write_stamp
The before_stamp is written by every event before it tries to allocate its
space on the ring buffer. The write_stamp is written after it allocates its
space and knows that nothing came in after it read the previous
before_stamp and write_stamp and the two matched.
A previous fix dd9394257078 ("ring-buffer: Do not try to put back
write_stamp") removed putting back the write_stamp to match the
before_stamp so that the next event could use the delta, but races were
found where the two would match, but not be for of the previous event.
It was determined to allow the event reservation to not have a valid
write_stamp when it is finished, and this fixed a lot of races.
The last use of the 64-bit timestamp cmpxchg depended on the write_stamp
being valid after an interruption. But this is no longer the case, as if an
event is interrupted by a softirq that writes an event, and that event gets
interrupted by a hardirq or NMI and that writes an event, then the softirq
could finish its reservation without a valid write_stamp.
In the slow path of the event reservation, a delta can still be used if the
write_stamp is valid. Instead of using a cmpxchg against the write stamp,
the before_stamp needs to be read again to validate the write_stamp. The
cmpxchg is not needed.
This updates the slowpath to validate the write_stamp by comparing it to
the before_stamp and removes all rb_time_cmpxchg() as there are no more
users of that function.
The removal of the 32-bit updates of rb_time_t will be done in the next
merge window.
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace/urgent
Head SHA1: b803d7c664d55705831729d2f2e29c874bcd62ea
Steven Rostedt (Google) (1):
ring-buffer: Fix slowpath of interrupted event
----
kernel/trace/ring_buffer.c | 79 ++++++++++++++--------------------------------
1 file changed, 24 insertions(+), 55 deletions(-)
---------------------------
commit b803d7c664d55705831729d2f2e29c874bcd62ea
Author: Steven Rostedt (Google) <rostedt@...dmis.org>
Date: Mon Dec 18 23:07:12 2023 -0500
ring-buffer: Fix slowpath of interrupted event
To synchronize the timestamps with the ring buffer reservation, there are
two timestamps that are saved in the buffer meta data.
1. before_stamp
2. write_stamp
When the two are equal, the write_stamp is considered valid, as in, it may
be used to calculate the delta of the next event as the write_stamp is the
timestamp of the previous reserved event on the buffer.
This is done by the following:
/*A*/ w = current position on the ring buffer
before = before_stamp
after = write_stamp
ts = read current timestamp
if (before != after) {
write_stamp is not valid, force adding an absolute
timestamp.
}
/*B*/ before_stamp = ts
/*C*/ write = local_add_return(event length, position on ring buffer)
if (w == write - event length) {
/* Nothing interrupted between A and C */
/*E*/ write_stamp = ts;
delta = ts - after
/*
* If nothing interrupted again,
* before_stamp == write_stamp and write_stamp
* can be used to calculate the delta for
* events that come in after this one.
*/
} else {
/*
* The slow path!
* Was interrupted between A and C.
*/
This is the place that there's a bug. We currently have:
after = write_stamp
ts = read current timestamp
/*F*/ if (write == current position on the ring buffer &&
after < ts && cmpxchg(write_stamp, after, ts)) {
delta = ts - after;
} else {
delta = 0;
}
The assumption is that if the current position on the ring buffer hasn't
moved between C and F, then it also was not interrupted, and that the last
event written has a timestamp that matches the write_stamp. That is the
write_stamp is valid.
But this may not be the case:
If a task context event was interrupted by softirq between B and C.
And the softirq wrote an event that got interrupted by a hard irq between
C and E.
and the hard irq wrote an event (does not need to be interrupted)
We have:
/*B*/ before_stamp = ts of normal context
---> interrupted by softirq
/*B*/ before_stamp = ts of softirq context
---> interrupted by hardirq
/*B*/ before_stamp = ts of hard irq context
/*E*/ write_stamp = ts of hard irq context
/* matches and write_stamp valid */
<----
/*E*/ write_stamp = ts of softirq context
/* No longer matches before_stamp, write_stamp is not valid! */
<---
w != write - length, go to slow path
// Right now the order of events in the ring buffer is:
//
// |-- softirq event --|-- hard irq event --|-- normal context event --|
//
after = write_stamp (this is the ts of softirq)
ts = read current timestamp
if (write == current position on the ring buffer [true] &&
after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
delta = ts - after [Wrong!]
The delta is to be between the hard irq event and the normal context
event, but the above logic made the delta between the softirq event and
the normal context event, where the hard irq event is between the two. This
will shift all the remaining event timestamps on the sub-buffer
incorrectly.
The write_stamp is only valid if it matches the before_stamp. The cmpxchg
does nothing to help this.
Instead, the following logic can be done to fix this:
before = before_stamp
ts = read current timestamp
before_stamp = ts
after = write_stamp
if (write == current position on the ring buffer &&
after == before && after < ts) {
delta = ts - after
} else {
delta = 0;
}
The above will only use the write_stamp if it still matches before_stamp
and was tested to not have changed since C.
As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
This means the 32-bit rb_time_t workaround can finally be removed. But
that's for a later time.
Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home
Cc: stable@...r.kernel.org
Cc: Masami Hiramatsu <mhiramat@...nel.org>
Cc: Mark Rutland <mark.rutland@....com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@...dmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5a114e752f11..83eab547f1d1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
return local_try_cmpxchg(l, &expect, set);
}
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
- unsigned long cnt, top, bottom, msb;
- unsigned long cnt2, top2, bottom2, msb2;
- u64 val;
-
- /* Any interruptions in this function should cause a failure */
- cnt = local_read(&t->cnt);
-
- /* The cmpxchg always fails if it interrupted an update */
- if (!__rb_time_read(t, &val, &cnt2))
- return false;
-
- if (val != expect)
- return false;
-
- if ((cnt & 3) != cnt2)
- return false;
-
- cnt2 = cnt + 1;
-
- rb_time_split(val, &top, &bottom, &msb);
- msb = rb_time_val_cnt(msb, cnt);
- top = rb_time_val_cnt(top, cnt);
- bottom = rb_time_val_cnt(bottom, cnt);
-
- rb_time_split(set, &top2, &bottom2, &msb2);
- msb2 = rb_time_val_cnt(msb2, cnt);
- top2 = rb_time_val_cnt(top2, cnt2);
- bottom2 = rb_time_val_cnt(bottom2, cnt2);
-
- if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
- return false;
- if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
- return false;
- if (!rb_time_read_cmpxchg(&t->top, top, top2))
- return false;
- if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
- return false;
- return true;
-}
-
#else /* 64 bits */
/* local64_t always succeeds */
@@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val)
{
local64_set(&t->time, val);
}
-
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
- return local64_try_cmpxchg(&t->time, &expect, set);
-}
#endif
/*
@@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
} else {
u64 ts;
/* SLOW PATH - Interrupted between A and C */
- a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
- /* Was interrupted before here, write_stamp must be valid */
+
+ /* Save the old before_stamp */
+ a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
RB_WARN_ON(cpu_buffer, !a_ok);
+
+ /*
+ * Read a new timestamp and update the before_stamp to make
+ * the next event after this one force using an absolute
+ * timestamp. This is in case an interrupt were to come in
+ * between E and F.
+ */
ts = rb_time_stamp(cpu_buffer->buffer);
+ rb_time_set(&cpu_buffer->before_stamp, ts);
+
+ barrier();
+ /*E*/ a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
+ /* Was interrupted before here, write_stamp must be valid */
+ RB_WARN_ON(cpu_buffer, !a_ok);
barrier();
- /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
- info->after < ts &&
- rb_time_cmpxchg(&cpu_buffer->write_stamp,
- info->after, ts)) {
- /* Nothing came after this event between C and E */
+ /*F*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
+ info->after == info->before && info->after < ts) {
+ /*
+ * Nothing came after this event between C and F, it is
+ * safe to use info->after for the delta as it
+ * matched info->before and is still valid.
+ */
info->delta = ts - info->after;
} else {
/*
- * Interrupted between C and E:
+ * Interrupted between C and F:
* Lost the previous events time stamp. Just set the
* delta to zero, and this will be the same time as
* the event this event interrupted. And the events that
Powered by blists - more mailing lists