lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 19 Dec 2023 23:37:10 +0900
From: Masami Hiramatsu (Google) <mhiramat@...nel.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Linux Trace Kernel
 <linux-trace-kernel@...r.kernel.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Mark Rutland <mark.rutland@....com>, Mathieu
 Desnoyers <mathieu.desnoyers@...icios.com>, Linus Torvalds
 <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

On Mon, 18 Dec 2023 23:07:12 -0500
Steven Rostedt <rostedt@...dmis.org> wrote:

> From: "Steven Rostedt (Google)" <rostedt@...dmis.org>
> 
> To synchronize the timestamps with the ring buffer reservation, there are
> two timestamps that are saved in the buffer meta data.
> 
> 1. before_stamp
> 2. write_stamp
> 
> When the two are equal, the write_stamp is considered valid, as in, it may
> be used to calculate the delta of the next event as the write_stamp is the
> timestamp of the previous reserved event on the buffer.
> 
> This is done by the following:
> 
>  /*A*/	w = current position on the ring buffer
> 	before = before_stamp
> 	after = write_stamp
> 	ts = read current timestamp
> 
> 	if (before != after) {
> 		write_stamp is not valid, force adding an absolute
> 		timestamp.

(additional comment)
This happens if this caller interrupts another reservation process's B to E.
(thus the original one only updates "before_stamp", but "write_stamp" is old.)

> 	}
> 
>  /*B*/	before_stamp = ts
> 
>  /*C*/	write = local_add_return(event length, position on ring buffer)
> 
> 	if (w == write - event length) {
> 		/* Nothing interrupted between A and C */
>  /*E*/		write_stamp = ts;
> 		delta = ts - after
> 		/*
> 		 * If nothing interrupted again,
> 		 * before_stamp == write_stamp and write_stamp
> 		 * can be used to calculate the delta for
> 		 * events that come in after this one.
> 		 */

(additional comment)
If something interrupts between C and E, the write_stamp goes backward
because interrupted one updates the before_stamp and write_stamp with
new timestamp. But in this case, write_stamp(=ts) != before_stamp(=new ts).
Thus anyway the next entry will use absolute time stamp forcibly.

> 	} else {
> 
> 		/*
> 		 * The slow path!
> 		 * Was interrupted between A and C.
> 		 */
> 
> This is the place that there's a bug. We currently have:
> 
> 		after = write_stamp
> 		ts = read current timestamp
> 
>  /*F*/		if (write == current position on the ring buffer &&
> 		    after < ts && cmpxchg(write_stamp, after, ts)) {
> 
> 			delta = ts - after;
> 
> 		} else {
> 			delta = 0;
> 		}
> 
> The assumption is that if the current position on the ring buffer hasn't
> moved between C and F, then it also was not interrupted, and that the last
> event written has a timestamp that matches the write_stamp. That is the
> write_stamp is valid.
> 
> But this may not be the case:
> 
> If a task context event was interrupted by softirq between B and C.
> 
> And the softirq wrote an event that got interrupted by a hard irq between
> C and E.
> 
> and the hard irq wrote an event (does not need to be interrupted)
> 
> We have:
> 
>  /*B*/ before_stamp = ts of normal context
> 
>    ---> interrupted by softirq
> 
> 	/*B*/ before_stamp = ts of softirq context
> 
> 	  ---> interrupted by hardirq
> 
> 		/*B*/ before_stamp = ts of hard irq context
> 		/*E*/ write_stamp = ts of hard irq context
> 
> 		/* matches and write_stamp valid */
> 	  <----
> 
> 	/*E*/ write_stamp = ts of softirq context
> 
> 	/* No longer matches before_stamp, write_stamp is not valid! */
> 
>    <---
> 
>  w != write - length, go to slow path
> 
> // Right now the order of events in the ring buffer is:
> //
> // |-- softirq event --|-- hard irq event --|-- normal context event --|
> //
> 
>  after = write_stamp (this is the ts of softirq)
>  ts = read current timestamp
> 
>  if (write == current position on the ring buffer [true] &&
>      after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
> 
> 	delta = ts - after  [Wrong!]
> 
> The delta is to be between the hard irq event and the normal context
> event, but the above logic made the delta between the softirq event and
> the normal context event, where the hard irq event is between the two. This
> will shift all the remaining event timestamps on the sub-buffer
> incorrectly.
> 
> The write_stamp is only valid if it matches the before_stamp. The cmpxchg
> does nothing to help this.

That's right. Even if someone interrupts between A and C, we can write
the ts to write_stamp. In this case, write_stamp(old) != before_stamp(new)
so the next entry will forcibly record the absolute timestamp.
In this case, what we need to do here is using the absolute timestamp instead
of delta.

> 
> Instead, the following logic can be done to fix this:
> 
> 	before = before_stamp
> 	ts = read current timestamp
> 	before_stamp = ts
> 
> 	after = write_stamp
> 
> 	if (write == current position on the ring buffer &&
> 	    after == before && after < ts) {
> 
> 		delta = ts - after
> 
> 	} else {
> 		delta = 0;
> 	}

Ah, this is a good idea. Instead of using the old timestamp, use
delta = 0, thus it will reuse the interrupted timestamp if someone
interrupts between A and C.

Yeah the above works, but my question is, do we really need this
really slow path? I mean;

> 	if (w == write - event length) {
> 		/* Nothing interrupted between A and C */
>  /*E*/		write_stamp = ts;
> 		delta = ts - after

	} else {
		/*
		  Something interrupted between A and C, which should record
		  a new entry before this reserved entry with newer timestamp.
		  we reuse it.
		 */
	 	ts = after = write_stamp;
		delta = 0;
	}

Isn't this enough?

Thank you,

> 
> The above will only use the write_stamp if it still matches before_stamp
> and was tested to not have changed since C.
> 
> As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
> 
> This mean the 32-bit rb_time_t workaround can finally be removed. But
> that's for a later time.
> 
> Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
> 
> Cc: stable@...r.kernel.org
> Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp")
> Signed-off-by: Steven Rostedt (Google) <rostedt@...dmis.org>
> ---
>  kernel/trace/ring_buffer.c | 79 ++++++++++++--------------------------
>  1 file changed, 24 insertions(+), 55 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5a114e752f11..e7055f52afe0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
>  	return local_try_cmpxchg(l, &expect, set);
>  }
>  
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> -	unsigned long cnt, top, bottom, msb;
> -	unsigned long cnt2, top2, bottom2, msb2;
> -	u64 val;
> -
> -	/* Any interruptions in this function should cause a failure */
> -	cnt = local_read(&t->cnt);
> -
> -	/* The cmpxchg always fails if it interrupted an update */
> -	 if (!__rb_time_read(t, &val, &cnt2))
> -		 return false;
> -
> -	 if (val != expect)
> -		 return false;
> -
> -	 if ((cnt & 3) != cnt2)
> -		 return false;
> -
> -	 cnt2 = cnt + 1;
> -
> -	 rb_time_split(val, &top, &bottom, &msb);
> -	 msb = rb_time_val_cnt(msb, cnt);
> -	 top = rb_time_val_cnt(top, cnt);
> -	 bottom = rb_time_val_cnt(bottom, cnt);
> -
> -	 rb_time_split(set, &top2, &bottom2, &msb2);
> -	 msb2 = rb_time_val_cnt(msb2, cnt);
> -	 top2 = rb_time_val_cnt(top2, cnt2);
> -	 bottom2 = rb_time_val_cnt(bottom2, cnt2);
> -
> -	if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
> -		return false;
> -	if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
> -		return false;
> -	if (!rb_time_read_cmpxchg(&t->top, top, top2))
> -		return false;
> -	if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
> -		return false;
> -	return true;
> -}
> -
>  #else /* 64 bits */
>  
>  /* local64_t always succeeds */
> @@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val)
>  {
>  	local64_set(&t->time, val);
>  }
> -
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> -	return local64_try_cmpxchg(&t->time, &expect, set);
> -}
>  #endif
>  
>  /*
> @@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
>  	} else {
>  		u64 ts;
>  		/* SLOW PATH - Interrupted between A and C */
> -		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> -		/* Was interrupted before here, write_stamp must be valid */
> +
> +		/* Save the old before_stamp */
> +		a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
>  		RB_WARN_ON(cpu_buffer, !a_ok);
> +
> +		/*
> +		 * Read a new timestamp and update the before_stamp to make
> +		 * the next event after this one force using an absolute
> +		 * timestamp. This is in case an interrupt were to come in
> +		 * between E and F.
> +		 */
>  		ts = rb_time_stamp(cpu_buffer->buffer);
> +		rb_time_set(&cpu_buffer->before_stamp, ts);
> +
> +		barrier();
> + /*E*/		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> +		/* Was interrupted before here, write_stamp must be valid */
> +		RB_WARN_ON(cpu_buffer, !a_ok);
>  		barrier();
> - /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> -		    info->after < ts &&
> -		    rb_time_cmpxchg(&cpu_buffer->write_stamp,
> -				    info->after, ts)) {
> -			/* Nothing came after this event between C and E */
> + /*F*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> +		    info->after == info->before && info->after < ts) {
> +			/*
> +			 * Nothing came after this event between C and F, it is
> +			 * safe to use info->after for the delta as it
> +			 * matched info->before and is still valid.
> +			 */
>  			info->delta = ts - info->after;
>  		} else {
>  			/*
> -			 * Interrupted between C and E:
> +			 * Interrupted between C and F:
>  			 * Lost the previous events time stamp. Just set the
>  			 * delta to zero, and this will be the same time as
>  			 * the event this event interrupted. And the events that
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@...nel.org>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ