linux-kernel - Re: [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <571ad413-5fd2-496f-96f7-06ca95b1ec9a@paulmck-laptop>
Date: Thu, 13 Nov 2025 09:09:17 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Petr Mladek <pmladek@...e.com>
Cc: linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
	d-tatianin@...dex-team.ru, john.ogness@...utronix.de,
	sfr@...b.auug.org.au, rostedt@...dmis.org, senozhatsky@...omium.org
Subject: Re: [BUG -next] WARNING: kernel/printk/printk_ringbuffer.c:1278 at
 get_data+0xb3/0x100

On Thu, Nov 13, 2025 at 08:37:15AM +0100, Petr Mladek wrote:
> Hi Paul,
> 
> first, thanks a lot for reporting the regression.
> 
> On Wed 2025-11-12 16:52:16, Paul E. McKenney wrote:
> > Hello!
> > 
> > Some rcutorture runs on next-20251110 hit the following error on x86:
> > 
> > WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0xb3/0x100, CPU#0: rcu_torture_sta/63
> > 
> > This happens in about 20-25% of the rcutorture runs, and is the
> > WARN_ON_ONCE(1) in the "else" clause of get_data().  There was no
> > rcutorture scenario that failed to reproduce this bug, so I am guessing
> > that the various .config files will not provide useful information.
> > Please see the end of this email for a representative splat, which is
> > usually rcutorture printing out something or another.  (Which, in its
> > defense, has worked just fine in the past.)
> > 
> > Bisection converged on this commit:
> > 
> > 67e1b0052f6b ("printk_ringbuffer: don't needlessly wrap data blocks around")
> > 
> > Reverting this commit suppressed (or at least hugely reduced the
> > probability of) the WARN_ON_ONCE().
> > 
> > The SRCU-T, SRCU-U, and TREE09 scenarios hit this most frequently at
> > about double the base rate, but are CONFIG_SMP=n builds.  The RUDE01
> > scenario was the most productive CONFIG_SMP=y scenario.  Reproduce as
> > follows, where "N" is the number of CPUs on your system divided by three,
> > rounded down:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*RUDE01"
> > 
> > Or if you can do CONFIG_SMP=n, the following works, where "N" is the
> > number of CPUs on your system:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "N*SRCU-T"
> > 
> > Or please tell me what debug I should enable on my runs.
> 
> The problem was reported by two test robots last week. It happens when
> a message fits exactly up to the last byte before the ring buffer gets
> wrapped for the first time. It is interesting that you have seen
> so frequently (in about 20-25% rcutorture runs).
> 
> Anyway, I have pushed a fix on Monday. It is the commit
> cc3bad11de6e0d601 ("printk_ringbuffer: Fix check of
> valid data size when blk_lpos overflows"), see
> https://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git/commit/?h=for-6.19&id=cc3bad11de6e0d6012460487903e7167d3e73957

Even better!  Thank you for the fix.

> Thanks a lot for so exhaustive report. And I am sorry that you
> probably spent a lot of time with it.

Well, actually, it was the first time that I turned "git bisect run"
loose on a full (and fully scripted) remote RCU run.  Each step involved
checking out 20 systems from the test group, building 20 kernels,
downloading the build products to each of the 20 systems, running each
of 286 guest OSes (15 each for 19 of the kernels and one instance of
the large one) spread over the 20 systems, waiting for them to finish,
uploading the test results, returning the systems to the test group,
analyzing them, and reporting either success (all runs succeeded) or
failure (at least one failure across the 286 kernels.  Then my grepping
through the run results directory to get you the failure rate.  Of course,
that failure rate indicates that I could have done the bisection more
quickly and with much less hardware, but that would have required me to
stop the other things I was doing and actually think about this.

Each step took somewhere between 90 minutes and two hours on a total of
1600 CPUs, and all ~11 bisection steps completed without my intervention.
Thus far, neither the test grid, the systems, the scripting, nor git
bisect have complained about my having wasted their time, but what with
AI they probably soon will do so.

I am somewhat surprised that it all went through without something
breaking, but I guess that we all get lucky from time to time.  ;-)

So not a lot of work for me, which is a good thing, given that I had
lots of other distractions, including another much more troublesome
bisection on ARM that actually found a bug that had not yet been fixed.
A trivial bug, admittedly, but such is life!

And again, thank you for so quickly fixing this!

							Thanx, Paul