linux-kernel - Re: [PATCH 2/2] printk: use the lockless ringbuffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87pneiyv12.fsf@linutronix.de>
Date:   Thu, 13 Feb 2020 23:36:25 +0100
From:   John Ogness <john.ogness@...utronix.de>
To:     Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Cc:     Petr Mladek <pmladek@...e.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Andrea Parri <parri.andrea@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        kexec@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] printk: use the lockless ringbuffer

On 2020-02-13, Sergey Senozhatsky <sergey.senozhatsky.work@...il.com> wrote:
>>>> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>>>>  
>>>>  		logbuf_unlock_irq();
>>>>  		ret = wait_event_interruptible(log_wait,
>>>> -					       user->seq != log_next_seq);
>>>> +					prb_read_valid(prb, user->seq, r));
>>>>  		if (ret)
>>>>  			goto out;
>>>>  		logbuf_lock_irq();
>>>>  	}
>>>>  
>>>> -	if (user->seq < log_first_seq) {
>>>> -		/* our last seen message is gone, return error and reset */
>>>> -		user->idx = log_first_idx;
>>>> -		user->seq = log_first_seq;
>>>> +	if (user->seq < r->info->seq) {
>>>> +		/* the expected message is gone, return error and reset */
>>>> +		user->seq = r->info->seq;
>>>>  		ret = -EPIPE;
>>>>  		logbuf_unlock_irq();
>>>>  		goto out;
>>>>  	}
>>>
>>> Sorry, why doesn't this do something like
>>>
>>> 	if (user->seq < prb_first_seq(prb)) {
>>> 		/* the expected message is gone, return error and reset */
>>> 		user->seq = prb_first_seq(prb);
>>> 		ret = -EPIPE;
>>> 		...
>>> 	}
>> 
>> Here prb_read_valid() was successful, so a record _was_ read. The
>> kerneldoc for the prb_read_valid() says:
>
> Hmm, yeah. That's true.
>
> OK, something weird...
>
> I ran some random printk-pressure test (mostly printks from IRQs;
> + some NMI printk-s, but they are routed through nmi printk-safe
> buffers; + some limited number of printk-safe printk-s, routed
> via printk-safe buffer (so, once again, IRQ); + user-space
> journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
> is terminally broken
>
> [..]
> cat /dev/kmsg
> cat: /dev/kmsg: Broken pipe

In mainline you can have this "problem" as well. Once the ringbuffer has
wrapped, any read to a newly opened /dev/kmsg when a new message arrived
will result in an EPIPE. This happens quite easily once the ringbuffer
has wrapped because each new message is overwriting the oldest message.

Although it can be convenient, cat(1) is actually a poor tool for
viewing the ringbuffer for this reason. Unfortunately dmesg(1) is
sub-optimal as well because it does not show the sequence numbers. So
with dmesg(1) you cannot see if a message was dropped. :-/

> So I printed seq numbers from devksmg read to a seq buffer and dumped
> it via procfs, just seq numbers before we adjust user->seq (set to
> r->seq) and after
>
> +                       offt += snprintf(BUF + offt,
> +                                       sizeof(BUF) - offt,
> +                                       "%s: devkmsg_read() error %llu %llu %llu\n",
> +                                       current->comm,
> +                                       user->seq,
> +                                       r->info->seq,
> +                                       prb_first_seq(prb));
>
>
> ...
> systemd-journal: devkmsg_read() error 1979281 1982465 1980933
> systemd-journal: corrected seq 1982465 1982465
> cat: devkmsg_read() error 1980987 1982531 1980987
> cat: corrected seq 1982531 1982531
> cat: devkmsg_read() error 1981015 1982563 1981015
> cat: corrected seq 1982563 1982563

The situation with a data-less record is the same as when the ringbuffer
wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
not going to help because it will not be able to get past that data-less
record.

We could implement it such that devkmsg_read() will skip over data-less
records instead of issuing an EPIPE. (That is what dmesg does.) But then
do we need EPIPE at all? The reader can see that is has missed records
by tracking the sequence number, so could we just get rid of EPIPE? Then
cat(1) would be a great tool to view the raw ringbuffer. Please share
your thoughts on this.


On a side note (but related to data-less records): I hacked the
ringbuffer code to inject data-less records at various times in order to
verify your report. And I stumbled upon a bug in the ringbuffer, which
can lead to an infinite loop in console_unlock(). The problem occurs at:

    retry = prb_read_valid(prb, console_seq, NULL);

which will erroneously return true if console_seq is pointing to a
data-less record but there are no valid records after it. The following
patch fixes the bug. And yes, for v2 I have added comments to the
desc_read_committed() code.

I now have 2 bugfixes queued up for v2. The first one is here[0].

[0] https://lkml.kernel.org/r/87wo919grz.fsf@linutronix.de

John Ogness


diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 796257f226ee..31893051ad6b 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1074,6 +1071,7 @@ static int desc_read_committed(struct prb_desc_ring *desc_ring,
 			       unsigned long id, u64 seq,
 			       struct prb_desc *desc)
 {
+	struct prb_data_blk_lpos *blk_lpos = &desc->text_blk_lpos;
 	enum desc_state d_state;
 
 	d_state = desc_read(desc_ring, id, desc);
@@ -1084,6 +1082,11 @@ static int desc_read_committed(struct prb_desc_ring *desc_ring,
 	else if (d_state != desc_committed)
 		return -EINVAL;
 
+	if (blk_lpos->begin == INVALID_LPOS &&
+	    blk_lpos->next == INVALID_LPOS) {
+		return -ENOENT;
+	}
+
 	return 0;
 }