linux-kernel - Re: [PATCH v1 1/5] relayfs: support a counter tracking if per-cpu buffers is full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAgur5MNODLbGP2zRje8T22KgekwSOvxfLnKvGFupO9ag@mail.gmail.com>
Date: Tue, 13 May 2025 09:37:35 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: axboe@...nel.dk, rostedt@...dmis.org, mhiramat@...nel.org, 
	mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org, 
	linux-block@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>, Yushan Zhou <katrinzhou@...cent.com>
Subject: Re: [PATCH v1 1/5] relayfs: support a counter tracking if per-cpu
 buffers is full

On Tue, May 13, 2025 at 8:51 AM Andrew Morton <akpm@...ux-foundation.org> wrote:
>
> On Mon, 12 May 2025 10:49:31 +0800 Jason Xing <kerneljasonxing@...il.com> wrote:
>
> > From: Jason Xing <kernelxing@...cent.com>
> >
> > Add 'full' field in per-cpu buffer structure to detect if the buffer is
> > full, which means: 1) relayfs doesn't intend to accept new data in
> > non-overwrite mode that is also by default, or 2) relayfs is going to
> > start over again and overwrite old unread data when kernel module has
> > its own subbuf_start callback to support overwrite mode. This counter
> > works for both overwrite and non-overwrite modes.
> >
> > This counter doesn't have any explicit lock to protect from being
> > modified by different threads at the same time for the better
> > performance consideration. In terms of the atomic operation, it's not
> > introduced for incrementing the counter like blktrace because side
> > effect may arise when multiple threads access the counter simultaneously
> > on the machine equipped with many cpus, say, more than 200. As we can
> > see in relay_write() and __relay_write(), the writer at the beginning
> > should consider how to use the lock for the whole write process, thus
> > it's not necessary to add another lock to make sure the counter is
> > accurate.
> >
> > Using 'pahole --hex -C rchan_buf vmlinux' so you can see this field just
> > fits into one 4-byte hole in the cacheline 2.
>
> Does this alter blktrace output?  If so is that backward-compatible
> (and do we care).  Is there any blktrace documentation which should be
> updated?

Thanks for the review.

Surely no, I tested blktrace by running 'blktrace -d /dev/vda1' to
verify the dropped field after the entire series applied. So it's
compatible.

>
> Also, please check Documentation/filesystems/relay.rst and see if any
> updates should be made to reflect the changes in this patchset.

Right, will do it accordingly.

>
> I'm not really clear on the use cases of this counter - perhaps you can
> be more verbose about this in the changelog.

Will add more.

The existing code 'blk_dropped_read' in blktrace might give you a
hint. In real production, we sometimes encounter the case where the
reader cannot consume the data as fast as possible, which leads to
data loss.

>
> > diff --git a/include/linux/relay.h b/include/linux/relay.h
> > index f80b0eb1e905..022cf11e5a92 100644
> > --- a/include/linux/relay.h
> > +++ b/include/linux/relay.h
> > @@ -28,6 +28,14 @@
> >   */
> >  #define RELAYFS_CHANNEL_VERSION              7
> >
> > +/*
> > + * Relay buffer error statistics dump
> > + */
> > +struct rchan_buf_error_stats
> > +{
> > +     unsigned int full;              /* counter for buffer full */
> > +};
>
> Why a struct?

It is because I'm going to add more counters related to the error
information, like patch [4/5]. Does it make any sense?

>
> >  /*
> >   * Per-cpu relay channel buffer
> >   */
> > @@ -43,6 +51,7 @@ struct rchan_buf
> >       struct irq_work wakeup_work;    /* reader wakeup */
> >       struct dentry *dentry;          /* channel file dentry */
> >       struct kref kref;               /* channel buffer refcount */
> > +     struct rchan_buf_error_stats stats; /* error stats */
>
> Could simply use
>
>         unsigned int full;
>
> here?
>
> Also, the name "full" implies to me "it is full".  Perhaps "full_count"
> would be better.

Got it. Makes sense to me.

>
> >       struct page **page_array;       /* array of current buffer pages */
> >       unsigned int page_count;        /* number of current buffer pages */
> >       unsigned int finalized;         /* buffer has been finalized */
> > diff --git a/kernel/relay.c b/kernel/relay.c
> > index 5aeb9226e238..b5db4aa60da1 100644
> > --- a/kernel/relay.c
> > +++ b/kernel/relay.c
> > @@ -252,8 +252,13 @@ EXPORT_SYMBOL_GPL(relay_buf_full);
> >  static int relay_subbuf_start(struct rchan_buf *buf, void *subbuf,
> >                             void *prev_subbuf)
> >  {
> > +     int buf_full = relay_buf_full(buf);
> > +
> > +     if (buf_full)
> > +             buf->stats.full++;
>
> I don't understand the changelog's description of this, sorry.
>
> Is it saying "this operation is protected by a lock"?  If so, please
> specifically state which lock this is.
>
> Or is it saying "we don't care if this races because the counter will
> be close enough".  If so then maybe so, but things like KCSAN will
> probably detect and warn and then people will want to address this.
>

Sorry for the confusion. I meant the whole write process should
perform with the lock protection which users are supposed to pay more
attention to. User calls __relay_write which means he already
considers the racy case, adding additional lock before writing. On the
assumption that the whole write process is protected, there is no need
to add any form of lock internally (even like atomic operations) to
ensure the consistency of full_count.

I will revise it in the next re-spin.

Thanks,
Jason