linux-kernel - Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0a7b6c7e-c388-4327-b666-6225bc63c90c@lucifer.local>
Date: Wed, 10 Dec 2025 16:18:36 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Vlastimil Babka <vbabka@...e.cz>,
        David Laight <david.laight.linux@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...nel.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
        oliver.sang@...el.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags

On Tue, Dec 09, 2025 at 10:26:22AM +0100, Mateusz Guzik wrote:
> On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> > As Mateusz pointed out off-list, the profiles look like mutexes are doing
> > less optimistic spinning and more sleeping. Which IMHO isn't something that
> > this change can directly affect.
> >
>
> Not mutexes but rwsems.
>
> The bench at hand has some of the code spinlocked, other parts take
> rwsems for reading *or* writing.
>
> I had a peek at rwsem implementation and to my understanding it can
> degrade to no spinning in a microbenchmark setting like this one,
> provided you are unlucky enough.

So we're just sleep, sleep sleeping? That would explain it...

I mean is this an issue with the IPC design or the rwsem's in general?

>
> In particular you can get unlucky if existing timings get perturbed,
> which I presume is happening after Lorenzo's patch.
>
> To demonstrate I wrote a toy patch which conditionally converts affected
> down_read calls into down_write (inlined at the end).
>
> While the original report is based on a 192-thread box, I was only able
> to test with 80 threads. Even so, the crux of the issue was nicely
> reproduced.
>
> ./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80

I wonder if large core count matters here in particular, I was doing this
(albeit in a VM...) with 62 cores

>
> Top says (times vary, idle is growing over time):
> %Cpu(s):  3.3 us, 24.4 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
>
> ... but if I flip the switch to down_write:
>
> %Cpu(s):  6.3 us, 80.9 sy,  0.0 ni, 12.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
>
> The switch is a sysctl named fs.magic_tunable (0 == down_read; 1 == down_write).
>
> In terms of performance I see the following:
>
> stress-ng: metrc: [2546] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
> # sysctl fs.magic_tunable=0 ## down_read
> stress-ng: metrc: [2546] msg            63353488     10.01     28.21    213.26   6331298.95      262362.91        30.16          2016
> # sysctl fs.magic_tunable=1 ## down_write
> stress-ng: metrc: [2036] msg           455014809     10.00     48.79    676.42  45496870.65      627425.68        90.64          2056
>
> That is to say rwsem code is the real culprit and Lorenzo is a random
> (albeit deserving) victim.

;)

Ack interesting.

>
> I see two action items:
> - massage the patch back to a state where things compile to the same asm
>   as before as it clearly avoidably regressed regardless of the
>   aforementioned issue

I think given the fix is so trivially obviously correct and not egregious to
take, we should do that for the time being.

> - figure out what to do with rwsem code for read vs write spinning

I have no idea how the code generation could have varied here, I did wonder if
somehow the changes somehow pessimised the optimiser for heuristic compiler
reasons somehow, or if somewhere there is something that has an issue with a
certain incantation of cast (but only after a certain flavour of nested macro
expansion?...)

>
> I'm not picking this up for the time being, but I might look at this at
> some point.
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index cd4a3db4659a..de1ef700d144 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
>  	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>  }
>
> +unsigned long magic_tunable;
> +
>  static const struct ctl_table fs_stat_sysctls[] = {
>  	{
>  		.procname	= "file-nr",
> @@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
>  		.extra1		= SYSCTL_LONG_ZERO,
>  		.extra2		= SYSCTL_LONG_MAX,
>  	},
> +	{
> +		.procname	= "magic_tunable",
> +		.data		= &magic_tunable,
> +		.maxlen		= sizeof(magic_tunable),
> +		.mode		= 0644,
> +		.proc_handler	= proc_doulongvec_minmax,
> +		.extra1		= SYSCTL_LONG_ZERO,
> +		.extra2		= SYSCTL_LONG_MAX,
> +	},
> +
>  	{
>  		.procname	= "nr_open",
>  		.data		= &sysctl_nr_open,
> diff --git a/ipc/msg.c b/ipc/msg.c
> index ee6af4fe52bf..fa835ea53e09 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -474,6 +474,8 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
>  	return err;
>  }
>
> +extern unsigned long magic_tunable;
> +
>  static int msgctl_info(struct ipc_namespace *ns, int msqid,
>  			 int cmd, struct msginfo *msginfo)
>  {
> @@ -495,11 +497,19 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
>  	msginfo->msgmnb = ns->msg_ctlmnb;
>  	msginfo->msgssz = MSGSSZ;
>  	msginfo->msgseg = MSGSEG;
> -	down_read(&msg_ids(ns).rwsem);
> -	if (cmd == MSG_INFO)
> -		msginfo->msgpool = msg_ids(ns).in_use;
> -	max_idx = ipc_get_maxidx(&msg_ids(ns));
> -	up_read(&msg_ids(ns).rwsem);
> +	if (!READ_ONCE(magic_tunable)) {
> +		down_read(&msg_ids(ns).rwsem);
> +		if (cmd == MSG_INFO)
> +			msginfo->msgpool = msg_ids(ns).in_use;
> +		max_idx = ipc_get_maxidx(&msg_ids(ns));
> +		up_read(&msg_ids(ns).rwsem);
> +	} else {
> +		down_write(&msg_ids(ns).rwsem);
> +		if (cmd == MSG_INFO)
> +			msginfo->msgpool = msg_ids(ns).in_use;
> +		max_idx = ipc_get_maxidx(&msg_ids(ns));
> +		up_write(&msg_ids(ns).rwsem);
> +	}
>  	if (cmd == MSG_INFO) {
>  		msginfo->msgmap = min_t(int,
>  				     percpu_counter_sum(&ns->percpu_msg_hdrs),
> diff --git a/ipc/util.c b/ipc/util.c
> index cae60f11d9c2..c65c8289a54b 100644
> --- a/ipc/util.c
> +++ b/ipc/util.c
> @@ -771,6 +771,7 @@ struct ipc_proc_iter {
>  	struct ipc_namespace *ns;
>  	struct pid_namespace *pid_ns;
>  	struct ipc_proc_iface *iface;
> +	bool writelocked;
>  };
>
>  struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
> @@ -828,6 +829,8 @@ static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
>  	return sysvipc_find_ipc(&iter->ns->ids[iface->ids], pos);
>  }
>
> +extern unsigned long magic_tunable;
> +
>  /*
>   * File positions: pos 0 -> header, pos n -> ipc idx = n - 1.
>   * SeqFile iterator: iterator value locked ipc pointer or SEQ_TOKEN_START.
> @@ -844,7 +847,13 @@ static void *sysvipc_proc_start(struct seq_file *s, loff_t *pos)
>  	 * Take the lock - this will be released by the corresponding
>  	 * call to stop().
>  	 */
> -	down_read(&ids->rwsem);
> +	if (!READ_ONCE(magic_tunable)) {
> +		down_read(&ids->rwsem);
> +		iter->writelocked = false;
> +	} else {
> +		down_write(&ids->rwsem);
> +		iter->writelocked = true;
> +	}
>
>  	/* pos < 0 is invalid */
>  	if (*pos < 0)
> @@ -871,7 +880,10 @@ static void sysvipc_proc_stop(struct seq_file *s, void *it)
>
>  	ids = &iter->ns->ids[iface->ids];
>  	/* Release the lock we took in start() */
> -	up_read(&ids->rwsem);
> +	if (!iter->writelocked)
> +		up_read(&ids->rwsem);
> +	else
> +		up_write(&ids->rwsem);
>  }
>
>  static int sysvipc_proc_show(struct seq_file *s, void *it)