linux-kernel - Re: [PATCH v2] The value may overflow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d2a3456b-aee9-f5b0-f8e4-5c5030c3217f@efficios.com>
Date:   Tue, 5 Sep 2023 15:36:40 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     paulmck@...nel.org
Cc:     David Laight <David.Laight@...lab.com>,
        Denis Arefev <arefev@...mel.ru>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        "rcu@...r.kernel.org" <rcu@...r.kernel.org>,
        "lvc-project@...uxtesting.org" <lvc-project@...uxtesting.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "trufanov@...mel.ru" <trufanov@...mel.ru>,
        "vfh@...mel.ru" <vfh@...mel.ru>, ldufour@...ux.ibm.com
Subject: Re: [PATCH v2] The value may overflow

On 9/5/23 15:27, Paul E. McKenney wrote:
> On Tue, Sep 05, 2023 at 09:40:51AM -0700, Paul E. McKenney wrote:
>> On Tue, Sep 05, 2023 at 10:34:43AM -0400, Mathieu Desnoyers wrote:
>>> On 9/5/23 10:15, David Laight wrote:
>>>> ...
>>>>> That would instead be more than 512-16=496 CPUs, correct?  496 CPUs would
>>>>> only require a 31-bit shift, which should be OK, but 497 would require
>>>>> a 32-bit shift, which would result in sign extension.  If it turns out
>>>>> that sign extension is OK, then we should get in trouble at 513 CPUs,
>>>>> which would result in a 33-bit shift (and is that even defined in C?).
>>>>
>>>> Not quite right :-)
>>>>
>>>> (1 << 31) is int and negative, that gets sign extended before
>>>> being converted to 'unsigned long' - so has the top 33 bits set.
>>
>> Good point, thank you for the correction.
>>
>>>> (1 << 32) is undefined, the current x86 cpu ignore the high
>>>> shift bits so it is (1 << 0).
>>
>> Which would cause it to attempt to invoke SRCU callbacks on the
>> lowest-numbered CPU associated with that srcu_node structure.
>>
>>> Yes, I was about to reply the same thing. A shift of 31 is buggy,
>>> because shifting 1 << 31 raises the sign bit, which sets the top 33
>>> bits when cast to unsigned long. A shift of 1 << 32 is undefined,
>>> with for instance x86 choosing to ignore the top bit.
>>>
>>> But in order to have a 1 << 31 shift from this expression:
>>>
>>>                  sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
>>>
>>> I *think* we need the group to have 32 cpus or more (indexed within
>>> the group from grplo to grplo + 31 (both inclusive)).
>>>
>>> So as soon as we have one group with 32 cpus, the problem should show
>>> up. With FANOUT_LEAF=16, we can have 15 groups of 31 cpus and 1
>>> group of 32 cpus, for:
>>>
>>>    15*31 + 32 = 497 cpus.
>>>
>>> AFAIU, this would be the minimum value for smp_processor_id()+1 which
>>> triggers this issue.
>>
>> By default, there are 16 CPUs per leaf srcu_node structure.  Each leaf
>> srcu_node structure takes up one bit in the parent srcu_node structures'
>> srcu_data_have_cbs[] array.  Up to 30 bits is OK, 31 bits is questionable,
>> and 32 bits and larger erroneous.
>>
>> This is the situation as I see it (assuming dense CPU numbering):
>>
>> 	# Leaf srcu_node		Largest
>> 	structures	#CPUs		CPU #		Status
>>
>> 	0-30		0-480		479		Good
>> 	31		481-496		495		Questionable
>> 	32-		497-		496+		Bad.
>>
>> Tree SRCU differs from Tree RCU in its operation, so this bug should
>> not hold up SRCU grace periods.  It might instead cause SRCU callbacks
>> to be ignored (which would admittedly look quite similar to the user).
>>
>> My attempts to cheat the numbering system ran up against the limited
>> height of the srcu_node tree.
>>
>> But there is still the question of why this has not been seen in the
>> wild, given that there really are systems with more than 479 CPUs
>> out there.  One possibility is the call to srcu_schedule_cbs_sdp()
>> from srcu_funnel_gp_start().  But it does not seem likely that this
>> would happen all that often, as it requires back-to-back grace periods
>> and then some.
>>
>> Maybe people with large systems boot with srcutree.convert_to_big=0?
>>
>> Adding Laurent for his thoughts.
>>
>> Aha!
>>
>> Here is what makes it work, given David's description of 1<<32:
>>
>> static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp,
>> 				  unsigned long mask, unsigned long delay)
>> {
>> 	int cpu;
>>
>> 	for (cpu = snp->grplo; cpu <= snp->grphi; cpu++) {
>> 		if (!(mask & (1 << (cpu - snp->grplo))))
>> 			continue;
>> 		srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, cpu), delay);
>> 	}
>> }
>>
>> As long as at least one bit is set in the result of 1<<N, and as long
>> as the compiler always does the same thing for a given N, then this loop
>> will make the right thing happen.
>>
>> But even that is not relied upon, because the calling code looks like
>> this:
>>
>> 			spin_lock_irq_rcu_node(snp);
>> 			cbs = false;
>> 			last_lvl = snp >= sup->level[rcu_num_lvls - 1];
>> 			if (last_lvl)
>> 				cbs = ss_state < SRCU_SIZE_BIG || snp->srcu_have_cbs[idx] == gpseq;
>> 			snp->srcu_have_cbs[idx] = gpseq;
>> 			rcu_seq_set_state(&snp->srcu_have_cbs[idx], 1);
>> 			sgsne = snp->srcu_gp_seq_needed_exp;
>> 			if (srcu_invl_snp_seq(sgsne) || ULONG_CMP_LT(sgsne, gpseq))
>> 				WRITE_ONCE(snp->srcu_gp_seq_needed_exp, gpseq);
>> 			if (ss_state < SRCU_SIZE_BIG)
>> 				mask = ~0;
>> 			else
>> 				mask = snp->srcu_data_have_cbs[idx];
>> 			snp->srcu_data_have_cbs[idx] = 0;
>> 			spin_unlock_irq_rcu_node(snp);
>> 			if (cbs)
>> 				srcu_schedule_cbs_snp(ssp, snp, mask, cbdelay);
>>
>> So that last_lvl check means that the srcu_schedule_cbs_snp() function
>> is invoked only for leaf srcu_node structures.  Which by default limit
>> the shift to 16.
>>
>> So this bug appears to have no effect for default RCU setups, even on very
>> large 64-bit systems, which is consistent with field experience.  Even if
>> this is the case, it still should be fixed, to avoid confusion if nothing
>> else.  Or just in case someone decides to set CONFIG_RCU_FANOUT_LEAF=32.
>> Which actually happened the other day due to someone trusting ChatGPT's
>> opinion about RCU Kconfig options...
> 
> And I therefore queued Denis's v3 patch with an edited commit log.
> Of course, if anyone sees some other way that the bug could manifest
> other than in a 64-bit kernel built with CONFIG_RCU_FANOUT_LEAF greater
> than 30 on a system with at least 31 CPUs, please let me know so that
> I can adjust.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit ed083b0e22f1396dee3599896249a3f218845298
> Author: Denis Arefev <arefev@...mel.ru>
> Date:   Mon Sep 4 15:21:14 2023 +0300
> 
>      Fix srcu_struct node grpmask overflow on 64-bit systems
>      
>      The value of an arithmetic expression 1 << (cpu - sdp->mynode->grplo)

AFAIU, the overflow resides in the "bitwise expression" and not
the arithmetic expression.

Other than this, please add my

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>


>      is subject to overflow due to a failure to cast operands to a larger
>      data type before performing arithmetic.
>      
>      The maximum result of this subtraction is defined by the RCU_FANOUT_LEAF
>      Kconfig option, which on 64-bit systems defaults to 16 (resulting in a
>      maximum shift of 15), but which can be set up as high as 64 (resulting
>      in a maximum shift of 63).  A value of 31 can result in sign extension,
>      resulting in 0xffffffff80000000 instead of the desired 0x80000000.
>      A value of 31 or greater triggers undefined behavior per the C standard.
>      
>      This bug has not been known to cause issues because almost all kernels
>      take the default CONFIG_RCU_FANOUT_LEAF=16.  Furthermore, as long as
>      a given compiler gives a deterministic result for 1<<N for N>=32, the
>      code correctly invokes all SRCU callbacks, albeit wasting CPU time along
>      the way.
>      
>      This commit therefore substitutes the correct 1UL for the buggy 1.
>      
>      Found by Linux Verification Center (linuxtesting.org) with SVACE.
>      
>      Signed-off-by: Denis Arefev <arefev@...mel.ru>
>      Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>      Cc: David Laight <David.Laight@...lab.com>
>      Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 833a8f848a90..5602042856b1 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -223,7 +223,7 @@ static bool init_srcu_struct_nodes(struct srcu_struct *ssp, gfp_t gfp_flags)
>   				snp->grplo = cpu;
>   			snp->grphi = cpu;
>   		}
> -		sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
> +		sdp->grpmask = 1UL << (cpu - sdp->mynode->grplo);
>   	}
>   	smp_store_release(&ssp->srcu_sup->srcu_size_state, SRCU_SIZE_WAIT_BARRIER);
>   	return true;
> @@ -835,7 +835,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp
>   	int cpu;
>   
>   	for (cpu = snp->grplo; cpu <= snp->grphi; cpu++) {
> -		if (!(mask & (1 << (cpu - snp->grplo))))
> +		if (!(mask & (1UL << (cpu - snp->grplo))))
>   			continue;
>   		srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, cpu), delay);
>   	}

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com