lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f0007398-cd35-4e9e-8ee5-7541de92821f@bytedance.com>
Date:   Mon, 20 Nov 2023 20:24:37 +0800
From:   "wuqiang.matt" <wuqiang.matt@...edance.com>
To:     "Masami Hiramatsu (Google)" <mhiramat@...nel.org>
Cc:     linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
        mattwu@....com
Subject: Re: [PATCH v1] lib: objpool: fix head overrun on RK3588 SBC

On 2023/11/20 13:18, Masami Hiramatsu (Google) wrote:
> On Tue, 14 Nov 2023 19:51:48 +0800
> "wuqiang.matt" <wuqiang.matt@...edance.com> wrote:
> 
>> objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
>> following kernel warnings:
>>
>>      WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
>>
>> This message is from objpool.c:168:
>>
>>      WARN_ON_ONCE(tail - head > pool->nr_objs);
>>
>> The overrun test case is to validate the case that pre-allocated objects
>> are insufficient: 8 objects are pre-allocated for each node and consumer
>> thread per node tries to grab 16 objects in a row. The testing system is
>> OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
>> disabling either all 4 big or 4 little cores, the overrun tests run well,
>> and once with big and little cores mixed together, the overrun test would
>> always cause an overrun loop. It's likely the memory timing differences
>> of big and little cores cause this trouble. Here are the debugging data
>> of objpool_try_get_slot after try_cmpxchg_release:
>>
>>      objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
>>
>> The local copies of 'head' and 'last' were 278 and 276, and reloading of
>> 'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
>> 'slot->head' became 'head + 1', which is correct. But what's wrong here
>> is the stale value of 'last', and that stale value of 'last' finally led
>> the overrun of 'head'.
> 
> Ah, good catch! So even if the ring size is enough, the head/tail update
> value is not updated locally, it can cause the overrun!

It's really confusing at the first glance of such an issue. I was assuming
the order between 'last' and 'head' should be implicitly maintained, but
after more digging, then found that wasn't true actually, the order should
be explicitly guaranteed by pop().

I also verified with Amlogic A311D which has 6 cores (4x A73 and 4x A53),
and got same results. I think I just need re-discover the differences of
HMP (heterogeneous multiprocessing) for big.LITTLE or P/E cores cpus.

>>
>> Memory updating of 'last' and 'head' are performed in push() and pop()
>> independently, which could be the culprit leading this out of order
>> visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
>> not enough only checking the condition of 'head != slot', the implicit
>> condition 'last - head <= nr_objs' must also be explicitly asserted to
>> guarantee 'last' is always behind 'head' before the object retrieving.
> 
> Indeed. Thanks for the investigation!
> 
>>
>> This patch will check and try reloading of 'head' and 'last' to ensure
>> 'last' is behind 'head' at the time of object retrieving. Performance
>> testings show the average impact is about 0.1% for X86_64 and 1.12% for
>> ARM64. Here are the results:
>>
>>      OS: Debian 10 X86_64, Linux 6.6rc
>>      HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>>                        1T         2T         4T         8T        16T
>>      native:     49543304   99277826  199017659  399070324  795185848
>>      objpool:    29909085   59865637  119692073  239750369  478005250
>>      objpool+:   29879313   59230743  119609856  239067773  478509029
>>                       32T        48T        64T        96T       128T
>>      native:   1596927073 2390099988 2929397330 3183875848 3257546602
>>      objpool:   957553042 1435814086 1680872925 2043126796 2165424198
>>      objpool+:  956476281 1434491297 1666055740 2041556569 2157415622
>>
>>      OS: Debian 11 AARCH64, Linux 6.6rc
>>      HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
>>                        1T         2T         4T         8T        16T
>>      native:     30890508   60399915  123111980  242257008  494002946
>>      objpool:    14742531   28883047   57739948  115886644  232455421
>>      objpool+:   14107220   29032998   57286084  113730493  232232850
>>                       24T        32T        48T        64T        96T
>>      native:    746406039 1000174750 1493236240 1998318364 2942911180
>>      objpool:   349164852  467284332  702296756  934459713 1387898285
>>      objpool+:  348388180  462750976  696606096  927865887 1368402195
>>
> 
> OK, looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@...nel.org>
> 
> And let me pick it.
> 
>> Signed-off-by: wuqiang.matt <wuqiang.matt@...edance.com>
> 
> BTW, this is a real bugfix, so it should have Fixes tag :)
> 
> Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")
> 

Oh, right! Thanks for your kind reminder. I'll keep that in mind.

> Thank you!

Best regards.

> 
>> ---
>>   lib/objpool.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/lib/objpool.c b/lib/objpool.c
>> index ce0087f64400..cfdc02420884 100644
>> --- a/lib/objpool.c
>> +++ b/lib/objpool.c
>> @@ -201,6 +201,23 @@ static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
>>   	while (head != READ_ONCE(slot->last)) {
>>   		void *obj;
>>   
>> +		/*
>> +		 * data visibility of 'last' and 'head' could be out of
>> +		 * order since memory updating of 'last' and 'head' are
>> +		 * performed in push() and pop() independently
>> +		 *
>> +		 * before any retrieving attempts, pop() must guarantee
>> +		 * 'last' is behind 'head', that is to say, there must
>> +		 * be available objects in slot, which could be ensured
>> +		 * by condition 'last != head && last - head <= nr_objs'
>> +		 * that is equivalent to 'last - head - 1 < nr_objs' as
>> +		 * 'last' and 'head' are both unsigned int32
>> +		 */
>> +		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
>> +			head = READ_ONCE(slot->head);
>> +			continue;
>> +		}
>> +
>>   		/* obj must be retrieved before moving forward head */
>>   		obj = READ_ONCE(slot->entries[head & slot->mask]);
>>   
>> -- 
>> 2.40.1
>>
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ