lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1e976d6e-b685-4adb-8bfb-6f7f910e5baf@alu.unizg.hr>
Date:   Fri, 27 Oct 2023 17:51:02 +0200
From:   Mirsad Todorovac <mirsad.todorovac@....hr>
To:     Yury Norov <yury.norov@...il.com>,
        Rasmus Villemoes <linux@...musvillemoes.dk>
Cc:     kernel test robot <oliver.sang@...el.com>, Jan Kara <jack@...e.cz>,
        oe-lkp@...ts.linux.dev, lkp@...el.com,
        linux-kernel@...r.kernel.org, ying.huang@...el.com,
        feng.tang@...el.com, fengwei.yin@...el.com,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Mirsad Todorovac <mirsad.todorovac@....unizg.hr>,
        Matthew Wilcox <willy@...radead.org>,
        linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

On 10/27/2023 5:51 AM, Yury Norov wrote:
> On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
>> On 25/10/2023 09.18, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
>>
>> So with that, can we please just finally say "yeah, let's make the
>> generic bitmap library functions correct
> 
> They are all correct already.
> 
>> and usable in more cases"
> 
> See below.
> 
>> instead of worrying about random micro-benchmarks that just show
>> you-win-some-you-lose-some.
> 
> That's I agree. I don't worry about either +2% or -3% benchmark, and
> don't think that they alone can or can't justificate such a radical
> change like making all find_bit functions volatile, and shutting down
> a newborn KCSAN.
> 
> Keeping that in mind, my best guess is that Jan's and Misrad's test
> that shows +2% was against stable bitmaps; and what robot measured
> is most likely against heavily concurrent access to some bitmap in
> the kernel.
> 
> I didn't look at both tests sources, but that at least makes some
> sense, because if GCC optimizes code against properly described
> memory correctly, this is exactly what we can expect.
> 
>> Yes, users will have to treat results from the find routines carefully
>> if their bitmap may be concurrently modified. They do. Nobody wins if
>> those users are forced to implement their own bitmap routines for their
>> lockless algorithms.
> 
> Again, I agree with this point, and I'm trying to address exactly this.
> 
> I'm working on a series that introduces lockless find_bit functions
> based on existing FIND_BIT() engine. It's not ready yet, but I hope
> I'll submit it in the next merge window.
> 
> https://github.com/norov/linux/commits/find_and_bit
> 
> Now that we've got a test that presumably works faster if find_bit()
> functions are all switched to be volatile, it would be great if we get
> into details and understand:
>   - what find_bit function or functions gives that gain in performance;
>   - on what bitmap(s);

I am positive that your test_and_clear_bit() loop per bit wouldn't 
improve performance. No insult, but it has to:

- LOCK the bus
- READ the (byte/word/longword/quadword) from memory
- TEST the bit and remember the result in FLAGS
- CLEAR the bit
- WRITE back the entire byte/word/longword/quadword

So, instead of speeding up, you'd end up reading 64 times in the worst 
case where only the last bit in the quadword is set.

What we need is this ffs() that expands to assembly instruction BSF
(bit scan forward)

  [1] https://www.felixcloutier.com/x86/bsf

followed by a BTR (bit test and reset)

  [2] https://www.felixcloutier.com/x86/btr

Ideally, we'd have an instruction that does both, and atomic, or a way 
to LOCK the bus for two instructions. But bad guys could use that to 
stall all cores indefinitely:

DON'T DO THIS!
-----------------------------
      LOCK
loop:
      BSF r16, m16/m32/m64
      BTR m16/m32/m64, r16
      JMP loop
      UNLOCK
-----------------------------

This would better work with the hardware-assisted CAM locking device, 
than stopping all cores from reading and writing on each memory barrier.

But this is a long story.

>   - is the reason in concurrent memory access (guess yes), and if so,
>   - can we refactor the code to use lockless find_and_bit() functions
>     mentioned above;
>   - if not, how else can we address this.
> 
> If you or someone else have an extra time slot to get deeper into
> that, I'll be really thankful.

I don't know if the Linux kernel uses any advantage of a trasnactional 
memory device if it finds one?

The idea of each mutex() or even user-space futex() stalling all cores 
to "asm LOCK CMPXCHG m8/m16/m32/m64, r8/r16/r32/r64" simply doesn't 
scale well with i.e. 128 cores that have to wait for one long 
read-modify-write cycle that bypasses cache and talks to very slow memory.

I see some progress with per-CPU variables.

[3] 
https://stackoverflow.com/questions/58664496/locking-on-per-cpu-variable/67997961#67997961

For multiprocessor system and to protect percpu variables, we can just 
disable preemption and do local_irq_save. This way we avoid taking the 
spinlock. Spinlock requires atomicity across multiple CPU's. With per 
cpu variables it shall not be required.

	local_irq_save(flags);
	preempt_disable();
	-- Modify the percpu variable
	preempt_enable();
	local_irq_restore(flags);

That compiles roughly to:

	unsigned long flags;

	asm volatile("cli": : :"memory");
	asm volatile(
		"mrs	%0, " IRQMASK_REG_NAME_R " @local_save_flags"
		: "=r" (flags) : : "memory", "cc");
	preempt_count_inc();
	__asm__ __volatile__("": : :"memory")  // barrier()
	--- your stuff here ---
	if (!(!(flags & X86_EFLAGS_IF))
		asm volatile("sti": : :"memory");
	__asm__ __volatile__("": : :"memory")  // barrier()
	preempt_count_dec();

With this code other CPUs can work in parallel. None of the CPU spins.

If we take spinlock then we modify an atomic variable also other CPU 
comes then it has to wait/spin if spinlock is acquired.

Best regards,
Mirsad


> Thanks,
> Yury

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ