[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1e976d6e-b685-4adb-8bfb-6f7f910e5baf@alu.unizg.hr>
Date: Fri, 27 Oct 2023 17:51:02 +0200
From: Mirsad Todorovac <mirsad.todorovac@....hr>
To: Yury Norov <yury.norov@...il.com>,
Rasmus Villemoes <linux@...musvillemoes.dk>
Cc: kernel test robot <oliver.sang@...el.com>, Jan Kara <jack@...e.cz>,
oe-lkp@...ts.linux.dev, lkp@...el.com,
linux-kernel@...r.kernel.org, ying.huang@...el.com,
feng.tang@...el.com, fengwei.yin@...el.com,
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
Mirsad Todorovac <mirsad.todorovac@....unizg.hr>,
Matthew Wilcox <willy@...radead.org>,
linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps
On 10/27/2023 5:51 AM, Yury Norov wrote:
> On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
>> On 25/10/2023 09.18, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:
>>
>> So with that, can we please just finally say "yeah, let's make the
>> generic bitmap library functions correct
>
> They are all correct already.
>
>> and usable in more cases"
>
> See below.
>
>> instead of worrying about random micro-benchmarks that just show
>> you-win-some-you-lose-some.
>
> That's I agree. I don't worry about either +2% or -3% benchmark, and
> don't think that they alone can or can't justificate such a radical
> change like making all find_bit functions volatile, and shutting down
> a newborn KCSAN.
>
> Keeping that in mind, my best guess is that Jan's and Misrad's test
> that shows +2% was against stable bitmaps; and what robot measured
> is most likely against heavily concurrent access to some bitmap in
> the kernel.
>
> I didn't look at both tests sources, but that at least makes some
> sense, because if GCC optimizes code against properly described
> memory correctly, this is exactly what we can expect.
>
>> Yes, users will have to treat results from the find routines carefully
>> if their bitmap may be concurrently modified. They do. Nobody wins if
>> those users are forced to implement their own bitmap routines for their
>> lockless algorithms.
>
> Again, I agree with this point, and I'm trying to address exactly this.
>
> I'm working on a series that introduces lockless find_bit functions
> based on existing FIND_BIT() engine. It's not ready yet, but I hope
> I'll submit it in the next merge window.
>
> https://github.com/norov/linux/commits/find_and_bit
>
> Now that we've got a test that presumably works faster if find_bit()
> functions are all switched to be volatile, it would be great if we get
> into details and understand:
> - what find_bit function or functions gives that gain in performance;
> - on what bitmap(s);
I am positive that your test_and_clear_bit() loop per bit wouldn't
improve performance. No insult, but it has to:
- LOCK the bus
- READ the (byte/word/longword/quadword) from memory
- TEST the bit and remember the result in FLAGS
- CLEAR the bit
- WRITE back the entire byte/word/longword/quadword
So, instead of speeding up, you'd end up reading 64 times in the worst
case where only the last bit in the quadword is set.
What we need is this ffs() that expands to assembly instruction BSF
(bit scan forward)
[1] https://www.felixcloutier.com/x86/bsf
followed by a BTR (bit test and reset)
[2] https://www.felixcloutier.com/x86/btr
Ideally, we'd have an instruction that does both, and atomic, or a way
to LOCK the bus for two instructions. But bad guys could use that to
stall all cores indefinitely:
DON'T DO THIS!
-----------------------------
LOCK
loop:
BSF r16, m16/m32/m64
BTR m16/m32/m64, r16
JMP loop
UNLOCK
-----------------------------
This would better work with the hardware-assisted CAM locking device,
than stopping all cores from reading and writing on each memory barrier.
But this is a long story.
> - is the reason in concurrent memory access (guess yes), and if so,
> - can we refactor the code to use lockless find_and_bit() functions
> mentioned above;
> - if not, how else can we address this.
>
> If you or someone else have an extra time slot to get deeper into
> that, I'll be really thankful.
I don't know if the Linux kernel uses any advantage of a trasnactional
memory device if it finds one?
The idea of each mutex() or even user-space futex() stalling all cores
to "asm LOCK CMPXCHG m8/m16/m32/m64, r8/r16/r32/r64" simply doesn't
scale well with i.e. 128 cores that have to wait for one long
read-modify-write cycle that bypasses cache and talks to very slow memory.
I see some progress with per-CPU variables.
[3]
https://stackoverflow.com/questions/58664496/locking-on-per-cpu-variable/67997961#67997961
For multiprocessor system and to protect percpu variables, we can just
disable preemption and do local_irq_save. This way we avoid taking the
spinlock. Spinlock requires atomicity across multiple CPU's. With per
cpu variables it shall not be required.
local_irq_save(flags);
preempt_disable();
-- Modify the percpu variable
preempt_enable();
local_irq_restore(flags);
That compiles roughly to:
unsigned long flags;
asm volatile("cli": : :"memory");
asm volatile(
"mrs %0, " IRQMASK_REG_NAME_R " @local_save_flags"
: "=r" (flags) : : "memory", "cc");
preempt_count_inc();
__asm__ __volatile__("": : :"memory") // barrier()
--- your stuff here ---
if (!(!(flags & X86_EFLAGS_IF))
asm volatile("sti": : :"memory");
__asm__ __volatile__("": : :"memory") // barrier()
preempt_count_dec();
With this code other CPUs can work in parallel. None of the CPU spins.
If we take spinlock then we modify an atomic variable also other CPU
comes then it has to wait/spin if spinlock is acquired.
Best regards,
Mirsad
> Thanks,
> Yury
Powered by blists - more mailing lists