lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEd9FL+Mhk8GdFm+3gvBk35ho-BTX-f7jn=O5Lz2mij-Q@mail.gmail.com>
Date: Sat, 20 Jul 2024 09:57:56 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Bharata B Rao <bharata@....com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	nikunj@....com, "Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>, 
	Andrew Morton <akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>, willy@...radead.org, 
	vbabka@...e.cz, kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system

On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@...gle.com> wrote:
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
>
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?
>

With my corporate employee hat on, I would like to note a couple of
three things.

1. there are definitely bugs here and someone(tm) should sort them out(R)

however....

2. the real goal is presumably to beat the kernel into shape where
production kernels no longer suffer lockups running this workload on
this hardware
3. the flamegraph (to be found in [1]) shows expensive debug enabled,
notably for preemption count (search for preempt_count_sub to see)
4. I'm told the lruvec problem is being worked on (but no ETA) and I
don't think the above justifies considering any hacks or otherwise
putting more pressure on it

It is plausible eliminating the aforementioned debug will be good enough.

Apart from that I note percpu_counter_add_batch (+ irq debug) accounts
for 5.8% cpu time. This will of course go down if irq tracing is
disabled, but so happens I optimized this routine to be faster
single-threaded (in particular by dodging the interrupt trip). The
patch is hanging out in the mm tree [2] and is trivially applicable
for testing.

Even if none of the debug opts can get modified, this should drop
percpu_counter_add_batch to 1.5% or so, which may or may not have a
side effect of avoiding the lockup problem.

[1]: https://lore.kernel.org/lkml/584ecb5e-b1fc-4b43-ba36-ad396d379fad@amd.com/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=51d821654be4286b005ad2b7dc8b973d5008a2ec

-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ