linux-kernel - Re: Tasks stuck jbd2 for a long time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZOOvOT4dL1SCHQDz@mit.edu>
Date:   Mon, 21 Aug 2023 14:38:49 -0400
From:   "Theodore Ts'o" <tytso@....edu>
To:     "Lu, Davina" <davinalu@...zon.com>
Cc:     "Bhatnagar, Rishabh" <risbhat@...zon.com>, Jan Kara <jack@...e.cz>,
        "jack@...e.com" <jack@...e.com>,
        "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Park, SeongJae" <sjpark@...zon.com>
Subject: Re: Tasks stuck jbd2 for a long time

On Mon, Aug 21, 2023 at 01:10:58AM +0000, Lu, Davina wrote:
> 
> > [2] https://lore.kernel.org/r/53153bdf0cce4675b09bc2ee6483409f@amazon.com
> 
> Thanks for pointed out, I almost forget I did this version 2.  How
> to replicate this issue : CPU is X86_64, 64 cores, 2.50GHZ, MEM is
> 256GB (it is VM though). Attached with one NVME device (no lvm, drbd
> etc) with IOPS 64000 and 16GiB. I can also replicate with 10000 IOPS
> 1000GiB NVME volume....

Thanks for the details.  This is something that am interested in
trying to potentially to merge, since for a sufficiently
coversion-heavy workload (assuming the conversion is happening across
multiple inodes, and not just a huge number of random writes into a
single fallocated file), limiting the number of kernel threads to one
CPU isn't always going to be the right thing.  The reason why we had
done this way was because at the time, the only choices that we had
was between a single kernel thread, or spawning a kernel thread for
every single CPU --- which for a very high-core-count system, consumed
a huge amount of system resources.  This is no longer the case with
the new Concurrency Managed Workqueue (cmwq), but we never did the
experiment to make sure cmwq didn't have surprising gotchas.

> > Finally, I'm a bit nervous about setting the internal __WQ_ORDERED
> > flag with max_active > 1.  What was that all about, anyway?
> 
> Yes, you are correct. I didn't use "__WQ_ORDERED" carefully, it
> better not use with max_active > 1 . My purpose was try to guarantee
> the work queue can be sequentially implemented on each core.

I won't have time to look at this before the next merge window, but
what I'm hoping to look at is your patch at [2], with two changes:

a)  Drop the _WQ_ORDERED flag, since it is an internal flag.

b) Just pass in 0 for max_active instead of "num_active_cpus() > 1 ?
   num_active_cpus() : 1", for two reasons.  Num_active_cpus() doesn't
   take into account CPU hotplugs (for example, if you have a
   dynmically adjustable VM shape where the number of active CPU's
   might change over time).  Is there a reason why we need to set that
   limit?

Do you see any potential problem with these changes?

Thanks,

						- Ted