linux-ext4 - Re: [PATCH] ext4: start the handle later in ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20211018212045.GA24360@quack2.suse.cz>
Date:   Mon, 18 Oct 2021 23:20:45 +0200
From:   Jan Kara <jack@...e.cz>
To:     Theodore Ts'o <tytso@....edu>
Cc:     xueqingwen <xueqingwen@...du.com>, adilger.kernel@...ger.ca,
        linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
        zhaojie <zhaojie17@...du.com>, jimyan <jimyan@...du.com>
Subject: Re: [PATCH] ext4: start the handle later in ext4_writepages() to
 avoid unnecessary wait

On Wed 13-10-21 22:31:37, Theodore Ts'o wrote:
> On Thu, Sep 23, 2021 at 08:12:04PM +0800, xueqingwen wrote:
> >   ....
> > Therefore, the handle was delayed to start until finding the pages that
> > need mapping in ext4_writepages(). With this patch, the above problem did
> > not recur. We had looked this patch over pretty carefully, but another pair
> > of eyes would be appreciated. Please help to review whether there are
> > defects and whether it can be merged to upstream.
> 
> Hi,
> 
> I've tried tests against this patch, and it's causing a large number
> of hangs.  For most of the hangs, it's while running generic/269,
> although there were a few other tests which would cause the kernel to
> hang.
> 
> I don't have time to try to figure out why your patch might be
> failing, at least not this week.  So if you could take a look at at
> the test artifiacts in this xz compressed tarfile, I'd appreciate it.
> The "report" file contains a summary report, and the *.serial files
> contain the output from the serial console of the VM's which were
> hanging with your patch applied.  Perhaps you can determine what needs
> to be fixed to prevent the kernel hangs?

Well, I guess the problem is that proper lock ordering is transaction start
-> page lock and this patch inverts it so it creates all sorts of deadlock
possibilities. Lockdep will not catch this problem because page lock is not
tracked by it.

I do understand the problem description but this just isn't a viable
solution to it. There are some possible solutions like locking the first
page outside of transaction, then unlocking it, starting a transaction and
then only trylocking pages in mpage_prepare_extent_to_map() but it tends to
result in pretty ugly code. Also we'd need to make sure we don't call
submit_bio() when having transaction started (as that is where throttling
happens) - any such place may cause described latency problems. It's going
to be rather difficult to find and address all such places.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR