linux-ext4 - Re: question about writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGRrVHxhBbMX9W9OmftUPDq5R4koJNjJDDy2xPDF073GTVo6jw@mail.gmail.com>
Date:   Thu, 14 Mar 2019 14:37:55 -0600
From:   Ross Zwisler <zwisler@...gle.com>
To:     Dave Chinner <david@...morbit.com>
Cc:     linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>,
        Jan Kara <jack@...e.com>, Jens Axboe <axboe@...nel.dk>,
        linux-block@...r.kernel.org, Ross Zwisler <zwisler@...nel.org>
Subject: Re: question about writeback

On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@...morbit.com> wrote:
> On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote:
> > Hi,
> >
> > I'm trying to understand a failure I'm seeing with both v4.14 and
> > v4.19 based kernels, and I was hoping you could point me in the right
> > direction.
> >
> > What seems to be happening is that under heavy I/O we get into a
> > situation where for a given inode/mapping we eventually reach a steady
> > state where one task is continuously dirtying pages and marking them
> > for writeback via ext4_writepages(), and another task is continuously
> > completing I/Os via ext4_end_bio() and clearing the
> > PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
> > far as I/O is concerned.
> >
> > The problem is that another task calls filemap_fdatwait_range(), and
> > that call never returns because it always finds pages that are tagged
> > for writeback.  I've added some prints to __filemap_fdatawait_range(),
> > and the total number of pages tagged for writeback seems pretty
> > constant.  It goes up and down a bit, but does not seem to move
> > towards 0.  If we halt I/O the system eventually recovers, but if we
> > keep I/O going we can block the task waiting in
> > __filemap_fdatawait_range() long enough for the system to reboot due
> > to what it perceives as hung task.
> >
> > My question is: Is there some mechanism that is supposed to prevent
> > this sort of situation?  Or is it expected that with slow enough
> > storage and a high enough I/O load, we could block inside of
> > filemap_fdatawait_range() indefinitely since we never run out of dirty
> > pages that are marked for writeback?
>
> SO your problem is that you are doing an extending write, and then
> doing __filemap_fdatawait_range(end = LLONG_MAX), and while it
> blocks on the pages under IO, the file is further extended and so
> the next radix tree lookup finds more pages past that page under
> writeback?
>
> i.e. because it is waiting for pages to complete, it never gets
> ahead of the extending write or writeback and always ends up with
> more pages to wait on and so never reached the end of the file as
> directed?
>
> So perhaps the caller should be waiting on a specific range to bound
> the wait (e.g.  isize as the end of the wait) rather than using the
> default "keep going until the end of file is reached" semantics?

The call to __filemap_fdatawait_range() is happening via the jdb2 code:

jbd2_journal_commit_transaction()
  journal_finish_inode_data_buffers()
    filemap_fdatawait_keep_errors()
      __filemap_fdatawait_range(end = LLONG_MAX)

Would it have to be an extending write?  Or could it work the same if
you have one thread just moving forward through a very large file,
dirtying pages, and the __filemap_fdatawait_range() call will just
keep finding new pages as it moves forward through the big file?

In either case, I think your description of the problem is correct.
Is this just a "well, don't do that" type situation, or is this
supposed to have a different result?

- Ross