linux-ext4 - Re: Discussion: is it time to remove dioread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200107004338.GB125832@mit.edu>
Date:   Mon, 6 Jan 2020 19:43:38 -0500
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Ritesh Harjani <riteshh@...ux.ibm.com>
Cc:     Xiaoguang Wang <xiaoguang.wang@...ux.alibaba.com>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Jan Kara <jack@...e.cz>, joseph.qi@...ux.alibaba.com,
        Liu Bo <bo.liu@...ux.alibaba.com>
Subject: Re: Discussion: is it time to remove dioread_nolock?

On Mon, Jan 06, 2020 at 05:54:56PM +0530, Ritesh Harjani wrote:
> > The initial reason we use dioread_nolock is that it'll also allocate
> > unwritten extents for buffered write, and normally the corresponding
> > inode won't be added to jbd2 transaction's t_inode_list, so while
> > commiting transaction, it won't flush inodes' dirty pages, then
> > transaction will commit quickly, otherwise in extream case, the time
> 
> I do notice this in ext4_map_blocks(). We add inode to t_inode_list only
> in case if we allocate written blocks. I guess this was done to avoid
> stale data exposure problem. So now due to ordered mode, we may end up
> flushing all dirty data pages in committing transaction before the
> metadata is flushed.
> 
> Do you have any benchmarks or workload where we could see this problem?
> And could this actually be a problem with any real world workload too?

After thinking about this some more, I've changed my mind.

I think this is something which *can* be very noticeable in some real
world workloads.  If the workload is doing a lot of allocating,
buffered writes to an inode, and the writeback thread starts doing the
writeback for that inode right before a commit starts, then the commit
can take a long time.  The problem is that if the storage device is
particularly slow --- for example, a slow USB drive, or a 32 GiB
Standard Persistent Disk in a Google Compute Environment (which has a
max sustained throughput of 3 MiB/s), it doesn't take a lot of queued
writeback I/O to trigger a hung task warning.  Even if hung task panic
isn't enabled, if the commit thread is busied out for a minute or two,
anything that is blocked on a commit completing --- such a fsync(2)
system call, could end up getting blocked for a long time, and that
could easily make a userspace application sad.

> Jan/Ted, your opinion on this pls?
> 
> I do see that there was a proposal by Ted @ [1] which should
> also solve this problem. I do have plans to work on Ted's proposal, but
> meanwhile, should we preserve this mount option for above mentioned use
> case? Or should we make it a no-op now?

> [1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2

I agree that this was not the original intent of dioread_nolock, but I
until we can implement [1], dioread_nolock is the only workaround real
workaround we have today.  (Well, data=writeback also works, but that
has other problems.)

If dropping dioread_nolock makes it easier to implement [1], we can
certainly make that one of the first patches in a patch series which
changes how we ext4_writepages() works so it writes the data blocks
before it updates the metadata blocks.  But unless there are some real
downsides to keeping the code around in the kernel until then, I'm not
sure it's worth it to take away the diorad_nolock functionality until
we have a good replacement --- even if that wasn't the original
purpose of the code.

What do other folks think?

					- Ted