linux-ext4 - Re: DIO process stuck apparently due to dioread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFgt=MCQttE5Vv_4W=hFWU_L_FvELnY6bnFQ3uSOu07VkDm6rA@mail.gmail.com>
Date:	Tue, 16 Aug 2011 14:32:12 -0700
From:	Jiaying Zhang <jiayingz@...gle.com>
To:	Tao Ma <tm@....ma>
Cc:	Jan Kara <jack@...e.cz>, Michael Tokarev <mjt@....msk.ru>,
	linux-ext4@...r.kernel.org, sandeen@...hat.com
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@....ma> wrote:
> On 08/16/2011 09:53 PM, Jan Kara wrote:
>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@....msk.ru> wrote:
>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>> [....]
>>>>
>>>> So, it looks like this (starting with cold cache):
>>>>
>>>> 1. rename the redologs and copy them over - this will
>>>>   make a hot copy of redologs
>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>   redologs, the header is corrupt
>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>
>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>> When shutting down, oracle calls fsync(), so that's like
>>>> sync(1) again.
>>>>
>>>> If there will be some time between 1. and 2., everything
>>>> will work too.
>>>>
>>>> Without dioread_nolock I can't trigger the problem no matter
>>>> how I tried.
>>>>
>>>>
>>>> A smaller test case.  I used redo1.odf file (one of the
>>>> redologs) as a test file, any will work.
>>>>
>>>>  $ cp -p redo1.odf temp
>>>>  $ dd if=temp of=foo iflag=direct count=20
>>> Isn't this the expected behavior here? When doing
>>> 'cp -p redo1.odf temp', data is copied to temp through
>>> buffer write, but there is no guarantee when data will be
>>> actually written to disk. Then with 'dd if=temp of=foo
>>> iflag=direct count=20', data is read directly from disk.
>>> Very likely, the written data hasn't been flushed to disk
>>> yet so ext4 returns zero in this case.
>>   No it's not. Buffered and direct IO is supposed to work correctly
>> (although not fast) together. In this particular case we take care to flush
>> dirty data from page cache before performing direct IO read... But
>> something is broken in this path obviously.
I see. Thanks a lot for the explanation.

I wonder whether the following patch will solve the problem:

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 6c27111..ca90d73 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
        }

 retry:
-       if (rw == READ && ext4_should_dioread_nolock(inode))
+       if (rw == READ && ext4_should_dioread_nolock(inode)) {
+               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
+                       mutex_lock(&inode->i_mutex);
+                       ext4_flush_completed_IO(inode);
+                       mutex_unlock(&inode->i_mutex);
+               }
                ret = __blockdev_direct_IO(rw, iocb, inode,
                                 inode->i_sb->s_bdev, iov,
                                 offset, nr_segs,
                                 ext4_get_block, NULL, NULL, 0);
-       else {
+       } else {
                ret = blockdev_direct_IO(rw, iocb, inode,
                                 inode->i_sb->s_bdev, iov,
                                 offset, nr_segs,

I tested the patch a little bit and it seems to resolve the race
on dioread_nolock in my case. Michael, I would very appreciate
if you can try this patch with your test case and see whether it works.

>>
>> I don't have time to dig into this in detail now but what seems to be the
>> problem is that with dioread_nolock option, we don't acquire i_mutex for
>> direct IO reads anymore. Thus these reads can compete with
>> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
>> i_mutex so without dioread_nolock the race cannot happen).
>>
>> Hmm, the new writepages code seems to be broken in combination with direct
>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>> data is on disk but with new bio submission code this is not true because
>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>> ext4_end_io_buffer_write() but do extent conversion only after that in
>> convert workqueue. So the race seems to be there all the time, just without
>> dioread_nolock it's much smaller.
I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
So I think we should be ok with the default mount options.

Jiaying

> You are absolutely right. The really problem is that ext4_direct_IO
> begins to work *after* we clear the page writeback flag and *before* we
> convert unwritten extent to a valid state. Some of my trace does show
> that. I am working on it now.
>>
>> Fixing this is going to be non-trivial - I'm not sure we can really move
>> clearing of PageWriteback bit to conversion workqueue. I thienk we already
>> tried that once but it caused deadlocks for some reason...
> I just did as what you described and yes I met with another problem and
> try to resolve it now. Once it is OK, I will send out the patch.
>
> Thanks
> Tao
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html