linux-ext4 - Re: DIO process stuck apparently due to dioread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4E4AEF13.7070504@msgid.tls.msk.ru>
Date:	Wed, 17 Aug 2011 02:28:35 +0400
From:	Michael Tokarev <mjt@....msk.ru>
To:	Jiaying Zhang <jiayingz@...gle.com>
CC:	Tao Ma <tm@....ma>, Jan Kara <jack@...e.cz>,
	linux-ext4@...r.kernel.org, sandeen@...hat.com
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

17.08.2011 01:32, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@....ma> wrote:
>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@....msk.ru> wrote:
>>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>>> [....]
>>>>>
>>>>> So, it looks like this (starting with cold cache):
>>>>>
>>>>> 1. rename the redologs and copy them over - this will
>>>>>   make a hot copy of redologs
>>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>>   redologs, the header is corrupt
>>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>>
>>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>>> When shutting down, oracle calls fsync(), so that's like
>>>>> sync(1) again.
>>>>>
>>>>> If there will be some time between 1. and 2., everything
>>>>> will work too.
>>>>>
>>>>> Without dioread_nolock I can't trigger the problem no matter
>>>>> how I tried.
>>>>>
>>>>>
>>>>> A smaller test case.  I used redo1.odf file (one of the
>>>>> redologs) as a test file, any will work.
>>>>>
>>>>>  $ cp -p redo1.odf temp
>>>>>  $ dd if=temp of=foo iflag=direct count=20
>>>> Isn't this the expected behavior here? When doing
>>>> 'cp -p redo1.odf temp', data is copied to temp through
>>>> buffer write, but there is no guarantee when data will be
>>>> actually written to disk. Then with 'dd if=temp of=foo
>>>> iflag=direct count=20', data is read directly from disk.
>>>> Very likely, the written data hasn't been flushed to disk
>>>> yet so ext4 returns zero in this case.
>>>   No it's not. Buffered and direct IO is supposed to work correctly
>>> (although not fast) together. In this particular case we take care to flush
>>> dirty data from page cache before performing direct IO read... But
>>> something is broken in this path obviously.
> I see. Thanks a lot for the explanation.
> 
> I wonder whether the following patch will solve the problem:
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 6c27111..ca90d73 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c

It is in inode.c in 3.0, not in indirect.c (there's no such file there).
But it applies (after de-MIMEfying it) - well, almost with no mods... ;)

> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>         }
> 
>  retry:
> -       if (rw == READ && ext4_should_dioread_nolock(inode))
> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> +                       mutex_lock(&inode->i_mutex);
> +                       ext4_flush_completed_IO(inode);
> +                       mutex_unlock(&inode->i_mutex);
> +               }
>                 ret = __blockdev_direct_IO(rw, iocb, inode,
>                                  inode->i_sb->s_bdev, iov,
>                                  offset, nr_segs,
>                                  ext4_get_block, NULL, NULL, 0);
> -       else {
> +       } else {
>                 ret = blockdev_direct_IO(rw, iocb, inode,
>                                  inode->i_sb->s_bdev, iov,
>                                  offset, nr_segs,
> 
> I tested the patch a little bit and it seems to resolve the race
> on dioread_nolock in my case. Michael, I would very appreciate
> if you can try this patch with your test case and see whether it works.

So I tried it just now.  I tried hard, several times.  No joy: I don't
see the "corruption" anymore, neither with the simple dd case nor with
oracle version.  I added a printk into the new if-unlikely path, and
it triggers several times, - almost always when I run my "oracle
test-case" - starting the database after newly-copying the redologs.

So it appears you've hit the right place there.  At least for this
test case ;)  Not sure if that's exactly right way to go - maybe
it's possible to have some optimisation in there, in case completed
list is not empty but not overlaps with our I/O list, but that's
probably overkill, dunno.

>>> Hmm, the new writepages code seems to be broken in combination with direct
>>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>>> data is on disk but with new bio submission code this is not true because
>>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>>> ext4_end_io_buffer_write() but do extent conversion only after that in
>>> convert workqueue. So the race seems to be there all the time, just without
>>> dioread_nolock it's much smaller.

> I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
> So I think we should be ok with the default mount options.

Am I right that the intention was to enable dioread_nolock by default
eventually, or even to keep its behavour only, without mount options?

And thank you -- all -- for your work and support!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html