linux-kernel - Re: [PATCH 1/3] aio: fix oops because of extra IO control block freeing.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <A5667440-009D-4ACB-9EA9-7462341DD704@zabbo.net>
Date:	Wed, 7 Mar 2007 14:48:19 -0800
From:	Zach Brown <zab@...bo.net>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Leonid Ananiev <leonid.i.ananiev@...el.com>,
	Leonid Ananiev <leonid.i.ananiev@...ux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-aio <linux-aio@...ck.org>,
	Chris Mason <chris.mason@...cle.com>
Subject: Re: [PATCH 1/3] aio: fix oops because of extra IO control block freeing.

On Mar 7, 2007, at 2:14 PM, Andrew Morton wrote:

> On Mon, 05 Mar 2007 17:23:33 +0300
> Leonid Ananiev <leonid.i.ananiev@...ux.intel.com> wrote:
>
>>  From Leonid Ananiev
>>
>> The patch fixes oops because of extra IO control block freeing.
>> IO is retried if page could not be invalidated.

This patch is incorrect and shouldn't be merged.

>>
>> Signed-off-by: Leonid Ananiev <leonid.i.ananiev@...el.com>
>>
>> The patch fixes oops "Kernel BUG at fs/aio.c:509" archived at
>> http://lkml.org/lkml/2007/2/8/337
>> The number of IO control block (iocb)users < 0.
>> If page could not be invalidated by invalidate_inode_pages2_range()
>> than EIO is returned. It happens if journal_try_to_free_buffers()  
>> fails
>> to drop_buffers().
>> This EIO is not differing from real IO competition with EIO and
>> aio_complete() is called.

(I'm going to read this as "This EIO is misinterpreted as real IO  
completion with -EIO and aio_complete() is called.")

>> Later aio_complete() is called from dio_bio_end_aio() and frees iocb
>> once more.

This analysis is correct.  Nothing can clobber -EIOCBQUEUED as it is  
returned up from fs/direct-io.c to fs/aio.c.  This -EIO from  
invalidation is one of the two places that currently break this rule.

The fix I had hoped for, invalidating down in fs/direct-io.c before  
returning -EIOCBQUEUED, doesn't work because it ends up getting the  
ordering between the journal lock and the page lock backwards.   
Sigh.  (note to self: help lockdep warn us about that ordering)

>> After patch generic_file_direct_IO() sets PgBusy flag in iocb
>> if page could not be invalidated. iocb is retried after IO  
>> competition.
>> The process is waked up if IO is SYNC else iocb is kicked.
>> The lines ___if (ret != -EIOCBRETRY)___ is deleted because
>> nothing set to EIOCBRETRY.

True, but this is gratuitously cruel to external users of - 
EIOCBRETRY.  Silently breaking them doesn't sound like a great plan.   
If we really decide to remove EIOCBRETRY support we'd get rid of all  
the retry infrastructure and remove the EIOCBRETRY errno so their  
builds failed.  That's a separate issue that shouldn't be confused  
with this EIOCBQUEUED clobbering.  Just removing some pieces of the  
infrastructure willy-nilly isn't acceptable.

>
> Please copy linux-aio@...ck.org on AIO patches.
>
>> Next patches 2/3 and 3/3 do cleanup only.
>
> I cannot find those patches.

Me either.  I was waiting for them to arrive before responding.

>> diff -upr linux-2.6.20/mm/filemap.c linux-2.6.20-aio2/mm/filemap.c
>> --- linux-2.6.20/mm/filemap.c	2007-02-04 21:44:54.000000000 +0300
>> +++ linux-2.6.20-aio2/mm/filemap.c	2007-03-04 21:46:10.000000000  
>> +0300
>> @@ -2413,10 +2413,9 @@ generic_file_direct_IO(int rw, struct ki
>>   		if (rw == WRITE && mapping->nrpages) {
>>   			pgoff_t end = (offset + write_len - 1)
>>   						>> PAGE_CACHE_SHIFT;
>> -			int err = invalidate_inode_pages2_range(mapping,
>> -					offset >> PAGE_CACHE_SHIFT, end);
>> -			if (err)
>> -				retval = err;
>> +			if (invalidate_inode_pages2_range(mapping,
>> +					offset >> PAGE_CACHE_SHIFT, end))
>> +				kiocbSetPgBusy(iocb);

There are two problems behind this bug:

- ext3_releasepage() returns failure if it races with kjournald  
holding a reference while it waits for a transaction to commit.

- generic_file_direct_IO() causes an oops if it clobbers -EIOCBQUEUED  
with the return code from invalidate_inode_pages2_range()..- 
 >releasepage().

This fix makes the incorrect assertion that *any* failure from  
invalidate_inode_pages2_range(), which might not have anything to do  
with this race you're currently seeing, is transitory and should  
trigger a retry.  That's wrong, it should be returning the error.

Now, getting ext3_releasepage() to not fail if this race hits to  
begin with is another story.  Chris has some ideas about reworking  
the page laundering helper to make that more reliable.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/