linux-kernel - Re: Bug with "fix partial page writes"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGBYx2b4+s+YSKFoPbNSw=vWVaCZyxHjQdKGMHhCztTppETT9Q@mail.gmail.com>
Date:	Mon, 21 Nov 2011 09:59:55 +0800
From:	Yongqiang Yang <xiaoqiangnk@...il.com>
To:	Hugh Dickins <hughd@...gle.com>
Cc:	Allison Henderson <achender@...ux.vnet.ibm.com>,
	Curt Wohlgemuth <curtw@...gle.com>, "Ted Ts'o" <tytso@....edu>,
	linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: Bug with "fix partial page writes"

Hi,

I am curious about the reason we need this operation in write_begin
functions.    I had a look at the commit log just now.   The commit
log explains the intention is to handle writes on a hole and writes on
EOF.   Two cases can be handled successfully by block_write_begin.


Yongqiang.

On Mon, Nov 21, 2011 at 4:59 AM, Hugh Dickins <hughd@...gle.com> wrote:
> We've seen no response to this, so Cc'ing Ted and linux-kernel,
> and I'll fill in some more detail below.
>
> On Tue, 8 Nov 2011, Curt Wohlgemuth wrote:
>> It appears that there's a bug with this patch:
>>
>> -------------------------------------------
>> commit 02fac1297eb3f471a27368271aadd285548297b0
>> Author: Allison Henderson <achender@...ux.vnet.ibm.com>
>> Date:   Tue Sep 6 21:53:01 2011 -0400
>>
>>     ext4: fix partial page writes
>> ...
>> -------------------------------------------
>>
>> Hugh Dickins found a bug with some nasty testing and lockdep that
>
> It's the tmpfs swapping test that I've been running, with variations,
> for years.  System booted with mem=700M and 1.5G swap, two repetitious
> make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because
> the balance of built to unbuilt source grows smaller with later kernels),
> one directly in a tmpfs (irrelevant in this case, except for the added
> pressure it generates), the other in a 1k-block ext2 (that I drive with
> ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file.
>
> The first oops I got was indeed down in lockdep, but I've since seen
> crashes from the same cause without lockdep configured in.  I've not
> bothered to write down the stacks, beyond noting ext4_da_write_end()'s
> call to ext4_discard_partial_page_buffers_no_lock() in them, since the
> code there is clearly at fault as Curt describes.
>
>> crashed in ext4_da_write_end(), and after looking at the code with
>> him, it appears that the call to
>> ext4_discard_partial_page_buffers_no_lock() in this routine is
>> manipulating an unlocked, and possibly non-existent page:
>>
>>
>> -------------------------------------------
>> ...
>>       ret2 = generic_write_end(file, mapping, pos, len, copied,
>>                                                       page, fsdata);
>>
>>       page_len = PAGE_CACHE_SIZE -
>>                       ((pos + copied - 1) & (PAGE_CACHE_SIZE - 1));
>>
>>       if (page_len > 0) {
>>               ret = ext4_discard_partial_page_buffers_no_lock(handle,
>>                       inode, page, pos + copied - 1, page_len,
>>                       EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED);
>>       }
>> ...
>> -------------------------------------------
>>
>> Note that generic_write_end() will unlock and release the page before
>> it returns.
>
> Exactly.  And clearly the loop-on-tmpfs aspect of the test is
> irrelevant, except in generating more pressure to trigger it.
>
>>
>> I've no good answer for how to fix this properly, but I wanted to let
>> Allison know about this, if she hadn't already.  I looked but didn't
>> see any related email on the linux-ext4 list for this problem.
>
> There was a second problem I was seeing, more elusive and much harder
> to attribute: occasionally the build on ext2 would fail with errors
> from ld (almost always of the kind "In function `no symbol': multiple
> definition of `no symbol'" and "Warning: size of symbol `' changed":
> I don't know if there's anything to be deduced from that).  I took
> these to indicate an error in filesystem or loop or tmpfs or swap.
>
> First suspect was loop changes from hch in 3.2-rc1, but backing those
> out made no difference.  I thought I was facing a week's bisection
> (since it would need at least a day to conclude any stage good), but
> took a gamble on backing out *both* parts of 02fac1297eb3: page_len
> additions to ext4_da_write_begin() as well as page_len additions to
> ext4_da_write_end().
>
> That gamble paid off: the test then showed no problems in several
> days running on two machines.  So, both parts of 02fac1297eb3 are
> bad, but it's not so easy to see what's wrong with the write_begin.
>
> My *guess* is that the partial page fixes have interfered with the
> subtle page-dirty buffer-dirty protocol in some way, which manifests
> only under memory pressure.
>
> It's conceivable that loop and tmpfs and swap play a part in this
> further error, but I don't think so: I have no evidence for that,
> and no such problem was seen before 3.2-rc1.
>
> ---
>
> I wanted to find you an easier way to reproduce the problem, so I
> tried fsx (I'm still using a pretty old fsx, no fallocate or punch
> hole), run in ext2 on a kernel booted with mem=700M.  Sorry, I did
> this a week ago, then didn't find time to write it up, and failed to
> note when my ext2 was in /dev/loop0 and when it was directly on disk.
>
> fsx foo -q -c 100 -l 100000000 &
> while :
> do      # memory hog mmaps and touches each page of 800MB private area
>        swapout 800
> done
>
> I did not reproduce either problem above with that.  Instead I found
> that backing out 02fac1297eb3 made fsx on 3.2-rc1 fail in a few minutes.
> But leaving 02fac1297eb3 in, fsx still failed in 20 minutes or an hour.
> On 3.1, fsx failed in a few minutes.  On 3.0, fsx failed in half an hour.
> On 2.6.39, fsx failed in a few minutes.  I had to go back to 2.6.38 for
> fsx to run successfully under memory pressure for more than two hours.
>
> It looks as if ext4 testing has not been running fsx under memory
> pressure recently.  And although I didn't reproduce my main problems
> that way, it could well be that getting fsx to run reliably again
> under memory pressure will be the way to fix those problems.
>
> Thanks,
> Hugh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Best Wishes
Yongqiang Yang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/