linux-kernel - Re: mm: shm: hang in shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53B2A0E0.3000503@suse.cz>
Date:	Tue, 01 Jul 2014 13:52:00 +0200
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Hugh Dickins <hughd@...gle.com>
CC:	Johannes Weiner <hannes@...xchg.org>,
	Konstantin Khlebnikov <koct9i@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	Dave Jones <davej@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: mm: shm: hang in shmem_fallocate

On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I 
have a particular question for you]
>
> On Thu, 26 Jun 2014, Vlastimil Babka wrote:
>> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>>
>>> Sorry for the slow response: I have got confused, learnt more, and
>>> changed my mind, several times in the course of replying to you.
>>> I think this reply will be stable... though not final.
>>
>> Thanks a lot for looking into it!
>>
>>>>
>>>> since this got a CVE,
>>>
>>> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
>>
>> Sorry, I should have mentioned it explicitly.
>>
>>> Looks overrated to me
>>
>> I'd bet it would pass unnoticed if you didn't use the sentence "but whether
>> it's a serious matter in the scale of denials of service, I'm not so sure" in
>> your first reply to Sasha's report :) I wouldn't be surprised if people grep
>> for this.
>
> Hah, you're probably right,
> I better choose my words more carefully in future.
>
>>
>>> (and amusing to see my pompous words about a
>>> "range notification mechanism" taken too seriously), but of course
>>> we do need to address it.
>>>
>>>> I've been looking at backport to an older kernel where
>>>
>>> Thanks a lot for looking into it.  I didn't think it was worth a
>>> Cc: stable@...r.kernel.org myself, but admit to being both naive
>>> and inconsistent about that.
>>>
>>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and
>>>> since
>>>
>>> Yes, that mechanism could be ported back pre-v3.5,
>>> but I agree with your preference not to.
>>>
>>>> it doesn't guarantee anything, it seems simpler just to give up retrying
>>>> to
>>>
>>> Right, I don't think we have formally documented the instant of "full hole"
>>> that I strove for there, and it's probably not externally verifiable, nor
>>> guaranteed by other filesystems.  I just thought it a good QoS aim, but
>>> it has given us this problem.
>>>
>>>> truncate really everything. Then I realized that maybe it would work for
>>>> current kernel as well, without having to add any checks in the page
>>>> fault
>>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
>>>> different
>>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does
>>>> discard
>>>> the old data from the range, it's fine from any information leak point of
>>>> view.
>>>> If someone races page faulting, it IMHO doesn't matter if he gets a new
>>>> zeroed
>>>> page before the parallel truncate has ended, or right after it has ended.
>>>
>>> Yes.  I disagree with your actual patch, for more than one reason,
>>> but it's in the right area; and I found myself growing to agree with
>>> you, that's it's better to have one kind of fix for all these releases,
>>> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
>>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>>
>> I was looking at our 3.0 based kernel, but it could be due to backported
>> patches on top.
>
> And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.
>
>>
>>> If I'd realized that we were going to have to backport, I'd have spent
>>> longer looking for a patch like yours originally.  So my inclination
>>> now is to go your route, make a new patch for v3.16 and backports,
>>> and revert the f00cdc6df7d7 that has already gone in.
>>>
>>>> So I'm posting it here as a RFC. I haven't thought about the
>>>> i915_gem_object_truncate caller yet. I think that this path wouldn't
>>>> satisfy
>>>
>>> My understanding is that i915_gem_object_truncate() is not a problem,
>>> that i915's dev->struct_mutex serializes all its relevant transitions,
>>> plus the object woudn't even be interestingly accessible to the user.
>>>
>>>> the new "lstart < inode->i_size" condition, but I don't know if it's
>>>> "vulnerable"
>>>> to the problem.
>>>
>>> I don't think i915 is vulnerable, but if it is, that condition would
>>> be fine for it, as would be the patch I'm now thinking of.
>>>
>>>>
>>>> -----8<-----
>>>> From: Vlastimil Babka <vbabka@...e.cz>
>>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
>>>> punching
>>>>
>>>> ---
>>>>    mm/shmem.c | 19 +++++++++++++++++++
>>>>    1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index f484c27..6d6005c 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
>>>> loff_t lstart, loff_t lend,
>>>>    		if (!pvec.nr) {
>>>>    			if (index == start || unfalloc)
>>>>    				break;
>>>> +                        /*
>>>> +                         * When this condition is true, it means we were
>>>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>>> +                         * To prevent a livelock when someone else is
>>>> faulting
>>>> +                         * pages back, we are content with single pass
>>>> and do
>>>> +                         * not retry with index = start. It's important
>>>> that
>>>> +                         * previous page content has been discarded, and
>>>> +                         * faulter(s) got new zeroed pages.
>>>> +                         *
>>>> +                         * The other callsites are shmem_setattr (for
>>>> +                         * truncation) and shmem_evict_inode, which set
>>>> i_size
>>>> +                         * to truncated size or 0, respectively, and
>>>> then call
>>>> +                         * us with lstart == inode->i_size. There we do
>>>> want to
>>>> +                         * retry, and livelock cannot happen for other
>>>> reasons.
>>>> +                         *
>>>> +                         * XXX what about i915_gem_object_truncate?
>>>> +                         */
>>>
>>> I doubt you have ever faced such a criticism before, but I'm going
>>> to speak my mind and say that comment is too long!  A comment of that
>>> length is okay above or just inside or at a natural break in a function,
>>> but here it distracts too much from what the code is actually doing.
>>
>> Fair enough. The reasoning should have gone into commit log, not comment.
>>
>>> In particular, the words "this condition" are so much closer to the
>>> condition above than the condition below, that it's rather confusing.
>>>
>>> /* Single pass when hole-punching to not livelock on racing faults */
>>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>>
>>>> +                        if (lstart < inode->i_size)
>>>
>>> For a long time I was going to suggest that you leave i_size out of it,
>>> and use "lend > 0" instead.  Then suddenly I realized that this is the
>>> wrong place for the test.
>>
>> Well my first idea was to just add a flag about how persistent it should be.
>> And set it false for the punch hole case. Then I wondered if there's already
>> some bit that distinguishes it. But it makes it more subtle.
>>
>>> And then that it's not your fault, it's mine,
>>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>>> Wow, that really pessimized the hole-punch case!
>>>
>>> When is pvec.nr 0?  When we've reached the end of the file.  Why should
>>> we go to the end of the file, when punching a hole at the start?  Ughh!
>>
>> Ah, I see (I think). But I managed to reproduce this problem when there was
>> only an extra page between lend and the end of file, so I doubt this is the
>> only problem. AFAIU it's enough to try punching a large enough hole, then the
>> loop can only do a single pagevec worth of pages per iteration, which gives
>> enough time for somebody faulting pages back?
>
> That's useful info, thank you: I just wasn't trying hard enough then;
> and you didn't even need 1024 cpus to show it either.  Right, we have
> to revert my pincer, certainly on shmem.  And I think I'd better do the
> same change on generic filesystems too (nobody has bothered to implement
> hole-punch on ramfs, but if they did, they would hit the same problem):
> though that part of it doesn't need a backport to -stable.
>
>>
>>>> +                                break;
>>>>    			index = start;
>>>>    			continue;
>>>>    		}
>>>> --
>>>> 1.8.4.5
>>>
>>> But there is another problem.  We cannot break out after one pass on
>>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>>> got swizzled into a page just as it was about to be removed - your patch
>>> might then leave that data behind in the hole.
>>
>> Thanks, I didn't notice that. Do I understand correctly that this could mean
>> info leak for the punch hole call, but wouldn't be a problem for madvise? (In
>> any case, that means the solution is not general enough for all kernels, so
>> I'm asking just to be sure).
>
> It's exactly the same issue for the madvise as for the fallocate:
> data that is promised to have been punched out would still be there.

AFAIK madvise doesn't promise anything. But nevermind.

> Very hard case to trigger, though, I think: since by the time we get
> to this loop, we have already made one pass down the hole, getting rid
> of everything that wasn't page-locked at the time, so the chance of
> catching any swap in this loop is lower.
>
>>
>>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>>> weeks ago, before noticing that it's already handled by the endless loop.
>>> If we make that loop no longer endless, we need to add in Konstantin's
>>> "if (shmem_free_swap) goto retry" patch.
>>>
>>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>>> be the whole of Trinity's problem: patch below.  If we waste time
>>> traversing the radix_tree to end-of-file, no wonder that concurrent
>>> faults have time to put something in the hole every time.
>>>
>>> Sasha, may I trespass on your time, and ask you to revert the previous
>>> patch from your tree, and give this patch below a try?  I am very
>>> interested to learn if in fact it fixes it for you (as it did for me).
>>
>> I will try this, but as I explained above, I doubt that alone will help.
>
> And afterwards you confirmed, thank you.
>
>>
>>> However, I am wasting your time, in that I think we shall decide that
>>> it's too unsafe to rely solely upon the patch below (what happens if
>>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>>
>> My reproducer is 4MB file, where the puncher tries punching everything except
>> first and last page. And there are 8 other threads (as I have 8 logical
>> CPU's) that just repeatedly sweep the same range, reading only the first byte
>> of each page.
>>
>>> end of file? if we care).  I think we shall end up with the optimization
>>> below (or some such: it can be written in various ways), plus reverting
>>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>>> shmem_free_swap handling, rolled into a single patch; and a similar
>>
>> So that means no retry in any case (except the swap thing)? All callers can
>> handle that? I guess shmem_evict_inode would be ok, as nobody else
>> can be accessing that inode. But what about shmem_setattr? (i.e. straight
>> truncation) As you said earlier, faulters will get a SIGBUS (which AFAIU is
>> due to i_size being updated before we enter shmem_undo_range). But could
>> possibly a faulter already pass the i_size test, and proceed with the fault
>> only when we are already in shmem_undo_range and have passed the page in
>> question?
>
> We still have to retry indefinitely in the truncation case, as you
> rightly guess.  SIGBUS beyond i_size makes it a much easier case to
> handle, and there's no danger of "indefinitely" becoming "infinitely"
> as in the punch-hole case.  But, depending on how the filesystem
> handles its end, there is still some possibility of a race with faulting,
> which some filesystems may require pagecache truncation to resolve.
>
> Does shmem truncation itself require that?  Er, er, it would take me
> too long to work out the definitive answer: perhaps it doesn't, but for
> safety I certainly assume that it does require that - that is, I never
> even considered removing the indefinite loop from the truncation case.
>
>>
>>> patch (without the swap part) for several functions in truncate.c.
>>>
>>> Hugh
>>>
>>> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
>>> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
>>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>>>    	for ( ; ; ) {
>>>    		cond_resched();
>>>
>>> +		index = min(index, end);
>>>    		pvec.nr = find_get_entries(mapping, index,
>>>    				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>>    				pvec.pages, indices);
>
> So let's all forget that patch, although it does help to highlight my
> mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)

What patch? What mistake? :)

> Here's the 3.16-rc2 patch that I've now settled on (which will also
> require a revert of current git's f00cdc6df7d7; well, not require the
> revert, but this makes that redundant, and cannot be tested with it in).
>
> I've not yet had time to write up the patch description, nor to test
> it fully; but thought I should get the patch itself into the open for
> review and testing before then.

It seems to work here (tested 3.16-rc1 which didn't have f00cdc6df7d7 yet).
Checking for end != -1 is indeed much more elegant solution than i_size.
Thanks. So you can add my Tested-by.

> I've checked against v3.1 to see how it works out there: certainly
> wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
> "end" was included in the range, not excluded), but the same
> principles apply.  Haven't checked the intermediates yet, will
> probably leave those until each stable wants them - but if you've a
> particular release in mind, please ask, or ask me to check your port.

I will try, thanks.

> I've included the mm/truncate.c part of it here, but that will be a
> separate (not for -stable) patch when I post the finalized version.
>
> Hannes, a question for you please, I just could not make up my mind.
> In mm/truncate.c truncate_inode_pages_range(), what should be done
> with a failed clear_exceptional_entry() in the case of hole-punch?
> Is that case currently depending on the rescan loop (that I'm about
> to revert) to remove a new page, so I would need to add a retry for
> that rather like the shmem_free_swap() one?  Or is it irrelevant,
> and can stay unchanged as below?  I've veered back and forth,
> thinking first one and then the other.
>
> Thanks,
> Hugh
>
> ---
>
>   mm/shmem.c    |   19 ++++++++++---------
>   mm/truncate.c |   14 +++++---------
>   2 files changed, 15 insertions(+), 18 deletions(-)
>
> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> +++ linux/mm/shmem.c	2014-06-26 15:41:52.704362962 -0700
> @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>   		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
> @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
>   			if (radix_tree_exceptional_entry(page)) {
>   				if (unfalloc)
>   					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>   				continue;
>   			}
>
> --- 3.16-rc2/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> +++ linux/mm/truncate.c	2014-06-26 16:31:35.932433863 -0700
> @@ -352,21 +352,17 @@ void truncate_inode_pages_range(struct a
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -			indices)) {
> -			if (index == start)
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> +			/* If all gone or hole-punch, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if (index == start && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/