linux-kernel - Re: mm: shm: hang in shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53ABE479.3080508@suse.cz>
Date:	Thu, 26 Jun 2014 11:14:33 +0200
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Hugh Dickins <hughd@...gle.com>
CC:	Konstantin Khlebnikov <koct9i@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	Dave Jones <davej@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: mm: shm: hang in shmem_fallocate

On 06/26/2014 12:36 AM, Hugh Dickins wrote:
> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>> in pages from the hole while it is being punched?
>>>>>
>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>
>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>
>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>> mounted before starting testing.
>>>>
>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>> location looks same as before:
>>>>
>>>
>>> Please give this patch a try.  It fixes what I can reproduce, but given
>>> your unexplained page_mapped() BUG in this area, we know there's more
>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>
>>
>> Hi,
>
> Sorry for the slow response: I have got confused, learnt more, and
> changed my mind, several times in the course of replying to you.
> I think this reply will be stable... though not final.

Thanks a lot for looking into it!

>>
>> since this got a CVE,
>
> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.

Sorry, I should have mentioned it explicitly.

> Looks overrated to me

I'd bet it would pass unnoticed if you didn't use the sentence "but 
whether it's a serious matter in the scale of denials of service, I'm 
not so sure" in your first reply to Sasha's report :) I wouldn't be 
surprised if people grep for this.

> (and amusing to see my pompous words about a
> "range notification mechanism" taken too seriously), but of course
> we do need to address it.
>
>> I've been looking at backport to an older kernel where
>
> Thanks a lot for looking into it.  I didn't think it was worth a
> Cc: stable@...r.kernel.org myself, but admit to being both naive
> and inconsistent about that.
>
>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>
> Yes, that mechanism could be ported back pre-v3.5,
> but I agree with your preference not to.
>
>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>
> Right, I don't think we have formally documented the instant of "full hole"
> that I strove for there, and it's probably not externally verifiable, nor
> guaranteed by other filesystems.  I just thought it a good QoS aim, but
> it has given us this problem.
>
>> truncate really everything. Then I realized that maybe it would work for
>> current kernel as well, without having to add any checks in the page fault
>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>> the old data from the range, it's fine from any information leak point of view.
>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>> page before the parallel truncate has ended, or right after it has ended.
>
> Yes.  I disagree with your actual patch, for more than one reason,
> but it's in the right area; and I found myself growing to agree with
> you, that's it's better to have one kind of fix for all these releases,
> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

I was looking at our 3.0 based kernel, but it could be due to backported 
patches on top.

> If I'd realized that we were going to have to backport, I'd have spent
> longer looking for a patch like yours originally.  So my inclination
> now is to go your route, make a new patch for v3.16 and backports,
> and revert the f00cdc6df7d7 that has already gone in.
>
>> So I'm posting it here as a RFC. I haven't thought about the
>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>
> My understanding is that i915_gem_object_truncate() is not a problem,
> that i915's dev->struct_mutex serializes all its relevant transitions,
> plus the object woudn't even be interestingly accessible to the user.
>
>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>> to the problem.
>
> I don't think i915 is vulnerable, but if it is, that condition would
> be fine for it, as would be the patch I'm now thinking of.
>
>>
>> -----8<-----
>> From: Vlastimil Babka <vbabka@...e.cz>
>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>
>> ---
>>   mm/shmem.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f484c27..6d6005c 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		if (!pvec.nr) {
>>   			if (index == start || unfalloc)
>>   				break;
>> +                        /*
>> +                         * When this condition is true, it means we were
>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>> +                         * To prevent a livelock when someone else is faulting
>> +                         * pages back, we are content with single pass and do
>> +                         * not retry with index = start. It's important that
>> +                         * previous page content has been discarded, and
>> +                         * faulter(s) got new zeroed pages.
>> +                         *
>> +                         * The other callsites are shmem_setattr (for
>> +                         * truncation) and shmem_evict_inode, which set i_size
>> +                         * to truncated size or 0, respectively, and then call
>> +                         * us with lstart == inode->i_size. There we do want to
>> +                         * retry, and livelock cannot happen for other reasons.
>> +                         *
>> +                         * XXX what about i915_gem_object_truncate?
>> +                         */
>
> I doubt you have ever faced such a criticism before, but I'm going
> to speak my mind and say that comment is too long!  A comment of that
> length is okay above or just inside or at a natural break in a function,
> but here it distracts too much from what the code is actually doing.

Fair enough. The reasoning should have gone into commit log, not comment.

> In particular, the words "this condition" are so much closer to the
> condition above than the condition below, that it's rather confusing.
>
> /* Single pass when hole-punching to not livelock on racing faults */
> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>
>> +                        if (lstart < inode->i_size)
>
> For a long time I was going to suggest that you leave i_size out of it,
> and use "lend > 0" instead.  Then suddenly I realized that this is the
> wrong place for the test.

Well my first idea was to just add a flag about how persistent it should 
be. And set it false for the punch hole case. Then I wondered if there's 
already some bit that distinguishes it. But it makes it more subtle.

> And then that it's not your fault, it's mine,
> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
> Wow, that really pessimized the hole-punch case!
>
> When is pvec.nr 0?  When we've reached the end of the file.  Why should
> we go to the end of the file, when punching a hole at the start?  Ughh!

Ah, I see (I think). But I managed to reproduce this problem when there 
was only an extra page between lend and the end of file, so I doubt this 
is the only problem. AFAIU it's enough to try punching a large enough 
hole, then the loop can only do a single pagevec worth of pages per 
iteration, which gives enough time for somebody faulting pages back?

>> +                                break;
>>   			index = start;
>>   			continue;
>>   		}
>> --
>> 1.8.4.5
>
> But there is another problem.  We cannot break out after one pass on
> shmem, because there's a possiblilty that a swap entry in the radix_tree
> got swizzled into a page just as it was about to be removed - your patch
> might then leave that data behind in the hole.

Thanks, I didn't notice that. Do I understand correctly that this could 
mean info leak for the punch hole call, but wouldn't be a problem for 
madvise? (In any case, that means the solution is not general enough for 
all kernels, so I'm asking just to be sure).

> As it happens, Konstantin Khlebnikov suggested a patch for that a few
> weeks ago, before noticing that it's already handled by the endless loop.
> If we make that loop no longer endless, we need to add in Konstantin's
> "if (shmem_free_swap) goto retry" patch.
>
> Right now I'm thinking that my idiocy in d0823576bf4b may actually
> be the whole of Trinity's problem: patch below.  If we waste time
> traversing the radix_tree to end-of-file, no wonder that concurrent
> faults have time to put something in the hole every time.
>
> Sasha, may I trespass on your time, and ask you to revert the previous
> patch from your tree, and give this patch below a try?  I am very
> interested to learn if in fact it fixes it for you (as it did for me).

I will try this, but as I explained above, I doubt that alone will help.

> However, I am wasting your time, in that I think we shall decide that
> it's too unsafe to rely solely upon the patch below (what happens if
> 1024 cpus are all faulting on it while we try to punch a 4MB hole at

My reproducer is 4MB file, where the puncher tries punching everything 
except first and last page. And there are 8 other threads (as I have 8 
logical CPU's) that just repeatedly sweep the same range, reading only 
the first byte of each page.

> end of file? if we care).  I think we shall end up with the optimization
> below (or some such: it can be written in various ways), plus reverting
> d0823576bf4b's "index == start && " pincer, plus Konstantin's
> shmem_free_swap handling, rolled into a single patch; and a similar

So that means no retry in any case (except the swap thing)? All callers 
can handle that? I guess shmem_evict_inode would be ok, as nobody else
can be accessing that inode. But what about shmem_setattr? (i.e. 
straight truncation) As you said earlier, faulters will get a SIGBUS 
(which AFAIU is due to i_size being updated before we enter 
shmem_undo_range). But could possibly a faulter already pass the i_size 
test, and proceed with the fault only when we are already in 
shmem_undo_range and have passed the page in question?

> patch (without the swap part) for several functions in truncate.c.
>
> Hugh
>
> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>   	for ( ; ; ) {
>   		cond_resched();
>
> +		index = min(index, end);
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/