linux-kernel - Re: mm: shm: hang in shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53AC39FD.9060202@suse.cz>
Date:	Thu, 26 Jun 2014 17:19:25 +0200
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Hugh Dickins <hughd@...gle.com>
CC:	Konstantin Khlebnikov <koct9i@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	Dave Jones <davej@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: mm: shm: hang in shmem_fallocate

On 06/26/2014 11:14 AM, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>>> in pages from the hole while it is being punched?
>>>>>>
>>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>>
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Does this happen with other holepunch filesystems?  If it does not,
>>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>>
>>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>>> mounted before starting testing.
>>>>>
>>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>>> location looks same as before:
>>>>>
>>>>
>>>> Please give this patch a try.  It fixes what I can reproduce, but given
>>>> your unexplained page_mapped() BUG in this area, we know there's more
>>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>>
>>>
>>> Hi,
>>
>> Sorry for the slow response: I have got confused, learnt more, and
>> changed my mind, several times in the course of replying to you.
>> I think this reply will be stable... though not final.
>
> Thanks a lot for looking into it!
>
>>>
>>> since this got a CVE,
>>
>> Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
>
> Sorry, I should have mentioned it explicitly.
>
>> Looks overrated to me
>
> I'd bet it would pass unnoticed if you didn't use the sentence "but
> whether it's a serious matter in the scale of denials of service, I'm
> not so sure" in your first reply to Sasha's report :) I wouldn't be
> surprised if people grep for this.
>
>> (and amusing to see my pompous words about a
>> "range notification mechanism" taken too seriously), but of course
>> we do need to address it.
>>
>>> I've been looking at backport to an older kernel where
>>
>> Thanks a lot for looking into it.  I didn't think it was worth a
>> Cc: stable@...r.kernel.org myself, but admit to being both naive
>> and inconsistent about that.
>>
>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>>
>> Yes, that mechanism could be ported back pre-v3.5,
>> but I agree with your preference not to.
>>
>>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>>
>> Right, I don't think we have formally documented the instant of "full hole"
>> that I strove for there, and it's probably not externally verifiable, nor
>> guaranteed by other filesystems.  I just thought it a good QoS aim, but
>> it has given us this problem.
>>
>>> truncate really everything. Then I realized that maybe it would work for
>>> current kernel as well, without having to add any checks in the page fault
>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>>> the old data from the range, it's fine from any information leak point of view.
>>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>>> page before the parallel truncate has ended, or right after it has ended.
>>
>> Yes.  I disagree with your actual patch, for more than one reason,
>> but it's in the right area; and I found myself growing to agree with
>> you, that's it's better to have one kind of fix for all these releases,
>> than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.

OK, seems I cannot reproduce this on 3.0.101 vanilla.

>> If I'd realized that we were going to have to backport, I'd have spent
>> longer looking for a patch like yours originally.  So my inclination
>> now is to go your route, make a new patch for v3.16 and backports,
>> and revert the f00cdc6df7d7 that has already gone in.
>>
>>> So I'm posting it here as a RFC. I haven't thought about the
>>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>>
>> My understanding is that i915_gem_object_truncate() is not a problem,
>> that i915's dev->struct_mutex serializes all its relevant transitions,
>> plus the object woudn't even be interestingly accessible to the user.
>>
>>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>>> to the problem.
>>
>> I don't think i915 is vulnerable, but if it is, that condition would
>> be fine for it, as would be the patch I'm now thinking of.
>>
>>>
>>> -----8<-----
>>> From: Vlastimil Babka <vbabka@...e.cz>
>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>>
>>> ---
>>>    mm/shmem.c | 19 +++++++++++++++++++
>>>    1 file changed, 19 insertions(+)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index f484c27..6d6005c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>>    		if (!pvec.nr) {
>>>    			if (index == start || unfalloc)
>>>    				break;
>>> +                        /*
>>> +                         * When this condition is true, it means we were
>>> +                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>> +                         * To prevent a livelock when someone else is faulting
>>> +                         * pages back, we are content with single pass and do
>>> +                         * not retry with index = start. It's important that
>>> +                         * previous page content has been discarded, and
>>> +                         * faulter(s) got new zeroed pages.
>>> +                         *
>>> +                         * The other callsites are shmem_setattr (for
>>> +                         * truncation) and shmem_evict_inode, which set i_size
>>> +                         * to truncated size or 0, respectively, and then call
>>> +                         * us with lstart == inode->i_size. There we do want to
>>> +                         * retry, and livelock cannot happen for other reasons.
>>> +                         *
>>> +                         * XXX what about i915_gem_object_truncate?
>>> +                         */
>>
>> I doubt you have ever faced such a criticism before, but I'm going
>> to speak my mind and say that comment is too long!  A comment of that
>> length is okay above or just inside or at a natural break in a function,
>> but here it distracts too much from what the code is actually doing.
>
> Fair enough. The reasoning should have gone into commit log, not comment.
>
>> In particular, the words "this condition" are so much closer to the
>> condition above than the condition below, that it's rather confusing.
>>
>> /* Single pass when hole-punching to not livelock on racing faults */
>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>
>>> +                        if (lstart < inode->i_size)
>>
>> For a long time I was going to suggest that you leave i_size out of it,
>> and use "lend > 0" instead.  Then suddenly I realized that this is the
>> wrong place for the test.
>
> Well my first idea was to just add a flag about how persistent it should
> be. And set it false for the punch hole case. Then I wondered if there's
> already some bit that distinguishes it. But it makes it more subtle.
>
>> And then that it's not your fault, it's mine,
>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>> Wow, that really pessimized the hole-punch case!
>>
>> When is pvec.nr 0?  When we've reached the end of the file.  Why should
>> we go to the end of the file, when punching a hole at the start?  Ughh!
>
> Ah, I see (I think). But I managed to reproduce this problem when there
> was only an extra page between lend and the end of file, so I doubt this
> is the only problem. AFAIU it's enough to try punching a large enough
> hole, then the loop can only do a single pagevec worth of pages per
> iteration, which gives enough time for somebody faulting pages back?
>
>>> +                                break;
>>>    			index = start;
>>>    			continue;
>>>    		}
>>> --
>>> 1.8.4.5
>>
>> But there is another problem.  We cannot break out after one pass on
>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>> got swizzled into a page just as it was about to be removed - your patch
>> might then leave that data behind in the hole.
>
> Thanks, I didn't notice that. Do I understand correctly that this could
> mean info leak for the punch hole call, but wouldn't be a problem for
> madvise? (In any case, that means the solution is not general enough for
> all kernels, so I'm asking just to be sure).
>
>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>> weeks ago, before noticing that it's already handled by the endless loop.
>> If we make that loop no longer endless, we need to add in Konstantin's
>> "if (shmem_free_swap) goto retry" patch.
>>
>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>> be the whole of Trinity's problem: patch below.  If we waste time
>> traversing the radix_tree to end-of-file, no wonder that concurrent
>> faults have time to put something in the hole every time.
>>
>> Sasha, may I trespass on your time, and ask you to revert the previous
>> patch from your tree, and give this patch below a try?  I am very
>> interested to learn if in fact it fixes it for you (as it did for me).
>
> I will try this, but as I explained above, I doubt that alone will help.

Yep, it didn't help here.

>> However, I am wasting your time, in that I think we shall decide that
>> it's too unsafe to rely solely upon the patch below (what happens if
>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>
> My reproducer is 4MB file, where the puncher tries punching everything
> except first and last page. And there are 8 other threads (as I have 8
> logical CPU's) that just repeatedly sweep the same range, reading only
> the first byte of each page.
>
>> end of file? if we care).  I think we shall end up with the optimization
>> below (or some such: it can be written in various ways), plus reverting
>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>> shmem_free_swap handling, rolled into a single patch; and a similar
>
> So that means no retry in any case (except the swap thing)? All callers
> can handle that? I guess shmem_evict_inode would be ok, as nobody else
> can be accessing that inode. But what about shmem_setattr? (i.e.
> straight truncation) As you said earlier, faulters will get a SIGBUS
> (which AFAIU is due to i_size being updated before we enter
> shmem_undo_range). But could possibly a faulter already pass the i_size
> test, and proceed with the fault only when we are already in
> shmem_undo_range and have passed the page in question?
>
>> patch (without the swap part) for several functions in truncate.c.
>>
>> Hugh
>>
>> --- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
>> +++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>>    	for ( ; ; ) {
>>    		cond_resched();
>>
>> +		index = min(index, end);
>>    		pvec.nr = find_get_entries(mapping, index,
>>    				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>>    				pvec.pages, indices);
>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/