[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53AC39FD.9060202@suse.cz>
Date: Thu, 26 Jun 2014 17:19:25 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Hugh Dickins <hughd@...gle.com>
CC: Konstantin Khlebnikov <koct9i@...il.com>,
Sasha Levin <sasha.levin@...cle.com>,
Dave Jones <davej@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: mm: shm: hang in shmem_fallocate
On 06/26/2014 11:14 AM, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
>> On Tue, 24 Jun 2014, Vlastimil Babka wrote:
>>> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
>>>> On Thu, 12 Jun 2014, Sasha Levin wrote:
>>>>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Would trinity be likely to have a thread or process repeatedly faulting
>>>>>>> in pages from the hole while it is being punched?
>>>>>>
>>>>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>>>>
>>>>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>>>>>>> Does this happen with other holepunch filesystems? If it does not,
>>>>>>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>>>>>>> is lighter than a consistent disk-based filesystem's has to be.
>>>>>>> But we don't want to make the tmpfs path heavier to match them.
>>>>>>
>>>>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>>>>> punching in other filesystems and I make sure to get a bunch of those
>>>>>> mounted before starting testing.
>>>>>
>>>>> Just pinging this one again. I still see hangs in -next where the hang
>>>>> location looks same as before:
>>>>>
>>>>
>>>> Please give this patch a try. It fixes what I can reproduce, but given
>>>> your unexplained page_mapped() BUG in this area, we know there's more
>>>> yet to be understood, so perhaps this patch won't do enough for you.
>>>>
>>>
>>> Hi,
>>
>> Sorry for the slow response: I have got confused, learnt more, and
>> changed my mind, several times in the course of replying to you.
>> I think this reply will be stable... though not final.
>
> Thanks a lot for looking into it!
>
>>>
>>> since this got a CVE,
>>
>> Oh. CVE-2014-4171. Couldn't locate that yesterday but see it now.
>
> Sorry, I should have mentioned it explicitly.
>
>> Looks overrated to me
>
> I'd bet it would pass unnoticed if you didn't use the sentence "but
> whether it's a serious matter in the scale of denials of service, I'm
> not so sure" in your first reply to Sasha's report :) I wouldn't be
> surprised if people grep for this.
>
>> (and amusing to see my pompous words about a
>> "range notification mechanism" taken too seriously), but of course
>> we do need to address it.
>>
>>> I've been looking at backport to an older kernel where
>>
>> Thanks a lot for looking into it. I didn't think it was worth a
>> Cc: stable@...r.kernel.org myself, but admit to being both naive
>> and inconsistent about that.
>>
>>> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
>>> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
>>
>> Yes, that mechanism could be ported back pre-v3.5,
>> but I agree with your preference not to.
>>
>>> it doesn't guarantee anything, it seems simpler just to give up retrying to
>>
>> Right, I don't think we have formally documented the instant of "full hole"
>> that I strove for there, and it's probably not externally verifiable, nor
>> guaranteed by other filesystems. I just thought it a good QoS aim, but
>> it has given us this problem.
>>
>>> truncate really everything. Then I realized that maybe it would work for
>>> current kernel as well, without having to add any checks in the page fault
>>> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
>>> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
>>> the old data from the range, it's fine from any information leak point of view.
>>> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
>>> page before the parallel truncate has ended, or right after it has ended.
>>
>> Yes. I disagree with your actual patch, for more than one reason,
>> but it's in the right area; and I found myself growing to agree with
>> you, that's it's better to have one kind of fix for all these releases,
>> than one for v3.5..v3.15 and another for v3.1..v3.4. (The CVE cites
>> v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
>
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.
OK, seems I cannot reproduce this on 3.0.101 vanilla.
>> If I'd realized that we were going to have to backport, I'd have spent
>> longer looking for a patch like yours originally. So my inclination
>> now is to go your route, make a new patch for v3.16 and backports,
>> and revert the f00cdc6df7d7 that has already gone in.
>>
>>> So I'm posting it here as a RFC. I haven't thought about the
>>> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
>>
>> My understanding is that i915_gem_object_truncate() is not a problem,
>> that i915's dev->struct_mutex serializes all its relevant transitions,
>> plus the object woudn't even be interestingly accessible to the user.
>>
>>> the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
>>> to the problem.
>>
>> I don't think i915 is vulnerable, but if it is, that condition would
>> be fine for it, as would be the patch I'm now thinking of.
>>
>>>
>>> -----8<-----
>>> From: Vlastimil Babka <vbabka@...e.cz>
>>> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching
>>>
>>> ---
>>> mm/shmem.c | 19 +++++++++++++++++++
>>> 1 file changed, 19 insertions(+)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index f484c27..6d6005c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>> if (!pvec.nr) {
>>> if (index == start || unfalloc)
>>> break;
>>> + /*
>>> + * When this condition is true, it means we were
>>> + * called from fallocate(FALLOC_FL_PUNCH_HOLE).
>>> + * To prevent a livelock when someone else is faulting
>>> + * pages back, we are content with single pass and do
>>> + * not retry with index = start. It's important that
>>> + * previous page content has been discarded, and
>>> + * faulter(s) got new zeroed pages.
>>> + *
>>> + * The other callsites are shmem_setattr (for
>>> + * truncation) and shmem_evict_inode, which set i_size
>>> + * to truncated size or 0, respectively, and then call
>>> + * us with lstart == inode->i_size. There we do want to
>>> + * retry, and livelock cannot happen for other reasons.
>>> + *
>>> + * XXX what about i915_gem_object_truncate?
>>> + */
>>
>> I doubt you have ever faced such a criticism before, but I'm going
>> to speak my mind and say that comment is too long! A comment of that
>> length is okay above or just inside or at a natural break in a function,
>> but here it distracts too much from what the code is actually doing.
>
> Fair enough. The reasoning should have gone into commit log, not comment.
>
>> In particular, the words "this condition" are so much closer to the
>> condition above than the condition below, that it's rather confusing.
>>
>> /* Single pass when hole-punching to not livelock on racing faults */
>> would have been enough (yes, I've cheated, that would be 2 or 4 lines).
>>
>>> + if (lstart < inode->i_size)
>>
>> For a long time I was going to suggest that you leave i_size out of it,
>> and use "lend > 0" instead. Then suddenly I realized that this is the
>> wrong place for the test.
>
> Well my first idea was to just add a flag about how persistent it should
> be. And set it false for the punch hole case. Then I wondered if there's
> already some bit that distinguishes it. But it makes it more subtle.
>
>> And then that it's not your fault, it's mine,
>> in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
>> Wow, that really pessimized the hole-punch case!
>>
>> When is pvec.nr 0? When we've reached the end of the file. Why should
>> we go to the end of the file, when punching a hole at the start? Ughh!
>
> Ah, I see (I think). But I managed to reproduce this problem when there
> was only an extra page between lend and the end of file, so I doubt this
> is the only problem. AFAIU it's enough to try punching a large enough
> hole, then the loop can only do a single pagevec worth of pages per
> iteration, which gives enough time for somebody faulting pages back?
>
>>> + break;
>>> index = start;
>>> continue;
>>> }
>>> --
>>> 1.8.4.5
>>
>> But there is another problem. We cannot break out after one pass on
>> shmem, because there's a possiblilty that a swap entry in the radix_tree
>> got swizzled into a page just as it was about to be removed - your patch
>> might then leave that data behind in the hole.
>
> Thanks, I didn't notice that. Do I understand correctly that this could
> mean info leak for the punch hole call, but wouldn't be a problem for
> madvise? (In any case, that means the solution is not general enough for
> all kernels, so I'm asking just to be sure).
>
>> As it happens, Konstantin Khlebnikov suggested a patch for that a few
>> weeks ago, before noticing that it's already handled by the endless loop.
>> If we make that loop no longer endless, we need to add in Konstantin's
>> "if (shmem_free_swap) goto retry" patch.
>>
>> Right now I'm thinking that my idiocy in d0823576bf4b may actually
>> be the whole of Trinity's problem: patch below. If we waste time
>> traversing the radix_tree to end-of-file, no wonder that concurrent
>> faults have time to put something in the hole every time.
>>
>> Sasha, may I trespass on your time, and ask you to revert the previous
>> patch from your tree, and give this patch below a try? I am very
>> interested to learn if in fact it fixes it for you (as it did for me).
>
> I will try this, but as I explained above, I doubt that alone will help.
Yep, it didn't help here.
>> However, I am wasting your time, in that I think we shall decide that
>> it's too unsafe to rely solely upon the patch below (what happens if
>> 1024 cpus are all faulting on it while we try to punch a 4MB hole at
>
> My reproducer is 4MB file, where the puncher tries punching everything
> except first and last page. And there are 8 other threads (as I have 8
> logical CPU's) that just repeatedly sweep the same range, reading only
> the first byte of each page.
>
>> end of file? if we care). I think we shall end up with the optimization
>> below (or some such: it can be written in various ways), plus reverting
>> d0823576bf4b's "index == start && " pincer, plus Konstantin's
>> shmem_free_swap handling, rolled into a single patch; and a similar
>
> So that means no retry in any case (except the swap thing)? All callers
> can handle that? I guess shmem_evict_inode would be ok, as nobody else
> can be accessing that inode. But what about shmem_setattr? (i.e.
> straight truncation) As you said earlier, faulters will get a SIGBUS
> (which AFAIU is due to i_size being updated before we enter
> shmem_undo_range). But could possibly a faulter already pass the i_size
> test, and proceed with the fault only when we are already in
> shmem_undo_range and have passed the page in question?
>
>> patch (without the swap part) for several functions in truncate.c.
>>
>> Hugh
>>
>> --- 3.16-rc2/mm/shmem.c 2014-06-16 00:28:55.124076531 -0700
>> +++ linux/mm/shmem.c 2014-06-25 10:28:47.063967052 -0700
>> @@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
>> for ( ; ; ) {
>> cond_resched();
>>
>> + index = min(index, end);
>> pvec.nr = find_get_entries(mapping, index,
>> min(end - index, (pgoff_t)PAGEVEC_SIZE),
>> pvec.pages, indices);
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists