[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <09e63000-97fd-dbc3-6a3b-c606e0d73e15@oracle.com>
Date: Mon, 28 Aug 2017 10:45:58 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: Nadav Amit <namit@...are.com>
Cc: Nadia Yvette Chambers <nyc@...omorphy.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Eric Biggers <ebiggers3@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Michal Hocko <mhocko@...nel.org>
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in
hugetlbfs_fallocate()
Adding Andrew, Michal on CC
On 08/27/2017 01:08 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@...cle.com> wrote:
>
>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>> This scenario opens a small time window, from the time the page is added
>>> to the page cache, until it is unlocked, in which the page might be
>>> removed from the page-cache by another core. If the page is removed
>>> during this time windows, it might cause a memory corruption, as the
>>> wrong page will be unlocked.
>>>
>>> It is arguable whether this scenario can happen in a real system, and
>>> there are several mitigating factors. The issue was found by code
>>> inspection (actually grep), and not by actually triggering the flow.
>>> Yet, since putting the page before unlocking is incorrect it should be
>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>> code.
>>>
>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>
>>> cc: Eric Biggers <ebiggers3@...il.com>
>>> cc: Mike Kravetz <mike.kravetz@...cle.com>
>>>
>>> Signed-off-by: Nadav Amit <namit@...are.com>
>>
>> Thank you Nadav.
>
> No problem.
>
>>
>> Reviewed-by: Mike Kravetz <mike.kravetz@...cle.com>
>>
>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>> able to remove a page (file content) is through an inode operation such as
>> truncate, hole punch, or unlink. That was the basis for my response that
>> the inode lock would be required for page freeing.
>>
>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>> I was expecting to see a check for hugetlbfs pages and exit (without
>> modification) if encountered. A quick review of the code did not find
>> any such checks.
>>
>> I'll take a closer look to determine exactly how hugetlbfs files are
>> handled. IMO, there should be something similar to the DAX check where
>> the routine quickly exits.
>
> I did not cc stable when submitting the patch, based on your previous
> response. Let me know if you want me to send v2 which does so.
I still do not believe there is a need to change this in stable. Your patch
should be sufficient to ensure we do the right thing going forward.
Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
hugetlbfs does indeed show a more general problem. One can use
sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does
not go through the special hugetlbfs page handling code, but rather the
normal mm paths. As a result hugetlbfs accounting (like reserve counts)
gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!!
I will address this issue in a separate patch.
--
Mike Kravetz
Powered by blists - more mailing lists