linux-kernel - Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <938a065b-ecae-5905-b394-6cbed0bfe946@collabora.com>
Date:   Mon, 17 Apr 2023 16:11:18 +0500
From:   Muhammad Usama Anjum <usama.anjum@...labora.com>
To:     Michał Mirosław <emmir@...gle.com>
Cc:     Muhammad Usama Anjum <usama.anjum@...labora.com>,
        Mike Rapoport <rppt@...nel.org>, Peter Xu <peterx@...hat.com>,
        David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrei Vagin <avagin@...il.com>,
        Danylo Mocherniuk <mdanylo@...gle.com>,
        Paul Gofman <pgofman@...eweavers.com>,
        Cyrill Gorcunov <gorcunov@...il.com>,
        Nadav Amit <namit@...are.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Shuah Khan <shuah@...nel.org>,
        Christian Brauner <brauner@...nel.org>,
        Yang Shi <shy828301@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Yun Zhou <yun.zhou@...driver.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Alex Sierra <alex.sierra@....com>,
        Matthew Wilcox <willy@...radead.org>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        "Gustavo A . R . Silva" <gustavoars@...nel.org>,
        Dan Williams <dan.j.williams@...el.com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        Greg KH <gregkh@...uxfoundation.org>, kernel@...labora.com
Subject: Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and
 optionally clear info about PTEs

On 4/11/23 2:29 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
> <usama.anjum@...labora.com> wrote:
>> On 4/7/23 3:14 PM, Michał Mirosław wrote:
>>> On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
>>> <usama.anjum@...labora.com> wrote:
>>>> On 4/7/23 12:34 PM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
>>>>> <usama.anjum@...labora.com> wrote:
>>>>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>>>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>>>>>> <usama.anjum@...labora.com> wrote:
>>> [...]
>>>>>>>>>> +       /*
>>>>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>>>>>> +        * we want to return output to user in compact form where no two
>>>>>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>>>>>> +        */
>>>>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>>>>>> +                             GFP_KERNEL);
>>>>>>>>>> +       if (!p.vec)
>>>>>>>>>> +               return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +       walk_start = walk_end = start;
>>>>>>>>>> +       while (walk_end < end && !ret) {
>>>>>>>>>
>>>>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>>>>>> error will be lost) - is it intended?
>>>>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>>>>>> there was more memory to walk over. We don't treat this error. So when
>>>>>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>>>>>> full and return as success.
>>>>>>>
>>>>>>> Thanks. What's the difference between -ENOSPC and
>>>>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>>>>>> flow).
>>>>>> -ENOSPC --> user buffer has been filled completely
>>>>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>>>>>                             still have more space
>>>>>
>>>>> What is the difference in code behaviour when those two cases are
>>>>> compared? (I'd expect none.)
>>>> There is difference:
>>>> We add data to user buffer. If it succeeds with return code 0, we engage
>>>> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
>>>> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
>>>> to the user buffer.
>>>
>>> Thanks! I see it now. I see a few more corner cases here:
>>> 1. If we did engage WP but fail to copy the vector we return -EFAULT
>>> but the WP is already engaged. I'm not sure this is something worth
>>> guarding against, but documenting that would be helpful I think.
>> Sure.
>>
>>> 2. If uffd_wp_range() fails, but we have already processed pages
>>> earlier, we should treat the error like ENOSPC and back out the failed
>>> range (the earier changes would be lost otherwise).
>> Backing out is easier to do for hugepages. But for normal pages, we'll have
>> to write some code to find where the current data was added (in cur or in
>> vec) and back out from that. I'll have to write some more code to avoid the
>> side-effects as well.
> 
> If I read the code correctly, the last page should always be in `cur`
> and on failure only a single page is needed to be backed out. Did I
> miss something?
I'm leaving using uffd_wp_range() in next revision as it is costing
performance. This will not be needed in next revision.

> 
>> But aren't we going over-engineering here? Error occurred and we are trying
>> to keep the previously generated correct data and returning successfully
>> still to the user? I don't think we should do this. An error is error. We
>> should return the error simply even if the memory flags would get lost. We
>> don't know what caused the error in uffd_wp_range(). Under normal
>> situation, we there shouldn't have had error.
> 
> In this case it means that on (intermittent) allocation error we get
> inconsistent or non-deterministic results. I wouldn't want to be the
> one debugging this later - I'd prefer either the syscall be
> "exception-safe" (give consistent and predictable output) or kill the
> process.
> 
> Best Regards
> Michał Mirosław

-- 
BR,
Muhammad Usama Anjum