[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2615158-a0a6-9c2f-b04a-964dfa932aec@collabora.com>
Date: Wed, 24 May 2023 16:26:33 +0500
From: Muhammad Usama Anjum <usama.anjum@...labora.com>
To: Peter Xu <peterx@...hat.com>
Cc: Muhammad Usama Anjum <usama.anjum@...labora.com>,
linux-mm@...ck.org, Paul Gofman <pgofman@...eweavers.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Shuah Khan <shuah@...nel.org>,
Christian Brauner <brauner@...nel.org>,
Yang Shi <shy828301@...il.com>,
Vlastimil Babka <vbabka@...e.cz>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Yun Zhou <yun.zhou@...driver.com>,
Cyrill Gorcunov <gorcunov@...il.com>,
Michał Mirosław <emmir@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Andrei Vagin <avagin@...il.com>,
Alex Sierra <alex.sierra@....com>,
Matthew Wilcox <willy@...radead.org>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Danylo Mocherniuk <mdanylo@...gle.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
"Gustavo A . R . Silva" <gustavoars@...nel.org>,
David Hildenbrand <david@...hat.com>,
Dan Williams <dan.j.williams@...el.com>,
linux-kernel@...r.kernel.org, Mike Rapoport <rppt@...nel.org>,
linux-fsdevel@...r.kernel.org, linux-kselftest@...r.kernel.org,
Greg KH <gregkh@...uxfoundation.org>, kernel@...labora.com,
Nadav Amit <namit@...are.com>
Subject: Re: [PATCH RESEND v15 2/5] fs/proc/task_mmu: Implement IOCTL to get
and optionally clear info about PTEs
On 5/24/23 12:43 AM, Peter Xu wrote:
> Hi, Muhammad,
>
> On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote:
>> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
>>> On 4/26/23 7:13 PM, Peter Xu wrote:
>>>> Hi, Muhammad,
>>>>
>>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
>>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
>>>>>> +/* Supported flags */
>>>>>> +#define PM_SCAN_OP_GET (1 << 0)
>>>>>> +#define PM_SCAN_OP_WP (1 << 1)
>>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
>>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
>>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
>>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
>>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
>>>>>
>>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
>>>>> vs
>>>>> 2) UFFDIO_WRITEPROTECT
>>>>>
>>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
>>>>> getting really good performance which is comparable just like we are
>>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
>>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
>>>>> performance and behavior wise.
>>>>>
>>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
>>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
>>>>> as I don't have tests comparing them one-to-one.
>>>>>
>>>>> What are your thoughts about it? Have you thought about making
>>>>> UFFDIO_WRITEPROTECT perform better?
>>>>>
>>>>> I'm sorry to mention the word "performance" here. Actually we want better
>>>>> performance to emulate Windows syscall. That is why we are adding this
>>>>> functionality. So either we need to see what can be improved in
>>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
>>>>> pagemap_ioctl?
>>>>
>>>> I'm fine if you want to add it back if it works for you. Though before
>>>> that, could you remind me why there can be a difference on performance?
>>> I've looked at the code again and I think I've found something. Lets look
>>> at exact performance numbers:
>>>
>>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
>>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
>>> the average write time to the same memory which is being WP-ed and total
>>> time of execution of these APIs:
>
> What is the steps of the test? Is it as simple as "writeprotect",
> "unprotect", then write all pages in a single thread?
>
> Is UFFDIO_WRITEPROTECT sent in one range covering all pages?
>
> Maybe you can attach the test program here too.
I'd not attached the test earlier as I thought that you wouldn't be
interested in running the test. I've attached it now. The test has multiple
threads where one thread tries to get status of flags and reset them, while
other threads write to that memory. In main(), we call the pagemap_scan
ioctl to get status of flags and reset the memory area as well. While in N
threads, the memory is written.
I usually run the test by following where memory area is of 100000 * pages:
./win2_linux 8 100000 1 1 0
I'm running tests on real hardware. The results are pretty consistent. I'm
also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to
UFFDIO_WRITEPROTECT.
The PM_SCAN_OP_WP op doesn't work exclusively on v15. So please find the
updated WIP code here:
https://gitlab.collabora.com/usama.anjum/linux-mainline/-/commits/memwatchv16/
>
>>>
>>> **avg write time:**
>>> | No of pages | 2000 | 8192 | 100000 | 500000 |
>>> |------------------------|------|------|--------|--------|
>>> | UFFDIO_WRITEPROTECT | 2200 | 2300 | 4100 | 4200 |
>>> | PM_SCAN_OP_WP | 2000 | 2300 | 2500 | 2800 |
>>>
>>> **Execution time measured in rdtsc:**
>>> | No of pages | 2000 | 8192 | 100000 | 500000 |
>>> |------------------------|------|-------|--------|--------|
>>> | UFFDIO_WRITEPROTECT | 3200 | 14000 | 59000 | 58000 |
>>> | PM_SCAN_OP_WP | 1900 | 7000 | 38000 | 40000 |
>>>
>>> Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution
>>> time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So
>>> UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time
>>> is also slower.
>>>
>>> This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although
>>> PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We
>>> should have seen no difference in performance. But we have quite a lot of
>>> difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses
>>> walk_page_range() to walk over pages which finds VMAs from address ranges
>>> to walk over them and pagemap_scan_pmd_entry() is handling most of the work
>>> including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and
>>> iterating from all the different page directories until a pte is found and
>>> then flags are updated there and tlb is flushed for every pte.
>>>
>>> My next deduction would be that we are getting worse performance as we are
>>> flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While
>>> we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP.
>>> I've just verified this by adding some logs to the change_pte_range() and
>>> pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000
>>> pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP.
>>> The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1
>>> page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger
>>> sizes. I've learned over my last experience that tlb flush is very
>>> expensive. Probably this is what we need to improve if we don't want to add
>>> PM_SCAN_OP_WP?
>>>
>>> The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic
>>> function and I'm not sure if can try to not do tlb flushes if uffd_wp is
>>> true. We can try to do flush somewhere else and hopefully we should do only
>>> one flush if possible. It will not be so straight forward to move away from
>>> generic fundtion. Thoughts?
>> I've just tested this theory of not doing per pte flushes and only did one
>> flush on entire range in uffd_wp_range(). But it didn't improve the
>> situation either. I was wrong that tlb flushes may be the cause.
>
> I had a feeling that you were trapping tlb_flush_pte_range(), which is
> actually not really sending any TLB flushes but updating mmu_gather object
> for the addr range for future invalidations.
>
> That's probably why it didn't show an effect when you comment it out.
Yeah, probably.
>
> I am not sure whether the wr-protect path difference can be caused by the
> arch hooks, namely arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode().
>
> On x86 I saw that it's actually hooked onto some PV calls. I had a feeling
> that this is for optimization only, but maybe it's still a good idea you
> also take that into your new code:
>
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> PVOP_VCALL0(mmu.lazy_mode.enter);
> }
I've just looked into it. It isn't making any difference. But I think I
should include it in the code. It must be helpful for hypervisors etc.
>
> The other thing is I think you're flushing tlb outside pgtable lock in your
> new code. IIUC that's racy, see:
>
> commit 6ce64428d62026a10cb5d80138ff2f90cc21d367
> Author: Nadav Amit <namit@...are.com>
> Date: Fri Mar 12 21:08:17 2021 -0800
>
> mm/userfaultfd: fix memory corruption due to writeprotect
>
> So you may want to put it at least into pgtable lock critical section, or
> IIUC you can also do inc_tlb_flush_pending() then dec_tlb_flush_pending()
> just like __tlb_gather_mmu(), to make sure do_wp_page() will properly flush
> the page when unluckily hit some of the page.
Good point. I'll release page table lock after tlb flushing. I've just
added it to next WIP v16.
>
> That's also the spot (the flush_tlb_page() in 6ce64428d) that made me think
> on whether it caused the slowness on writting to those pages. But it
> really depends on your test program, e.g. if it's a single threaded I don't
> think it'll trigger because when writting mm_tlb_flush_pending() should
> start to return 0 already, so the tlb should logically not be needed. If
> you want maybe you can double check that.
>
> So in short, I had a feeling that the new PM_SCAN_OP_WP just misses
> something here and there so it's faster - it means even if it's faster it
> may also be prone to race conditions etc so we'd better figure it out..
The test program is multi-threaded. The performance number cannot be
reproduced with single-threaded application.
>
> Thanks,
>
--
BR,
Muhammad Usama Anjum
View attachment "win2_linux.c" of type "text/x-csrc" (13016 bytes)
Powered by blists - more mailing lists