[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a8fce124-149f-ee46-6b23-a1805a32ebc1@oracle.com>
Date: Wed, 6 Jul 2022 14:33:09 -0600
From: Khalid Aziz <khalid.aziz@...cle.com>
To: Andy Lutomirski <luto@...nel.org>, Barry Song <21cnbao@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
Arnd Bergmann <arnd@...db.de>,
Jonathan Corbet <corbet@....net>,
Dave Hansen <dave.hansen@...ux.intel.com>,
David Hildenbrand <david@...hat.com>, ebiederm@...ssion.com,
hagen@...u.net, jack@...e.cz, Kees Cook <keescook@...omium.org>,
kirill@...temov.name, kucharsk@...il.com, linkinjeon@...nel.org,
linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>, longpeng2@...wei.com,
markhemm@...glemail.com, Peter Collingbourne <pcc@...gle.com>,
Mike Rapoport <rppt@...nel.org>, sieberf@...zon.com,
sjpark@...zon.de, Suren Baghdasaryan <surenb@...gle.com>,
tst@...oebel-theuer.de, Iurii Zaikin <yzaikin@...gle.com>
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
On 7/3/22 14:54, Andy Lutomirski wrote:
> On 6/29/22 10:38, Khalid Aziz wrote:
>> On 5/30/22 22:24, Barry Song wrote:
>>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@...cle.com> wrote:
>>>>
>>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>>> are removed when the last task sharing the PTEs exits. Add a check
>>>> for shared PTEs and skip them.
>>>>
>>>> Signed-off-by: Khalid Aziz <khalid.aziz@...cle.com>
>>>> ---
>>>> mm/memory.c | 22 +++++++++++++++++++---
>>>> 1 file changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>> } else {
>>>> /*
>>>> * Optimization: gather nearby vmas into one call down
>>>> + * as long as they all belong to the same mm (that
>>>> + * may not be the case if a vma is part of mshare'd
>>>> + * range
>>>> */
>>>> while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>>>> - && !is_vm_hugetlb_page(next)) {
>>>> + && !is_vm_hugetlb_page(next)
>>>> + && vma->vm_mm == tlb->mm) {
>>>> vma = next;
>>>> next = vma->vm_next;
>>>> unlink_anon_vmas(vma);
>>>> unlink_file_vma(vma);
>>>> }
>>>> - free_pgd_range(tlb, addr, vma->vm_end,
>>>> - floor, next ? next->vm_start : ceiling);
>>>> + /*
>>>> + * Free pgd only if pgd is not allocated for an
>>>> + * mshare'd range
>>>> + */
>>>> + if (vma->vm_mm == tlb->mm)
>>>> + free_pgd_range(tlb, addr, vma->vm_end,
>>>> + floor, next ? next->vm_start : ceiling);
>>>> }
>>>> vma = next;
>>>> }
>>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>> pgd_t *pgd;
>>>> unsigned long next;
>>>>
>>>> + /*
>>>> + * If this is an mshare'd page, do not unmap it since it might
>>>> + * still be in use.
>>>> + */
>>>> + if (vma->vm_mm != tlb->mm)
>>>> + return;
>>>> +
>>>
>>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>>> folio_referenced()? are all vmas in those processes sharing page table still
>>> in the rmap of the shared page?
>>> without shared PTE, if 1000 processes share one page, we are reading 1000
>>> PTEs, with it, are we reading just one? or are we reading the same PTE
>>> 1000 times? Have you tested it?
>>>
>>
>> We are treating mshared region same as threads sharing address space. There is one PTE that is being used by all
>> processes and the VMA maintained in the separate mshare mm struct that also holds the shared PTE is the one that gets
>> added to rmap. This is a different model with mshare in that it adds an mm struct that is separate from the mm structs
>> of the processes that refer to the vma and pte in mshare mm struct. Do you see issues with rmap in this model?
>
> I think this patch is actually the most interesting bit of the series by far. Most of the rest is defining an API
> (which is important!) and figuring out semantics. This patch changes something rather fundamental about how user
> address spaces work: what vmas live in them. So let's figure out its effects.
>
> I admit I'm rather puzzled about what vm_mm is for in the first place. In current kernels (without your patch), I think
> it's a pretty hard requirement for vm_mm to equal the mm for all vmas in an mm. After a quick and incomplete survey,
> vm_mm seems to be mostly used as a somewhat lazy way to find the mm. Let's see:
>
> file_operations->mmap doesn't receive an mm_struct. Instead it infers the mm from vm_mm. (Why? I don't know.)
>
> Some walk_page_range users seem to dig the mm out of vm_mm instead of mm_walk.
>
> Some manual address space walkers start with an mm, don't bother passing it around, and dig it back out of of vm_mm.
> For example, unuse_vma() and all its helpers.
>
> The only real exception I've found so far is rmap: AFAICS (on quick inspection -- I could be wrong), rmap can map from a
> folio to a bunch of vmas, and the vmas' mms are not stored separately but instead determined by vm_mm.
>
>
>
> Your patch makes me quite nervous. You're potentially breaking any kernel code path that assumes that mms only contain
> vmas that have vm_mm == mm. And you're potentially causing rmap to be quite confused. I think that if you're going to
> take this approach, you need to clearly define the new semantics of vm_mm and audit or clean up every user of vm_mm in
> the tree. This may be nontrivial (especially rmap), although a cleanup of everything else to stop using vm_mm might be
> valuable.
>
> But I'm wondering if it would be better to attack this from a different direction. Right now, there's a hardcoded
> assumption that an mm owns every page table it references. That's really the thing you're changing. To me, it seems
> that a magical vma that shares page tables should still be a vma that belongs to its mm_struct -- munmap() and
> potentialy other m***() operations should all work on it, existing find_vma() users should work, etc.
>
> So maybe instead there should be new behavior (by a VM_ flag or otherwise) that indicates that a vma owns its PTEs. It
> could even be a vm_operation, although if anyone ever wants regular file mappings to share PTEs, then a vm_operation
> doesn't really make sense.
>
Hi Andy,
You are absolutely right. Dragons lie on the path to changing the sense of vm_mm. Dave H pointed out potential issues
with rb_tree as well. As I have looked at more code, it seems breaking the assumption that vm_mm always points to
containing mm struct opens up the possibility of code breaking in strange ways in odd places. As a result, I have
changed the code in v2 patches to not break this assumption about vm_mm and instead I rewrote the code to use the vm
flag VM_SHARED_PT and vm_private_data everywhere it needed to find the mshare mm struct. All the vmas belonging to the
new mm struct for mshare region also have their vm_mm pointing to the mshare mm_struct and that keeps all vma operations
working normally.
Thanks,
Khalid
Powered by blists - more mailing lists