[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877fgwul3v.fsf@linux.vnet.ibm.com>
Date: Mon, 21 Mar 2016 16:57:32 +0530
From: "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
To: Jérôme Glisse <jglisse@...hat.com>,
akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, joro@...tes.org,
Mel Gorman <mgorman@...e.de>, "H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Johannes Weiner <jweiner@...hat.com>,
Larry Woodman <lwoodman@...hat.com>,
Rik van Riel <riel@...hat.com>,
Dave Airlie <airlied@...hat.com>,
Brendan Conoboy <blc@...hat.com>,
Joe Donohue <jdonohue@...hat.com>,
Christophe Harle <charle@...dia.com>,
Duncan Poole <dpoole@...dia.com>,
Sherry Cheung <SCheung@...dia.com>,
Subhash Gutti <sgutti@...dia.com>,
John Hubbard <jhubbard@...dia.com>,
Mark Hairgrove <mhairgrove@...dia.com>,
Lucien Dunning <ldunning@...dia.com>,
Cameron Buschardt <cabuschardt@...dia.com>,
Arvind Gopalakrishnan <arvindg@...dia.com>,
Haggai Eran <haggaie@...lanox.com>,
Shachar Raindel <raindel@...lanox.com>,
Liran Liss <liranl@...lanox.com>,
Roland Dreier <roland@...estorage.com>,
Ben Sander <ben.sander@....com>,
Greg Stoner <Greg.Stoner@....com>,
John Bridgman <John.Bridgman@....com>,
Michael Mantor <Michael.Mantor@....com>,
Paul Blinzer <Paul.Blinzer@....com>,
Leonid Shamis <Leonid.Shamis@....com>,
Laurent Morichetti <Laurent.Morichetti@....com>,
Alexander Deucher <Alexander.Deucher@....com>,
Jérôme Glisse <jglisse@...hat.com>
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.
Jérôme Glisse <jglisse@...hat.com> writes:
> [ text/plain ]
> To migrate memory back we first need to lock HMM special CPU page
> table entry so we know no one else might try to migrate those entry
> back. Helper also allocate new page where data will be copied back
> from the device. Then we can proceed with the device DMA operation.
>
> Once DMA is done we can update again the CPU page table to point to
> the new page that holds the content copied back from device memory.
>
> Note that we do not need to invalidate the range are we are only
> modifying non present CPU page table entry.
>
> Changed since v1:
> - Save memcg against which each page is precharge as it might
> change along the way.
>
> Signed-off-by: Jérôme Glisse <jglisse@...hat.com>
> ---
> include/linux/mm.h | 12 +++
> mm/memory.c | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 269 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c5c062e..1cd060f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2392,6 +2392,18 @@ static inline void hmm_mm_init(struct mm_struct *mm)
> {
> mm->hmm = NULL;
> }
> +
> +int mm_hmm_migrate_back(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pte_t *new_pte,
> + unsigned long start,
> + unsigned long end);
> +void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pte_t *new_pte,
> + dma_addr_t *hmm_pte,
> + unsigned long start,
> + unsigned long end);
> #else /* !CONFIG_HMM */
> static inline void hmm_mm_init(struct mm_struct *mm)
> {
> diff --git a/mm/memory.c b/mm/memory.c
> index 3cb3653..d917911a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3513,6 +3513,263 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> }
> EXPORT_SYMBOL_GPL(handle_mm_fault);
>
> +
> +#ifdef CONFIG_HMM
> +/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
> + *
> + * @mm: The mm struct.
> + * @vma: The vm area struct the range is in.
> + * @new_pte: Array of new CPU page table entry value.
> + * @start: Start address of the range (inclusive).
> + * @end: End address of the range (exclusive).
> + *
> + * This function will lock HMM page table entry and allocate new page for entry
> + * it successfully locked.
> + */
Can you add more comments around this ?
> +int mm_hmm_migrate_back(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pte_t *new_pte,
> + unsigned long start,
> + unsigned long end)
> +{
> + pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
> + unsigned long addr, i;
> + int ret = 0;
> +
> + VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
> +
> + if (unlikely(anon_vma_prepare(vma)))
> + return -ENOMEM;
> +
> + start &= PAGE_MASK;
> + end = PAGE_ALIGN(end);
> + memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
> +
> + for (addr = start; addr < end;) {
> + unsigned long cstart, next;
> + spinlock_t *ptl;
> + pgd_t *pgdp;
> + pud_t *pudp;
> + pmd_t *pmdp;
> + pte_t *ptep;
> +
> + pgdp = pgd_offset(mm, addr);
> + pudp = pud_offset(pgdp, addr);
> + /*
> + * Some other thread might already have migrated back the entry
> + * and freed the page table. Unlikely thought.
> + */
> + if (unlikely(!pudp)) {
> + addr = min((addr + PUD_SIZE) & PUD_MASK, end);
> + continue;
> + }
> + pmdp = pmd_offset(pudp, addr);
> + if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
> + pmd_trans_huge(*pmdp))) {
> + addr = min((addr + PMD_SIZE) & PMD_MASK, end);
> + continue;
> + }
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
> + next = min((addr + PMD_SIZE) & PMD_MASK, end);
> + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> + swp_entry_t entry;
> +
> + entry = pte_to_swp_entry(*ptep);
> + if (pte_none(*ptep) || pte_present(*ptep) ||
> + !is_hmm_entry(entry) ||
> + is_hmm_entry_locked(entry))
> + continue;
> +
> + set_pte_at(mm, addr, ptep, hmm_entry);
> + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> + vma->vm_page_prot));
> + }
> + pte_unmap_unlock(ptep - 1, ptl);
I guess this is fixing all the ptes in the cpu page table mapping a pmd
entry. But then what is below ?
> +
> + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> + addr < next; addr += PAGE_SIZE, i++) {
Your use of vairable addr with multiple loops updating then is also
making it complex. We should definitely add more comments here. I guess
we are going through the same range we iterated above here.
> + struct mem_cgroup *memcg;
> + struct page *page;
> +
> + if (!pte_present(new_pte[i]))
> + continue;
What is that checking for ?. We set that using pte_mkspecial above ?
> +
> + page = alloc_zeroed_user_highpage_movable(vma, addr);
> + if (!page) {
> + ret = -ENOMEM;
> + break;
> + }
> + __SetPageUptodate(page);
> + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> + &memcg)) {
> + page_cache_release(page);
> + ret = -ENOMEM;
> + break;
> + }
> + /*
> + * We can safely reuse the s_mem/mapping field of page
> + * struct to store the memcg as the page is only seen
> + * by HMM at this point and we can clear it before it
> + * is public see mm_hmm_migrate_back_cleanup().
> + */
> + page->s_mem = memcg;
> + new_pte[i] = mk_pte(page, vma->vm_page_prot);
> + if (vma->vm_flags & VM_WRITE) {
> + new_pte[i] = pte_mkdirty(new_pte[i]);
> + new_pte[i] = pte_mkwrite(new_pte[i]);
> + }
Why mark it dirty if vm_flags is VM_WRITE ?
> + }
> +
> + if (!ret)
> + continue;
> +
> + hmm_entry = swp_entry_to_pte(make_hmm_entry());
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
Again we loop through the same range ?
> + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> + addr < next; addr += PAGE_SIZE, ptep++, i++) {
> + unsigned long pfn = pte_pfn(new_pte[i]);
> +
> + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
> + continue;
What is that checking for ?
> +
> + set_pte_at(mm, addr, ptep, hmm_entry);
> + pte_clear(mm, addr, &new_pte[i]);
what is that pte_clear for ?. Handling of new_pte needs more code comments.
> + }
> + pte_unmap_unlock(ptep - 1, ptl);
> + break;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(mm_hmm_migrate_back);
> +
-aneesh
Powered by blists - more mailing lists