linux-kernel - Re: [PATCH v12 08/29] HMM: add device page fault support v6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160323100919.GA2888@gmail.com>
Date:	Wed, 23 Mar 2016 11:09:20 +0100
From:	Jerome Glisse <j.glisse@...il.com>
To:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
Cc:	Jérôme Glisse <jglisse@...hat.com>,
	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, Linus Torvalds <torvalds@...ux-foundation.org>,
	joro@...tes.org, Mel Gorman <mgorman@...e.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Johannes Weiner <jweiner@...hat.com>,
	Larry Woodman <lwoodman@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Airlie <airlied@...hat.com>,
	Brendan Conoboy <blc@...hat.com>,
	Joe Donohue <jdonohue@...hat.com>,
	Christophe Harle <charle@...dia.com>,
	Duncan Poole <dpoole@...dia.com>,
	Sherry Cheung <SCheung@...dia.com>,
	Subhash Gutti <sgutti@...dia.com>,
	John Hubbard <jhubbard@...dia.com>,
	Mark Hairgrove <mhairgrove@...dia.com>,
	Lucien Dunning <ldunning@...dia.com>,
	Cameron Buschardt <cabuschardt@...dia.com>,
	Arvind Gopalakrishnan <arvindg@...dia.com>,
	Haggai Eran <haggaie@...lanox.com>,
	Liran Liss <liranl@...lanox.com>,
	Roland Dreier <roland@...estorage.com>,
	Ben Sander <ben.sander@....com>,
	Greg Stoner <Greg.Stoner@....com>,
	John Bridgman <John.Bridgman@....com>,
	Michael Mantor <Michael.Mantor@....com>,
	Paul Blinzer <Paul.Blinzer@....com>,
	Leonid Shamis <Leonid.Shamis@....com>,
	Laurent Morichetti <Laurent.Morichetti@....com>,
	Alexander Deucher <Alexander.Deucher@....com>,
	Jatin Kumar <jakumar@...dia.com>
Subject: Re: [PATCH v12 08/29] HMM: add device page fault support v6.

On Wed, Mar 23, 2016 at 12:22:23PM +0530, Aneesh Kumar K.V wrote:
> Jérôme Glisse <jglisse@...hat.com> writes:
> 
> > [ text/plain ]
> > This patch add helper for device page fault. Thus helpers will fill
> > the mirror page table using the CPU page table and synchronizing
> > with any update to CPU page table.
> >
> > Changed since v1:
> >   - Add comment about directory lock.
> >
> > Changed since v2:
> >   - Check for mirror->hmm in hmm_mirror_fault()
> >
> > Changed since v3:
> >   - Adapt to HMM page table changes.
> >
> > Changed since v4:
> >   - Fix PROT_NONE, ie do not populate from protnone pte.
> >   - Fix huge pmd handling (start address may != pmd start address)
> >   - Fix missing entry case.
> >
> > Signed-off-by: Jérôme Glisse <jglisse@...hat.com>
> > Signed-off-by: Sherry Cheung <SCheung@...dia.com>
> > Signed-off-by: Subhash Gutti <sgutti@...dia.com>
> > Signed-off-by: Mark Hairgrove <mhairgrove@...dia.com>
> > Signed-off-by: John Hubbard <jhubbard@...dia.com>
> > Signed-off-by: Jatin Kumar <jakumar@...dia.com>
> > ---
> 
> 
> ....
> ....
> 
>  +static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
> > +				 struct hmm_event *event,
> > +				 struct vm_area_struct *vma,
> > +				 struct hmm_pt_iter *iter,
> > +				 pmd_t *pmdp,
> > +				 struct hmm_mirror_fault *mirror_fault,
> > +				 unsigned long start,
> > +				 unsigned long end)
> > +{
> > +	struct page *page;
> > +	unsigned long addr, pfn;
> > +	unsigned flags = FOLL_TOUCH;
> > +	spinlock_t *ptl;
> > +	int ret;
> > +
> > +	ptl = pmd_lock(mirror->hmm->mm, pmdp);
> > +	if (unlikely(!pmd_trans_huge(*pmdp))) {
> > +		spin_unlock(ptl);
> > +		return -EAGAIN;
> > +	}
> > +	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
> > +	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
> > +	pfn = page_to_pfn(page);
> > +	spin_unlock(ptl);
> > +
> > +	/* Just fault in the whole PMD. */
> > +	start &= PMD_MASK;
> > +	end = start + PMD_SIZE - 1;
> > +
> > +	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
> > +			return -ENOENT;
> > +
> > +	for (ret = 0, addr = start; !ret && addr < end;) {
> > +		unsigned long i, next = end;
> > +		dma_addr_t *hmm_pte;
> > +
> > +		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
> > +		if (!hmm_pte)
> > +			return -ENOMEM;
> > +
> > +		i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
> > +
> > +		/*
> > +		 * The directory lock protect against concurrent clearing of
> > +		 * page table bit flags. Exceptions being the dirty bit and
> > +		 * the device driver private flags.
> > +		 */
> > +		hmm_pt_iter_directory_lock(iter);
> > +		do {
> > +			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
> > +				hmm_pte[i] = hmm_pte_from_pfn(pfn);
> > +				hmm_pt_iter_directory_ref(iter);
> 
> I looked at that and it is actually 
> static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
> {
> 	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
> 	hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
> }
> 
> static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
> 					struct page *ptd)
> {
> 	if (!atomic_inc_not_zero(&ptd->_mapcount))
> 		/* Illegal this should not happen. */
> 		BUG();
> }
> 
> what is the mapcount update about ?

Unlike regular CPU page table we do not rely on unmap to prune HMM mirror
page table. Rather we free/prune it aggressively once the device no longer
have anything mirror in a given range.

As such mapcount is use to keep track of any many valid entry there is per
directory.

Moreover mapcount is also use to protect from concurrent pruning when
you walk through the page table you increment refcount by one along your
way. When you done walking you decrement refcount.

Because of that last aspect, the mapcount can never reach zero because we
unmap page, it can only reach zero once we cleanup the page table walk.

> 
> > +			}
> > +			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
> > +			if (pmd_write(*pmdp))
> > +				hmm_pte_set_write(&hmm_pte[i]);
> > +		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
> > +		hmm_pt_iter_directory_unlock(iter);
> > +		mirror_fault->addr = addr;
> > +	}
> > +
> 
> So we don't have huge page mapping in hmm page table ? 

No we don't right now. First reason is that i wanted to keep things simple for
device driver. Second motivation is to keep first patchset simpler especialy
the page migration code.

Memory overhead is 2MB per GB of virtual memory mirrored. There is no TLB here.
I believe adding huge page can be done as part of a latter patchset if it makes
sense.

Cheers,
Jérôme