linux-kernel - Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160321143058.GB4518@gmail.com>
Date:	Mon, 21 Mar 2016 15:30:59 +0100
From:	Jerome Glisse <j.glisse@...il.com>
To:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
Cc:	Jérôme Glisse <jglisse@...hat.com>,
	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, Linus Torvalds <torvalds@...ux-foundation.org>,
	joro@...tes.org, Mel Gorman <mgorman@...e.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Johannes Weiner <jweiner@...hat.com>,
	Larry Woodman <lwoodman@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Airlie <airlied@...hat.com>,
	Brendan Conoboy <blc@...hat.com>,
	Joe Donohue <jdonohue@...hat.com>,
	Christophe Harle <charle@...dia.com>,
	Duncan Poole <dpoole@...dia.com>,
	Sherry Cheung <SCheung@...dia.com>,
	Subhash Gutti <sgutti@...dia.com>,
	John Hubbard <jhubbard@...dia.com>,
	Mark Hairgrove <mhairgrove@...dia.com>,
	Lucien Dunning <ldunning@...dia.com>,
	Cameron Buschardt <cabuschardt@...dia.com>,
	Arvind Gopalakrishnan <arvindg@...dia.com>,
	Haggai Eran <haggaie@...lanox.com>,
	Shachar Raindel <raindel@...lanox.com>,
	Liran Liss <liranl@...lanox.com>,
	Roland Dreier <roland@...estorage.com>,
	Ben Sander <ben.sander@....com>,
	Greg Stoner <Greg.Stoner@....com>,
	John Bridgman <John.Bridgman@....com>,
	Michael Mantor <Michael.Mantor@....com>,
	Paul Blinzer <Paul.Blinzer@....com>,
	Leonid Shamis <Leonid.Shamis@....com>,
	Laurent Morichetti <Laurent.Morichetti@....com>,
	Alexander Deucher <Alexander.Deucher@....com>
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when
 migrating memory back v2.

On Mon, Mar 21, 2016 at 07:18:41PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@...il.com> writes:
> > [ text/plain ]
> > On Mon, Mar 21, 2016 at 04:57:32PM +0530, Aneesh Kumar K.V wrote:
> >> Jérôme Glisse <jglisse@...hat.com> writes:

[...]

> >> > +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> >> > +		for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
> >> > +		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
> >> > +		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
> >> > +			swp_entry_t entry;
> >> > +
> >> > +			entry = pte_to_swp_entry(*ptep);
> >> > +			if (pte_none(*ptep) || pte_present(*ptep) ||
> >> > +			    !is_hmm_entry(entry) ||
> >> > +			    is_hmm_entry_locked(entry))
> >> > +				continue;
> >> > +
> >> > +			set_pte_at(mm, addr, ptep, hmm_entry);
> >> > +			new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> >> > +						   vma->vm_page_prot));
> >> > +		}
> >> > +		pte_unmap_unlock(ptep - 1, ptl);
> >> 
> >> 
> >> I guess this is fixing all the ptes in the cpu page table mapping a pmd
> >> entry. But then what is below ?
> >
> > Because we are dealing with special swap entry we know we can not have huge pages.
> > So we only care about HMM special swap entry. We record entry we want to migrate
> > in the new_pte array. The loop above is under pmd spin lock, the loop below does
> > memory allocation and we do not want to hold any spin lock while doing allocation.
> >
> 
> Can this go as code comment ?

Yes of course, i should have added more comment in first place.


> >> > +
> >> > +		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> >> > +		     addr < next; addr += PAGE_SIZE, i++) {
> >> 
> >> Your use of vairable addr with multiple loops updating then is also
> >> making it complex. We should definitely add more comments here. I guess
> >> we are going through the same range we iterated above here.
> >
> > Correct we are going over the exact same range, i am keeping the addr only
> > for alloc_zeroed_user_highpage_movable() purpose.
> >
> 
> Can we use a different variable name there ?

If you have suggestion for name ? I am just lacking imagination but i can use
a different name like vaddr.


> >> > +			struct mem_cgroup *memcg;
> >> > +			struct page *page;
> >> > +
> >> > +			if (!pte_present(new_pte[i]))
> >> > +				continue;
> >> 
> >> What is that checking for ?. We set that using pte_mkspecial above ?
> >
> > Not all entry in the range might match the criteria (ie special unlocked HMM swap
> > entry). We want to allocate pages only for entry that match the criteria.
> >
> 
> Since we did in the beginning, 
> 	memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
> 
> we should not find present bit set ? using present there is confusing,
> may be pte_none(). Also with comments around explaining the details ?

Yes pte_none() will works too, i will use that and add comments.


> >> > +			page = alloc_zeroed_user_highpage_movable(vma, addr);
> >> > +			if (!page) {
> >> > +				ret = -ENOMEM;
> >> > +				break;
> >> > +			}
> >> > +			__SetPageUptodate(page);
> >> > +			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> >> > +						  &memcg)) {
> >> > +				page_cache_release(page);
> >> > +				ret = -ENOMEM;
> >> > +				break;
> >> > +			}
> >> > +			/*
> >> > +			 * We can safely reuse the s_mem/mapping field of page
> >> > +			 * struct to store the memcg as the page is only seen
> >> > +			 * by HMM at this point and we can clear it before it
> >> > +			 * is public see mm_hmm_migrate_back_cleanup().
> >> > +			 */
> >> > +			page->s_mem = memcg;
> >> > +			new_pte[i] = mk_pte(page, vma->vm_page_prot);
> >> > +			if (vma->vm_flags & VM_WRITE) {
> >> > +				new_pte[i] = pte_mkdirty(new_pte[i]);
> >> > +				new_pte[i] = pte_mkwrite(new_pte[i]);
> >> > +			}
> >> 
> >> Why mark it dirty if vm_flags is VM_WRITE ?
> >
> > It is a left over of some debuging i was doing, i missed it.

I actually remember why i set the dirty bit, i wanted to change the driver
API to have driver clear the dirty bit if they did not write instead on
relying on them to set it if they did. I thought it was a safer to cope with
potentialy buggy driver. I might update patchset to do that.

[...]

> >> > +		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
> >> > +		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
> >> > +			unsigned long pfn = pte_pfn(new_pte[i]);
> >> > +
> >> > +			if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
> >> > +				continue;
> >> 
> 
> So here we are using the fact that we had set new pte using zero pfn in
> the firs loop and hence if we find a present new_pte with zero pfn, it implies we
> failed to allocate a page for that ?

Yes that's correct. I could use a another pte flag instead on relying on zero
pfn.

[...]

> >> > +
> >> > +			set_pte_at(mm, addr, ptep, hmm_entry);
> >> > +			pte_clear(mm, addr, &new_pte[i]);
> >> 
> >> what is that pte_clear for ?. Handling of new_pte needs more code comments.
> >> 
> >
> > Entry for which we failed to allocate memory we clear the special HMM swap entry
> > as well as the new_pte entry so that migration code knows it does not have to do
> > anything here.
> >
> 
> So that pte_clear is not expecting to do any sort of tlb flushes etc ? The
> idea is to put new_pte = 0 ?.  

Correct, no tlb flushing needed, new_pte is a private array use only during
migration and never expose to outside world. I will change to new_pte[i] = 0
instead.


> 
> Can we do all those conditionals without using pte bits ? A check like
> pte_present, is_zero_pfn etc confuse the reader. Instead can
> we do
> 
> if (pte_state[i] == SKIP_LOOP_FIRST)
> 
> if (pte_state[i] == SKIP_LOOP_SECOND)
> 
> I understand that we want to return new_pte array with valid pages, so
> may be the above will make code complex, but atleast code should have
> more comments explaining each step

Well another point of new_pte is that we can directly use the new_pte
value to update the CPU page table in the final migration step. But i
can define some HMM_PTE_MIGRATE, HMM_PTE_RESTORE as alias of existing pte
flag and they will be clear along the way depending on outcomes of each
step.

Jérôme