linux-kernel - Re: [PATCH -mm -v9 1/3] mm, THP, swap: Delay splitting THP during swap out

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 21 Apr 2017 10:08:12 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Balbir Singh <bsingharora@...il.com>
Cc:     "Huang, Ying" <ying.huang@...el.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org,
        Andrea Arcangeli <aarcange@...hat.com>,
        Ebru Akagunduz <ebru.akagunduz@...il.com>,
        Michal Hocko <mhocko@...nel.org>, Tejun Heo <tj@...nel.org>,
        Hugh Dickins <hughd@...gle.com>, Shaohua Li <shli@...nel.org>,
        Minchan Kim <minchan@...nel.org>,
        Rik van Riel <riel@...hat.com>, cgroups@...r.kernel.org
Subject: Re: [PATCH -mm -v9 1/3] mm, THP, swap: Delay splitting THP during
 swap out

On Fri, Apr 21, 2017 at 04:11:36PM +1000, Balbir Singh wrote:
> > In the future of THP swap optimization, some information of the
> > swapped out THP (such as compound map count) will be recorded in the
> > swap_cluster_info data structure.
> > 
> > The mem cgroup swap accounting functions are enhanced to support
> > charge or uncharge a swap cluster backing a THP as a whole.
> 
> Thanks and in the future it will be good to add stats to indicate
> the number of THP swapped out for tracking.

Agreed. Huang, can you include stats from your test that show how many
times we succeed and how many times we fall back to splitting?

> > With the patch, the swap out throughput improves 11.5% (from about
> > 3.73GB/s to about 4.16GB/s) in the vm-scalability swap-w-seq test case
> > with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
> > device used is a RAM simulated PMEM (persistent memory) device. 
> 
> I am not sure if RAM simulating PMEM is a fair way to test, its just
> memcpy and no swap out.

It would be good to know exactly what simulating PMEM means. Does it
add some artificial delay, or is it a simple ramfs that holds a swap
file?

IMO, this patch isn't a pure cycles-optimization anyway in the "reduce
instructions involved in this path" sense. It's the groundwork to make
swapping THP-native. So slimming down the cycles is great, but the
infrastructure work to later do 2MB TLB shootdown and swap with 2MB IO
holds more weight for me.

> > @@ -326,11 +326,14 @@ PAGEFLAG_FALSE(HighMem)
> >  #ifdef CONFIG_SWAP
> >  static __always_inline int PageSwapCache(struct page *page)
> >  {
> > +#ifdef CONFIG_THP_SWAP
> > +	page = compound_head(page);
> > +#endif
> 
> Can we please add a static inline THPSwapPage() that returns page_compound(page)
> for CONFIG_THP_SWAP and page otherwise?

Where else would it be used?

I think it'd be preferable to leave this #ifdef wart as a reminder
that we need to revisit this.

> > @@ -5929,25 +5929,26 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
> >  	memcg = mem_cgroup_id_get_online(memcg);
> >  
> >  	if (!mem_cgroup_is_root(memcg) &&
> > -	    !page_counter_try_charge(&memcg->swap, 1, &counter)) {
> > +	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
> >  		mem_cgroup_id_put(memcg);
> >  		return -ENOMEM;
> >  	}
> >  
> > -	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
> > +	if (nr_pages > 1)
> > +		mem_cgroup_id_get_many(memcg, nr_pages - 1);
> 
> The nr_pages -1 is not initutive, a comment about mem_cgroup_id_get_online()
> getting 1 would help.

Agreed. Something like this might help:

	/* Get references for the tail pages, too */

> > @@ -1066,6 +1155,33 @@ void swapcache_free(swp_entry_t entry)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_THP_SWAP
> > +void swapcache_free_cluster(swp_entry_t entry)
> > +{
> > +	unsigned long offset = swp_offset(entry);
> > +	unsigned long idx = offset / SWAPFILE_CLUSTER;
> > +	struct swap_cluster_info *ci;
> > +	struct swap_info_struct *si;
> > +	unsigned char *map;
> > +	unsigned int i;
> > +
> > +	si = swap_info_get(entry);
> > +	if (!si)
> > +		return;
> > +
> > +	ci = lock_cluster(si, offset);
> > +	map = si->swap_map + offset;
> > +	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> > +		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
> > +		map[i] = 0;
> > +	}
> > +	unlock_cluster(ci);
> > +	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> > +	swap_free_cluster(si, idx);
> > +	spin_unlock(&si->lock);
> > +}
> > +#endif /* CONFIG_THP_SWAP */
> > +
> >  static int swp_entry_cmp(const void *ent1, const void *ent2)
> >  {
> >  	const swp_entry_t *e1 = ent1, *e2 = ent2;
> 
> 
> This is a massive patch, I presume you've got recommendations to keep it
> this way?

It used to be split into patches that introduce API and helpers on one
hand and patches that use these functions on the other hand. That was
impossible to review, because you had to jump between emails.

If you have ideas about which parts could be split out and be
stand-alone changes in their own right, I'd be all for that.