lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20091211093032.db7fdd91.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Fri, 11 Dec 2009 09:30:32 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Christoph Lameter <cl@...ux-foundation.org>
Cc:	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	minchan.kim@...il.com, mingo@...e.hu
Subject: Re: [RFC mm][PATCH 2/5] percpu cached mm counter


thank you for review.

On Thu, 10 Dec 2009 11:51:24 -0600 (CST)
Christoph Lameter <cl@...ux-foundation.org> wrote:

> On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Now, mm's counter information is updated by atomic_long_xxx() functions if
> > USE_SPLIT_PTLOCKS is defined. This causes cache-miss when page faults happens
> > simultaneously in prural cpus. (Almost all process-shared objects is...)
> 
> s/prural cpus/multiple cpus simultaneously/?
> 
Ah, I see..I often does this misuse, sorry.

> > This patch implements per-cpu mm cache. This per-cpu cache is loosely
> > synchronized with mm's counter. Current design is..
> 
> Some more explanation about the role of the per cpu data would be useful.
> 
I see.
> For each cpu we keep a set of counters that can be incremented using per
> cpu operations. curr_mc points to the mm struct that is currently using
> the per cpu counters on a specific cpu?
> 
yes. Precisely. per-cpu curr_mmc.mm points to mm_struct of current thread
if a page fault occurs since last schedule().


> >   - prepare per-cpu object curr_mmc. curr_mmc containes pointer to mm and
> >     array of counters.
> >   - At page fault,
> >      * if curr_mmc.mm != NULL, update curr_mmc.mm counter.
> >      * if curr_mmc.mm == NULL, fill curr_mmc.mm = current->mm and account 1.
> >   - At schedule()
> >      * if curr_mm.mm != NULL, synchronize and invalidate cached information.
> >      * if curr_mmc.mm == NULL, nothing to do.
> 
> Sounds like a very good idea that could be expanded and used for other
> things like tracking the amount of memory used on a specific NUMA node in
> the future. Through that we may get to a schedule that can schedule with
> an awareness where the memory of a process is actually located.
> 
Hmm. Expanding as per-node stat ?

>  > By this.
> >   - no atomic ops, which tends to cache-miss, under page table lock.
> >   - mm->counters are synchronized when schedule() is called.
> >   - No bad thing to read-side.
> >
> > Concern:
> >   - added cost to schedule().
> 
> That is only a simple check right?
yes.

> Are we already touching that cacheline in schedule?

0000000000010040 l     O .data.percpu   0000000000000050 vmstat_work
00000000000100a0 g     O .data.percpu   0000000000000030 curr_mmc
00000000000100e0 l     O .data.percpu   0000000000000030 vmap_block_queue

Hmm...not touched unless a page fault occurs.

> Or place that structure near other stuff touched by the scheduer?
> 

I'll think about that.


> >
> > +#if USE_SPLIT_PTLOCKS
> > +
> > +DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> > +
> > +void __sync_mm_counters(struct mm_struct *mm)
> > +{
> > +	struct pcp_mm_cache *mmc = &per_cpu(curr_mmc, smp_processor_id());
> > +	int i;
> > +
> > +	for (i = 0; i < NR_MM_COUNTERS; i++) {
> > +		if (mmc->counters[i] != 0) {
> 
> Omit != 0?
> 
> if you change mmc->curr_mc then there is no need to set mmc->counters[0]
> to zero right? add_mm_counter_fast will set the counter to 1 next?
> 
yes. I can omit that.


> > +static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> > +{
> > +	struct mm_struct *cached = percpu_read(curr_mmc.mm);
> > +
> > +	if (likely(cached == mm)) { /* fast path */
> > +		percpu_add(curr_mmc.counters[member], val);
> > +	} else if (mm == current->mm) { /* 1st page fault in this period */
> > +		percpu_write(curr_mmc.mm, mm);
> > +		percpu_write(curr_mmc.counters[member], val);
> > +	} else /* page fault via side-path context (get_user_pages()) */
> > +		add_mm_counter(mm, member, val);
> 
> So get_user pages will not be accellerated.
> 
Yes. but I guess it's not fast path. I'll mention about that in patch description.


> > Index: mmotm-2.6.32-Dec8/kernel/sched.c
> > ===================================================================
> > --- mmotm-2.6.32-Dec8.orig/kernel/sched.c
> > +++ mmotm-2.6.32-Dec8/kernel/sched.c
> > @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> >  	trace_sched_switch(rq, prev, next);
> >  	mm = next->mm;
> >  	oldmm = prev->active_mm;
> > +
> >  	/*
> >  	 * For paravirt, this is coupled with an exit in switch_to to
> >  	 * combine the page table reload and the switch backend into
> 
> Extraneous new line.
> 
will fix.

> > @@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
> >
> >  	if (sched_feat(HRTICK))
> >  		hrtick_clear(rq);
> > +	/*
> > +	 * sync/invaldidate per-cpu cached mm related information
> > +	 * before taling rq->lock. (see include/linux/mm.h)
> > +	 */
> > +	sync_mm_counters_atomic();
> >
> >  	spin_lock_irq(&rq->lock);
> >  	update_rq_clock(rq);
> 
> Could the per cpu counter stuff be placed into rq to avoid
> touching another cacheline?
> 
I will try and check how it can be done without annoyting people.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ