lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Wed, 11 Jan 2012 18:10:25 +0200
From:	Gilad Ben-Yossef <gilad@...yossef.com>
To:	Milton Miller <miltonm@....com>
Cc:	Christoph Lameter <cl@...ux.com>,
	Michal Nazarewicz <mina86@...a86.com>,
	Mel Gorman <mel@....ul.ie>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	linux-kernel@...r.kernel.org, Chris Metcalf <cmetcalf@...era.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Russell King <linux@....linux.org.uk>, linux-mm@...ck.org,
	Pekka Enberg <penberg@...nel.org>,
	Matt Mackall <mpm@...enic.com>, Rik van Riel <riel@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	Sasha Levin <levinsasha928@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	linux-fsdevel@...r.kernel.org, Avi Kivity <avi@...hat.com>
Subject: Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist

On Wed, Jan 11, 2012 at 9:04 AM, Milton Miller <miltonm@....com> wrote:

>> > > @@ -1097,7 +1105,19 @@ void drain_local_pages(void *arg)
>> > >   */
>> > >  void drain_all_pages(void)
>> > >  {
>> > > - on_each_cpu(drain_local_pages, NULL, 1);
>> > > + int cpu;
>> > > + struct per_cpu_pageset *pcp;
>> > > + struct zone *zone;
>> > > +
>> > > + for_each_online_cpu(cpu)
>> > > +         for_each_populated_zone(zone) {
>> > > +                 pcp = per_cpu_ptr(zone->pageset, cpu);
>> > > +                 if (pcp->pcp.count)
>> > > +                         cpumask_set_cpu(cpu, cpus_with_pcps);
>> > > +                 else
>> > > +                         cpumask_clear_cpu(cpu, cpus_with_pcps);
>> > > +         }
>> > > + on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);
>> >
>
> This will only select cpus that have pcp pages in the last zone,
> not pages in any populated zone.

Oy! you are very right.

>
> Call cpu_mask_clear before the for_each_online_cpu loop, and only
> set cpus in the loop.
>
> Hmm, that means we actually need a lock for using the static mask.

We don't have to clear before the loop or lock. We can do something like this:

        int cpu;
        struct per_cpu_pageset *pcp;
        struct zone *zone;

        for_each_online_cpu(cpu) {
                bool has_pcps = false;
                for_each_populated_zone(zone) {
                        pcp = per_cpu_ptr(zone->pageset, cpu);
                        if (pcp->pcp.count) {
                                has_pcps = true;
                                break;
                        }
                }
                if (has_pcps)
                        cpumask_set_cpu(cpu, cpus_with_pcps);
                else
                        cpumask_clear_cpu(cpu, cpus_with_pcps);
        }
        on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);


>
> -- what would happen without any lock?
> [The last to store would likely see all the cpus and send them IPIs
> but the earlier, racing callers would miss some cpus (It will test
> our smp_call_function_many locking!), and would retry before all the
> pages were drained from pcp pools].  The locking should be better
> than racing -- one cpu will peek and pull all per-cpu data to itself,
> then copy the mask to its smp_call_function_many element, then wait
> for all the cpus it found cpus to process their list pulling their
> pcp data back to the owners.   Not much sense in the racing cpus
> figighting to bring the per-cpu data to themselves to write to the
> now contended static mask while pulling the zone pcp data from the
> owning cpus that are trying to push to the buddy lists.
>

That is a very good point and I guess you are right that the locking
saves cache bounces and probably also some IPIs.

I am not 100% it is a win though. The lockless approach is simpler,
has zero risk of dead locks. I also fear that any non interruptable lock
(be it spinlock or  interruptable lock non mutex) might delay an OOM
in progress.

Having the lock there is not a correctness issue - it is a performance
enhancement. Because this is not a fast path, i would go for simple
and less risk and not have the lock there at all.

> The benefit numbers in the changelog need to be measured again
> after this correction.
>

Yes, of course. Preliminary numbers with the lockless version above
still looks good.

I am guessing though that Mel's patch will make all this moot anyway
since if we're not doing a drain_all in the direct reclaim we're left with
very rare code paths (memory error and memory migration) that call
the drain_all and they wont tend to do that concurrently like the direct
reclaim.

>
>
> Disabling preemption around online loop and the call will prevent
> races with cpus going offline, but we do not that we care as the
> offline notification will cause the notified cpu to drain that pool.
> on_each_cpu_mask disables preemption as part of its processing.
>
> If we document the o-e-c-m behaviour then we should consider putting a
> comment here, or at least put the previous paragraph in the change log.

I noticed that Mel's patch already added get_online_cpus() in drain_all.


>
>> > >  }
>> > >
>> > >  #ifdef CONFIG_HIBERNATION
>> > > @@ -3601,6 +3621,10 @@ static void setup_zone_pageset(struct zone *zone)
>> > >  void __init setup_per_cpu_pageset(void)
>> > >  {
>> > >   struct zone *zone;
>> > > + int ret;
>> > > +
>> > > + ret = zalloc_cpumask_var(&cpus_with_pcps, GFP_KERNEL);
>> > > + BUG_ON(!ret);
>> > >
>
> Switching to cpumask_t will eliminate this hunk.   Even if we decide
> to keep it a cpumask_var_t we don't need to pre-zero it as we set
> the entire mask so alloc_cpumask_var would be sufficient.
>

hmm... then we can make it a static local variable and it doesn't even
make the kernel image really bigger since it's in the BSS. Cool.

Thanks,

Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@...yossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ