lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110722074227.GW5349@suse.de>
Date:	Fri, 22 Jul 2011 08:42:27 +0100
From:	Mel Gorman <mgorman@...e.de>
To:	Minchan Kim <minchan.kim@...il.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	P?draig Brady <P@...igbrady.com>,
	James Bottomley <James.Bottomley@...senpartnership.com>,
	Colin King <colin.king@...onical.com>,
	Andrew Lutomirski <luto@....edu>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-mm <linux-mm@...ck.org>,
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat
 when reclaiming successfully

On Fri, Jul 22, 2011 at 09:21:57AM +0900, Minchan Kim wrote:
> On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@...e.de> wrote:
> > On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >> > > >       set_freezable();
> >> > > >
> >> > > > -     order = 0;
> >> > > > -     classzone_idx = MAX_NR_ZONES - 1;
> >> > > > +     order = new_order = 0;
> >> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >> > > >       for ( ; ; ) {
> >> > > > -             unsigned long new_order;
> >> > > > -             int new_classzone_idx;
> >> > > >               int ret;
> >> > > >
> >> > > > -             new_order = pgdat->kswapd_max_order;
> >> > > > -             new_classzone_idx = pgdat->classzone_idx;
> >> > > > -             pgdat->kswapd_max_order = 0;
> >> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> >> > > > +             /*
> >> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> >> > > > +              * new request of a similar or harder type will succeed soon
> >> > > > +              * so consider going to sleep on the basis we reclaimed at
> >> > > > +              */
> >> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> >> > > > +                     new_order = pgdat->kswapd_max_order;
> >> > > > +                     new_classzone_idx = pgdat->classzone_idx;
> >> > > > +                     pgdat->kswapd_max_order =  0;
> >> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> >> > > > +             }
> >> > > > +
> >> > >
> >> > > But in this part.
> >> > > Why do we need this?
> >> >
> >> > Lets say it's a fork-heavy workload and it is routinely being woken
> >> > for order-1 allocations and the highest zone is very small. For the
> >> > most part, it's ok because the allocations are being satisfied from
> >> > the lower zones which kswapd has no problem balancing.
> >> >
> >> > However, by reading the information even after failing to
> >> > balance, kswapd continues balancing for order-1 due to reading
> >> > pgdat->kswapd_max_order, each time failing for the highest zone. It
> >> > only takes one wakeup request per balance_pgdat() to keep kswapd
> >> > awake trying to balance the highest zone in a continual loop.
> >>
> >> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
> >> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
> >> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
> >>
> >
> > If a wakeup for order-1 happened during the last pgdat, the
> > classzone_idx as communicated back from balance_pgdat() is lost and it
> > will not sleep in this ordering of events
> >
> > kswapd                                                                  other processes
> > ======                                                                  ===============
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> >                                                                        wakeup for order-1
> > kswapd balances lower zone
> >                                                                        allocate from lower zone
> > balance_pgdat fails balance for highest zone, returns
> >        with lower classzone_idx and possibly lower order
> > new_order = pgdat->kswapd_max_order      (order == 1)
> > new_classzone_idx = pgdat->classzone_idx (highest zone)
> > if (order < new_order || classzone_idx > new_classzone_idx) {
> >        order = new_order;
> >        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
> > }
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> >
> > The wakup for order-1 at any point during balance_pgdat() is enough to
> > keep kswapd awake even though the process that called wakeup_kswapd
> > would be able to allocate from the lower zones without significant
> > difficulty.
> >
> > This is why if balance_pgdat() fails its request, it should go to sleep
> > if watermarks for the lower zones are met until woken by another
> > process.
> 
> Hmm.
> 
> The role of kswapd is to reclaim pages by background until all of zone
> meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is
> another reason like GFP_ATOMIC).

kswapd does not necessarily have to balance every zone to prevent direct
reclaim. Again, if the highest zone is small, it does not remain
balanced for very long because it's often the first choice for
allocating from. It gets used very quickly but direct reclaim does not
stall because there are the lower zones.

> So it's not wrong to consume many cpu
> usage by design unless other tasks are ready.

It wastes power while not making the system run any faster. It will
look odd to any user or administrator that is running top and generates
bug reports.

> It would be balanced or
> unreclaimable at last so it should end up. However, the problem is
> small part of highest zone is easily [set|reset] to be
> all_unreclaimabe so the situation could be forever like our example.
> So fundamental solution is to prevent it that all_unreclaimable is
> set/reset easily, I think.
> Unfortunately it have no idea now.

One way would be to have the allocator skip over it easily and
implement a placement policy that relocates only long-lived and very
old pages to the highest zone and then leave them there and have
kswapd ignore the zone. We don't have anything like this at the moment.

> In different viewpoint,  the problem is that it's too excessive
> because kswapd is just best-effort and if it got fails, we have next
> wakeup and even direct reclaim as last resort. In such POV, I think
> this patch is right and it would be a good solution. Then, other
> concern is on your reply about KOSAKI's question.
> 
> I think below your patch is needed.
> 
> Quote from
> "
> 1. Read for balance-request-A (order, classzone) pair
> 2. Fail balance_pgdat
> 3. Sleep based on (order, classzone) pair
> 4. Wake for balance-request-B (order, classzone) pair where
>   balance-request-B != balance-request-A
> 5. Succeed balance_pgdat
> 6. Compare order,classzone with balance-request-A which will treat
>   balance_pgdat() as fail and try go to sleep
> 
> This is not the same as new_classzone_idx being "garbage" but is it
> what you mean? If so, is this your proposed fix?
> 

That was the proposed fix but discussion died. I'll pick it up again
later and am keeping an eye out for any bugs that could be attributed to
it.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ