lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110901041458.GA30123@localhost>
Date:	Thu, 1 Sep 2011 12:14:58 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Stefan Priebe - Profihost AG <s.priebe@...fihost.ag>
Cc:	Zhu Yanhai <zhu.yanhai@...il.com>,
	Pekka Enberg <penberg@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mel@....ul.ie>, Jens Axboe <jaxboe@...ionio.com>,
	Linux Netdev List <netdev@...r.kernel.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Subject: Re: slow performance on disk/network i/o full speed after
 drop_caches

Hi Stefan,

On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
> Hi Fengguang,
> Hi Yanhai,
> 
> > you're abssolutely corect zone_reclaim_mode is on - but why?
> > There must be some linux software which switches it on.
> >
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > also
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > tells us nothing.
> >
> > I've then read this:
> >
> > "zone_reclaim_mode is set during bootup to 1 if it is determined that
> > pages from remote zones will cause a measurable performance reduction.
> > The page allocator will then reclaim easily reusable pages (those page
> > cache pages that are currently not used) before allocating off node pages."
> >
> > Why does the kernel do that here in our case on these machines.
> 
> Can nobody help why the kernel in this case set it to 1?

It's determined by RECLAIM_DISTANCE.

build_zonelists():

                /*
                 * If another node is sufficiently far away then it is better
                 * to reclaim pages in a zone before going off node.
                 */
                if (distance > RECLAIM_DISTANCE)
                        zone_reclaim_mode = 1;

Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
It may well help your case, too.

commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
Author: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Date:   Wed Jun 15 15:08:20 2011 -0700

    mm: increase RECLAIM_DISTANCE to 30
    
    Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
    that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
    Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
    a very traditional single-process model.
    
      * a master process which reads config files and manages the other
        process
      * multiple imapd processes, one per connection
      * multiple pop3d processes, one per connection
      * multiple lmtpd processes, one per connection
      * periodical "cleanup" processes.
    
    There are thousands of independent processes.  The problem is, recent
    Intel motherboard turn on zone_reclaim_mode by default and traditional
    prefork model software don't work well on it.  Unfortunatelly, such models
    are still typical even in the 21st century.  We can't ignore them.
    
    This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
    any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
    such relatively cheap 2-4 socket machine are often used for traditional
    servers as above.  The intention is that these machines don't use
    zone_reclaim_mode.
    
    Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
    This patch doesn't change such high-end NUMA machine behavior.
    
    Dave Hansen said:
    
    : I know specifically of pieces of x86 hardware that set the information
    : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
    : behavior which that implies.
    :
    : They've done performance testing and run very large and scary benchmarks
    : to make sure that they _want_ this turned on.  What this means for them
    : is that they'll probably be de-optimized, at least on newer versions of
    : the kernel.
    :
    : If you want to do this for particular systems, maybe _that_'s what we
    : should do.  Have a list of specific configurations that need the
    : defaults overridden either because they're buggy, or they have an
    : unusual hardware configuration not really reflected in the distance
    : table.

    And later said:
    
    : The original change in the hardware tables was for the benefit of a
    : benchmark.  Said benchmark isn't going to get run on mainline until the
    : next batch of enterprise distros drops, at which point the hardware where
    : this was done will be irrelevant for the benchmark.  I'm sure any new
    : hardware will just set this distance to another yet arbitrary value to
    : make the kernel do what it wants.  :)
    :
    : Also, when the hardware got _set_ to this initially, I complained.  So, I
    : guess I'm getting my way now, with this patch.  I'm cool with it.

diff --git a/include/linux/topology.h b/include/linux/topology.h
index b91a40e..fc839bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
  * (in whatever arch specific measurement units returned by node_distance())
  * then switch on zone reclaim on boot.
  */
-#define RECLAIM_DISTANCE 20
+#define RECLAIM_DISTANCE 30
 #endif
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS     (1)

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ