lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 01 Sep 2011 07:41:57 +0200
From:	Stefan Priebe - Profihost AG <s.priebe@...fihost.ag>
To:	Wu Fengguang <fengguang.wu@...el.com>
CC:	Zhu Yanhai <zhu.yanhai@...il.com>,
	Pekka Enberg <penberg@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mel@....ul.ie>, Jens Axboe <jaxboe@...ionio.com>,
	Linux Netdev List <netdev@...r.kernel.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Subject: Re: slow performance on disk/network i/o full speed after drop_caches

Thanks!

Am 01.09.2011 06:14, schrieb Wu Fengguang:
> Hi Stefan,
>
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
>> Hi Fengguang,
>> Hi Yanhai,
>>
>>> you're abssolutely corect zone_reclaim_mode is on - but why?
>>> There must be some linux software which switches it on.
>>>
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> also
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> tells us nothing.
>>>
>>> I've then read this:
>>>
>>> "zone_reclaim_mode is set during bootup to 1 if it is determined that
>>> pages from remote zones will cause a measurable performance reduction.
>>> The page allocator will then reclaim easily reusable pages (those page
>>> cache pages that are currently not used) before allocating off node pages."
>>>
>>> Why does the kernel do that here in our case on these machines.
>>
>> Can nobody help why the kernel in this case set it to 1?
>
> It's determined by RECLAIM_DISTANCE.
>
> build_zonelists():
>
>                  /*
>                   * If another node is sufficiently far away then it is better
>                   * to reclaim pages in a zone before going off node.
>                   */
>                  if (distance>  RECLAIM_DISTANCE)
>                          zone_reclaim_mode = 1;
>
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
>
> commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
> Author: KOSAKI Motohiro<kosaki.motohiro@...fujitsu.com>
> Date:   Wed Jun 15 15:08:20 2011 -0700
>
>      mm: increase RECLAIM_DISTANCE to 30
>
>      Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
>      that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
>      Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
>      a very traditional single-process model.
>
>        * a master process which reads config files and manages the other
>          process
>        * multiple imapd processes, one per connection
>        * multiple pop3d processes, one per connection
>        * multiple lmtpd processes, one per connection
>        * periodical "cleanup" processes.
>
>      There are thousands of independent processes.  The problem is, recent
>      Intel motherboard turn on zone_reclaim_mode by default and traditional
>      prefork model software don't work well on it.  Unfortunatelly, such models
>      are still typical even in the 21st century.  We can't ignore them.
>
>      This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
>      any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
>      such relatively cheap 2-4 socket machine are often used for traditional
>      servers as above.  The intention is that these machines don't use
>      zone_reclaim_mode.
>
>      Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
>      This patch doesn't change such high-end NUMA machine behavior.
>
>      Dave Hansen said:
>
>      : I know specifically of pieces of x86 hardware that set the information
>      : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
>      : behavior which that implies.
>      :
>      : They've done performance testing and run very large and scary benchmarks
>      : to make sure that they _want_ this turned on.  What this means for them
>      : is that they'll probably be de-optimized, at least on newer versions of
>      : the kernel.
>      :
>      : If you want to do this for particular systems, maybe _that_'s what we
>      : should do.  Have a list of specific configurations that need the
>      : defaults overridden either because they're buggy, or they have an
>      : unusual hardware configuration not really reflected in the distance
>      : table.
>
>      And later said:
>
>      : The original change in the hardware tables was for the benefit of a
>      : benchmark.  Said benchmark isn't going to get run on mainline until the
>      : next batch of enterprise distros drops, at which point the hardware where
>      : this was done will be irrelevant for the benchmark.  I'm sure any new
>      : hardware will just set this distance to another yet arbitrary value to
>      : make the kernel do what it wants.  :)
>      :
>      : Also, when the hardware got _set_ to this initially, I complained.  So, I
>      : guess I'm getting my way now, with this patch.  I'm cool with it.
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
>    * (in whatever arch specific measurement units returned by node_distance())
>    * then switch on zone reclaim on boot.
>    */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30
>   #endif
>   #ifndef PENALTY_FOR_NODE_WITH_CPUS
>   #define PENALTY_FOR_NODE_WITH_CPUS     (1)
>
> Thanks,
> Fengguang
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists