[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4E5F1B25.8040800@profihost.ag>
Date: Thu, 01 Sep 2011 07:41:57 +0200
From: Stefan Priebe - Profihost AG <s.priebe@...fihost.ag>
To: Wu Fengguang <fengguang.wu@...el.com>
CC: Zhu Yanhai <zhu.yanhai@...il.com>,
Pekka Enberg <penberg@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mel Gorman <mel@....ul.ie>, Jens Axboe <jaxboe@...ionio.com>,
Linux Netdev List <netdev@...r.kernel.org>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Subject: Re: slow performance on disk/network i/o full speed after drop_caches
Thanks!
Am 01.09.2011 06:14, schrieb Wu Fengguang:
> Hi Stefan,
>
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
>> Hi Fengguang,
>> Hi Yanhai,
>>
>>> you're abssolutely corect zone_reclaim_mode is on - but why?
>>> There must be some linux software which switches it on.
>>>
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> also
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> tells us nothing.
>>>
>>> I've then read this:
>>>
>>> "zone_reclaim_mode is set during bootup to 1 if it is determined that
>>> pages from remote zones will cause a measurable performance reduction.
>>> The page allocator will then reclaim easily reusable pages (those page
>>> cache pages that are currently not used) before allocating off node pages."
>>>
>>> Why does the kernel do that here in our case on these machines.
>>
>> Can nobody help why the kernel in this case set it to 1?
>
> It's determined by RECLAIM_DISTANCE.
>
> build_zonelists():
>
> /*
> * If another node is sufficiently far away then it is better
> * to reclaim pages in a zone before going off node.
> */
> if (distance> RECLAIM_DISTANCE)
> zone_reclaim_mode = 1;
>
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
>
> commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
> Author: KOSAKI Motohiro<kosaki.motohiro@...fujitsu.com>
> Date: Wed Jun 15 15:08:20 2011 -0700
>
> mm: increase RECLAIM_DISTANCE to 30
>
> Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
> that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
> Xeon E5520 + Intel S5520UR MB). He is using Cyrus IMAPd and it's built on
> a very traditional single-process model.
>
> * a master process which reads config files and manages the other
> process
> * multiple imapd processes, one per connection
> * multiple pop3d processes, one per connection
> * multiple lmtpd processes, one per connection
> * periodical "cleanup" processes.
>
> There are thousands of independent processes. The problem is, recent
> Intel motherboard turn on zone_reclaim_mode by default and traditional
> prefork model software don't work well on it. Unfortunatelly, such models
> are still typical even in the 21st century. We can't ignore them.
>
> This patch raises the zone_reclaim_mode threshold to 30. 30 doesn't have
> any specific meaning. but 20 means that one-hop QPI/Hypertransport and
> such relatively cheap 2-4 socket machine are often used for traditional
> servers as above. The intention is that these machines don't use
> zone_reclaim_mode.
>
> Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
> This patch doesn't change such high-end NUMA machine behavior.
>
> Dave Hansen said:
>
> : I know specifically of pieces of x86 hardware that set the information
> : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> : behavior which that implies.
> :
> : They've done performance testing and run very large and scary benchmarks
> : to make sure that they _want_ this turned on. What this means for them
> : is that they'll probably be de-optimized, at least on newer versions of
> : the kernel.
> :
> : If you want to do this for particular systems, maybe _that_'s what we
> : should do. Have a list of specific configurations that need the
> : defaults overridden either because they're buggy, or they have an
> : unusual hardware configuration not really reflected in the distance
> : table.
>
> And later said:
>
> : The original change in the hardware tables was for the benefit of a
> : benchmark. Said benchmark isn't going to get run on mainline until the
> : next batch of enterprise distros drops, at which point the hardware where
> : this was done will be irrelevant for the benchmark. I'm sure any new
> : hardware will just set this distance to another yet arbitrary value to
> : make the kernel do what it wants. :)
> :
> : Also, when the hardware got _set_ to this initially, I complained. So, I
> : guess I'm getting my way now, with this patch. I'm cool with it.
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
> * (in whatever arch specific measurement units returned by node_distance())
> * then switch on zone reclaim on boot.
> */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30
> #endif
> #ifndef PENALTY_FOR_NODE_WITH_CPUS
> #define PENALTY_FOR_NODE_WITH_CPUS (1)
>
> Thanks,
> Fengguang
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists