linux-kernel - RE: [PATCH 4/4] zone_reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4D05DB80B95B23498C72C700BD6C2E0B2EF6E29A@pdsmsx502.ccr.corp.intel.com>
Date:	Tue, 19 May 2009 11:38:26 +0800
From:	"Zhang, Yanmin" <yanmin.zhang@...el.com>
To:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"Wu, Fengguang" <fengguang.wu@...el.com>
CC:	LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Christoph Lameter <cl@...ux-foundation.org>
Subject: RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default

>>-----Original Message-----
>>From: KOSAKI Motohiro [mailto:kosaki.motohiro@...fujitsu.com]
>>Sent: 2009年5月19日 10:54
>>To: Wu, Fengguang
>>Cc: kosaki.motohiro@...fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van
>>Riel; Christoph Lameter; Zhang, Yanmin
>>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>
>>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
>>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
>>> >
>>> > Current linux policy is, if the machine has large remote node distance,
>>> >  zone_reclaim_mode is enabled by default because we've be able to assume

>>
>>ok, I would explain zone reclaim design and performance tendency.
>>
>>Firstly, we can make classification of linux eco system, roughly.
>> - HPC
>> - high-end server
>> - volume server
>> - desktop
>> - embedded
>>
>>it is separated by typical workload mainly.
>>
>>Secondly, zone_reclaim mean "I strongly dislike remote node access than
>>disk access".
>>it is very fitting on HPC workload. it because
>>  - HPC workload typically make the number of the same as cpus of processess
>>(or thread).
>>    IOW, the workload typically use memory equally each node.
>>  - HPC workload is typically CPU bounded job. CPU migration is rare.
>>  - HPC workload is typically long lived. (possible >1 year)
>>    IOW, remote node allocation makes _very_ _very_ much remote node access.
>>
>>but zone_reclaim don't fit typical server workload.
>>  - server workload often make thread pool and some thread is sleeping until
>>    a request receved.
>>    IOW, when thread waking-up, the thread might move another cpu.
>>    node distance tendency don't make sense on weak cpu locality workload.
>>
>>Plus, disk-cache is the file-server's identity. we shouldn't think it's not
>>important.
>>Plus, DB software can consume almost system memory and (In general) RDB data
>>makes
>>harder to split equally as hpc.
>>
>>desktop workload is special. desktop peopole can run various workload beyond
>>our assumption. So, we shouldn't have any workload assumption to desktop
>>people.
>>However, AFAIK almost desktop software use memory as UMA.
>>
>>we don't need to care embedded. it is typically UMA.
>>
>>
>>IOW, the benefit of zone reclaim depend on "strong cpu locality" and
>>"workload is cpu bounded" and "thead is long lived".
>>but many workload don't fill above requirement. IOW, zone reclaim is
>>workload depended feature (as Wu said).
>>
>>
>>In general, the feature of workload depended don't fit default option.
>>we can't know end-user run what workload anyway.
>>
>>Fortunately (or Unfortunately), typical workload and machine size had
>>significant mutuality.
>>Thus, the current default setting calculation had worked well in past days.
[YM] Your analysis is clear and deep.

>>
>>Now, it was breaked. What should we do?
>>Yanmin, We know 99% linux people use intel cpu and you are one of
>>most hard repeated testing
[YM] It's very easy to reproduce them on my machines. :) Sometimes, because the 
issues only exist on machines with lots of cpu while other community developers
have no such environments. 

 guy in lkml and you have much test.
>>May I ask your tested machine and benchmark?
[YM] Usually I started lots of benchmark testing against the latest kernel, but 
as for this issue, it's reported by a customer firstly. The customer runs apache
on Nehalem machines to access lots of files. So the issue is an example of file 
server.

BTW, I found many test cases of fio have big drop after I upgraded BIOS of one 
Nehalem machine. By checking vmstat data, I found almost a half memory is always free. It's also related to zone_reclaim_mode because new BIOS changes the node
distance to a large value. I use numactl --interleave=all to walkaround the problem temporarily.

I have no HPC environment.

>>
>>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency
>>workload,
>> we can drop our afraid and we would prioritize your opinion, of cource.
So it seems only file servers have the issue currently.

Yanmin