linux-kernel - Re: [PATCH 0/2] Disable zone_reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1404081752390.16708@nuc>
Date:	Tue, 8 Apr 2014 17:58:21 -0500 (CDT)
From:	Christoph Lameter <cl@...ux.com>
To:	Robert Haas <robertmhaas@...il.com>
cc:	Vlastimil Babka <vbabka@...e.cz>, Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Josh Berkus <josh@...iodbs.com>,
	Andres Freund <andres@...quadrant.com>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>, sivanich@....com
Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default

On Tue, 8 Apr 2014, Robert Haas wrote:

> Well, as Josh quite rightly said, the hit from accessing remote memory
> is never going to be as large as the hit from disk.  If and when there
> is a machine where remote memory is more expensive to access than
> disk, that's a good argument for zone_reclaim_mode.  But I don't
> believe that's anywhere close to being true today, even on an 8-socket
> machine with an SSD.

I am nost sure how disk figures into this?

The tradeoff is zone reclaim vs. the aggregate performance
degradation of the remote memory accesses. That depends on the
cacheability of the app and the scale of memory accesses.

The reason that zone reclaim is on by default is that off node accesses
are a big performance hit on large scale NUMA systems (like ScaleMP and
SGI). Zone reclaim was written *because* those system experienced severe
performance degradation.

On the tightly coupled 4 and 8 node systems there does not seem to
be a benefit from what I hear.

> Now, perhaps the fear is that if we access that remote memory
> *repeatedly* the aggregate cost will exceed what it would have cost to
> fault that page into the local node just once.  But it takes a lot of
> accesses for that to be true, and most of the time you won't get them.
>  Even if you do, I bet many workloads will prefer even performance
> across all the accesses over a very slow first access followed by
> slightly faster subsequent accesses.

Many HPC workloads prefer the opposite.

> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts.  In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.

Shifting pages involves similar tradeoffs as zone reclaim vs. remote
allocations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/