linux-kernel - Sudden and massive page cache eviction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>
Date:	Fri, 12 Nov 2010 17:20:21 +0100
From:	Peter Schüller <scode@...tify.com>
To:	linux-kernel@...r.kernel.org
Cc:	Mattias de Zalenski <zalenski@...tify.com>
Subject: Sudden and massive page cache eviction

Hello,

We have been seeing sudden and repeated evictions of huge amounts of
page cache on some of our servers for reasons that we cannot explain.
We are hoping that someone familiar with the vm subsystem may be able
to shed some light on the issue and perhaps confirm whether it is
plausibly a kernel bug or not. I will try to present the information
most-important-first, but this post will unavoidable be a bit long -
sorry.

First, here is a good example of the symptom (more graphs later on):

   http://files.spotify.com/memcut/b_daily_allcut.png

After looking into this we have seen similar incidents on servers
running completely different software; but in this particular case
this machine is running a service which is heavily dependent on the
buffer cache to deal with incoming request load. The direct effects of
these is that we end up in complete I/O saturation (average queue
depth goes to 150-250 and stays there indefinitely or until we
actively tweak it (warm up caches etc)). Our interpretation of that is
that the eviction is not the result of something along the lines of a
large file being removed; given the effects on I/O load it is clear
that the data being evicted is in fact part of the active set used by
the service running on the machine.

The I/O load on these systems comes mainly from two things:

  (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
  (2) Seek-bound I/O generated by traversal of prefix directory trees
(i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
  (3) Seek-bound I/O reading small segments of small-to-medium sized
files contained in the prefix tree.

The prefix tree consist of 8*2^16 directory entries in total, with
individual files being in the tens of millions per server.

We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
experimental repo). While it initially looked like it was behaving
better, it slowly reverted to not making a difference (maybe as a
function of uptime, but we have not had the opportunity to test this
by re-booting some of them so it is an untested hypothesis).

Most of the activity on this system (ignoring the usual stuff like
ssh/cron/syslog/etc) is coming from Python processes that consume
non-trivial amounts of heap space, plus the disk activity and some
POSIX shared memory caching utilized by the BDB library.

We have correlated the incidence of these page eviction with higher
loads on the system; i.e., it tends to happen under high-load periods
and in addition we tend to see additional machines having problems as
a result of us "fixing" a machine that experienced an eviction (we
have some limited cascading effects that causes slightly higher load
on other servers in the cluster when we do that).

We believe the most plausible way an application bug could trigger
this behavior would require that (1) the application allocates the
memory, and (2) actually touches the pages. We believe this to be
unlikely in this case because:

  (1) We see similar sudden evictions on various other servers, which
we noticed when we started looking for them.
  (2) The fact that it tends to trigger correlated with load suggests
that it is not a functional bug in the service as such as higher load
is in this case unlikely to trigger any paths that does anything
unique with respect to memory allocation. In particular because the
domain logic is all Python, and none of it really deals with data
chunks.
  (3) If we did manage to allocate something in the Python heap, we
would have to be "lucky" (or unlucky) if Python were consistently able
to munmap()/brk() down afterwards.

Some additional "sample" graphs showing a few incidences of the problem:

   http://files.spotify.com/memcut/a_daily.png
   http://files.spotify.com/memcut/a_weekly.png
   http://files.spotify.com/memcut/b_daily_allcut.png
   http://files.spotify.com/memcut/c_monthly.png
   http://files.spotify.com/memcut/c_yearly.png
   http://files.spotify.com/memcut/d_monthly.png
   http://files.spotify.com/memcut/d_yearly.png
   http://files.spotify.com/memcut/a_monthly.png
   http://files.spotify.com/memcut/a_yearly.png
   http://files.spotify.com/memcut/c_daily.png
   http://files.spotify.com/memcut/c_weekly.png
   http://files.spotify.com/memcut/d_daily.png
   http://files.spotify.com/memcut/d_weekly.png

And here is an example from a server only running PostgreSQL (where
the sudden drop of gigabytes of page cache is unlikely because we are
not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
nor do we have a use-case which will imply ftruncate() on table
files):

   http://files.spotify.com/memcut/postgresql_weekly.png

As you can see it's not as significant there, but it seems to, at
least visually, be the same "type" of effect. We've seen similar on
various machines, although depending on service running it may or may
not be explainable by regular file removal.

Further, we have observed the kernel's unwillingness to retain data in
page cache under interesting circumstances:

(1) page cache eviction happens
(2) we warm up our BDB files by cat:ing them (simple but effective)
(3) within a matter of minutes, while there is still several GB of
free (truly free, not page cached), these are evicted (as evidenced by
re-cat:ing them a little while later)

This latest observation we understand may be due to NUMA related
allocation issues, and we should probably try to use numactl to ask
for a more even allocation. We have not yet tried this. However, it is
not clear how any issues having to do with that would cause sudden
eviction of data already *in* the page cache (on whichever node).

-- 
/ Peter Schuller aka scode
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/