netdev - Re: [PATCH 00/28] Swap over NFS -v16

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18384.46967.583615.711455@notabene.brown>
Date:	Fri, 7 Mar 2008 14:33:11 +1100
From:	Neil Brown <neilb@...e.de>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	netdev@...r.kernel.org, trond.myklebust@....uio.no
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Tuesday March 4, a.p.zijlstra@...llo.nl wrote:
> 
> On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> > 
> > Those skbs we allocated - they are either sitting in the fragment
> > cache, or have been attached to a SK_MEMALLOC socket, or have been
> > freed - correct?  If so, then there is already a limit to how much
> > memory they can consume.
> 
> Not really, there is no natural limit to the amount of packets that can
> be in transit between RX and socket demux. So we need the extra (skb)
> accounting to impose that.

Isn't there?  A brief look at the code suggests that (except for
fragment handling) there is a fairly straight path from
network-receive to socket demux.  No queues along the way.
That suggests the number of in-transit skbs should be limited by the
number of CPUs.  Did I miss something?  Or is the number of CPUs
potentially too large to be a suitable limit (seems unlikely).

While looking at the code it also occurred to me that:
  1/ tcpdump could be sent incoming packets.  Is there a limit
     to the number of packets that can be in-kernel waiting for
     tcpdump to collect them?  Should this limit be added to the base
     reserve?
  2/ If the host is routing network packets, then incoming packets
     might go on an outbound queue.  Is this space limited?  and
     included in the reserve?

Not major points, but I thought I would mention them.

> > I also don't see the value of tracking pages to see if they are
> > 'reserve' pages or not.  The decision to drop an skb that is not for
> > an SK_MEMALLOC socket should be based on whether we are currently
> > short on memory.  Not whether we were short on memory when the skb was
> > allocated.
> 
> That comes from accounting, once you need to account data you need to
> know when to start accounting, and keep state so that you can properly
> un-account.
> 

skbs in the main (only?) thing you do accounting on, so focusing on
those:

Suppose that every time you allocate memory for an skb, you check
if the allocation had to dip into emergency reserves, and account the
memory if so - releasing the memory and dropping the packet if we are
over the limit.
And any time you free memory associated with an skb, you check if the
accounts currently say '0', and if not subtract the size of the
allocation from the accounts.

Then you have quite workable accounting that doesn't need to tag every
piece of memory with its 'reserve' status, and only pays the
accounting cost (presumably a spinlock) when running out of memory, or
just recovering.

This more relaxed approach to accounting reserved vs non-reserved
memory has a strong parallel in your slub code (which I now
understand).  When sl[au]b first gets a ->reserve page, it sets the
->reserve flag on the memcache and leaves it set until it sometime
later gets a non-"->reserve" page.  Any memory freed in the mean time
(whether originally reserved or not) is treated as reserve memory in
that it will only be returned for ALLOC_NO_WATERMARKS allocations.
I think this is a good way of working with reserved memory.  It isn't
precise, but it is low-cost and good enough to get you through the
difficult patch.

Your netvm-skbuff-reserve.patch has some code to make sure that all
the allocations in an skb have the same 'reserve' status.   I don't
think that is needed and just makes the code messy - plus it requires
the 'overcommit' flag to mem_reserve_kmalloc_charge which is a bit of
a wart on the interface.

I would suggest getting rid of that.  Just flag the whole skb if any
part gets a 'reserve' allocation, and use that flag to decide to drop
packets arriving at non-SK_MEMALLOC sockets.



So: I think I now really understand what your code is doing, so I will
try to explain it in terms that even I understand... This text in
explicitly available under GPLv2 in case you want it.

It actually describes something a bit different to what your code
currently does, but I think it is very close to the spirit.  Some
differences follow from my observations above.  Others the way that
seemed to make sense while describing the problem and solution
differed slightly from what I saw the code doing.  Obviously the code
and the description should be aligned one way or another before being
finalised.  
The description is a bit long ... sorry about that.  But I wanted to
make sure I included motivation and various assumptions.  Some of my
understanding may well be wrong, but I present it here anyway.  It is
easier for you to correct if it is clearly visible:-)

Problem:
   When Linux needs to allocate memory it may find that there is
   insufficient free memory so it needs to reclaim space that is in
   use but not needed at the moment.  There are several options:

   1/ Shrink a kernel cache such as the inode or dentry cache.  This
      is fairly easy but provides limited returns.
   2/ Discard 'clean' pages from the page cache.  This is easy, and
      works well as long as there are clean pages in the page cache.
      Similarly clean 'anonymous' pages can be discarded - if there
      are any.
   3/ Write out some dirty page-cache pages so that they become clean.
      The VM limits the number of dirty page-cache pages to e.g. 40%
      of available memory so that (among other reasons) a "sync" will
      not take excessively long.  So there should never be excessive
      amounts of dirty pagecache.
      Writing out dirty page-cache pages involves work by the
      filesystem which may need to allocate memory itself.  To avoid
      deadlock, filesystems use GFP_NOFS when allocating memory on the
      write-out path.  When this is used, cleaning dirty page-cache
      pages is not an option so if the filesystem finds that  memory
      is tight, another option must be found.
   4/ Write out dirty anonymous pages to the "Swap" partition/file.
      This is the most interesting for a couple of reasons.
      a/ Unlike dirty page-cache pages, there is no need to write anon
         pages out unless we are actually short of memory.  Thus they
         tend to be left to last.
      b/ Anon pages tend to be updated randomly and unpredictably, and
         flushing them out of memory can have a very significant
         performance impact on the process using them.  This contrasts
         with page-cache pages which are often written sequentially
         and often treated as "write-once, read-many".
      So anon pages tend to be left until last to be cleaned, and may
      be the only cleanable pages while there are still some dirty
      page-cache pages (which are waiting on a GFP_NOFS allocation).

[I don't find the above wholly satisfying.  There seems to be too much
 hand-waving.  If someone can provide better text explaining why
 swapout is a special case, that would be great.]

So we need to be able to write to the swap file/partition without
needing to allocate any memory ... or only a small well controlled
amount.

The VM reserves a small amount of memory that can only be allocated
for use as part of the swap-out procedure.  It is only available to
processes with the PF_MEMALLOC flag set, which is typically just the
memory cleaner.

Traditionally swap-out is performed directly to block devices (swap
files on block-device filesystems are supported by examining the
mapping from file offset to device offset in advance, and then using
the device offsets to write directly to the device).  Block devices
are (required to be) written to pre-allocate any memory that might be
needed during write-out, and to block when the pre-allocated memory is
exhausted and no other memory is available.  They can be sure not to
block forever as the pre-allocated memory will be returned as soon as
the data it is being used for has been written out.  The primary
mechanism for pre-allocating memory is called "mempools".

This approach does not work for writing anonymous pages
(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.


The main reason that it does not work is that when data from an anon
page is written to the network, we must wait for a reply to confirm
the data is safe.  Receiving that reply will consume memory and,
significantly, we need to allocate memory to an incoming packet before
we can tell if it is the reply we are waiting for or not.

The secondary reason is that the network code is not written to use
mempools and in most cases does not need to use them.  Changing all
allocations in the networking layer to use mempools would be quite
intrusive, and would waste memory, and probably cause a slow-down in
the common case of not swapping over the network.

These problems are addressed by enhancing the system of memory
reserves used by PF_MEMALLOC and requiring any in-kernel networking
client that is used for swap-out to indicate which sockets are used
for swapout so they can be handled specially in low memory situations.

There are several major parts to this enhancement:

1/ PG_emergency, GFP_MEMALLOC

  To handle low memory conditions we need to know when those
  conditions exist.  Having a global "low on memory" flag seems easy,
  but its implementation is problematic.  Instead we make it possible
  to tell if a recent memory allocation required use of the emergency
  memory pool.
  For pages returned by alloc_page, the new page flag PG_emergency
  can be tested.  If this is set, then a low memory condition was
  current when the page was allocated, so the memory should be used
  carefully.

  For memory allocated using slab/slub: If a page that is added to a
  kmem_cache is found to have PG_emergency set, then a  ->reserve
  flag is set for the whole kmem_cache.  Further allocations will only
  be returned from that page (or any other page in the cache) if they
  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
  Non-emergency allocations will block in alloc_page until a
  non-reserve page is available.  Once a non-reserve page has been
  added to the cache, the ->reserve flag on the cache is removed.
  When memory is returned by slab/slub, PG_emergency is set on the page
  holding the memory to match the ->reserve flag on that cache.

  After memory has been returned by kmem_cache_alloc or kmalloc, the
  page's PG_emergency flag can be tested.  If it is set, then the most
  recent allocation from that cache required reserve memory, so this
  allocation should be used with care.

  It is not safe to test the cache's ->reserve flag immediately after
  an allocation as that flag is in per-cpu data, and the process could
  have be rescheduled to a different cpu if preemption is enabled.
  Thus the use of PG_emergency to carry this information.

  This allows us to
   a/ request use of the emergency pool when allocating memory
     (GFP_MEMALLOC), and 
   b/ to find out if the emergency pool was used.

2/ SK_MEMALLOC, sk_buff->emergency.

  When memory from the reserve is used to store incoming network
  packets, the memory must be freed (and the packet dropped) as soon
  as we find out that the packet is not for a socket that is used for
  swap-out. 
  To achieve this we have an ->emergency flag for skbs, and an
  SK_MEMALLOC flag for sockets.
  When memory is allocated for an skb, it is allocated with
  GFP_MEMALLOC (if we are currently swapping over the network at
  all).  If a subsequent test shows that the emergency pool was used,
  ->emergency is set.
  When the skb is finally attached to its destination socket, the
  SK_MEMALLOC flag on the socket is tested.  If the skb has
  ->emergency set, but the socket does not have SK_MEMALLOC set, then
  the skb is immediately freed and the packet is dropped.
  This ensures that reserve memory is never queued on a socket that is
  not used for swapout.

  Similarly, if an skb is ever queued for deliver to user-space for
  example by netfilter, the ->emergency flag is tested and the skb is
  released if ->emergency is set.

  This ensures that memory from the emergency reserve can be used to
  allow swapout to proceed, but will not get caught up in any other
  network queue.


3/ pages_emergency

  The above would be sufficient if the total memory below the lowest
  memory watermark (i.e the size of the emergency reserve) were known
  to be enough to hold all transient allocations needed for writeout.
  I'm a little blurry on how big the current emergency pool is, but it
  isn't big and certainly hasn't been sized to allow network traffic
  to consume any.

  We could simply make the size of the reserve bigger. However in the
  common case that we are not swapping over the network, that would be
  a waste of memory.

  So a new "watermark" is defined: pages_emergency.  This is
  effectively added to the current low water marks, so that pages from
  this emergency pool can only be allocated if one of PF_MEMALLOC or
  GFP_MEMALLOC are set.

  pages_emergency can be changed dynamically based on need.  When
  swapout over the network is required, pages_emergency is increased
  to cover the maximum expected load.  When network swapout is
  disabled, pages_emergency is decreased.

  To determine how much to increase it by, we introduce reservation
  groups....

3a/ reservation groups

  The memory used transiently for swapout can be in a number of
  different places.  e.g. the network route cache, the network
  fragment cache, in transit between network card and socket, or (in
  the case of NFS) in sunrpc data structures awaiting a reply.
  We need to ensure each of these is limited in the amount of memory
  they use, and that the maximum is included in the reserve.

  The memory required by the network layer only needs to be reserved
  once, even if there are multiple swapout paths using the network
  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
  the same time would be unusual).

  So we create a tree of reservation groups.  The network might
  register a collection of reservations, but not mark them as being in
  use.  NFS and sunrpc might similarly register a collection of
  reservations, and attach it to the network reservations as it
  depends on them.
  When swapout over NFS is requested, the NFS/sunrpc reservations are
  activated which implicitly activates the network reservations.

  The total new reservation is added to pages_emergency.

  Provided each memory usage stays beneath the registered limit (at
  least when allocating memory from reserves), the system will never
  run out of emergency memory, and swapout will not deadlock.

  It is worth noting here that it is not critical that each usage
  stays beneath the limit 100% of the time.  Occasional excess is
  acceptable provided that the memory will be freed  again within a
  short amount of time that does *not* require waiting for any event
  that itself might require memory.
  This is because, at all stages of transmit and receive, it is
  acceptable to discard all transient memory associated with a
  particular writeout and try again later.  On transmit, the page can
  be re-queued for later transmission.  On receive, the packet can be
  dropped assuming that the peer will resend after a timeout.

  Thus allocations that are truly transient and will be freed without
  blocking do not strictly need to be reserved for.  Doing so might
  still be a good idea to ensure forward progress doesn't take too
  long. 

4/ lo-mem accounting

  Most places that might hold on to emergency memory (e.g. route
  cache, fragment cache etc) already place a limit on the amount of
  memory that they can use.  This limit can simply be reserved using
  the above mechanism and no more needs to be done.

  However some memory usage might not be accounted with sufficient
  firmness to allow an appropriate emergency reservation.  The
  in-flight skbs for incoming packets is (claimed to be) on such
  example.

  To support this, a low-overhead mechanism for accounting memory
  usage against the reserves is provided.  This mechanism uses the
  same data structure that is used to store the emergency memory
  reservations through the addition of a 'usage' field.

  When memory allocation for a particular purpose succeeds, the memory
  is checked to see if it is 'reserve' memory.  If it is, the size of
  the allocation is added to the 'usage'.  If this exceeds the
  reservation, the usage is reduced again and the memory that was
  allocated is free.

  When memory that was allocated for that purpose is freed, the
  'usage' field is checked again.  If it is non-zero, then the size of
  the freed memory is subtracted from the usage, making sure the usage
  never becomes less than zero.

  This provides adequate accounting with minimal overheads when not in
  a low memory condition.  When a low memory condition is encountered
  it does add the cost of a spin lock necessary to serialise updates
  to 'usage'.
  


5/ swapfile/swap_out/swap_in

  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
  any network socket that it uses, and can know when to account
  reserve memory carefully, new address_space_operations are
  available.
  "swapfile" requests that an address space (i.e a file) be make ready
  for swapout.  swap_out and swap_in request the actual IO.  They
  together must ensure that each swap_out request can succeed without
  allocating more emergency memory that was reserved by swapfile.


Thanks for reading this far.  I hope it made sense :-)

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html