netdev - Re: [PATCH 01/30] swap over network documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20080320142032.9279e288.randy.dunlap@oracle.com>
Date:	Thu, 20 Mar 2008 14:20:32 -0700
From:	Randy Dunlap <randy.dunlap@...cle.com>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	netdev@...r.kernel.org, trond.myklebust@....uio.no, neilb@...e.de,
	miklos@...redi.hu, penberg@...helsinki.fi
Subject: Re: [PATCH 01/30] swap over network documentation

On Thu, 20 Mar 2008 21:10:43 +0100 Peter Zijlstra wrote:

> Document describing the problem and proposed solution
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> ---
>  Documentation/network-swap.txt |  270 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 270 insertions(+)
> 
> Index: linux-2.6/Documentation/network-swap.txt
> ===================================================================
> --- /dev/null
> +++ linux-2.6/Documentation/network-swap.txt
> @@ -0,0 +1,270 @@

...

> +There are several major parts to this enhancement:
> +
> +1/ page->reserve, GFP_MEMALLOC

...

> +  For memory allocated using slab/slub: If a page that is added to a
> +  kmem_cache is found to have page->reserve set, then a  s->reserve

                                                    then an

> +  flag is set for the whole kmem_cache.  Further allocations will only
> +  be returned from that page (or any other page in the cache) if they
> +  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).

                   allocations

> +  Non-emergency allocations will block in alloc_page until a
> +  non-reserve page is available.  Once a non-reserve page has been
> +  added to the cache, the s->reserve flag on the cache is removed.
> +
> +  Because slab objects have no individual state its hard to pass

                                                   it's (or "it is")

> +  reserve state along, the current code relies on a regular alloc

                          so the

> +  failing. There are various allocation wrappers help here.

                                           wrappers to help here.  (?)

> +
> +  This allows us to
> +   a/ request use of the emergency pool when allocating memory
> +     (GFP_MEMALLOC), and
> +   b/ to find out if the emergency pool was used.
> +
> +2/ SK_MEMALLOC, sk_buff->emergency.
> +
...
> +
> +  Similarly, if an skb is ever queued for delivery to user-space for

                                                         user-space, for

> +  example by netfilter, the ->emergency flag is tested and the skb is
> +  released if ->emergency is set. (so obviously the storage route may
> +  not pass through a userspace helper, otherwise the packets will never
> +  arrive and we'll deadlock)
> +
> +  This ensures that memory from the emergency reserve can be used to
> +  allow swapout to proceed, but will not get caught up in any other
> +  network queue.
> +
> +
> +3/ pages_emergency
> +
...
> +
> +  So a new "watermark" is defined: pages_emergency.  This is
> +  effectively added to the current low water marks, so that pages from
> +  this emergency pool can only be allocated if one of PF_MEMALLOC or
> +  GFP_MEMALLOC are set.

                  is set.

> +
> +  pages_emergency can be changed dynamically based on need.  When
> +  swapout over the network is required, pages_emergency is increased
> +  to cover the maximum expected load.  When network swapout is
> +  disabled, pages_emergency is decreased.
> +
> +  To determine how much to increase it by, we introduce reservation
> +  groups....
> +
> +3a/ reservation groups
> +
> +  The memory used transiently for swapout can be in a number of
> +  different places.  e.g. the network route cache, the network

               places, e.g.,

> +  fragment cache, in transit between network card and socket, or (in
> +  the case of NFS) in sunrpc data structures awaiting a reply.
> +  We need to ensure each of these is limited in the amount of memory
> +  they use, and that the maximum is included in the reserve.
> +

...

> +
> +4/ low-mem accounting
> +
> +  Most places that might hold on to emergency memory (e.g. route
> +  cache, fragment cache etc) already place a limit on the amount of

            fragment cache, etc.)

> +  memory that they can use.  This limit can simply be reserved using
> +  the above mechanism and no more needs to be done.
> +
> +  However some memory usage might not be accounted with sufficient

     However,

> +  firmness to allow an appropriate emergency reservation.  The
> +  in-flight skbs for incoming packets is on such example.

                                            one

> +
> +  To support this, a low-overhead mechanism for accounting memory
> +  usage against the reserves is provided.  This mechanism uses the
> +  same data structure that is used to store the emergency memory
> +  reservations through the addition of a 'usage' field.
> +
> +  Before we attempt allocation from the memory reserves, we much check

s/much/must/ ?

> +  if the resulting 'usage' is below the reservation. If so, we increase
> +  the usage and attempt the allocation (which should succeed). If
> +  the projected 'usage' exceeds the reservation we'll either fail the
> +  allocation, or wait for 'usage' to decrease enough so that it would
> +  succeed, depending on __GFP_WAIT.
> +
> +  When memory that was allocated for that purpose is freed, the
> +  'usage' field is checked again.  If it is non-zero, then the size of
> +  the freed memory is subtracted from the usage, making sure the usage
> +  never becomes less than zero.
> +
> +  This provides adequate accounting with minimal overheads when not in
> +  a low memory condition.  When a low memory condition is encountered
> +  it does add the cost of a spin lock necessary to serialise updates
> +  to 'usage'.
> +
> +
> +
> +5/ swapon/swapoff/swap_out/swap_in
> +
> +  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
> +  any network socket that it uses, and can know when to account
> +  reserve memory carefully, new address_space_operations are
> +  available.
> +  "swapon" requests that an address space (i.e a file) be make ready

                                             (i.e.
s/make/made/

> +  for swapout.  swap_out and swap_in request the actual IO.  They
> +  together must ensure that each swap_out request can succeed without
> +  allocating more emergency memory that was reserved by swapon. swapoff
> +  is used to reverse the state changes caused by swapon when we disable
> +  the swap file.
> +
> +
> +Thanks for reading this far.  I hope it made sense :-)
> +
> +Neil Brown (with updates from Peter Zijlstra)


Thanks.

---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html