netdev - Re: [PATCH v2 net-next 1/6] bpf: LRU List

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161115015035.GC8080@ast-mbp.thefacebook.com>
Date:   Mon, 14 Nov 2016 17:50:37 -0800
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Martin KaFai Lau <kafai@...com>
Cc:     netdev@...r.kernel.org, David Miller <davem@...emloft.net>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v2 net-next 1/6] bpf: LRU List

On Fri, Nov 11, 2016 at 10:55:06AM -0800, Martin KaFai Lau wrote:
> Introduce bpf_lru_list which will provide LRU capability to
> the bpf_htab in the later patch.
> 
> * General Thoughts:
> 1. Target use case.  Read is more often than update.
>    (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
>    If bpf_prog does a bpf_lookup_elem() first and then an in-place
>    update, it still counts as a read operation to the LRU list concern.
> 2. It may be useful to think of it as a LRU cache
> 3. Optimize the read case
>    3.1 No lock in read case
>    3.2 The LRU maintenance is only done during bpf_update_elem()
> 4. If there is a percpu LRU list, it will lose the system-wise LRU
>    property.  A completely isolated percpu LRU list has the best
>    performance but the memory utilization is not ideal considering
>    the work load may be imbalance.
> 5. Hence, this patch starts the LRU implementation with a global LRU
>    list with batched operations before accessing the global LRU list.
>    As a LRU cache, #read >> #update/#insert operations, it will work well.
> 6. There is a local list (for each cpu) which is named
>    'struct bpf_lru_locallist'.  This local list is not used to sort
>    the LRU property.  Instead, the local list is to batch enough
>    operations before acquiring the lock of the global LRU list.  More
>    details on this later.
> 7. In the later patch, it allows a percpu LRU list by specifying a
>    map-attribute for scalability reason and for use cases that need to
>    prepare for the worst (and pathological) case like DoS attack.
>    The percpu LRU list is completely isolated from each other and the
>    LRU nodes (including free nodes) cannot be moved across the list.  The
>    following description is for the global LRU list but mostly applicable
>    to the percpu LRU list also.
> 
> * Global LRU List:
> 1. It has three sub-lists: active-list, inactive-list and free-list.
> 2. The two list idea, active and inactive, is borrowed from the
>    page cache.
> 3. All nodes are pre-allocated and all sit at the free-list (of the
>    global LRU list) at the beginning.  The pre-allocation reasoning
>    is similar to the existing BPF_MAP_TYPE_HASH.  However,
>    opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
>    the LRU map.
> 
> * Active/Inactive List (of the global LRU list):
> 1. The active list, as its name says it, maintains the active set of
>    the nodes.  We can think of it as the working set or more frequently
>    accessed nodes.  The access frequency is approximated by a ref-bit.
>    The ref-bit is set during the bpf_lookup_elem().
> 2. The inactive list, as its name also says it, maintains a less
>    active set of nodes.  They are the candidates to be removed
>    from the bpf_htab when we are running out of free nodes.
> 3. The ordering of these two lists is acting as a rough clock.
>    The tail of the inactive list is the older nodes and
>    should be released first if the bpf_htab needs free element.
> 
> * Rotating the Active/Inactive List (of the global LRU list):
> 1. It is the basic operation to maintain the LRU property of
>    the global list.
> 2. The active list is only rotated when the inactive list is running
>    low.  This idea is similar to the current page cache.
>    Inactive running low is currently defined as
>    "# of inactive < # of active".
> 3. The active list rotation always starts from the tail.  It moves
>    node without ref-bit set to the head of the inactive list.
>    It moves node with ref-bit set back to the head of the active
>    list and then clears its ref-bit.
> 4. The inactive rotation is pretty simply.
>    It walks the inactive list and moves the nodes back to the head of
>    active list if its ref-bit is set. The ref-bit is cleared after moving
>    to the active list.
>    If the node does not have ref-bit set, it just leave it as it is
>    because it is already in the inactive list.
> 
> * Shrinking the Inactive List (of the global LRU list):
> 1. Shrinking is the operation to get free nodes when the bpf_htab is
>    full.
> 2. It usually only shrinks the inactive list to get free nodes.
> 3. During shrinking, it will walk the inactive list from the tail,
>    delete the nodes without ref-bit set from bpf_htab.
> 4. If no free node found after step (3), it will forcefully get
>    one node from the tail of inactive or active list.  Forcefully is
>    in the sense that it ignores the ref-bit.
> 
> * Local List:
> 1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
>    batch enough operations before acquiring the lock of the
>    global LRU.
> 2. A local list has two sub-lists, free-list and pending-list.
> 3. During bpf_update_elem(), it will try to get from the free-list
>    of (the current CPU local list).
> 4. If the local free-list is empty, it will acquire from the
>    global LRU list.  The global LRU list can either satisfy it
>    by its global free-list or by shrinking the global inactive
>    list.  Since we have acquired the global LRU list lock,
>    it will try to get at most LOCAL_FREE_TARGET elements
>    to the local free list.
> 5. When a new element is added to the bpf_htab, it will
>    first sit at the pending-list (of the local list) first.
>    The pending-list will be flushed to the global LRU list
>    when it needs to acquire free nodes from the global list
>    next time.
> 
> * Lock Consideration:
> The LRU list has a lock (lru_lock).  Each bucket of htab has a
> lock (buck_lock).  If both locks need to be acquired together,
> the lock order is always lru_lock -> buck_lock and this only
> happens in the bpf_lru_list.c logic.
> 
> In hashtab.c, both locks are not acquired together (i.e. one
> lock is always released first before acquiring another lock).
> 
> Signed-off-by: Martin KaFai Lau <kafai@...com>

thanks for detailed commit log.
I think it's worth adding it to bpf_lru_list.c as design documentation.
Acked-by: Alexei Starovoitov <ast@...nel.org>