[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <642362a1403ee_286af20850@john.notmuch>
Date: Tue, 28 Mar 2023 14:56:49 -0700
From: John Fastabend <john.fastabend@...il.com>
To: Jakub Sitnicki <jakub@...udflare.com>,
John Fastabend <john.fastabend@...il.com>
Cc: cong.wang@...edance.com, daniel@...earbox.net, lmb@...valent.com,
edumazet@...gle.com, bpf@...r.kernel.org, netdev@...r.kernel.org,
ast@...nel.org, andrii@...nel.org, will@...valent.com
Subject: Re: [PATCH bpf v2 02/12] bpf: sockmap, convert schedule_work into
delayed_work
Jakub Sitnicki wrote:
> On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote:
> > Sk_buffs are fed into sockmap verdict programs either from a strparser
> > (when the user might want to decide how framing of skb is done by attaching
> > another parser program) or directly through tcp_read_sock. The
> > tcp_read_sock is the preferred method for performance when the BPF logic is
> > a stream parser.
> >
> > The flow for Cilium's common use case with a stream parser is,
> >
> > tcp_read_sock()
> > sk_psock_verdict_recv
> > ret = bpf_prog_run_pin_on_cpu()
> > sk_psock_verdict_apply(sock, skb, ret)
> > // if system is under memory pressure or app is slow we may
> > // need to queue skb. Do this queuing through ingress_skb and
> > // then kick timer to wake up handler
> > skb_queue_tail(ingress_skb, skb)
> > schedule_work(work);
> >
> >
> > The work queue is wired up to sk_psock_backlog(). This will then walk the
> > ingress_skb skb list that holds our sk_buffs that could not be handled,
> > but should be OK to run at some later point. However, its possible that
> > the workqueue doing this work still hits an error when sending the skb.
> > When this happens the skbuff is requeued on a temporary 'state' struct
> > kept with the workqueue. This is necessary because its possible to
> > partially send an skbuff before hitting an error and we need to know how
> > and where to restart when the workqueue runs next.
> >
> > Now for the trouble, we don't rekick the workqueue. This can cause a
> > stall where the skbuff we just cached on the state variable might never
> > be sent. This happens when its the last packet in a flow and no further
> > packets come along that would cause the system to kick the workqueue from
> > that side.
> >
> > To fix we could do simple schedule_work(), but while under memory pressure
> > it makes sense to back off some instead of continue to retry repeatedly. So
> > instead to fix convert schedule_work to schedule_delayed_work and add
> > backoff logic to reschedule from backlog queue on errors. Its not obvious
> > though what a good backoff is so use '1'.
> >
> > To test we observed some flakes whil running NGINX compliance test with
> > sockmap we attributed these failed test to this bug and subsequent issue.
> >
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Tested-by: William Findlay <will@...valent.com>
> > Signed-off-by: John Fastabend <john.fastabend@...il.com>
> > ---
[...]
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -481,7 +481,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> > }
> > out:
> > if (psock->work_state.skb && copied > 0)
> > - schedule_work(&psock->work);
> > + schedule_delayed_work(&psock->work, 0);
> > return copied;
> > }
> > EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
> > @@ -639,7 +639,8 @@ static void sk_psock_skb_state(struct sk_psock *psock,
> >
> > static void sk_psock_backlog(struct work_struct *work)
> > {
> > - struct sk_psock *psock = container_of(work, struct sk_psock, work);
> > + struct delayed_work *dwork = to_delayed_work(work);
> > + struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
> > struct sk_psock_work_state *state = &psock->work_state;
> > struct sk_buff *skb = NULL;
> > bool ingress;
> > @@ -679,6 +680,10 @@ static void sk_psock_backlog(struct work_struct *work)
> > if (ret == -EAGAIN) {
> > sk_psock_skb_state(psock, state, skb,
> > len, off);
> > +
> > + // Delay slightly to prioritize any
> > + // other work that might be here.
> > + schedule_delayed_work(&psock->work, 1);
>
> Do IIUC that this means we can back out changes from commit bec217197b41
> ("skmsg: Schedule psock work if the cached skb exists on the psock")?
Yeah I think so this is a more direct way to get the same result. I'm also
thinking this check,
if (psock->work_state.skb && copied > 0)
schedule_work(&psock->work)
is not correct copied=0 which could happen on empty queue could be the
result of a skb stuck from this eagain error in backlog.
I think its OK to revert that patch in a separate patch. And ideally we
could get some way to load up the stack to hit these corner cases without
running long stress tests.
WDYT?
>
> Nit: Comment syntax.
Yep happy to fix.
Powered by blists - more mailing lists