[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20201123123745.GH15658@gauss3.secunet.de>
Date: Mon, 23 Nov 2020 13:37:45 +0100
From: Steffen Klassert <steffen.klassert@...unet.com>
To: Nic Dade <nic.dade@...il.com>
CC: <netdev@...r.kernel.org>
Subject: Re: ESN, seqhi and out-of-order calls to advance()
Hi,
I've Cced netdev, maybe other people have an opinion on this too.
On Thu, Nov 19, 2020 at 01:39:29PM -0800, Nic Dade wrote:
> I've been investigating a problem which happens when I use IPsec
> (strongswan in userspace), ESN, the default anti-replay window (32
> seqnums), on a multi-core CPU. The small window size isn't necessary, it
> just makes it easier to reproduce.
Using the same SA on multiple cores has synchronization problems anyway.
We curently try to solve that on the protocol level:
https://datatracker.ietf.org/doc/html/draft-pwouters-multi-sa-performance
>
> It looks like it's the same problem which was found and patched in commit
> 3b59df46a4, but that commit only applies to async case, and doesn't quite
> solve the problem when it is the CPU cores which are causing it.
>
> The trouble is that xfrm_replay_seqhi() is called twice for each packet,
> once to decide what seqhi value to use for authentication, and a second
> time to decide what seq num to advance to for anti-replay. Sometimes they
> are not the same value.
>
> Here's what happens to cause this: two packets belonging to the same ESP SA
> are being received, and the packet's seqnums are > window apart. The
> packets pass through xfrm_input() which does the 1st call to
> xfrm_replay_seqhi(). Once the packet and seqhi have been authenticated,
> x->repl->advance() == xfrm_replay_advance_esn() is called. If the higher
> seqnum packet gets to advance() first, then it moves the auti-replay window
> forward. Then when the second packet arrives at advance(), the call
> xfrm_replay_advance_esn() makes to xfrm_replay_seqhi() thinks the seqnum <
> bottom means 32-bit seqnums wrapped, and it returns seqhi+1.
> xfrm_replay_advance_esn() stores that away in the xfrm_replay_state_esn for
> the future. From now until the ESP SA renews, or the sender gets to the
> point where seqhi really did increment, all packets will fail
> authentication.
Well analyzed! Yes, that can happen if the packets are processed on
different cpus.
>
> This is b/c the seqhi which was authenticated != the seqhi which advance()
> believed to be correct.
>
> I think it would be safer if the authenticated seqhi computed in
> xfrm_input() was passed to the advance() function as an argument, rather
> than having advance() recompute seqhi. That would also fix the async case.
> Or the non-async case also needs to recheck.
Right, using the authenticated seqhi in the advance() function makes
sense. We can do away the recheck() if that fixes the problem entirely.
But I did not look into this problem for some years, so not absolutely
sure if that is sufficient.
>
> Also it seems to be that calling xfrm_replay_seqhi() while not holding
> x->lock (as xfrm_input() does) is not safe, since both x->seq_hi and x->seq
> are used by xfrm_replay_seqhi(), and that can race with the updates to seq
> and seq_hi in xfrm_replay_advance_esn()
True, indeed.
> ----------------------------------------------------------
>
> Note I still have some mysteries. I do know for sure (from instrumenting
> the code and reproducing) that the 2nd call to xfrm_replay_seqhi() does
> result in a different answer than the 1st call, and that this happens when
> the packets are processed simultaneously and out of order as described
> above. My mystery is that my instrumentation also indicates it's the same
> CPU core (by smp_processor_id()) processing the two packets, which I cannot
> explain.
That is odd. Should not happen if the packets are processed on the same cpu
with a sync cipher.
Powered by blists - more mailing lists