linux-kernel - Re: [PATCH v9 net-next 01/23] net/tcp: Prepare tcp_md5sig

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <937d4a14-050a-56f1-8af9-2c7acdd5ae86@arista.com>
Date:   Fri, 11 Aug 2023 22:02:47 +0100
From:   Dmitry Safonov <dima@...sta.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     David Ahern <dsahern@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
        Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        linux-kernel@...r.kernel.org,
        Andy Lutomirski <luto@...capital.net>,
        Ard Biesheuvel <ardb@...nel.org>,
        Bob Gilligan <gilligan@...sta.com>,
        Dan Carpenter <error27@...il.com>,
        David Laight <David.Laight@...lab.com>,
        Dmitry Safonov <0x7f454c46@...il.com>,
        Donald Cassidy <dcassidy@...hat.com>,
        Eric Biggers <ebiggers@...nel.org>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Francesco Ruggeri <fruggeri05@...il.com>,
        "Gaillardetz, Dominik" <dgaillar@...na.com>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Ivan Delalande <colona@...sta.com>,
        Leonard Crestez <cdleonard@...il.com>,
        Salam Noureddine <noureddine@...sta.com>,
        "Tetreault, Francois" <ftetreau@...na.com>, netdev@...r.kernel.org,
        Steen Hegelund <Steen.Hegelund@...rochip.com>
Subject: Re: [PATCH v9 net-next 01/23] net/tcp: Prepare tcp_md5sig_pool for
 TCP-AO

Hi Eric,

On 8/8/23 10:58, Eric Dumazet wrote:
> On Wed, Aug 2, 2023 at 7:27 PM Dmitry Safonov <dima@...sta.com> wrote:
[..]
>> -       hash = crypto_alloc_ahash("md5", 0, CRYPTO_ALG_ASYNC);
>> -       if (IS_ERR(hash))
>> -               return;
>> -
>> -       for_each_possible_cpu(cpu) {
>> -               void *scratch = per_cpu(tcp_md5sig_pool, cpu).scratch;
>> -               struct ahash_request *req;
>> -
>> -               if (!scratch) {
>> -                       scratch = kmalloc_node(sizeof(union tcp_md5sum_block) +
>> -                                              sizeof(struct tcphdr),
>> -                                              GFP_KERNEL,
>> -                                              cpu_to_node(cpu));
>> -                       if (!scratch)
>> -                               return;
>> -                       per_cpu(tcp_md5sig_pool, cpu).scratch = scratch;
>> -               }
>> -               if (per_cpu(tcp_md5sig_pool, cpu).md5_req)
>> -                       continue;
>> -
>> -               req = ahash_request_alloc(hash, GFP_KERNEL);
>> -               if (!req)
>> -                       return;
>> -
>> -               ahash_request_set_callback(req, 0, NULL, NULL);
>> -
>> -               per_cpu(tcp_md5sig_pool, cpu).md5_req = req;
>> +       scratch_size = sizeof(union tcp_md5sum_block) + sizeof(struct tcphdr);
>> +       ret = tcp_sigpool_alloc_ahash("md5", scratch_size);
>> +       if (ret >= 0) {
>> +               tcp_md5_sigpool_id = ret;
> 
> tcp_md5_alloc_sigpool() can be called from multiple cpus,
> yet you are writing over tcp_md5_sigpool_id here without any
> spinlock/mutex/annotations ?

Yeah, it's writing-only: as long as there was a TCP-MD5 key in
the system, sigpool_id returned by tcp_sigpool_alloc_ahash() would stay
the same. The only time when it writes a different value - iff all
previous MD5 keys were released.

> KCSAN would eventually file a report, and a compiler might very well
> transform this to
> 
> tcp_md5_sigpool_id = random_value;
> <window where readers might catch garbage>
> tcp_md5_sigpool_id = ret;

Agree, haven't thought about the optimizing compiler nightmares.
Will add WRITE_ONCE().

[..]
>> +
>> +/**
>> + * sigpool_reserve_scratch - re-allocates scratch buffer, slow-path
>> + * @size: request size for the scratch/temp buffer
>> + */
>> +static int sigpool_reserve_scratch(size_t size)
>> +{
>> +       struct scratches_to_free *stf;
>> +       size_t stf_sz = struct_size(stf, scratches, num_possible_cpus());
> 
> This is wrong.   num_possible_cpus() could be 2, with two cpus numbered 0 and 8.
> 
> Look for nr_cpu_ids instead.

Hmm, but it seems num_possible_cpus() was exactly what I wanted?

As free_old_scratches() does:

:	while (stf->cnt--)
:		kfree(stf->scratches[stf->cnt]);

So, that's just the number of previously allocated scratches, not
per-CPU index.

>> +       int cpu, err = 0;
>> +
>> +       lockdep_assert_held(&cpool_mutex);
>> +       if (__scratch_size >= size)
>> +               return 0;
>> +
>> +       stf = kmalloc(stf_sz, GFP_KERNEL);
>> +       if (!stf)
>> +               return -ENOMEM;
>> +       stf->cnt = 0;
>> +
>> +       size = max(size, __scratch_size);
>> +       cpus_read_lock();
>> +       for_each_possible_cpu(cpu) {
>> +               void *scratch, *old_scratch;
>> +
>> +               scratch = kmalloc_node(size, GFP_KERNEL, cpu_to_node(cpu));
>> +               if (!scratch) {
>> +                       err = -ENOMEM;
>> +                       break;
>> +               }
>> +
>> +               old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
>> +                                       scratch, lockdep_is_held(&cpool_mutex));
>> +               if (!cpu_online(cpu) || !old_scratch) {
>> +                       kfree(old_scratch);
>> +                       continue;
>> +               }
>> +               stf->scratches[stf->cnt++] = old_scratch;
>> +       }
> 
> 
> I wonder why you are not simply using something around __alloc_percpu()

I presume that was a heritage of __tcp_alloc_md5sig_pool() that
allocated scratches this way. In pre-crypto-cloning versions per-CPU
ahash_requests were allocated with alloc_percpu().

But now looking closer, if I read pcpu allocator code correctly, it does
pcpu_mem_zalloc() for populating per-CPU chunks which depending on
pcpu_chunk_struct_size may be allocated with vmalloc() or kzalloc().

IOW, as long as CONFIG_NEED_PER_CPU_KM != n, which is:

: config NEED_PER_CPU_KM
:	depends on !SMP || !MMU

The memory, returned by pcpu allocator has no guarantee of being from
direct-map. Which I presume defeats the purpose of having any
scratch/temporary area for sg/crypto.

Am I missing something?

>> +       cpus_read_unlock();
>> +       if (!err)
>> +               __scratch_size = size;
>> +
>> +       call_rcu(&stf->rcu, free_old_scratches);
>> +       return err;
>> +}
>> +
>> +static void sigpool_scratch_free(void)
>> +{
>> +       int cpu;
>> +
>> +       for_each_possible_cpu(cpu)
>> +               kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
>> +                                         NULL, lockdep_is_held(&cpool_mutex)));
> 
> I wonder why bothering about freeing some space ? One scratch buffer
> per cpu is really small.

When all TCP-MD5/TCP-AO keys removed, why not free some memory?

[..]
>> +       /* slow-path */
>> +       mutex_lock(&cpool_mutex);
>> +       ret = sigpool_reserve_scratch(scratch_size);
>> +       if (ret)
>> +               goto out;
>> +       for (i = 0; i < cpool_populated; i++) {
>> +               if (!cpool[i].alg)
>> +                       continue;
>> +               if (strcmp(cpool[i].alg, alg))
>> +                       continue;
>> +
>> +               if (kref_read(&cpool[i].kref) > 0)
>> +                       kref_get(&cpool[i].kref);
>> +               else
>> +                       kref_init(&cpool[i].kref);
> 
> This looks wrong. kref_init() should be called once, after allocation,
> not based on kref_read().

Well, as long as it's the slow-path and cpool_mutex was grabbed, this
essentially re-allocating an entry that had it's last reference
released, but wasn't yet destroyed by __cpool_free_entry().

I think, if I interpreted it correctly, it was suggested-by Jakub:
https://lore.kernel.org/all/20230106175326.2d6a4dcd@kernel.org/T/#u

[..]
>> +static void __cpool_free_entry(struct sigpool_entry *e)
>> +{
>> +       crypto_free_ahash(e->hash);
>> +       kfree(e->alg);
>> +       memset(e, 0, sizeof(*e));
>> +}
>> +
>> +static void cpool_cleanup_work_cb(struct work_struct *work)
> 
> Really this looks over complicated to me.
> 
> All this kref maintenance and cpool_mutex costs :/

Hmm, unsure: I can remove this sigpool thing and just allocate a crypto
tfm for every new setsockopt(). Currently, this allocates one tfm per
hash algorithm, and without it it would be one tfm per setsockopt()/key.

The supported scale currently is thousands of keys per-socket, which
means that with sizeof(struct crypto_ahash) == 104, that will increase
memory consumption by hundreds of kilobytes to megabytes. I thought
that's a reasonable thing to do.

Should I proceed TCP-AO patches without this manager thing?

>> +{
>> +       bool free_scratch = true;
>> +       unsigned int i;
>> +
>> +       mutex_lock(&cpool_mutex);
>> +       for (i = 0; i < cpool_populated; i++) {
>> +               if (kref_read(&cpool[i].kref) > 0) {
>> +                       free_scratch = false;
>> +                       continue;
>> +               }
>> +               if (!cpool[i].alg)
>> +                       continue;
>> +               __cpool_free_entry(&cpool[i]);
>> +       }
>> +       if (free_scratch)
>> +               sigpool_scratch_free();
>> +       mutex_unlock(&cpool_mutex);
>> +}
[..]

Thanks,
         Dmitry