linux-kernel - race condition in crypto larval handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXu5jJvuAC3paV5k_bmoHHrHVgc0Hi6ShSDk3iQUt1Lf0cioQ@mail.gmail.com>
Date:	Fri, 6 Sep 2013 16:20:50 -0700
From:	Kees Cook <keescook@...omium.org>
To:	Herbert Xu <herbert@...dor.apana.org.au>
Cc:	"David S. Miller" <davem@...emloft.net>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-crypto@...r.kernel.org, Tyler Hicks <tyhicks@...onical.com>
Subject: race condition in crypto larval handling

Hi,

I've tracked down a race condition and ref counting problem in the
crypto API internals. We've been seeing it under Chrome OS, but it
seems it's not isolated to just us:

https://code.google.com/p/chromium/issues/detail?id=244581
http://marc.info/?l=linux-crypto-vger&m=135429403609373&w=2
https://bugzilla.redhat.com/show_bug.cgi?id=983682
http://www.mail-archive.com/linux-cifs@vger.kernel.org/msg07933.html

I think I've found the basic origin of the problem.
crypto_larval_add() has synchronization to make sure only a single
larval can ever be created. That logic seems totally fine. However,
this means that crypto_larval_lookup() from two threads can return the
same larval, which means that crypto_alg_mod_lookup() runs the risk of
calling crypto_larval_kill() on the same larval twice, which does not
appear to be handled sanely.

I can easily crash the kernel by forcing a synchronization point just
before the "return crypt_larval_add(...)" call in
crypto_larval_lookup(). Basically, I added this sloppy sync code (I
feel like there must be a better way to do this):

+static atomic_t global_sync = ATOMIC_INIT(0);
...
struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
...
+       if (strncmp(name, "asdf", 4) == 0) {
+               int value;
+
+               value = atomic_add_return(1, &global_sync);
+               if (value == 1) {
+                       /* I was first to stall here, wait for inc. */
+                       while (atomic_read(&global_sync) != 2)
+                               cpu_relax();
+               } else if (value == 2) {
+                       /* I was second to stall here, wait for dec. */
+                       while (atomic_read(&global_sync) != 1)
+                               cpu_relax();
+               } else {
+                       BUG();
+               }
+               atomic_dec(&global_sync);
+       }

        return crypto_larval_add(name, type, mask);
 }

And then ran code from userspace that did, effectively:

    struct sockaddr_alg sa = {
        .salg_family = AF_ALG,
        .salg_type   = "hash",
    };
    strcpy(sa.salg_name, argv[1]);

    fork();
    ...
    sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
    bind(sds[0], (struct sockaddr *) &sa, sizeof(sa));

And ran this as "./tickle asdf1" to generate the two threads trying to
find an invalid alg. The race looks possible even with valid algs, but
this was the fastest path I could see to tickle the issue.

With added printks to the kernel, it was clear that crypto_larval_kill
was being called twice on the same larval, and the list_del() call was
doing bad things. When I fixed that, the refcnt bug became very
obvious. Here's the change I made for crypto_larval_kill:

@@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
        struct crypto_larval *larval = (void *)alg;

        down_write(&crypto_alg_sem);
-       list_del(&alg->cra_list);
+       if (!list_empty(&alg->cra_list))
+               list_del_init(&alg->cra_list);
        up_write(&crypto_alg_sem);
        complete_all(&larval->completion);
        crypto_alg_put(alg);

It seems that there are too many "put" calls (mixed between
crypto_alg_put() and crypto_mod_put(), which also calls
crypto_alg_put()). See this annotated portion of
crypto_alg_mod_lookup:

        /* This can (correctly) return the same larval for two threads. */
        larval = crypto_larval_lookup(name, type, mask);
        if (IS_ERR(larval) || !crypto_is_larval(larval))
                return larval;

        ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);

        if (ok == NOTIFY_STOP)
                /* This calls crypto_mod_put(), but sometimes also get?? */
                alg = crypto_larval_wait(larval);
        else {
                /* This directly calls crypto_mod_put */
                crypto_mod_put(larval);
                alg = ERR_PTR(-ENOENT);
        }
        /* This calls crypto_alg_put */
        crypto_larval_kill(larval);
        return alg;

In the two-thread situation, the first thread gets a larval with
refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
decrements the ref count twice.

It seems to me like either each call to crypto_larval_lookup() should
result in a refcount bumped by two, or crypto_alg_mod_lookup() should
decrement only once, and the initial refcount should be 1 not 2 from
crypto_larval_add. But it's not clear to me which is sensible here.

What's the right solution here?

Thanks,

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/