linux-kernel - Re: race condition in crypto larval handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXu5j+m5dtV78K1h4bgquhEM0PM2vkw4BE7XLqcT8QdWY8bpQ@mail.gmail.com>
Date:	Sat, 7 Sep 2013 10:10:17 -0700
From:	Kees Cook <keescook@...omium.org>
To:	Neil Horman <nhorman@...driver.com>
Cc:	Herbert Xu <herbert@...dor.apana.org.au>,
	"David S. Miller" <davem@...emloft.net>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-crypto <linux-crypto@...r.kernel.org>,
	Tyler Hicks <tyhicks@...onical.com>
Subject: Re: race condition in crypto larval handling

On Sat, Sep 7, 2013 at 7:39 AM, Neil Horman <nhorman@...driver.com> wrote:
> On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
>> Hi,
>>
>> I've tracked down a race condition and ref counting problem in the
>> crypto API internals. We've been seeing it under Chrome OS, but it
>> seems it's not isolated to just us:
>>
>> https://code.google.com/p/chromium/issues/detail?id=244581
>> http://marc.info/?l=linux-crypto-vger&m=135429403609373&w=2
>> https://bugzilla.redhat.com/show_bug.cgi?id=983682
>> http://www.mail-archive.com/linux-cifs@vger.kernel.org/msg07933.html
>>
>> I think I've found the basic origin of the problem.
>> crypto_larval_add() has synchronization to make sure only a single
>> larval can ever be created. That logic seems totally fine. However,
>> this means that crypto_larval_lookup() from two threads can return the
>> same larval, which means that crypto_alg_mod_lookup() runs the risk of
>> calling crypto_larval_kill() on the same larval twice, which does not
>> appear to be handled sanely.
>>
>> I can easily crash the kernel by forcing a synchronization point just
>> before the "return crypt_larval_add(...)" call in
>> crypto_larval_lookup(). Basically, I added this sloppy sync code (I
>> feel like there must be a better way to do this):
>>
>> +static atomic_t global_sync = ATOMIC_INIT(0);
>> ...
>> struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
>> ...
>> +       if (strncmp(name, "asdf", 4) == 0) {
>> +               int value;
>> +
>> +               value = atomic_add_return(1, &global_sync);
>> +               if (value == 1) {
>> +                       /* I was first to stall here, wait for inc. */
>> +                       while (atomic_read(&global_sync) != 2)
>> +                               cpu_relax();
>> +               } else if (value == 2) {
>> +                       /* I was second to stall here, wait for dec. */
>> +                       while (atomic_read(&global_sync) != 1)
>> +                               cpu_relax();
>> +               } else {
>> +                       BUG();
>> +               }
>> +               atomic_dec(&global_sync);
>> +       }
>>
>>         return crypto_larval_add(name, type, mask);
>>  }
>>
>> And then ran code from userspace that did, effectively:
>>
>>     struct sockaddr_alg sa = {
>>         .salg_family = AF_ALG,
>>         .salg_type   = "hash",
>>     };
>>     strcpy(sa.salg_name, argv[1]);
>>
>>     fork();
>>     ...
>>     sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
>>     bind(sds[0], (struct sockaddr *) &sa, sizeof(sa));
>>
>> And ran this as "./tickle asdf1" to generate the two threads trying to
>> find an invalid alg. The race looks possible even with valid algs, but
>> this was the fastest path I could see to tickle the issue.
>>
>> With added printks to the kernel, it was clear that crypto_larval_kill
>> was being called twice on the same larval, and the list_del() call was
>> doing bad things. When I fixed that, the refcnt bug became very
>> obvious. Here's the change I made for crypto_larval_kill:
>>
>> @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
>>         struct crypto_larval *larval = (void *)alg;
>>
>>         down_write(&crypto_alg_sem);
>> -       list_del(&alg->cra_list);
>> +       if (!list_empty(&alg->cra_list))
>> +               list_del_init(&alg->cra_list);
>>         up_write(&crypto_alg_sem);
>>         complete_all(&larval->completion);
>>         crypto_alg_put(alg);
>>
>> It seems that there are too many "put" calls (mixed between
>> crypto_alg_put() and crypto_mod_put(), which also calls
>> crypto_alg_put()). See this annotated portion of
>> crypto_alg_mod_lookup:
>>
>>         /* This can (correctly) return the same larval for two threads. */
>>         larval = crypto_larval_lookup(name, type, mask);
>>         if (IS_ERR(larval) || !crypto_is_larval(larval))
>>                 return larval;
>>
>>         ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);
>>
>>         if (ok == NOTIFY_STOP)
>>                 /* This calls crypto_mod_put(), but sometimes also get?? */
>>                 alg = crypto_larval_wait(larval);
>>         else {
>>                 /* This directly calls crypto_mod_put */
>>                 crypto_mod_put(larval);
>>                 alg = ERR_PTR(-ENOENT);
>>         }
>>         /* This calls crypto_alg_put */
>>         crypto_larval_kill(larval);
>>         return alg;
>>
>> In the two-thread situation, the first thread gets a larval with
>> refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
>> larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
>> the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
>> decrements the ref count twice.
>>
>> It seems to me like either each call to crypto_larval_lookup() should
>> result in a refcount bumped by two, or crypto_alg_mod_lookup() should
>> decrement only once, and the initial refcount should be 1 not 2 from
>> crypto_larval_add. But it's not clear to me which is sensible here.
>>
>> What's the right solution here?
>>
>> Thanks,
>>
>> -Kees
>>
>> --
>> Kees Cook
>> Chrome OS Security
>>
> I've been tracking this problem too:
> https://bugzilla.redhat.com/show_bug.cgi?id=1002351
>
> Though I'd just started so I've not gotten anywhere near this far.  It seems,
> given your analysis above that we really need to rework the refcounting here.
> At the very least we probably need to:
>
> 1) only modify the module reference count when a larval for a given alg is
> created, or when its last refcount is decremented.
>
> 2) fix up the cra_refcount to start at one and increment for each lookup, and
> decrement for each kill.

What I haven't been able to figure out is the "expected" behavior of a
larval. crypto_larval_kill() seems to be the only thing that removes
it from the alg list, so that seems like the only place a "put" should
be happening. Though that put should likely be the mod_put not the
alg_put. I don't think larval_wait should be doing a put, but it also
can perform a "get", so I'm baffled by that too. :)

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/