lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190722151426.5266-1-mplaneta@os.inf.tu-dresden.de>
Date:   Mon, 22 Jul 2019 17:14:16 +0200
From:   Maksym Planeta <mplaneta@...inf.tu-dresden.de>
To:     Moni Shoua <monis@...lanox.com>,
        Doug Ledford <dledford@...hat.com>,
        Jason Gunthorpe <jgg@...pe.ca>, linux-rdma@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     Maksym Planeta <mplaneta@...inf.tu-dresden.de>
Subject: [PATCH 00/10] Refactor rxe driver to remove multiple race conditions

This patchset helps to get rid of following race condition situations:
                                             
  1. Tasklet functions were incrementing reference counting after entering
  running the tasklet.                       
  2. Getting a pointer to reference counted object (kref) was done without
  protecting kref_put with a lock.
  3. QP cleanup was sometimes scheduling cleanup for later execution in
  rxe_qp_do_cleaunpm, although this QP's memory could be freed immediately after
  returning from rxe_qp_cleanup.
  4. Non-atomic cleanup functions could be called in SoftIRQ context
  5. Manipulating with reference counter inside a critical section could have
  been done both inside and outside of SoftIRQ region. Such behavior may end up
  in a deadlock.

The easiest way to observe these problems is to compile the kernel with KASAN
and lockdep and abruptly stop an application using SoftRoCE during the
communication phase. For my system this often resulted in kernel crash of a
deadlock inside the kernel.

To fix the above mentioned problems, this patch does following things:

  1. Replace tasklets with workqueues
  2. Adds locks to kref_put
  3. Aquires reference counting in an appropriate place

As a shortcomming, the performance is slightly reduced, because instead of
trying to execute tasklet function directly the new version always puts it onto
the queue.

TBH, I'm not sure that I removed all of the problems, but the driver
deffinetely behaves much more stable now. I would be glad to get some
help with additional testing.

 drivers/infiniband/sw/rxe/rxe_comp.c        |  38 ++----
 drivers/infiniband/sw/rxe/rxe_cq.c          |  17 ++-
 drivers/infiniband/sw/rxe/rxe_hw_counters.c |   1 -
 drivers/infiniband/sw/rxe/rxe_hw_counters.h |   1 -
 drivers/infiniband/sw/rxe/rxe_loc.h         |   3 +-
 drivers/infiniband/sw/rxe/rxe_mcast.c       |  22 ++--
 drivers/infiniband/sw/rxe/rxe_mr.c          |  10 +-
 drivers/infiniband/sw/rxe/rxe_net.c         |  21 ++-
 drivers/infiniband/sw/rxe/rxe_pool.c        |  40 ++++--
 drivers/infiniband/sw/rxe/rxe_pool.h        |  16 ++-
 drivers/infiniband/sw/rxe/rxe_qp.c          | 130 +++++++++---------
 drivers/infiniband/sw/rxe/rxe_recv.c        |   8 +-
 drivers/infiniband/sw/rxe/rxe_req.c         |  17 +--
 drivers/infiniband/sw/rxe/rxe_resp.c        |  54 ++++----
 drivers/infiniband/sw/rxe/rxe_task.c        | 139 +++++++-------------
 drivers/infiniband/sw/rxe/rxe_task.h        |  40 ++----
 drivers/infiniband/sw/rxe/rxe_verbs.c       |  81 ++++++------
 drivers/infiniband/sw/rxe/rxe_verbs.h       |   8 +-
 18 files changed, 302 insertions(+), 344 deletions(-)

-- 
2.20.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ