[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080813071221.GB4872@verge.net.au>
Date: Wed, 13 Aug 2008 17:12:22 +1000
From: Simon Horman <horms@...ge.net.au>
To: Sven Wegener <sven.wegener@...aler.net>
Cc: lvs-devel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [RFC,PATCH] ipvs: Fix race condition in lblb and lblcr
schedulers
On Wed, Aug 13, 2008 at 08:59:32AM +0200, Sven Wegener wrote:
> On Wed, 13 Aug 2008, Simon Horman wrote:
>
> > On Tue, Aug 12, 2008 at 03:07:48PM +0200, Sven Wegener wrote:
> > > On Tue, 12 Aug 2008, Sven Wegener wrote:
> > >
> > > > On Tue, 12 Aug 2008, Simon Horman wrote:
> > > >
> > > > > On Tue, Aug 12, 2008 at 12:57:21AM +0200, Sven Wegener wrote:
> > > > > > Both schedulers have a race condition that happens in the following
> > > > > > situation:
> > > > > >
> > > > > > We have an entry in our table that already has expired according to it's
> > > > > > last use time. Then we need to schedule a new connection that uses this
> > > > > > entry.
> > > > > >
> > > > > > CPU 1 CPU 2
> > > > > >
> > > > > > ip_vs_lblc_schedule()
> > > > > > ip_vs_lblc_get()
> > > > > > lock table for read
> > > > > > find entry
> > > > > > unlock table
> > > > > > ip_vs_lblc_check_expire()
> > > > > > lock table for write
> > > > > > kfree() expired entry
> > > > > > unlock table
> > > > > > return invalid entry
> > > > > >
> > > > > > Problem is that we assign the last use time outside of our critical
> > > > > > region. We can make hitting this race more difficult, if not impossible,
> > > > > > if we assign the last use time while still holding the lock for reading.
> > > > > > That gives us six minutes during which it's save to use the entry, which
> > > > > > should be enough for our use case, as we're going to use it immediately
> > > > > > and don't keep a long reference to it.
> > > > > >
> > > > > > We're holding the lock for reading and not for writing. The last use time
> > > > > > is an unsigned long, so the assignment should be atomic by itself. And we
> > > > > > don't care, if some other user sets it to a slightly different value. The
> > > > > > read_unlock() implies a barrier so that other CPUs see the new last use
> > > > > > time during cleanup, even if we're just using a read lock.
> > > > > >
> > > > > > Other solutions would be: 1) protect the whole ip_vs_lblc_schedule() with
> > > > > > write_lock()ing the lock, 2) add reference counting for the entries, 3)
> > > > > > protect each entry with it's own lock. And all are bad for performance.
> > > > > >
> > > > > > Comments? Ideas?
> > > > >
> > > > > Is there a pathological case here if sysctl_ip_vs_lblc_expiration is
> > > > > set to be very short and we happen to hit ip_vs_lblc_full_check()?
> > > >
> > > > Yes.
> > > >
> > > > > To be honest I think that I like the reference count approach best,
> > > > > as it seems safe and simple. Is it really going to be horrible
> > > > > for performance?
> > > >
> > > > Probably not, I guess the sentence was a bit pessimistic.
> > > >
> > > > > If so, I wonder if a workable solution would be to provide a more fine-grained
> > > > > lock on tbl. Something like the way that ct_read_lock/unlock() works.
> > > >
> > > > Also possible. But I guess I was thinking too complicated last night. What
> > > > I was after with the "protect the whole ip_vs_lblc_schedule() with
> > > > write_lock()ing the lock" was also to simply prevent someone adding
> > > > duplicate entries. If we just extend the read_lock() region to cover the
> > > > whole usage of the entry and do an additional duplicate check during
> > > > inserting the entry under write_lock(), we fix the issue and also fix the
> > > > race that someone may add duplicate entries. We have a bit overhead,
> > > > because we unlock/lock and also there is a chance of doing one or more
> > > > useless __ip_vs_wlc_schedule(), but in the end this should work. More
> > > > fine-grained locking could help to lower the impact.
> > >
> > > I wondered if this whole thing can ever be totally race condition free,
> > > without changing how destinations are purged from the trash. Initial
> > > version of the patch below. Basically it pushes the locking up into
> > > ip_vs_lblc_schedule() and rearranges the code to be safe that we have a
> > > valid destination.
> >
> > Hi Sven,
> >
> > I think that I am happy with this, except for updating lastuse
> > under a read lock, is it really safe? Do you still have any concerns?
> >
> > I've annotated the code with my thoughts on how your locking is working
> > - thinking aloud if you like.
>
> Your locking description is exactly how it was intented by me.
Great, I wanted to make sure that we are both on the same page.
> > > diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
> > > index 7a6a319..67f7b04 100644
> > > --- a/net/ipv4/ipvs/ip_vs_lblc.c
> > > +++ b/net/ipv4/ipvs/ip_vs_lblc.c
> > > @@ -123,31 +123,6 @@ static ctl_table vs_vars_table[] = {
> > >
> > > static struct ctl_table_header * sysctl_header;
> > >
> > > -/*
> > > - * new/free a ip_vs_lblc_entry, which is a mapping of a destionation
> > > - * IP address to a server.
> > > - */
> > > -static inline struct ip_vs_lblc_entry *
> > > -ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
> > > -{
> > > - struct ip_vs_lblc_entry *en;
> > > -
> > > - en = kmalloc(sizeof(struct ip_vs_lblc_entry), GFP_ATOMIC);
> > > - if (en == NULL) {
> > > - IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
> > > - return NULL;
> > > - }
> > > -
> > > - INIT_LIST_HEAD(&en->list);
> > > - en->addr = daddr;
> > > -
> > > - atomic_inc(&dest->refcnt);
> > > - en->dest = dest;
> > > -
> > > - return en;
> > > -}
> > > -
> > > -
> > > static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en)
> > > {
> > > list_del(&en->list);
> > > @@ -189,10 +164,8 @@ ip_vs_lblc_hash(struct ip_vs_lblc_table *tbl, struct ip_vs_lblc_entry *en)
> > > */
> > > hash = ip_vs_lblc_hashkey(en->addr);
> > >
> > > - write_lock(&tbl->lock);
> > > list_add(&en->list, &tbl->bucket[hash]);
> > > atomic_inc(&tbl->entries);
> > > - write_unlock(&tbl->lock);
> > >
> > > return 1;
> > > }
> > > @@ -209,23 +182,54 @@ ip_vs_lblc_get(struct ip_vs_lblc_table *tbl, __be32 addr)
> > >
> > > hash = ip_vs_lblc_hashkey(addr);
> > >
> > > - read_lock(&tbl->lock);
> > > -
> > > list_for_each_entry(en, &tbl->bucket[hash], list) {
> > > if (en->addr == addr) {
> > > /* HIT */
> > > - read_unlock(&tbl->lock);
> > > return en;
> > > }
> > > }
> > >
> > > - read_unlock(&tbl->lock);
> > > -
> > > return NULL;
> > > }
> > >
> > >
> > > /*
> > > + * new/free a ip_vs_lblc_entry, which is a mapping of a destionation
> > > + * IP address to a server.
> > > + */
> > > +static inline struct ip_vs_lblc_entry *
> > > +ip_vs_lblc_new(struct ip_vs_lblc_table *tbl, __be32 daddr,
> > > + struct ip_vs_dest *dest)
> > > +{
> > > + struct ip_vs_lblc_entry *en;
> > > +
> > > + en = ip_vs_lblc_get(tbl, daddr);
> > > + if (!en) {
> > > + en = kmalloc(sizeof(*en), GFP_ATOMIC);
> > > + if (!en) {
> > > + IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
> > > + return NULL;
> > > + }
> > > +
> > > + INIT_LIST_HEAD(&en->list);
> > > + en->addr = daddr;
> > > + en->lastuse = jiffies;
> > > +
> > > + atomic_inc(&dest->refcnt);
> > > + en->dest = dest;
> > > +
> > > + ip_vs_lblc_hash(tbl, en);
> > > + } else if (en->dest != dest) {
> > > + atomic_dec(&en->dest->refcnt);
> > > + atomic_inc(&dest->refcnt);
> > > + en->dest = dest;
> > > + }
> > > +
> > > + return en;
> > > +}
> > > +
> > > +
> > > +/*
> > > * Flush all the entries of the specified table.
> > > */
> > > static void ip_vs_lblc_flush(struct ip_vs_lblc_table *tbl)
> > > @@ -484,7 +488,7 @@ is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
> > > static struct ip_vs_dest *
> > > ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
> > > {
> > > - struct ip_vs_dest *dest;
> > > + struct ip_vs_dest *dest = NULL;
> > > struct ip_vs_lblc_table *tbl;
> > > struct ip_vs_lblc_entry *en;
> > > struct iphdr *iph = ip_hdr(skb);
> > > @@ -492,38 +496,43 @@ ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
> > > IP_VS_DBG(6, "ip_vs_lblc_schedule(): Scheduling...\n");
> > >
> > > tbl = (struct ip_vs_lblc_table *)svc->sched_data;
> > > + /* First look in our cache */
> > > + read_lock(&tbl->lock);
> > > en = ip_vs_lblc_get(tbl, iph->daddr);
> > > - if (en == NULL) {
> > > - dest = __ip_vs_wlc_schedule(svc, iph);
> > > - if (dest == NULL) {
> > > - IP_VS_DBG(1, "no destination available\n");
> > > - return NULL;
> > > - }
> > > - en = ip_vs_lblc_new(iph->daddr, dest);
> > > - if (en == NULL) {
> > > - return NULL;
> > > - }
> > > - ip_vs_lblc_hash(tbl, en);
> > > - } else {
> > > + if (en) {
> > > dest = en->dest;
> > > - if (!(dest->flags & IP_VS_DEST_F_AVAILABLE)
> > > - || atomic_read(&dest->weight) <= 0
> > > - || is_overloaded(dest, svc)) {
> > > - dest = __ip_vs_wlc_schedule(svc, iph);
> > > - if (dest == NULL) {
> > > - IP_VS_DBG(1, "no destination available\n");
> > > - return NULL;
> > > - }
> > > - atomic_dec(&en->dest->refcnt);
> > > - atomic_inc(&dest->refcnt);
> > > - en->dest = dest;
> > > - }
> > > + en->lastuse = jiffies;
> >
> > Setting en->lastuse with a read lock held makes me feel uncomfortable.
> > Are you sure it is safe?
>
> It looks suspicous, but word-size writes should be atomic by itself. So
> even doing it under read lock should not lead to corrupted values. We
> don't care if the values that are set concurrently are overwritten,
> because they should only differ in a couple of jiffies. Might be good to
> have someone other comment on the issue too.
Ok. I was only worried about the atomicity. I agree that having the
value rewritten and potentially slightly inaccurate is not a concern.
> > > +
> > > + /*
> > > + * If the destination is not available, i.e. it's in the trash,
> > > + * ignore it, as it may be removed from under our feet
> > > + */
> > > +
> > > + if (!(dest->flags & IP_VS_DEST_F_AVAILABLE))
> > > + dest = NULL;
> > > }
> > > - en->lastuse = jiffies;
> >
> > At this point we know that dest exists and is not in the trash.
> > We know that dest can't disapear given that its not already
> > in the trash and our caller holds a read lock on __ip_vs_svc_lock.
> > So dest will be safe until after ip_vs_lblc_schedule() returns.
> >
> > Dest seems ok :-)
> >
> > Ok, that seems complex but non-racy to me :-)
> >
> > Perhaps a longer comment would be in order.
>
> Yeah, that's the point where we prevent the race with the trash purging.
> We could change the purging of destinations from the trash to be under
> __ip_vs_svc_lock write locked, then we know that all destinations, even
> the ones in the trash are valid. Might make more sense than duplicating
> this logic in other schedulers.
That sounds like a good way to simplify things.
> > > + read_unlock(&tbl->lock);
> >
> > After the lock is released we may now lose en, but we don't care
> > as its not used again inside ip_vs_lblc_schedule().
> >
> > So far so good :-)
> >
> > > +
> > > + /* If the destination has a weight and is not overloaded, use it */
> > > + if (dest && atomic_read(&dest->weight) > 0 && !is_overloaded(dest, svc))
> > > + goto out;
> > > +
> > > + /* No cache entry or it is invalid, time to schedule */
> > > + dest = __ip_vs_wlc_schedule(svc, iph);
> > > + if (!dest) {
> > > + IP_VS_DBG(1, "no destination available\n");
> > > + return NULL;
> > > + }
> >
> > This new dest must also not be in the trash by virtue of
> > being returned by __ip_vs_wlc_schedule() and delitions from
> > svc->destinations being protected by __ip_vs_svc_lock, which
> > is held by the caller.
> >
> > Also good :-)
> >
> > BTW, the name of __ip_vs_wlc_schedule, really ought to be changed
> > to __ip_vs_lblc_schedule. I'll make a trivial patch for that.
> >
> > > +
> > > + /* If we fail to create a cache entry, we'll just use the valid dest */
> > > + write_lock(&tbl->lock);
> > > + ip_vs_lblc_new(tbl, iph->daddr, dest);
> > > + write_unlock(&tbl->lock);
> >
> > This write lock looks obviously correct as the destructors
> > do the same thing.
> >
> > > +out:
> > > IP_VS_DBG(6, "LBLC: destination IP address %u.%u.%u.%u "
> > > "--> server %u.%u.%u.%u:%d\n",
> > > - NIPQUAD(en->addr),
> > > + NIPQUAD(iph->daddr),
> > > NIPQUAD(dest->addr),
> > > ntohs(dest->port));
> > >
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists