netdev - Question regarding netpoll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <201410112036.CGG26070.FOHSFMtOQOVLFJ@I-love.SAKURA.ne.jp>
Date:	Sat, 11 Oct 2014 20:36:21 +0900
From:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:	ebiederm@...ssion.com, nhorman@...driver.com, davem@...emloft.net
Cc:	netdev@...r.kernel.org
Subject: Question regarding netpoll_info != NULL check.

Hello.

A 2.6.32-220.4.2.el6 system crashed at poll_one_napi() when the administrator
was updating network configuration and by error issued ifconfig command with
a wrong argument.

His system was using a bonding device as netconsole's the ethernet device to
send console messages out of, and poll_one_napi() was called due to printk()
by changing status of a bonding device.

---------- linux-2.6.32-220.4.2.el6.x86_64/net/core/netpoll.c ----------
128:static int poll_one_napi(struct netpoll_info *npinfo,
129:                     struct napi_struct *napi, int budget)
130:{
131:    int work;
132:
133:    /* net_rx_action's ->poll() invocations and our's are
134:     * synchronized by this test which is only made while
135:     * holding the napi->poll_lock.
136:     */
137:    if (!test_bit(NAPI_STATE_SCHED, &napi->state))
138:            return budget;
139:
140:    npinfo->rx_flags |= NETPOLL_RX_DROP;
141:    atomic_inc(&trapped);
142:    set_bit(NAPI_STATE_NPSVC, &napi->state);
143:
144:    work = napi->poll(napi, budget);
145:    trace_napi_poll(napi);
146:
147:    clear_bit(NAPI_STATE_NPSVC, &napi->state);
148:    atomic_dec(&trapped);
149:    npinfo->rx_flags &= ~NETPOLL_RX_DROP;
150:
151:    return budget - work;
152:}
153:
154:static void poll_napi(struct net_device *dev)
155:{
156:    struct napi_struct *napi;
157:    int budget = 16;
158:
159:    list_for_each_entry(napi, &dev->napi_list, dev_list) {
160:            if (napi->poll_owner != smp_processor_id() &&
161:                spin_trylock(&napi->poll_lock)) {
162:                    budget = poll_one_napi(dev->npinfo, napi, budget);
163:                    spin_unlock(&napi->poll_lock);
164:
165:                    if (!budget)
166:                            break;
167:            }
168:    }
169:}
170:
171:static void service_arp_queue(struct netpoll_info *npi)
172:{
173:    if (npi) {
174:            struct sk_buff *skb;
175:
176:            while ((skb = skb_dequeue(&npi->arp_tx)))
177:                    arp_reply(skb);
178:    }
179:}
180:
181:void netpoll_poll_dev(struct net_device *dev)
182:{
183:    const struct net_device_ops *ops;
184:
185:    if (!dev || !netif_running(dev))
186:            return;
187:
188:    ops = dev->netdev_ops;
189:    if (!ops->ndo_poll_controller)
190:            return;
191:
192:    /* Process pending work on NIC */
193:    ops->ndo_poll_controller(dev);
194:
195:    poll_napi(dev);
196:
197:    if (dev->priv_flags & IFF_SLAVE) {
198:            if (dev->npinfo) {
199:                    struct net_device *bond_dev = dev->master;
200:                    struct sk_buff *skb;
201:                    while ((skb = skb_dequeue(&dev->npinfo->arp_tx))) {
202:                            skb->dev = bond_dev;
203:                            skb_queue_tail(&bond_dev->npinfo->arp_tx, skb);
204:                    }
205:            }
206:    }
207:
208:    service_arp_queue(dev->npinfo);
209:
210:    zap_completion_queue();
211:}
---------- linux-2.6.32-220.4.2.el6.x86_64/net/core/netpoll.c ----------

We can see that while line 198 did dev->npinfo != NULL check, line 162 called
by line 195 did not do dev->npinfo != NULL check and thus npinfo was NULL at
line 140 called by line 162.

What is strange, we kept this dev->npinfo != NULL check until commit 18b37535
("netpoll: Consolidate neigh_tx processing in service_neigh_queue") and that
commit depends on dev->npinfo != NULL assumption which commit ca99ca14
("netpoll: protect napi_poll and poll_controller during dev_[open|close]")
is also assuming. But the above-mentioned system crashed due to
dev->npinfo == NULL. Why we don't do dev->npinfo != NULL check?

Excuse me, I didn't do a full-fledged study because the size of vmcore was
too huge to receive. Therefore I asked the administrator to do a superficial
analysis by feeding predefined commands to the crash utility. If the reason
we don't need dev->npinfo != NULL check is that dev->npinfo != NULL is always
true, how should I verify that dev->npinfo == NULL was a race condition?

Regards.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html