-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluetooth: HCI cmd response timeout #34659
Comments
Thanks for the reference to this open patch. I ported it to Sadly, I still experience the crashes, but it looks a bit different now:
Another time it looked as following:
|
I think if this is the fix needed the symptom should be possible to see in a sniffer log, it should be locked with NACKS at the LL. The USAGE fault could mean a stack overflow, or calling of an invalid function pointer. |
Thanks for the tip. I did a trace using an nRF52840-DK and Wireshark. The tests where performed with an unpatched zephyr Measure Mode: Data notifications (handle 0x28, 241 bytes of payload) at high rate
Read other Characteristic (Handle 0x39) while in Measure ModeThe first and the 3th black lines are the read request, the 2nd black and the blue lines are the responses. Watchdog Reset after some minutesReading the log Characteristic (Handle 0x39) while in Measure Mode works for some minutes but the crashes with a Watchdog Reset (marked back, 60 seconds window): Device Log
|
@caco3 so to summarize:
Is this correct? |
I am not 100% sure if the patch did help. Fact is, running stock 2.5.0 or 1cc7307 leads to
Then I ported the mentioned patch to 2.5.0 and gave it another try. The system still crashes but the trace looks a bit different, so I am not sure if it is the same issue:
Is there a way to get more output (printk) in case of such a crash? Note 1: Note 2: |
@joerchan will tx thread timeout if num complete does not arrive for long time until supervision timeout? |
No, TX thread waits for num completes indefinitely. |
@caco3 we will need more information in order to try and figure out the root cause. Ideally we'd need a small sample that reproduces this on an Nordic DK so we can then analyze the issue. |
@caco3 when you come back please do comment on this issue, since Zephyr 2.6.0 is currently being stabilizied and ideally we'd like to include a fix for this. |
@carlescufi I will look into this after I get back to work. How ever this will only be at beginning of July, most likely to late for the |
Just a little update: I will now try to build a minimal example to reproduce the issue. |
Thanks @caco3. Please do let us know when you have the sample so we can analyze it. |
We are trying intensively to trace the issue down as deep that we can provide a minimal example. I ported the mentioned patch to Zephyr It will need further testing but I strongly believe that this patch helps against the crashing! We still have stability issues but maybe they are no longer directly related to this issue. |
I feel we need to clarify the setup here, because we are confused. |
Sorry for the confusion. I try to describe it better below: Setup A:
Setup B: The Measurement Device sends data at around 625 kbit/s to the Processing Host using GATT Notify. Without the patch the Output of `btmon` (run on the Cortex-A8):
Now, when we replace the
Output of `btmon` (run on the x86 machine):
Connection Parameters (Measurement Device):
|
@caco3 Thanks for the clarification. This issue seems to always happen when your ring buffer fills up:
I have a feeling that the fix you backported (#26057) is only masking the real issue, which seems to be related to your ring buffer being full and then somehow preventing the scheduler from running the controller thread. Can you confirm that the issue only happens when your ring buffer is full? |
@jfischer-no , @carlescufi FYI Thanks to the great support of @cvinayak , we could trace it further down to a stalling in the USB subsystem. It seems to hang at the I added
I tested this with commit
|
Seems to depend on USB host controller.
I guess it timeouts in a loop, not sure why you do not see a 'p', this lines are changed by cd74614, before that is was just
printk (over UART?) is too slow (synchronous if not via logging) and "too much printk" is not suitable for tracing something in USB device support. |
you do not need |
@jfischer-no There are not much HCI events, only ACL Data is being received as fast as possible in the scenario. Unrelated, |
But usb_transfer_sync is not finished when k_sem_take timeouts, it tries again and again until it was successful or cancelled. |
@jfischer-no Not on the hardware we use. Maybe on a different board, but that might also use a different USB-Controller.
@jfischer-no I used the ACM-UART via SEGGER Programmer. |
I recorded the
In normal operation, At the start of the stalling, Because Below the modified int usb_transfer_sync(uint8_t ep, uint8_t *data, size_t dlen, unsigned int flags)
{
struct usb_transfer_sync_priv pdata;
int ret;
DEBUG_PORT->OUTSET = DEBUG_PIN3;
k_sem_init(&pdata.sem, 0, 1);
ret = usb_transfer(ep, data, dlen, flags, usb_transfer_sync_cb, &pdata);
DEBUG_PORT->OUTCLR = DEBUG_PIN3;
if (ret) {
return ret;
}
/* Semaphore will be released by the transfer completion callback
* which might not be called when transfer was cancelled
*/
while (1) {
struct usb_transfer_data *trans;
DEBUG_PORT->OUTSET = DEBUG_PIN4;
ret = k_sem_take(&pdata.sem, K_MSEC(USB_TRANSFER_SYNC_TIMEOUT));
DEBUG_PORT->OUTCLR = DEBUG_PIN4;
if (ret == 0) {
break;
}
DEBUG_PORT->OUTSET = DEBUG_PIN5;
trans = usb_ep_get_transfer(ep);
DEBUG_PORT->OUTCLR = DEBUG_PIN5;
if (!trans || trans->status != -EBUSY) {
LOG_WRN("Sync transfer cancelled, ep 0x%02x", ep);
return -ECANCELED;
}
}
return pdata.tsize;
} |
usb_ep_get_transfer() is there to check if the transfer is still valid and not cancelled, for example by detach from the host (see cd74614, before that is was just k_sem_take(&pdata.sem, K_FOREVER); .
See my comment above:
Depending on what tools you are equipped with, you could listen on the bus e.g. with USB protocol analyzer, if the host is still sending IN tockens for the endpoint some time after the stall, then the problem is on Zephyr OS side. |
I do not have an USB-Analyzer at hand, but we ordered one. How ever it will take some weeks until I will have it in my hands. I run bluetoothd in foreground, debug mode. I can see messages while the device gets connected, but during the streaming and around the stalling not a single message is shown! bluetoothd -n -d | tee bluetooth.log```Start and connectingSep 1 07:03:43 wlreceiver daemon.debug bluetoothd[1076]: ../bluez-5.54/src/adapter.c:start_discovery() sender :1.11 Start streaming dataSep 1 07:03:49 wlreceiver daemon.debug bluetoothd[1076]: ../bluez-5.54/src/adapter.c:new_conn_param() hci0 00:07:29:4B:43:B4 (1) min 0x0006 max 0x0006 latency 0x0000 timeout 0x000a Stalling starts after some seconds (shown time is wrong), but nothing shown in logSep 1 07:03:47 wlreceiver daemon.warn wrcd-clientd[599]: fpgainject: WARNING: FIFO underflow! After a while, Bluez shows the device as disconencted, however the device does not indicate any disconnect!Sep 1 07:04:17 wlreceiver daemon.debug bluetoothd[1076]: ../bluez-5.54/src/device.c:att_disconnected_cb()
|
To add information to this issue for future reference:
|
👍 Just for clarification:
The stalling issue only occures, once I start to do excessive GATT reads from the central to the peripheral while the peripheral still transmitts data at the high throughput |
Can you please attach the trace once you have it? |
Sorry for the late response. Below some information about the system:
Our system is based on a AM335x. The USB-Controller is built in.
I went through the AM335x Errata and found one issue on page 18 that might be related:
I hope to be able to provide some USB traces in the next 1..2 weeks. |
I have an AM335x SBC somewhere. Can you please tell who is the manufacturer of your SoM/SBC, gladly privately if it is confidential. |
The module is designed by us, thus very specific and proprietary. |
Downgrading this from Medium to Low since it now looks it could be a USB Host issue on the Linux SBC side. The actual Bluetooth bug that started this issue is already covered by #25917 already, and is listed as a known issue for now, with a workaround in #26057. Discussion can continue here to try and narrow down whether the USB issue can be attributed exclusively to the Linux side or whether there is potentially an issue on the Zephyr side. |
@jfischer-no Sorry for the late response 😒 . I finally got the logic analyzer. How ever I had to realize it only supports USB 1.1 but the BLE dongle runs with USB 2.0. So at the moment, I am not able to provide a logic analyzer trace. Where you able to investigate something with your board? |
No, TBH it would take a lot of time which I do not currently have. |
@caco3 it would be good if you could narrow down the problem to find out whether this can indeed be Zephyr related or not. At the moment this doesn't look like it, so I would like to close the issue if possible. |
The only way I see is to get an USB 2.0 sniffer. Sadly we don't have one and this issue atm. is not very high on the priority list on our side. 😒 |
Dear @jfischer-no and @carlescufi Some new information:
|
@caco3 Thank you for the information, hope in the future we are able to reproduce the issue in a control setup and have a fix it (hope I can remember this issue if in future if any new related issue is created). |
Describe the bug
We have an application based on a nrf52840. We see
k_sem_take failed with err -11
assertion when we stress the device with a lot of notifications and read requests at the same time.The system is used for a measurement device. We run an external ADC at 10 kSamples/s and read its data through SPI DMA transfers. After 30 samples (=> 333 Hz) we trigger an Interrupt Handler which copies the data into another buffer.
Then we we release a semaphore. A Kernel Thread is waiting for this semaphore so it can grab the data from the buffer and call
bt_gatt_notify()
with it. DLE is enabled and the data length set to 251.Code snippets:
Log:
What is the best way to trace this further down?
prj.conf (sanitized):
We also played with the following parameters but did not see an improvement:
and changed
TIMER_IRQ_PRIORITY
to3
Environment :
2.5.0
) but also tested with 1cc7307Reference to code:
zephyr/subsys/bluetooth/host/hci_core.c
Line 333 in fe7c2ef
The text was updated successfully, but these errors were encountered: