Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subsys/mgmt/hawkbit: Unable to finish download if CPU blocking function (i.e. flash_img_buffered_write) is used #37324

Closed
ycsin opened this issue Jul 30, 2021 · 32 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@ycsin
Copy link
Member

ycsin commented Jul 30, 2021

Update 2

The upgrade works consistently after I enabled SPI async API as that offload the writing of the firmware from the CPU to the DMA, freeing the CPU to handle the TCP packets.

Update 1

CPU blocking functions such as flash_img_buffered_write() or log immediate mode can cause this issue. Happens to the big_http_download sample as well.


Describe the bug
I'm currently trying the hawkbit sample for the OTA on my custom board. For some reason the download process is extremely long and the gap between incoming data is increasingly long. The first handful chunks of data is normal, after that the next chunk would arrive at ~30s later, then ~1 min later, then almost exactly about 2 mins later (like there's a pattern) and eventually it would fail after downloading about 20kB (it usually fails at exactly 20 kB) or after ~17 minutes.

After trying a few things, I found that the download is slow only if it tries to write the downloaded buffer into the flash here. If I comment that line out, the download is actually pretty fast and typically managed to download everything (251kB) under 40 seconds. If I replace that line with a k_msleep(60), it also downloads just as fast.

To determine the time flash_img_buffered_write() took to finish, I tried to calculate the elapsed time using k_uptime_get() before and after the function call, and the log system tells me that it took 6-41ms. And now I'm pretty much lost and not sure how to debug this and get it working.

From my current understanding, the hawkbit will do a http GET request to the download link using the http_client_req then the connection/socket should be opened and the image chunks will be pushed continuously from the server without requiring ack from the client (I could be wrong here). The hawkbit client will simply write the received data into the flash until the HTTP_DATA_FINAL is received. I don't know how a function that literally returned immediately can affect this download process.

The setup that I use:

  • STM32G0B1RE custom board
  • Quectel EC21 LTE modem (gsm_ppp driver)
  • image-0 partition is 416 kB in the internal flash
  • image-1 partition is 416 kB in the external spi-nor flash

To Reproduce
Steps to reproduce the behavior:

  1. Build and flash hawkbit sample

Expected behavior
I expect the download to finish under 40 seconds.

Impact
Unable to use hawkbit

Logs and console output
hawkbit_log.txt

Environment (please complete the following information):

  • OS: Windows 10
  • Toolchain: GNUARMEMB
  • v2.6.0-rc2-107-gebe282b02d9f
    • I applied this commit manually.

Extra context

This is the log file when I replace the flash_img_buffered_write() with k_msleep(60): hawkbit_test_log.txt

@ycsin ycsin added the bug The issue is a bug, or the PR is fixing a bug label Jul 30, 2021
@ycsin
Copy link
Member Author

ycsin commented Jul 30, 2021

@jukkar Any comments on the http_client part?
@Navin-Sankar @burumaj Any idea?
@nvlsianpu Any comments on the flash_img_buffered_write()?

@ycsin
Copy link
Member Author

ycsin commented Jul 30, 2021

There was a time where it downloaded 20+% of the image before a hard fault (current thread: sysworkq, not sure why tho), and doesn't reach that far again after a reset, quite random.

@ycsin ycsin changed the title subsys/mgmt/hawkbit: Unable to finish download subsys/mgmt/hawkbit: Unable to finish download if flash_img_buffered_write() is used Jul 31, 2021
@nashif nashif added the priority: low Low impact/importance bug label Aug 3, 2021
@ycsin
Copy link
Member Author

ycsin commented Aug 5, 2021

To verify the write speed of the spi-nor, I wrote a simple program to write a total of 251 kB of 512 bytes dummy string into the spi-nor flash's image-1. At the end of write it will read back from the flash and memcmp with the dummy string.

Here are the results:

Write that requires a page erase: 43ms
Write that doesn't require a page erase: 8ms

The write speed matches what I observe in the hawkbit sample. So now it is known that:

  • Hawkbit can download normally when it doesn't write the downloaded data into anywhere.
    • Manage to download 251 kb in 40 seconds, or about 7 kiloBytes per second
    • Doing k_msleep(60) (to simulate the write) between the receive will not effect the download, apart from taking longer than 40 seconds Not a good test since this isn't blocking
  • Hawkbit can download normally when it writes the downloaded data into the internal flash.
  • Hawkbit isn't able to download properly when it writes the downloaded data into external spi-nor flash.
    • The external spi-nor flash write speed looks fine, but for some reason there's a large gap between incoming data.
  • A test program that just writes dummy data into external spi-nor flash, similar to the hawkbit, works just fine. (Write total 251kiB of 512 bytes chunks)

How does writing data into external spi-nor flash affect the download?


Log:

[00:00:11.298,000] <inf> main: ==================================================
[00:00:11.298,000] <inf> main: Write test ended for AT25DF081A
[00:00:11.298,000] <inf> main: ==================================================
[00:00:11.298,000] <inf> main: Total elapsed time: 6 s (6291 ms)
[00:00:11.298,000] <inf> main: Write speed: 41 kb/s
[00:00:11.298,000] <inf> main: ==================================================
[00:00:11.298,000] <inf> main: Reading back to verify written data
[00:00:13.151,000] <inf> main: ==================================================
[00:00:13.151,000] <inf> main: Data matches
[00:00:13.151,000] <inf> main: ==================================================
[00:00:13.151,000] <inf> main: Total elapsed time: 1 s (1853 ms)
[00:00:13.151,000] <inf> main: Read speed: 251 kb/s
[00:00:13.151,000] <inf> main: ==================================================

Speed test program
main.c.txt

Full log
spi-nor_speed-test.txt

@ycsin
Copy link
Member Author

ycsin commented Aug 5, 2021

@pabigot , just wondering if you have any clue about this?

@ycsin
Copy link
Member Author

ycsin commented Aug 5, 2021

I increased the CONFIG_IMG_BLOCK_BUF_SIZE to 4096, which is the sector size of my external serial flash, then now it is able to download the whole firmware.


So now this becomes very weird.
Increasing the CONFIG_IMG_BLOCK_BUF_SIZE caused the flash_img_buffered_write() to reduce its frequency of writing the spi-nor, but each write now takes longer time (bigger size) and somehow this quicken the download.

@de-nordic
Copy link
Collaborator

@ycsin You are right there is a bug: the target image for upload is taken from the DTS, as the partition defined with "image-1" label, but the write size for the device is determined for the flash selected by chosen "zephyr,flash", which may be different device, as it probably happens here.
The BUILD_ASSERT, at the top of flash_img.c, fails to catch that the CONFIG_IMG_BLOCK_BUF_SIZE has not been set properly, because check is done against incorrect flash, as explained above.

@ycsin
Copy link
Member Author

ycsin commented Aug 5, 2021

@de-nordic thanks for the info, I will look more closely to it later! After this I encountered another bug where if I flash an image to the primary slot over the updated image, it will result in a hard fault inside the bootloader. I debugged and traced this just now, it seems to happen in the swap scratch function. So for some reason the bootloader is thinking that there's an upgrade and then hard fault. I've left my workplace, will update this comment later.

I think there's another bug: after it booted up the updated image, which is unconfirmed, if I reset the device, the bootloader will boot right into the primary instead of doing a swap. (Likely to be another issue)


EDIT:
You are right that the flash_img.c probably shouldn't just determine the write size with zephyr,flash, but the default 512 bytes is a valid write size for spi-nor, which has a write_block_size of 1.

The hard fault of the bootloader happens inside this while loop.

@ycsin
Copy link
Member Author

ycsin commented Aug 11, 2021

I increased the CONFIG_IMG_BLOCK_BUF_SIZE to 4096, which is the sector size of my external serial flash, then now it is able to download the whole firmware.

I encountered the issue again today even after this change and removing the flash-write part didn't help, I tried the big_http_download again and it is still able to download the firmware in full, but the speed dropped to 3kBps somehow.

Downloading the file using my laptop is still very fast, still not sure where the issue lies. I guess I might have to work around this using partial downloads.

@ycsin
Copy link
Member Author

ycsin commented Aug 11, 2021

@Navin-Sankar Any comments?

@d3zd3z
Copy link
Collaborator

d3zd3z commented Aug 12, 2021

That setting the download block size to match the page size eliminates the problem suggests to me that we are running into a situation where instead of just writing a page in pieces, we are erasing it and rewrite it for each of the smaller blocks.

Whether a page can be partially written does depend on the flash device in question. Most NOR flashes are able to write to parts of a page just without problem. Other devices do require the whole page to be written once between erases. If that is the case, I'm not sure there is a very good solution other than to buffer an entire page.

I didn't readily find the information about the STM part in question, but most of these allow a minimum write size of 8 or 16 bytes, so it should be possible to write to different 16-byte parts of a page without needing to erase the page.

@ycsin
Copy link
Member Author

ycsin commented Aug 13, 2021

@d3zd3z after a few days of testing, I'm guessing that the cause of this issue might be networking/socket related, instead of the flash_img_buffered_write function itself. The spi-nor flash that I'm using should be able to write parts of a page without problem, the STM flash minimum write size should be 8 bytes IIRC.

I do encounter a bootloader-related issue as mentioned in this comment, I think I should create a separate issue for it, after testing for a few more times to confirm the pattern.

@ycsin
Copy link
Member Author

ycsin commented Aug 17, 2021

After further testing I found that the download is slow when something is blocking during the reception of the data, such as flash_img_buffered_write, or if the log is set to immediate mode.

Increasing CONFIG_IMG_BLOCK_BUF_SIZE causes the write to happen less frequently and therefore improves the download speed, but this seems to be a workaround instead of a fix.

Will doing the flash_img_buffered_write in workq fixes the issue?

@d3zd3z
Copy link
Collaborator

d3zd3z commented Aug 20, 2021

After further testing I found that the download is slow when something is blocking during the reception of the data, such as flash_img_buffered_write, or if the log is set to immediate mode.

This makes me think of something. I haven't studied the STM driver all that much, but I know that some flash drivers make the flash device itself inaccessible during the flash operation (the flash goes away while being programmed). They generally run code out of RAM to do this, and run with interrupts masked. If well written (and the hardware supports it), they can interrupt the flash operation, make the code reappear, and handle the interrupt. Again I don't know the behavior here.

But, is it possible to figure out if the flash_img_buffered_write() might be blocking for a while and causing packets to be dropped? Dropped packets would slow down the transfer significantly, waiting for retransmits from the server.

@ycsin
Copy link
Member Author

ycsin commented Aug 26, 2021

This makes me think of something. I haven't studied the STM driver all that much, but I know that some flash drivers make the flash device itself inaccessible during the flash operation (the flash goes away while being programmed). They generally run code out of RAM to do this, and run with interrupts masked. If well written (and the hardware supports it), they can interrupt the flash operation, make the code reappear, and handle the interrupt. Again I don't know the behavior here.

Not sure about STM32 flash either, but I'm using an external spi-nor flash here

But, is it possible to figure out if the flash_img_buffered_write() might be blocking for a while and causing packets to be dropped? Dropped packets would slow down the transfer significantly, waiting for retransmits from the server.

I used wireshark to sniff the transfer on my laptop, seems like the client does not need to acknowledge every packet for the host to continue its transfer, the acknowledgement only happens every 3-4 packets. I used a uart dongle to read the TX of the modem and there's nothing much going there, since it seems like acknowledge isn't required for the transfer to continue, I'd expect it to continue output data but it isn't. But I'm not sure about the networking part. Maybe @mniestroj @rlubos @jukkar can shed some light here?

@ycsin ycsin changed the title subsys/mgmt/hawkbit: Unable to finish download if flash_img_buffered_write() is used subsys/mgmt/hawkbit: Unable to finish download if flash_img_buffered_write() (CPU blocking function) is used Aug 26, 2021
@rlubos
Copy link
Contributor

rlubos commented Sep 13, 2021

@ycsin From the networking point of view, it could be the case, that due to increased CPU consumption the application thread is not able to consume the data fast enough, so we fill up all of the RX buffers. If the low-level network driver does not specify a timeout for the allocation, or the timeout is too small, it'd drop the packet, which would result in retransmission at TCP level and decreased performance.

We had a similar issue with one of the Ethernet drivers not long ago (see #36891 (comment)) - the application did not catch up with the incoming traffic, which resulted in a packet drop. Adding a timeout for the net_pkt_rx_alloc_with_buffer() function in the Ethernet driver solved the problem.

Now, I'm not very familiar with the PPP implementation, but there seems to be a similar case with the ppp driver. Probalby worth checking if it's related?

@ycsin
Copy link
Member Author

ycsin commented Sep 13, 2021

@ycsin From the networking point of view, it could be the case, that due to increased CPU consumption the application thread is not able to consume the data fast enough, so we fill up all of the RX buffers. If the low-level network driver does not specify a timeout for the allocation, or the timeout is too small, it'd drop the packet, which would result in retransmission at TCP level and decreased performance.

We had a similar issue with one of the Ethernet drivers not long ago (see #36891 (comment)) - the application did not catch up with the incoming traffic, which resulted in a packet drop. Adding a timeout for the net_pkt_rx_alloc_with_buffer() function in the Ethernet driver solved the problem.

Now, I'm not very familiar with the PPP implementation, but there seems to be a similar case with the ppp driver. Probalby worth checking if it's related?

Thanks heaps for the pointer. I'm not familiar with PPP nor networking, but this does sound like something I'm facing, I will try to test that in the ppp driver.

@ycsin
Copy link
Member Author

ycsin commented Sep 14, 2021

I think there are a few ways that can solve this issue, from the networking perspective:

  1. As mentioned by @rlubos, increase the delay in ppp driver so that it waits longer for allocation.

and from the application perspective, in this case the hawkbit/stream_flash:

  1. Format the slot-1 before starting to download so that the application will only have to write and do not have to erase, which reduce the CPU time spent in the application.
  2. Increase the CONFIG_IMG_BLOCK_BUF_SIZE so that the application writes the flash less often, which also reduce the CPU time spent in the application, at the cost of RAM.

I will try the PPP fix as soon as I can and do a PR if it does improve/solve the issue. For the hawkbit part, I guess I will make a PR to erase the slot-1 prior to download at some point.

But all these seems to be some workarounds, I think the real issue is that both the PPP and the Hawkbit/application are using too much CPU time for the entire operation. I wonder if the data transmission from modem to memory in PPP and the data transmission from memory to SPI flash can be offloaded to DMA?

@jukkar @mniestroj any idea for the PPP part?
@nvlsianpu any idea for the flash/stream_flash part?

@nvlsianpu
Copy link
Collaborator

Maybe worth to consider moving calls to flash_img_buffer_write to other thread with lower priority than image collection thread, or to the worqueue. (this maight be done in the flash steram module as well).As image you are collecting is in external flash this should does the job.

@ycsin
Copy link
Member Author

ycsin commented Sep 14, 2021

(this maight be done in the flash steram module as well)

@nvlsianpu is it better to offload this in hawkbit or in stream_flash?

@ycsin ycsin changed the title subsys/mgmt/hawkbit: Unable to finish download if flash_img_buffered_write() (CPU blocking function) is used subsys/mgmt/hawkbit: Unable to finish download if CPU blocking function (i.e. flash_img_buffered_write) is used Oct 4, 2021
@ycsin
Copy link
Member Author

ycsin commented Oct 4, 2021

@rlubos some update on this, I've moved to a new version of my custom board and that uses AT45 flash with a page of 512B, the previous one (AT25) was 4KiB, this makes the format to happen quite frequently and caused the upgrade to fail, even with the ppp delay at 200 msec. What is the trade off if I continue to increase this delay?

I tried to enable SPI DMA and sometimes I get this error.

@ycsin
Copy link
Member Author

ycsin commented Oct 5, 2021

Some more updates:
@nvlsianpu I tried to offload flash_img_buffer_write part (L850 - L865) to a work, but it doesn't seem to help.

I think the CPU is simply too occupied by the networking stuff and UART interrupts due to firmware transfer: at least 7kiB per second, up to ~15 KiB per second from my observation. This is when I comment out the flash_img_buffer_write part. And it caused the log subsys to lag when the modem is transferring data.

@rlubos
Copy link
Contributor

rlubos commented Oct 5, 2021

What is the trade off if I continue to increase this delay?

I think it depends on the driver/modem really, if packets keep coming and they are not consumed due to lack of RX buffer, the modem will eventually drop them. It could be the case that the respective driver/modem has buffering capabilities on it's own, but they're not unlimited.

So if we're not talking about sporadic CPU business causing delay, but a general CPU overload we should rather think of limiting the data flow at TCP level. Typically, you can limit the amount of data sent by the server by decreasing the Recieve Window size sent in the ACK messge. Unfortunately, from what I've seen, TCP2 implementation does not implement the feature, and sends a fixed value of IPv6 MTU all the time, effectively giving no limits to the server. I suggest to open an enhancement issue for the feature, as IMO it's quite valuable for constrained devices, and it seems to be a regressinon compared to the previous TCP implementation (where Recieve window handling was implemented).

@ycsin
Copy link
Member Author

ycsin commented Oct 5, 2021

IPv6 MTU all the time

Is this feature limited to IPv6 only? or including IPv4?

@rlubos
Copy link
Contributor

rlubos commented Oct 5, 2021

IPv6 MTU all the time

Is this feature limited to IPv6 only? or including IPv4?

Well, I don't want to say for the author, but it seems that IPv6 MTU was just an arbitrary value chosen for the Window Size. It should not matter whether IPv4 or IPv6 is used at the IP layer.

@ycsin
Copy link
Member Author

ycsin commented Oct 13, 2021

I workaround this by enabling SPI async, I wonder if #39275 would be able to fix this for SOCs that doesnt support SPI async by limiting the download speed to what the CPU can gracefully process?

@rlubos
Copy link
Contributor

rlubos commented Oct 14, 2021

I workaround this by enabling SPI async, I wonder if #39275 would be able to fix this for SOCs that doesnt support SPI async by limiting the download speed to what the CPU can gracefully process?

After testing that PR a bit it turned out that recv window handling will require a bit more effort to implement than proposed in the PR. So this is still an open topic.

@ycsin
Copy link
Member Author

ycsin commented Nov 18, 2021

I have tried to use hawkBit on a new board that uses the esp_at driver. And it seems to suffer the problem even though I already enabled SPI_ASYNC.

It stucked at 66% by default, if I increase the CONFIG_WIFI_ESP_AT_MDM_RING_BUF_SIZE to 2048* then it stucks at 90%+, so I guess it's the data overloading the CPU again.

I think the esp_at driver uses some form of offloaded networking instead of the native TCP2?

@rlubos @mniestroj any idea?

EDIT:

  • I think I increased it to 4096 instead of 2048

@mniestroj
Copy link
Member

I think the esp_at driver uses some form of offloaded networking instead of the native TCP2?

Correct. TCP stack is implemented on ESP chip.

@ycsin Do you use UART hardware flow control?

@ycsin
Copy link
Member Author

ycsin commented Nov 22, 2021

@ycsin Do you use UART hardware flow control?

@mniestroj thanks for the reply.
No, I do not use hw flow control:

/* WiFi */
&usart1 {
	pinctrl-0 = <&usart1_tx_pc4 &usart1_rx_pc5>;
	current-speed = <115200>;
	status = "okay";
	esp8266 {
		compatible = "espressif,esp-at";
		label = "esp8266";
		power-gpios = <&gpioc 6 GPIO_ACTIVE_HIGH>;
		reset-gpios = <&gpioc 10 GPIO_ACTIVE_HIGH>;
		status = "okay";
	};
};

Are you aware of anything that can be configured which might be able to workaround this issue?
Is hardware flow control able to solve this issue, if so I might be able to raise this up for my next board's revision

EDIT:
Seems like hw flow ctrl is able to stop the incoming bytes (in modem_iface_uart_isr) at hw layer, would be nice to have confirmation from experts.

@mniestroj
Copy link
Member

@ycsin Having HW flow control is quite critical, as it will suspend sending/receiving data stream when the other end is busy (e.g. processing previous packet). Without that, part of the frame can be easily dropped and there is no easy way to recover from that.

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@ycsin
Copy link
Member Author

ycsin commented Mar 1, 2022

I believe this issue is now solved by #43018 and/or #39275. I'm able to download a 300+kB firmware and write that into flash using SPI interrupt API. Previously this is only possible using SPI DMA.

@ycsin ycsin closed this as completed Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

7 participants