Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NACKs observed while communicating with fan controllers #383

Open
ryan-summers opened this issue Apr 9, 2021 · 37 comments
Open

NACKs observed while communicating with fan controllers #383

ryan-summers opened this issue Apr 9, 2021 · 37 comments

Comments

@ryan-summers
Copy link

Booster has been observed to sporadically indicate NACKs from the fan controllers. Reference quartiq/booster#140 and #379

@hartytp
Copy link
Collaborator

hartytp commented Jun 14, 2021

@gkasprow this seems to be the last reliability issue with Booster.

The issue seems to be (using the current quartiq firmware as a reference implementation) the fan controllers give infrequent (typically only once per ~days) NACK errors when operating at 100kHz bus frequency.

I don't see anything obviously wrong in the design, but could you quickly stick a scope on them and verify the timing/voltage characteristics please? It might be that something is a little marginal...

image

@gkasprow
Copy link
Member

I observe similar issues in another project
I have two Texas chips on the same bus
One is a battery charger, other is BMS
I decided to separate the buses
because observed errors on the bus from time to time
Could be an issue here as well. It seems that SMBus is not fully compatible with I2C.

@gkasprow
Copy link
Member

I looked at the signals with the scope and see nothing suspicious.
What if we move the fan control SMBus to a dedicated I2C controller?
We have a free I2C controller available on pins PA8 and PC9.
PC9 is used by onboard LED but we don't have to use it.

@gkasprow
Copy link
Member

@jordens how do you use the I2C peripherals in STM32?
Are you toggling the I2C controller mode between I2C and SMBus before talking to the ICs?

@hartytp
Copy link
Collaborator

hartytp commented Jun 24, 2021

@gkasprow are you happy that none of the timings / voltage levels look marginal?

@gkasprow
Copy link
Member

I tried with open-source firmware. I tried to convert the bin file from Quartiq's release to dfu but for some reason, it doesn't work. Do you have a working dfu file?

@jordens
Copy link
Member

jordens commented Jun 25, 2021

We don't switch to SMBus since we don't use or need any of the SMBus features in host mode that would be available at the peripheral level. And the rest is a subset of i2c. That's the same as for the old firmware.
The command to do a dfu upload is dfu-util -a 0 -s 0x08000000:leave --download booster.bin. I don't know how to generate other dfu file formats. The only information required other than the bin is where to write it: 0x08000000.

@gkasprow
Copy link
Member

The only information required other than the bin is where to write it: 0x08000000.
thanks, that was what I did wrong :) Now it works.

@gkasprow
Copy link
Member

the signals look OK. The edges are fast, setup-hold times are met.
tek00021

@ryan-summers
Copy link
Author

It looks like there are some spurious pulses present, although they appear right at the falling edge, so I suspect this is just transition delay. However, I'm still somewhat surprised to see them.

@jordens
Copy link
Member

jordens commented Jun 25, 2021

Those are ACKs

@gkasprow
Copy link
Member

gkasprow commented Feb 7, 2022

Michal spent a lot of time with I2C debugger and scope trying to catch these NACKs. It looks it appears only in the STM32 core but not on the I2C bus. Maybe we should consider replacing fan controllers with Microchip ones?

@gkasprow
Copy link
Member

gkasprow commented Feb 7, 2022

What we can do is to move the fan control to a dedicated I2C bus used only by the EEPROM. What do you think @jordens
We can change HWREV so the CPU knows about it.

@gkasprow
Copy link
Member

gkasprow commented Feb 7, 2022

Since neither I nor Michal managed to catch any of these NACKs using regular firmware, I'd need a small modification to the firmware that would trigger the scope once it detects NACK. I have a scope with a very long memory and will be able to decode all transfers that appeared before. I'd like to look at it before we release the new HW.

@jordens
Copy link
Member

jordens commented Feb 8, 2022

@ryan-summers Maybe you can help and spin that special firmware.

@jordens
Copy link
Member

jordens commented Feb 8, 2022

The NACks are rare, It's possible that you don't see them for many days. it's best to provoke the explicitly by continuously running transfers on that chip like quartiq/booster#140 (comment) etc. And I don't think the old firmware does a lot of traffic.
Changing the hardware should be done once we know reasonably well what the problem is. But yes, it it's I2C SI then reorganize the buses.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2022

So can you prepare such firmware that generates a lot of traffic on these chips and triggers scope when NACK?

@ryan-summers
Copy link
Author

ryan-summers commented Feb 8, 2022

@gkasprow I've developed a version of the 0.3 release candidate that will toggle the mainboard LEDs (LD9-LD11) on for a very brief moment after the NACK is observed. The LEDs will be de-asserted as soon as the transaction succeeds on the re-attempt.

Let me know directly on that PR if you need different behavior for your capture and I'll get it set up.

Edit: The PR is quartiq/booster#202 and the branch is rs/led-toggle-on-bus-nack

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2022

Thanks,
I'll look at that once I get my Booster back.
How to compile that release? Or, can you generate the dfu file for me?

@jordens
Copy link
Member

jordens commented Feb 8, 2022

The actions run should have binaries and there are instructions on how to use them.

@jordens
Copy link
Member

jordens commented Feb 8, 2022

Ah. It doesn't. But see that PR for them.

@ryan-summers
Copy link
Author

I've just updated that branch to upload the DFU file. Should be available at https://github.com/quartiq/booster/actions/runs/1812387122 now.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2022

@jordens I'm preparing a new HW revision. What about adding the option of routing a dedicated I2C bus to the fan controller? I will add 0R resistors so one can always revert to the original solution.
I can also add Microchip fan controllers as the assembly option to play with them and possibly mitigate chip shortages.

@jordens
Copy link
Member

jordens commented Feb 8, 2022

Ok. But populate the old i2c bus connectivity by default, not that new one. then modify a device and test modified firmware.

@gkasprow
Copy link
Member

gkasprow commented Feb 9, 2022

ACK.
@jordens Since we are using only 5 fans and external temperature sensor, what about adding EMC2305 or MAX31785. I'd add it as an assembly option to test.

@jordens
Copy link
Member

jordens commented Feb 9, 2022

Both sound OK to me on a quick glance.

@gkasprow
Copy link
Member

gkasprow commented Feb 9, 2022

Let's go for EMC2305. It's simpler, cheaper and from different vendor so has different I2C IP-core in it :)

@gkasprow
Copy link
Member

gkasprow commented Feb 9, 2022

We could theoretically solder both fan controllers and enable one of them in software :)

@jordens
Copy link
Member

jordens commented Mar 9, 2022

@gkasprow any results from testing that firmware and debugging the NAKs?

@gkasprow
Copy link
Member

gkasprow commented Mar 9, 2022

@michalgaska was testing it with two channels installed. Any conclusion?

@gkasprow
Copy link
Member

Michal says he is able to catch the issues. He will post diagrams.

@michalgaska
Copy link

I2C before and after NAK:
IMG20220314112442

@gkasprow
Copy link
Member

please dump here raw data (10M points). The image is not saying anything.

@michalgaska
Copy link

Here is the link to the raw data in .CSV:
https://drive.google.com/file/d/1cZY_hTal67UYRY69qNyjCySIduJB1WnA/view?usp=sharing

@ryan-summers
Copy link
Author

ryan-summers commented Mar 16, 2022

I imported the data into Sigrok's PulseView for analysis:

sigrok-cli -I csv:header=yes:start_line=15:column_formats=t,a,a,a,*- -i .\tek0002ALL.csv -o all.sr
// Then, open all.sr in PulseView

It looks like there's a ton of noise on the I2C clock line in the raw CSV data:
image

Closer view on the SCL line without overlays:
image

Do these artifacts also appear on the actual oscilloscope display? If so, that definitely is a cause for concern.

@ryan-summers
Copy link
Author

ryan-summers commented Mar 16, 2022

image
image

An additional comment here is that the MAX6639 appears to require a minimum 4.7 uS clock high period, but the current capture appears to indicate that we may be violating that with a high period of only 4.5 uS.

We may benefit by slightly decreasing the I2C bus speed in firmware as a result.

@gkasprow
Copy link
Member

it must be some artifact. The signals look clean on the scope/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants