BusManager - Safety board communication and rescheduling
@bz reported below:
I took a look at the new PC/104 over in Ruby's area. I was able to reproduce the bus manager
timeout several times (see attached CAN log).
As you can see, in each failure the WAM control loop waited one full second between control periods instead of waiting 2 milliseconds.
This failure was always immediately preceded by asking the safety board for its status. While this looks to be a necessary condition, it is not a sufficient condition because the control loop often succeeded to run after asking the safety board for its status.
Sometimes it took only a few seconds of running ex04 to observe the failure. Sometimes it took several minutes. This might explain why it seemed to work last week, but not this week. Maybe you were just lucky last week!
I'd bet the bug has something to do with missing the opportunity for a "2 ms control window rescheduling point" due to the extra safety board (and/or FT) communication, and instead rescheduling for a full second later.
When it does this, the safety board (correctly) forces an E-Stop condition after 25 ms due to the lack of communication from the WAM PC. And this E-Stop causes the pucks to go offline, so when the WAM PC does finally send out a request for motor positions, nobody responds, and the software crashes.
It is okay to miss the 2 ms window by a little bit, but the control thread MUST be rescheduled ASAP, not 1 full second later! This is the bug. Comm_Fault.txt