Message sending error. Interrupt or txbuffer error? #78

megax · 2024-07-09T10:27:30Z

I would like to report a suspected error. I am experiencing data loss in the sent data. A message does not specifically disappear, but rather becomes damaged. Example code, I can't send (to equivalent) because my program is complex and I discovered it by accident. I can attach my experience, the settings of begin/loop, and what I managed to achieve in order to improve. Interrupt in FIFO is turned off directly, QNEthernet in causes an error if it is active. Platform: teensy 4.1

FlexCAN_T4<CAN1, RX_SIZE_512, TX_SIZE_256> _can;

begin:
    _can.begin();
    _can.setClock(CLK_60MHz);
    _can.setBaudRate(250000);
    _can.setMaxMB(16); // note: FIFO seems to use 8 mailboxes, so this needs to be at least 8 (more for TX)
    _can.enableFIFO();
    _can.disableFIFOInterrupt();

loop:
    _can.events();

    ...

    CAN_message_t msg;

    if(_can.readFIFO(msg)) {
        // read code
    }

 write:
    CAN_message_t msg;
    msg.id = message.id;
    msg.flags.extended = message.extended;
    msg.flags.remote = message.rtr;
    msg.len = message.len;
    //msg.seq = 1; // No MailBox
    memmove(&msg.buf, message.data, message.len * sizeof(uint8_t));

    int error = _can.write(MB14, msg);

I use a GPS RTK based system where RTCM (https://www.tersus-gnss.com/tech_blog/new-additions-in-rtcm3-and-What-is-msm) messages are relayed to some devices. As the data arrives, I forward it in 8 byte blocks. This means ~ 800 bytes of data are received in 1s. This means I send ~ 15-25 can messages at a time. (If I add a millis delay, the error still occurs even if I slow down the sending of messages by 1-5ms. I only postpone the occurrence of the error.) I can see the error with the teensy on the other side and with the external canbus adapter. I checked with two different varieties. The data is distorted, when the messages meet, or the bytes are changed between the data. I am attaching information about them. Also, I checked, it's fine as long as the TX doesn't start buffering. (It also makes it worse if I don't use MB14 fixedly, but only set the seq. That's why it's fixed to write. So I only use one mailbox, with several seq evenings, errors are common even with my modification) So, when sent by can here: writeTxMailbox then I see something different than what I entered with the write function. I wrote the data to Serial, and I see exactly the same distortion as when reading from the network.

I assume the ISR (flexcan_isr_can) is causing the error. Or that the basic thread and the interrupt write into the circular buffer. I made a modification that definitely shows a decrease in the number of errors. And calling _can.events() further reduces errors. Also, the errors are almost negligible at the beginning of the run, as I move forward in time, they become more and more frequent, and eventually they persist. And even if I turn the specific part off and on again (I know how to do it with a command), it is still in progress. The TX circular buffer never expires. I also checked this, this is why the size is 256 . Furthermore, the load on the canbus network also has an effect on it. Although, after the modification, this is no longer relevant either (It does not change the frequency of errors proportionally). I would not include data about them now, because the error exists regardless of that.

Numerical data (I can only monitor the valid rtcm, not the error in the data, I have to analyze it per message, but this is also a good metric:
Without modification:
For 10,000 good rtcm messages, 60,000 are bad. ~4-5 hours of use

Change (without events()):
For 58000 good rtcm messages, 120 are bad. ~4-5 hours of use

Change (using envets()):
For 58000 good rtcm messages, 77 are bad. ~4-5 hours of use

And then what I incorporated. I use the IRQ functions and enable nvicIrq in write. Similarly as it is in events() (If I do not incorporate this modification and use events(), the number of errors increases. In that case, events() is harmful, but this is how I discovered that something like this could be a problem. Because events() increased the appearance of errors). I suspect that, due to the complexity of my program, it may run in the background for some time. I don't use any thread management. There is only canbus interrupt, qnethernet interrupt.

I am attaching the amendment here:

FCTP_FUNC int FCTP_OPT::write(FLEXCAN_MAILBOX mb_num, const CAN_message_t &msg) {
  if ( mb_num < mailboxOffset() ) return 0; /* FIFO doesn't transmit */
  volatile uint32_t *mbxAddr = &(*(volatile uint32_t*)(_bus + 0x80 + (mb_num * 0x10)));
  if ( !((FLEXCAN_get_code(mbxAddr[0])) >> 3) ) return 0; /* not a transmit mailbox */

  NVIC_DISABLE_IRQ(nvicIrq);

  if ( msg.seq ) {
    int first_tx_mb = getFirstTxBox();
    if ( FLEXCAN_get_code(FLEXCANb_MBn_CS(_bus, first_tx_mb)) == FLEXCAN_MB_CODE_TX_INACTIVE ) {
      writeTxMailbox(first_tx_mb, msg);
      NVIC_ENABLE_IRQ(nvicIrq);
      return 1; /* transmit entry accepted */
    }
    else {
      CAN_message_t msg_copy = msg;
      msg_copy.mb = first_tx_mb;
      int b = struct2queueTx(msg_copy); /* queue if no mailboxes found */
      NVIC_ENABLE_IRQ(nvicIrq);
      return b; /* queue if no mailboxes found */
    }
  }
  if ( FLEXCAN_get_code(mbxAddr[0]) == FLEXCAN_MB_CODE_TX_INACTIVE ) {
    writeTxMailbox(mb_num, msg);
    NVIC_ENABLE_IRQ(nvicIrq);
    return 1;
  }

  CAN_message_t msg_copy = msg;
  msg_copy.mb = mb_num;
  int b = struct2queueTx(msg_copy); /* queue if no mailboxes found */
  NVIC_ENABLE_IRQ(nvicIrq);
  return b; /* queue if no mailboxes found */
}

I'm not sure if my troubleshooting tip is good, because it takes a long time to find out if the modification has an effect. I have been looking for the error since Thursday, this modification has been working since Friday evening. There were 440 errors out of 180,000 messages. (Longest run).

Finally, I am attaching pictures of the errors. If the description was a bit confusing, I asked someone to help me translate my thoughts in English. This is a company project, and in 1 year of use, I have never used this part under teensy (there is also particle port, from there we switched to teensy, there was never such a problem). This error only appeared this week. Because I wanted to switch to ram2 and I wanted to test it and this is a subsystem of mine that makes use of the circular buffer, I have a modification for this as well, but I tested everything with your code, to the fullest extent. And then also with my modification, but the ram modification has no effect. I'm only writing this so that if I accidentally touch the fork I have, it won't cause any errors. I also filtered this out, but it can be effective for me in the long run when it comes to sporing ram1 because I don't have enough ram1. Because of the size of the project.
I also discovered a similar error with a bandrate of 500,000, but then I couldn't pinpoint exactly what was wrong. (tractor's internal canbus network, loaded over 50% and I only forwarded a few messages to another network (12 pcs), but after 3-4 hours random errors occurred there too, but I think the two cases are related. It's just harder for me to analyze than this. Due to the number of data. And because there is only one teensy in it, and I don't see how rtcm provides this with error checking.)

I will use it now with this modification, if I can make a smaller sample program later (I can't promise, because I don't have much time, but I'll try).

Please, if you can, look at my thoughts and contact me if you have any ideas. My knowledge of flexcan is limited, my idea is based on thread breaks and incorrect data. It is possible that it only covers part of the reality and I only noticed one symptom and there is still an error somewhere. (it is certain, because the number of errors is not 0, even though the circular buffer never fills up)

Regards,
Csaba

The text was updated successfully, but these errors were encountered:

tonton81 · 2024-07-09T14:18:28Z

i would try a different transceiver if you have data issues, the payload is not modified on reception and are rather memmoved or memcopy'd arrays exactly as it receives from hardware and sent by user. you can also check the error registers, if they are higher than normal you got a transceiver/baudrate/termination issue causing this

megax · 2024-07-10T07:50:18Z

Thank you for your response! Calling the Error function contains all such data, if I looked correctly, right? I'll look into it and try to verify that.

My problem is that I looked at the writeTXMailbox function. I write the same logo in what I wrote out of my own system before write. Of course, it runs later when it starts buffering, but when I write it out, I see something different than what I wrote before writing. I changed the code according to the attached code because the content extracted from the circular buffer changed before the output. So, nothing has happened yet that I would copy to the canbus network. And when it is copied/sent (or the flexcan part actually runs and the registers are written), the messages displayed in writeTXMailbox are the same as the data read on the canbus network.

I will check what you wrote, just wanted to share this with you. That I have already made such an experiment. Basically, if there was a network error, I don't think the interrupt modification would work. I have another problem, we also have a particle photon 1 device. We originally had that code on that device, we just switched to teensy. There is no such failure with the same network. Of course, it's a different platform, but it's good for a control/comparison point.

Particle test: For 50000 good rtcm messages, 0 are bad (Same user code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message sending error. Interrupt or txbuffer error? #78

Message sending error. Interrupt or txbuffer error? #78

megax commented Jul 9, 2024 •

edited

Loading

tonton81 commented Jul 9, 2024

megax commented Jul 10, 2024

Message sending error. Interrupt or txbuffer error? #78

Message sending error. Interrupt or txbuffer error? #78

Comments

megax commented Jul 9, 2024 • edited Loading

tonton81 commented Jul 9, 2024

megax commented Jul 10, 2024

megax commented Jul 9, 2024 •

edited

Loading