[SOLVED] Bluz restarts after connecting


#1

I’ve been having a strange problem lately where my Bluz will restart around the point when the connection with the cloud is made. I was working on reducing the code to the bare minimum that caused this when I thought I found a solution. Switching from code that checks Particle.connected() several times in my loop to something that checks it once and stores it in a variable which then gets checked in the loop dramatically reduced the incidence of this to a point that I moved on. But now it’s back as I’ve been optimizing my code to be efficient enough for battery power. Unfortunately when in this restarting state I can’t flash new code, I have to go through safe mode which means I can only test a new version once every 5-10min. So I was wondering if there is a known issue that may be causing this and/or some ideas to help get to the bottom of this more quickly?

One other difference with my code around the time this started happening is I removed a line that prevented System.sleep if Particle.connected() is false. I think there was a note about this being required to connect in some test code I saw on the forums but I did a test without and it worked, plus for a battery powered bluz it seems wasteful not to sleep and would most likely completely drain my battery within hours during an outage.

I’m still on beta 4, I was thinking if I could reproduce this I could try with earlier firmware to see if this is a new bug. My understanding is if I upgrade to beta 5 then it’ll no longer be possible/easy to do that. Would upgrading to beta 5 mean that invoking safe mode manually would cause it to reset to beta 5 or would it go back to 1.1.47 like it does now? Not having to wait for a system firmware update for safe mode could save me some time during this process.

Here’s minimal code that should cause it but doesn’t seem to. It will at least illustrate what I’m doing and where it crashes. Perhaps the extra includes (onewire & dallas) in my main program are to blame, though they aren’t being initialized.

unsigned long lastConnectTime = 0;
int lastBLEStatus = 0;

void setup() {
    Serial1.begin(38400);
    Serial1.println("Staring up, v1.01...");
}

void loop() { 
    checkBLE();
    System.sleep(SLEEP_MODE_CPU);
}

void checkBLE()
{
    BLEState state = BLE.getState();

    if (lastBLEStatus != state)
    { 
        if (state == BLE_CONNECTED)
            Serial1.println("BLE Connected");
        else 
            Serial1.println("BLE NOT Connected; State=" + String(state));
        lastBLEStatus = state;
    }
    if (state == BLE_CONNECTED && Particle.connected()){ 
        if (lastConnectTime == 0) {
            Serial1.println("Particle Connected");
            lastConnectTime = millis(); 
            Serial1.println("Bluz will typically restart by this point");
        }
    }
}

The serial output for a program like this would be:

Staring up, v1.01…
BLE NOT Connected; State=1
BLE NOT Connected; State=3
Staring up, v1.01…
BLE NOT Connected; State=1
BLE NOT Connected; State=3
Staring up, v1.01…
[and so on]

So it never actually prints Particle Connected. I did have a version that prints a line for each loop, with a small delay so I could monitor the timing. I found that when particle connects, with or without the crash, there is a significant delay in the output. It appears the connection event causes several seconds of blocking and occurs between the loop. So I’d see ‘loop begin’, ‘loop end’, [massive delay], [restart or ‘loop begin’]


#2

That type of behavior is usually caused by too much RAM usage by the system firmware.

What happens is, when the device connects it must start the encryption process with the cloud. This uses RSA to send session keys back and forth, and the process uses malloc to eat up a bunch of RAM. If there isn’t enough RAM available, the system restarts. Normally there should be plenty of RAM but if the system firmware has used too much before the connection happens, it can cause this behavior.

Normally this is caused by modifying the system firmware, but I suppose it could also be caused somehow by making the right combination of calls to the system parts.

Would it be possible to post your entire code to a GitHub Gist and post a link to it here? Or you can share with me directly of you don’t want it public.


#3

Interesting. Adding the libraries to my bare program along with the init for the dallas sensor allowed me to reproduce it. It doesn’t happen if I unplug the sensor so perhaps that library is using a lot of the RAM? The one I put in the gist is basically the same onewire/dallas one you get through the WebIDE but with necessary bluz compatibility tweaks.

I just realized if a sensor is detected I output the address via sprintf but at some point I added extra info to that output and forgot to increase the size of the char array. So perhaps that is part of the issue? If I increase the size of that then Bluz doesn’t restart. With past devices if I try to store more than the initial size of the array it comes out gibberish.


#4

That could certainly be the issue. The setup() function may not get called until the device connects. So if the buffer is getting overrun it could cause issues.

If you increase the size of the array. Is this issue fixed?


#5

Setup and loop do run before the particle connection. After fixing the array size I haven’t seen the issue return. Thus another reason I like using String() instead of sprintf :grin:. I’ll upgrade to beta 5 in a few days and go back to that.

You think the connection stuff was created in memory before setup() and my sprintf call corrupted the same part of memory thus causing the restart when the connection code went back to use it? In a prior test I prevented my bad sprintf call until after particle was connected and that did not show any problems. I know the Photon/Core devices by default do not run setup() until after the connection so that could explain why I’ve never seen this issue on those.


#6

The system and user parts of RAM are separate, so one shouldn’t corrupt the other.

I am going to mark this as solved, sounds like it was just the buffer overrun issue. That can always lead to problems that are unpredictable, so it would lead to strange behavior sometimes and possible nothing other times.