How to troubleshoot a stalled/hung Bluz program?


#1

I’ve found my Bluz if left long enough will eventually stop responding. Serial and publish communication stop as well as a led which I pulse at the end of the loop. The main led remains blinking blue but the Gateway serial shows no future traffic to/from it. Sometimes it’ll take 12hrs to get to this state, but more typical is a few hours. I’m using 1.1.47 but I didn’t try 1.0.47 long enough to know if this is a new issue.

With extra serial printlns I appear to have narrowed it down to this line, at least with my current program. Most if not all times when it hangs the last output will be part of what’s generated by this:
Serial1.println("Temp is: " + String(fahrenheit,2) + "F. Success: " + String(readPercentage,2) + "%. Total Attempts: " + String(totalAttempts) + ". Total Time: " + String(totalTime) + "ms. Avg time for success: " + String(averageTime, 1) + "ms" );

The last successful output of this line was:
Temp is: 80.82F. Success: 99.96%. Total Attempts: 10567. Total Time: 1410794ms. Avg time for success: 133.0ms

The data types of those values are: double, float, int, unsigned long, and float.

Any ideas on how to troubleshot this further?


#2

Is it possible to post the code for your entire app? It is pretty hard to debug from that limited set.

Serial1.print can be unreliable if called from an interrupt in 1.1.47, are you calling that line from inside one?

Otherwise, there could be other issues present and that just happens to show as the last output. Full code would be useful to try and PO pint the issue.


#3

No interrupt, just using the normal loop. Basically my program is Luke’s DS18B20 temp example with some tweaks to improve the efficiency and success rate. It’s roughly 10x faster at getting the temp with near perfect accuracy but I wanted to resolve this hanging issue before sending a pull request.

#include "OneWire.h"
#include "DS18B20.h"

#define ledPin D7
#define oneWirePin D2
DS18B20 ds18b20 = DS18B20(oneWirePin);

float pubTemp;
double celsius;
double fahrenheit = 0;
unsigned long metricPublishRate = 30000;
unsigned long lastMetricPublishTime = 35000;
unsigned long lastDSSampleTime = 20000;
unsigned long dsSampleRate = 2000;
bool firstPass = true;
bool radioON;
int failedAttempts = 0;
int successAttempts = 0;
int totalAttempts = 0;
unsigned long totalTime = 0;
float averageTime = 0;

int dsResolution = 9; //Higher resolution takes longer to calculate; 9 (.9F) < 94ms; 10 (.45F) < 188ms; 11 (.23F) < 375ms; 12 (.11F) < 750ms

void getTemp();
void publishData();
void radioCallbackHandler(bool radio_active);

/* executes once at startup */
void setup() {
    pinMode(ledPin, OUTPUT);
    pinMode(oneWirePin, INPUT);
    Serial1.begin(38400);
    Serial1.println("================================================");
    Serial1.println("               Bluz-Temp - v1.48                ");
    Serial1.println("================================================");
    Particle.variable("tempDS18B20", & fahrenheit, DOUBLE);
    BLE.registerNotifications(radioCallbackHandler);
}

/* executes continuously after setup() runs */
void loop() {
    if (Particle.connected()) {
        digitalWrite(ledPin, HIGH);
        System.sleep(SLEEP_MODE_CPU);
        digitalWrite(ledPin, LOW);

        if (millis() - lastDSSampleTime > dsSampleRate && !radioON) {
            Serial1.println("starting read");
            long before = millis();
            getTemp();
            long delta = millis() - before;
            lastDSSampleTime = millis();
            Serial1.println("time taken: " + String(delta));
            totalTime += delta;
            if (ds18b20.crcCheck()) {
                averageTime = (totalTime / successAttempts);
                Serial1.println("Temp is: " + String(fahrenheit, 2) + "F. Success: " + String(readPercentage, 2) + "%. Total Attempts: " + String(totalAttempts) + ".  Total Time: " + String(totalTime) + "ms. Avg time for success: " + String(averageTime, 1) + "ms");
            }
            Serial1.println("ending read");
        }
        if (millis() - lastMetricPublishTime > metricPublishRate && !radioON) {
            publishData();
            lastMetricPublishTime = millis();
        }
    }
}

void radioCallbackHandler(bool radio_active) {
    radioON = radio_active;
}

void publishData() {
    Serial1.println("starting publish");
    if (ds18b20.crcCheck()) {
        Particle.publish("bluzTemp", String(millis()) + " - DS18B20 Temp is: " + String(fahrenheit, 3) + "; Read Success: " + String(readPercentage, 1) + "% of " + String(totalAttempts) + " attempts", 60, PRIVATE);
    }
    Serial1.println("ending publish");
}

void getTemp() {
    char szRom[24];
    if (firstPass) {
        ds18b20.search();
        ds18b20.resetsearch();
    }
    celsius = ds18b20.getTemperature();
    if (!ds18b20.crcCheck()) {
        failedAttempts++;
    } else {
        if (firstPass) {
            firstPass = false; //valid result so rom must be correct
            ds18b20.setResolution(dsResolution); //Set after getting the address
            ds18b20.getROM(szRom); //get address
            Serial1.print("Rom: ");
            Serial1.println(szRom);
            //ignore the first reads so we can get an accurate percentage early on and not have to wait awhile to balance out the first 2 bad reads
            totalAttempts = 1;
            failedAttempts = 0;
        }
        successAttempts++;
    }
    fahrenheit = ds18b20.convertToFahrenheit(celsius);
    totalAttempts = successAttempts + failedAttempts;
    readPercentage = ((float) successAttempts / (float) totalAttempts) * 100.0;
}

You can get the Onewire and DS18B20 code from here:

In prior versions I have ruled out everything in getTemp(). I was turning on D7 when it started and off when it ended. After the program stopped responding the led was off. This current version lights up when the Bluz is sleeping and last night when the program stopped responding the led was off. It’s possible the LED would be off regardless so I guess it’s worth testing the reverse so we can see it fail with the LED left on.

In my current test I have commented out the big println I singled out before, it’s been running 3-4 hours now without a problem but I’d probably need to see it go 24hrs before believing that’s the cause.


#4

It hung again, even with that println removed, the last output was a few characters from:
Serial1.println(“ending read”);


#5

I ran another test with all of the temperature and publish calls disabled, so basically just the serial println calls and it still hung. I’m now running a test of the reverse, everything enabled except the serial calls.


#6

Sorry for the delay. Sounds like you’re taking the right approach.

There was a bug around this which we have fixed in the develop branch, so it would still happen in the currently released versions: https://github.com/bluzDK/bluzDK-firmware/issues/2

It should be relegated only to specific cases though, so I’m a little surprised you would see it. Still, if that is causing the issue, your current test will pass and we already have a fix as well.

Let me know how it goes, this is a good check before we do firmware release 2.0?


#7

It’s been going 2 days without hanging so I’d say the serial output was the cause. In your testing for that issue do you remember if it still happened when the RX/TX wasn’t hooked up? Wondering if I should move all my serial output to a separate function so I can easily disable it when not using it or if just not having the hardware pins hooked up is enough.


#8

Never mind, it’s very easy to reproduce by using serial output every loop. The software call is the cause, it didn’t matter if nothing was hooked up to RX/TX.


#9

Thanks. This is definitely fixed in the current develop branch, so we will be releasing a beta very soon that will solve this for you.