[SOLVED] Gateway goes "out in the weeds"


#1

For some unknown reason, it seems that my gateway stops working every few days. I’m now wondering if this contributed to short battery life on the battery shields, although it’s tough to know if the gateway stopped working and that caused the DKs to run their battery down trying to re-connect or if the battery simply died far earlier than expected for some other reason.

I’m using the same gateway code as before (it’s about a month old) and it’s unmodified, with the exception of adding a function that does a system.reset() to enable me to remotely reset the gateway. Once I call that function and the gateway restarts, the DK’s reconnect and everything runs fine for a few days.

From my perspective, this is casting serious doubt on the reliability of the platform for any “important” application.

If new(er) gateway code is available, I’m more than happy to flash it (after adding my simple reset function).


#2

Can you provide a little more details on the setup? I assume you are using a Photon, what version of Particle firmware is loaded onto it?

How many bluz DK are connected to the gateway at a time? Just one?

I assume this is the same code on the DK as posted in the other thread, correct?

When you say the gateway stops working, what are the symptoms? Bluz DK are flashing magenta? When this happens, is the gateway still online? Can you send it commands via the Tinker app?

I have heard intermittent issues about this, but have never been able to successfully reproduce.

We are so, so, so close to finalizing all the Kickstarter and pre-order shipments (which makes me very, very, very happy :smile:) and I am looking forward to jumping back into the releases and fixing issues as opposed to packing envelopes and programming/assembling hardware. I will move this up the list to try and investigate!


#3

Hi @eric, I am using a Photon and assume it’s something like 0.4.9, but I’m not totally sure.

There are 2 DKs connected and both stop reporting via particle.publish(). Yes, it’s fundamentally the same code as the other day, but cleaned up just a bit.

I don’t know what the DKs are doing, but I believe they flash magenta (from memory–before I shutoff the LED). I’m about 15 miles away when I noticed they were not reporting and a simple reset of the gateway Photon cleared up the problem.

The next time it happens, I’ll wait until I get home and put eyes on the DKs to see what they’re showing, if anything on their RGB LED. I do have it turning off once they connect, though. I imagine I won’t be able to see anything.


#4

well…it just happened, again, so I recompiled with 0.5.1 and re-flashed it. We’ll see if that makes a difference.


#5

I have noticed similar behaviour, but have not had the time or inclination to roll my sleeves up and dig into this.

All I have noticed is that endpoints stop reporting, the white LED on the GW stops blinking, and that endpoint battery life does seem to suffer. I had not reported it because, as I said I “have not had the time or inclination to roll my sleeves up and dig into this”, but I wanted to know you are not alone in seeing something.


#6

I am investigating this. I have 3 DK publishing every 1 second hooked through a gateway, hopefully it can catch the issue quickly and I can work towards a resolution. If there is an issue here, it is probably a race condition that we can hopefully clear up quickly.


#7

Still working on this. I did get it to happen one time, it corresponded with the gateway having to reconnect to the Particle cloud. So the gateway lost it’s connection for some reason (possible the socket closed on the Photon) and it initiated a cloud reconnect. Directly after this happened, it seemed to get stuck but I was unable to determine where.

So I added more to the logging and started it again.

This time I also set the gateway to publish every second, thinking this could somehow have to do with the gateway data and bluz data getting into a race condition. I ran it all day yesterday fine, but this morning it was in an even weirder state as the Photon was blinking blue (like it would be in WiFi setup mode). Not really sure what that means?

Had to restart and try again, hopefully will get a more definite answer today.


#8

@eric, I appreciate the continued attention to this issue. FWIW, since I recompiled the gateway Photon code with 0.5.1, I’ve not had a problem. I’m pushing data from each DK every 10 minutes or door status change, whichever comes first. I have a latency event set up on Grovestreams that alerts me if it hasn’t received data from either DK for 25 minutes (2 updates @ 10 min max). I don’t believe I’ve seen these alerts since the recompile.


#9

Looks like it happened, again, last evening. One of the DKs stopped publishing at 11:51PM (EDT) and the other at 11:54. So…this is interesting…that one continued to publish data after the other stopped. I’m at work now and can’t see the status of the LEDs. If that would be helpful information, I’ll report back once I get home in 10 hours. Otherwise, I’ll go ahead and reset the gateway remotely and get them logging, again.


#10

I was able to reproduce this again with my constant-publish setup, so there is no need to check the colors, you can reset your gateway.

This definitely seems like a race condition when both the DK and the gateway want to send data at the same time. Still trying to track down the root cause, but now that I can reproduce it, I should be able to get to the bottom of it. Hopefully I can have some firmware for you to test out soon!


#11

OK. Remote reset worked, again. I’m glad I have that function and it works!


#12

Don’t know if this helps, but the gateway went out into the weeds, again, and it appears to have occurred right after a publish().


#13

I have found a way to reproduce this with my setup, it seems I can always get it to happen within about 4-6 hours if i have the max number of DK attached and publishing data aggressively, and also have the gateway publishing data at the same time. What seems to happen is the gateway just sort of locks up.

I am probably being over-optimistic to mention this, but I did find one area of concern in the code and fixed it last night. I set it up and ran things again last night, and other issues led to the bluz boards not being online this morning. However, the gateway was working still, and was still connected to the cloud and publishing. So I reset everything and will let it run again today, if things seem to be working out, I will try and get you a preliminary build to test.


#14

@ctmorrison I think I may have fixed the issue and was hoping you could test this out. I was able to reproduce this before, though sometime unreliably, and now I can’t seem to anymore. The fix I made makes logical sense, I am just not 100% certain it was the cause of your issue. Therefore, I think it would be good to have you test this out.

You can update the code on your gateway from the staging console here: http://staging-console.bluz.io/

Just login with your Particle credentials and you should see your device show up with a green Update button below it. Once you click that, it will download the latest code to your gateway, and it may take a few minutes and the gateway will reboot in between.

This is still a bit experimental, but I have had good success with it. If something doesn’t work, it should show you an error message and you can try again. It is possible the update can get stuck, meaning the white LED will stay lit or unlit for more than a minute. If this happens, you need to reset the device or wait for the WDT to kill the update, which can take up to 10 minutes. Then you can try again. This shouldn’t happen, but it is possible, so you may want to be near the device when you do the update.

Let me know how it goes


#15

@eric I just did the upgrade. I was a bit surprised when I first brought up the webpage and the button was labeled “claim.” So…I did a reset function on the gateway and went back to the web page and the button was labeled “update” and I pushed it. What I later realized is that the gateway had hung, again. It will be a good test and I’ll let you know how it goes. I believe it has been more reliable with 0.5.1, but I can’t be absolutely certain, as I’m lacking any real empirical data about runtime, hangs, etc. Sorry about my lack of formality in testing.


#16

Do you happen to remember what the device ID said when the Claim button appeared? Was it the proper device ID?

Also, did the upgrade go smoothly? Did you have to reboot? We’re testing out that feature as well for many new things down the road.

Hopefully this helps and/or fixes our issue entirely. I am going to keep testing from my end so we can hopefully clear this all up soon. Let me know if you see the issue again. Thanks


#17

I don’t recall the device ID when it had the “Claim” button and assumed it was the right one. And (obviously), I don’t remember it changing after I rebooted the gateway.

The upgrade seemed to go smoothly. I don’t recall if I rebooted after the upgrade, but I do believe it rebooted on its own, based upon looking at the log and doing a “particle list”.

Still running and I’ll update you if I see it hang again.


#18

It’s still misbehaving, but now seems to clear itself up, rather than staying offline. I got a notice data had not been received, so I brought up the dashboard. Coincidentally(?), the gateway started working, again. I’m attaching a screenshot of the dashboard, showing disconnects and reconnects. The bluzGatewayBT is the BLE device. bluz1 and bluz2 are the two DKs. There’s another device bluzGateway that’s the Photon on the gateway, but it’s not shown–this seems unusual from recollection.


#19

That can happen, basically if the Photon loses connectivity with the Particle cloud, even for an instant, then the Photon can lose the TCP sockets. If it does, then all devices under it would need to reconnect.

I have seen this just observing Photons in general, they sometimes lose the connection and go back to the blinking cyan state and then reconnect. It happens so fast on a Photon that you wouldn’t realize it or probably even notice, but it can lead to some downtime for the bluz network running from the Photon if it happens. Worst case, it could take a bluz board up to 60 seconds to realize it is offline then (X+1) * 15 seconds to reconnect, where X is the number of bluz boards on the gateway. So in your case, it could have been offline for nearly two minutes if this happened.

You mentioned you receive a notice the data hadn’t been received. What triggers that notice? Is the timing super strict where if it misses one publish within some short interval you would et the notice? Or would the device have to be offline for hours?


#20

The DKs push data every time a door opens/closes OR 10 minutes, which ever is first. If a door opens/closes, the 10 minute cycle is reset. The alert is currently set up to fire if 25 minutes goes by without an update being received.