Giving Bluz TCPClient via BluzGW -- for Blynk and other things


#1

@rgm and all …

I have taken up the gauntlet to get Blynk working on Bluz, via the BluzGW.

This is a placeholder post for discussions relating to that, as we proceed.


Http client library not working
#2

I’m basing my work on the vshymanskyy/blynk-library-spark repository, which describes itself as …

Blynk library for Particle Core, Photon, Electon, P0, P1, RedBear Duo etc. http://www.blynk.cc/

So far, I think I’ve got my head around what is required for the moving data back and forth to the Blynk network. That part should be fairly painless, using @eric’s existing system for transporting user data via the gateway.

There’s bound to be more than just that – like the Blynk library making calls to Particle functions the Bluz doesn’t yet have and what not.

Anyway, my plan is to have the Bluz contain the complete Blynk Library, as in #include "blynk/blynk.h" and add the required data transport to that, in combination with the BluzGW doing the TCP/IP connection to Blynk’s network.

The Bluz gateway’s Photon runs on an OTA flashable sketch, making it silly easy to change or update. So we could probably just have a separate version of that sketch, for people who want Blynk support.

Another option would be to use the offical version of the BluzGW Particle sketch and add to it the capability of “hearing” a special escape sequence to initiate the Blynk connection and transport layer. In that way, it will still work for all things non-Blynk. But it will certainly be the former idea, for a start.

I’ll dive deeper tomorrow.


#3

Really like your last option @gruvin how about a generic Bluz to GW layer for TCP. so we could do any TCP stuff we wanted? and the blynk lib can just run over that.

if there is a standard then it could be added to all the gateways (phone apps, GW shield, @mumblepins GW, etc. )


#4

@gruvin Love where you are headed with this.

The best way to make this work is to actually create the TCPClient library on bluz. The Blynk protocol uses that under the hood (as @rgm pointed out), so if that gets implemented then Blynk will just work.

This shouldn’t be too tough. The TCPClient library is just a wrapper around socket, which is already working. If you look in platform/MCU/NRF51/Spark… then you will see a socket_manager class and a socket class. All you need to do is make a TCPClient class for bluz HAL that uses the same underlying socket class.

The entire communication pipe from bluz to the gateway is already implemented and handles things like different socket numbers. There is definitely some testing on the gateway side, and I know for a fact we don’t parse the IP address handed up, but it shouldn’t be too bad.

This is the best way, you would get to use all the underlying socket protocols that already exist on bluz and the gateway. Plus, at the end, you would have Blynk support AND a TCPClient class so that bluz could send TCP messages to ANY server, not just the Blynk one.


#5

Wow. OK. I had no idea all that was working on the Bluz side already … though now you mention it, it’s kind of obvious. :slight_smile:

I’ll turn spin wheel the HMS Gruvin and see how she turns.


#6

@eric OK. I’ve actually read the Particle gateway code now and see how it all works.

Thinking out loud …

So, I guess if I just move the domain/IP and port parameters over to the Bluz side, so they can be passed to the gateway (instead of the presently hard coded CLOUD_DOMAIN, 5683) then that will take care of that part.

MAX_CLIENTS is a little more interesting. Ah … I see. That’s just the maximum number of clients the NRF can connect to over BLE at any one time. So, it doesn’t affect how many TCP sockets we can open on the gateway side. OK. I’ll come up with something to separate those out.

Now I need turn my attention to the TCPClient end of things. Oh right, so TCPClient is already part of the Particle framework. That’s new since I last looked at their docs, I think (couple years?) Cool.

So really then, seems all I have to do there is write a new HAL/BLE<–>BluzGW layer and plug that into existing code (after I port it?) – just like @eric suggested. Neatoes.

On with the show then.


#7

The current socket framework actually hands over the IP address and port, the gateway just ignores it at the moment and hard codes it.

So all you really need to do is have the gateway parse the proper IP address and you will be good.

On the TCPClient side, you just need to implement the HAL functions and call the socket functions: https://github.com/bluzDK/bluzDK-firmware/blob/develop/hal/src/nrf51/socket_hal.cpp#L22. Everything underneath is handled for you, you shouldn’t need to change anything in the platform/ folder at all, just implement the TCPClient file with the current socket HAL.

Yes, I think we hard coded the number of system sockets to 1, so you need to increase that by 1.


#8

@eric … This is going to be a little bit of a bigger job than first met our eyes … if only for implementing inet_gethostbyname() and ping(), the former being the one we may most want.

Those currently require the wiced library on the Photon, which in turn may be part of Particle’s RTOS. It has it’s own RTOS folder. I haven’t dug into that.

For now, I’m just going to proceed without a working gethostname() or ping() and just use IP addresses for connect(). {shrug} We’ll cross that bridge when we get to it. Rolling our own is not out of the question. Get hostname is essentially just a TCP call on port 53, after all. Ping … well there’s different ways that gets implemented … ICMP with or without UDP. {shrug} Long time since I wrote anything IP down at that level. Could be fun. hehe


#9

oooh ooh ooh i ported a library a get hostbyname ages ago for the core when the dns IP got corrupted… might save you a bit of work to see how it was done… it uses UDP though…


#10

@gruvin I haven’t looked at this closely yet, but we shouldn’t need to do those things on bluz. If there is any IP work to do, such as lookup of IP addresses or pings, then the gateway can do it. So if the user on bluz wants to connect to “api.abc.xyz”, then we just hand that domain to the gateway and let them do the lookup. The Photon already does a great job of this, so we continue to let it do the IP work.

Same with ping. It can be a different socket type or whatever we need to do, but bluz should just hand over what it wants to the gateway “ping api.abc.xyz ttl=64” or whatever, the Photon does the work and just hands back the answer.

Maybe that is too simpliied? I am not sure since I haven’t looked too closely, but we should make the gateway do any IP work. If a library or capability already exists on the Photon to do something, push the work there.


#11

@eric … yes, I thought of that last night, just before I fell asleep. “Get the gateway to do it!”. :wink:

No, not too simplified. I’m on it. :wink:

I’m pushed for time with other stuff going on atm. But I have made a start. More to come.


#12

@eric … making progress on the TCPClient thing.

EDIT: I don’t think this below is actually the problem. We’re already in HAL land at this point, from socket_hal.c. Hmmm. I’ll keep on it.

Got to the point of trying to make the second connection and was getting an SOS. Just had a bit of a giggle …

void DataManagementLayer::sendData(int16_t length, uint8_t *data)
{
"// a bit of a hack for now, should HAL this out, but it'll work for the time being"

(quotes added for dramatic effect)

So glad you [I presume] put that comment there. It would have been a loooong time before I figured that out for myself (again!)

Looks like I get to play with my old enemy again … the HALish Dynalib! “I’ll get you this time Dynalib! Mwaaaahahaaaa!!”

[Ed. Too much coffee? Not enough sleep? Both?]


#13

@eric (anyone?) … I’m stuck.

A call into an NRF51_StdPeriph_Driver, with seemingly valid data, is causing an SOS upon my TCPClient’s socket_connect() – a HAL function in socket_hal.cpp, so all pointer references are safely adjusted. I have checked this – my IP address for example comes across just fine in the same data buffer.

This call to socket_connect proceeds down the chain until it gets to the following XXX markers in platform/MCU/NRF51/SPARK_Firmware_Driver/src/ble_scs.c, where it SOS’s …

uint32_t scs_data_send(scs_t * p_scs, uint8_t *data, uint16_t len)
{
    ble_gatts_hvx_params_t params;
    //uint16_t len = sizeof(data);
    int error = NRF_SUCCESS;
    uint8_t buffer[20];

    for (int i = 0; i < len; i+=20) {
    	uint16_t size = (len-i > 20 ? 20 : len-i);
    	memcpy(buffer, data+i, size);

		memset(&params, 0, sizeof(params));
		params.type = BLE_GATT_HVX_NOTIFICATION;
		params.handle = p_scs->data_up_handles.value_handle;
		params.p_data = buffer;
		params.p_len = &size;

		// XXX TCPClient::connect SOS's HERE XXX
                int error = sd_ble_gatts_hvx(p_scs->conn_handle, &params); // NRF51_StdPeriph_Driver ... can't go changing stuff there 
                // if (data[16] == 112) return BLE_ERROR_NO_TX_BUFFERS; // XXX DEBUG
...

I cannot fault any of the code preceding this. All data appears to be correct. So I’m totally stumped. I have not looked into the actual value contained in p_scs->data_up_handles.value_handle;. But I can see no reason it could be wrong, since p_scs simply reference the global uint16_t m_scs and if that were corrupt, nothing would work. And on that note, the main client socket_connect works just fine.

The rest of params is set locally in this function and again, looks perfectly fine to me.

I cannot see how this could be a user/system-part boundary crossing issue. The very first call to socket_connect() from TCPClient::connect is already diving into the HAL.

Utterly stumped. :-/

Right now, I’m just asking if you have any, “Ah ha!” comments to make. But if you’d like to look into my work-in-progress on this, it’s here.


#14

Is the socket_connect being called from an interrupt context at all? Did you increase the MAX_SOCKETS in socket_manager.h to be 2 instead of 1?

What is the SOS code that flashes?

Just a few things off the top of my head. I can try and take a look at the code when I am back in front of my laptop.


#15

[quote=“eric, post:14, topic:461, full:true”][/quote]
Thanks for quick response. Take your time with this though. I may yet figure it out myself.

[quote]Is the socket_connect being called from an interrupt context at all?
[/quote]
Nope. Certainly not that I can tell. It’s being called from, wiring/bluz_wiring_tcpclient.cpp. Shouldn’t be any interrupts happening there … unless there’s an NRF callback involved in some bad way.

But all the same code works just fine for the usual client connect and the data that arrives looks right. That’s what’s really got me.

Did you increase the MAX_SOCKETS in socket_manager.h to be 2 instead of 1?

Yes.

What is the SOS code that flashes?

I cannot tell. The flashing is broken up by something else. It’s a long sequence though, so I suspect it’s 8 flashes.

      I think the cause of LED thing may be that the while(1) loop in
       app_setup_and_loop_passive() is still operating, even in SOS state –
      thus fighting over the LED with the SOS routine. Just an educated guess
      at this stage … but I digress.

Just a few things off the top of my head. I can try and take a look at the code when I am back in front of my laptop.

Just when you get time. I’m going to dive into StdPeriph_Driver to try and see exactly where it bombs.

I have J-Link Debugger (Segger software, Mac version). So I’ll try to hook that up the elf and C source files, some time. No doubt all the interrupts firing away will kill that exercise, though. We shall see.


#16

UPDATE

Preface
Comments below concern effectively non-public code, just until I get a beta version ready.

Summary
After chasing my tail for days trying to find a bug that didn’t exist (long, boring story) …

Outgoing TCP connections are working.

No data transport support yet.

Details
EDIT: updated below to reflect decision changes.

On the Gateway Photon, each m_client now contains a an array of TCPClient sockets[MAX_CLIENT_SOCKETS] – currently 2 – one for the cloud link and one for the user app.

So each BluzDK can have up to two (presently) sockets connected. One will of course be the cloud.

In theory, if a DK loses its cloud socket connection, it could still retain one to a local LAN address. Interesting. {shrug}

Next on the list, in order …

  1. Implement DNS name resolution, using a custom BLE data service …

    void spi_data_process(...)
    {
        switch (serviceID) {
            ....
            case RESOLVER_DATA_SERVICE:
                /* take supplied domain name, resolve its IP address and return
                   the result to the 'DK client */
                break;
        }
    }
    
  2. Finish the mission critical stuff, by getting data to flow back and forth transparently, over the gateway’d auxiliary socket link(s).

  3. Debug, rinse and repeat.


#17

If anyone is interested, I have just published an early Beta of my proposed new Photon Gateway sketch.

This is the one that gateways additional user application TCPClient connections from connected BluzDKs. The firmware on the BluzDK end of things is still very buggy. Wanted to get the gateway how I wanted it, first.


#18

@eric – Beating my head against a wall here. As noted in, How to get system debug using DEBUG_BUILD=y to come out Serial1; …

EDIT: This is all just social FYI, really. Getting it off my chest. I’m sure I’ll figure it out eventually. :wink:

I have spent three days now, basically coming up with this …

I suspect Dynalib issues. But I actually don’t see how it could be. All the internal data buffer pointers appear to me to be on the system-part1 side of the wall. On the user-part side, I only get as far as calling socket_open(..) – a HAL function, which succeeds just fine.

The overarching problem from then on is that ALL data coming down the pipe is being fed to Socket 0 – aka sparkSocket. A Socket 1 now exists, but nothing inbound ever gets to it. That and only that is all I’ve been trying to figure out for three days.

I looked at the Gateway NRF51 firmware. Far as I can see, that passes all data in pristine form, most importantly leaving the socketID intact.

The socket ID as sent from the Gateway’s Photon is correct. Pretty darn hard to have it not be, really …

  // process messages coming FROM the cloud going TO the BLE network
  for (uint8_t clientId = 0; clientId < MAX_CLIENTS; clientId++) {
    for (uint8_t socketId = 0; socketId < MAX_CLIENT_SOCKETS; socketId++) {
    ...
          rx_buffer[3] = SOCKET_DATA_SERVICE;
          rx_buffer[4] = (SOCKET_DATA << 4) | (socketId & 0x0F);
          spi_send(rx_buffer, rx_buffer_filled);
  

Note the references to ‘socketID’.

Just as a final sanity check … this part is in socket_manager.h

 static const int32_t MAX_NUMBER_OF_SOCKETS = 2;

All that side of things is confirmed working anyway. The second socket from application.h's TCPClient.connect() is happening as expected.

Oh … I should also mention, that debug checks of the contents of the inbound data buffer on the DK side – at some point – do appear corrupted. The data coming back from the test socket on 10.1.0.112:22 (namely, “SSH-2.0-OpenSSH_6.9\r\n”) leaves the Photon and hits the GW-NRF SPI bus running – SPI and BLE headers attached, of course.

After that … well I don’t confidently know yet. DEBUG() is crashing. Data seems to arrive. But it appears corrupted, maybe. Yet through all of this, the sparkSocket data gets through just fine! That’s what’s got my head spinning. :confused:


#19

Not sure about the socket ID issue, but putting DEBUG statements in callbacks will lead to issues. All the code that copies data to the socket buffer is called in the highest BLE priority interrupt, DEBUG statements shouldn’t be used in that chain of events. Essentially, the UART system can’t run properly since you have locked up the highest interrupt level waiting on interrupts of lower priority.

Not sure that is what is happening, I don’t know exactly where you have put DEBUG statements, but something to keep in mind.


#20

Ah! Thanks @eric. That’s actually great news – and something I should have realised, had I stepped back from the forest. I was beginning to fear some kind of prior buffer overrun / stack trashing event. Whew.

I shall store data and get the debug info elsewhere, then. No sweat. hehe

I’m really looking forward to finding the simple one line of code that’s causing this weirdness. I have a feeling it’s gonna be a smack-on-the-forehead event. :confounded: