Skip to content

Emc race and cppcheck warnings#3734

Open
smoe wants to merge 3 commits intoLinuxCNC:masterfrom
smoe:emc_race_and_cpcheck_warnings
Open

Emc race and cppcheck warnings#3734
smoe wants to merge 3 commits intoLinuxCNC:masterfrom
smoe:emc_race_and_cpcheck_warnings

Conversation

@smoe
Copy link
Collaborator

@smoe smoe commented Jan 23, 2026

Took the patch proposed by @BsAtHome for #3700 and prepared this PR for it. It looks fine to me but I also did not test it. Any takers?

The most important bits are

  • elimination of dynamic memory allocation
  • the declaration of the sessions variable as atomic. It is decreased when the client can no longer be read (within a thread), and increased in sockMain when a new thread is started (outside a thread) and also when the return value of the thread is != 0 (outside a thread).

There are some meta talking points:

  • There is no pthread_join or pthread_detach anywhere.
  • Is the exit(0) upon an acceptance failure (the immediate stop of the controller with no explanation whatsoever) what we truly want?

@smoe smoe added bug 2.9-must-fix must be fixed for the next 2.9 point release 2.10-must-fix labels Jan 23, 2026
@smoe smoe marked this pull request as draft January 23, 2026 11:51
@BsAtHome
Copy link
Contributor

Good to see that you fixed the off-by-one in read() :-)

Seeing pthread_join() normally indicates that the application is well behaved. It clearly is not when exit() is called upon failure. The program is mostly written as a C program masked in a C++ source file. There are many things that should be streamlined, but I'm not sure whether this has any real value at this time. There are many interactions to consider, for example with NLM.
That said, the missing pthread_join() does make it necessary to add pthread_detach() after creation because the resources should be released automatically when a thread ends. It can be added as an else-clause to the if(res != 0) just after the thread is created.

There are also no allowances for signals causing interrupted syscalls. That would simply cause the application to bail and exit.

The exit(0) in failed accept() is interesting because it ignores all network errors that should be treated as EAGAIN (see accept(2)). Your machine will stop accepting (remote) connections when the network wobbles. Is this good/bad/whatever?

That brings us to a general LCNC issue: error handling. There are many more instances in LCNC programs and utilities that are not exactly well-behaved when we talk about error resilience and recovery. But that would require a complete review and well defined strategy how to cope.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 23, 2026

Is this stuff even thread safe? It looks like it uses emcsched.cc, and addProgram there just uses push_back on a std::list... did I muss a lock?. Seems incorrect.

@smoe
Copy link
Collaborator Author

smoe commented Jan 23, 2026

I like to where this cppcheck warning brought us. The pthread_detach should always declare the pthread_t to be reallocatable after it returned, but correct and direct me.

At one of the very last lines, just prior to the invocation of sockMain, another pthread_create is performed. That should be detached, too, but I wait for the clarification on the position of this pthread_detach.

@rmu75, I also just skimmed through emcsched.cc and emcsched.hh - the queueStatus is an enum that may indeed immediately benefit from some thread safety, as in testing for a condition and changing its status should not be interrupted. And the addProgram changes the queue and then triggers a queue sort, which may come as a disturbing inconvenience to anyone iterating through the queue at the same time. Should we have this as a separate issue?

@BsAtHome
Copy link
Contributor

Is this stuff even thread safe? ... Seems incorrect.

You are right! It is incorrect and not thread safe. There are probably many more problems with thread safety. These would not have been seen as the "normal" use is to attach only one remote client to send stuff.

Seen in this light, the choice is to limit to one client only (or use a select/poll construct to serialize access) or to review and fix the entire code path to be thread safe.

@BsAtHome
Copy link
Contributor

I like to where this cppcheck warning brought us. The pthread_detach should always declare the pthread_t to be reallocatable after it returned, but correct and direct me.

A detached thread will have its resources purged when it exits. That means that the return value cannot be retrieved, for which you would have to use a join. In this case, it doesn't matter because no return value is used. Therefore, a simple detach should suffice.

At one of the very last lines, just prior to the invocation of sockMain, another pthread_create is performed. That should be detached, too, but I wait for the clarification on the position of this pthread_detach.

That doesn't matter. That specific thread is the "queue runner" and there is only one during the lifetime of the program. Any resources associated with it will automatically die when the program dies.

@smoe
Copy link
Collaborator Author

smoe commented Jan 23, 2026

We have two separate threads already. One is started in

res = pthread_create(&updateThread, NULL, checkQueue, (void *)NULL);
and the other from within the socksMain we are editing. And both threads are changing the queue, one via parseCommand -> commandSet -> ... -> addProgram, the second via
q.remove(q.front());
.

@smoe
Copy link
Collaborator Author

smoe commented Jan 23, 2026

That doesn't matter. That specific thread is the "queue runner" and there is only one during the lifetime of the program. Any resources associated with it will automatically die when the program dies.

Done it anyway, may eventually avoid false positive memory leaks in valgrind.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 23, 2026

I think the likelihood that concurrency-problems are triggered in real application of schedrmt are very low and in this case it would be acceptable to just document the shortcomings. Introducing a global mutex that is held while processing data from read() should be possible and relatively easy. As no significant amounts of data are moved that should have no performance implications.

Proper way to fix this amounts to a rewrite with boost::asio or equivalent (which would also get rid of the threads btw) (or port to python).

@BsAtHome
Copy link
Contributor

With a global mutex, I think only invocation of parseCommand() in schedrmt.cc:1177 and updateQueue() in schedrmt.cc:1141 need to be protected.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 23, 2026

with something like this

class MutexGuard {
    static pthread_mutex_t mutex_;
public:
    MutexGuard() { while (pthread_mutex_lock(&mutex_)) /* sleep? exit?*/; }
    ~MutexGuard() { pthread_mutex_unlock(&mutex_); }
};

pthread_mutex_t MutexGuard::mutex_ = PTHREAD_MUTEX_INITIALIZER;

it should be only a matter of creating a local MutexGuard object at the start of the functions to be mutex'ed. Don't pthread_exit though.

@BsAtHome
Copy link
Contributor

There are no thread-die-exit-or-else afaics. So just mutex the two functions should handle the problem.

Then, thisQuit() is called from a signal handler and uses exit(0). However, you should use _exit(0) when called from a signal handler.

@smoe smoe force-pushed the emc_race_and_cpcheck_warnings branch 2 times, most recently from e95461e to fc012ca Compare January 25, 2026 14:58
@BsAtHome
Copy link
Contributor

@smoe, did you see my review comments?

@smoe smoe force-pushed the emc_race_and_cpcheck_warnings branch from fc012ca to b820de0 Compare January 25, 2026 15:43
@smoe
Copy link
Collaborator Author

smoe commented Jan 25, 2026

@smoe, did you see my review comments?

Well, yes and no, had not mentally parsed the exit -> _exit one. This is all a bit beyond my daily routines - what is the signal meant to stop? I was wrong to remove the exit from the implementation of shutdown, from how interpret this all now, but what flavour of exit is appropriate when the shutdown code also frees everything?

@smoe
Copy link
Collaborator Author

smoe commented Jan 26, 2026

With a global mutex, I think only invocation of parseCommand() in schedrmt.cc:1177 and updateQueue() in schedrmt.cc:1141 need to be protected.

Don't we need to expand any such global mutex also to emcsched.cc, etc?

@rmu75
Copy link
Collaborator

rmu75 commented Jan 26, 2026

I would keep the mutexes in the main loop of the threads. Blocking read returns data -> acquire mutex -> do stuff -> release mutex -> go back to blocking read of socket.

@smoe
Copy link
Collaborator Author

smoe commented Jan 26, 2026

@rmu75 , @BsAtHome , please kindly help me with a patch.

@smoe
Copy link
Collaborator Author

smoe commented Jan 26, 2026

With a global mutex, I think only invocation of parseCommand() in schedrmt.cc:1177 and updateQueue() in schedrmt.cc:1141 need to be protected.

Don't we need to expand any such global mutex also to emcsched.cc, etc?

I can answer that question for me now - no, we don't, since schedrmt is an application on its own, so there is no other route to invoke the same lower-level routines. Sorry, this seems rather obvious but somehow I needed that.

@smoe
Copy link
Collaborator Author

smoe commented Feb 1, 2026

Do you have ideas for including your (some of your) tests as part of the CI ?

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 1, 2026

Do you have ideas for including your (some of your) tests as part of the CI ?

Not yet, but yes, they need to be added.

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 5, 2026

An update on the linuxcncrsh rewrite: Most code now just works. I've also re-enabled the test in test/linuxcncrsh and fixed a lot of errors in the test itself.

The idea that you can make linuxcncrsh single threaded is not supported by the code. The problem is in the fact that NML will stall the entire process when you run synchronously (WAIT_MODE DONE). That is a problem and can only be solved by having the session (i.e. a connected socket) run in its own thread.

The next thing is that linuxcncrsh mostly is single threaded because it uses an enableConn variable tracking which session is allowed to talk to NML (which, of course, currently has a race condition). That means that only one session can talk to the machine at any time. That session is the one which successfully executed SET ENABLE ....

Now, a real question is whether it makes sense, at all, to have multiple sessions to linuxcncrsh that can control a single machine. There may be scenarios, but I'd need someone to tell me which scenario that would be. However, if so, then this probably needs to me multiplexed onto one single NML connection, but that is a speculation I have at this time.
It would be easier to have only one session controlling the machine and then, if the interface allows it, allow for other session only to observe some few things. Or maybe you may want to kick one (stuck) session from another session (but may lead to other problems).

BTW, there is no guard against instantiating linuxcncrsh multiple times on different ports and connecting a session from each instance. I have no idea how that may lead to catastrophic machine pulp, broken iron, lost limbs and other unfortunate faults.

Interactions of multiple sessions needs to be discussed how they are to be handled.

FWIW, I'm currently stuck on issuing SET JOG_STOP. It apparently aborts the entire test process and I have no clue why.

@smoe
Copy link
Collaborator Author

smoe commented Feb 5, 2026

To help Andy a bit with his decision making towards accepting your work:

  • Is anything working less reliably than in the prior implementation? You mention SET JOG_STOP as your end boss.
  • To be on the safe side when a second access is requested: A warning that linuxcncrsh is already in use and terminate?

And .. maybe https://linuxcnc.org/docs/html/man/man1/linuxcncrsh.1.html is not meant to be ultimately safe in the first place? It can be started in parallel to a GUI, so you already can unknowingly give contradictory (limb-cutting) commands to what the machine is doing. It is a bit funny for us to care about thread-safety when that same thread-safe tool may indeed ruin any real-live thread (and limb) at ease when interfering with a running program?!

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 5, 2026

Is anything working less reliably than in the prior implementation?

No, it is working much better now :-)
Well, it is not yet finished entirely, but the framework is there and most commands perform as they should. A lot of extra checks have also been implemented to tell the user about invalid input.

It is just that I've been digging and discovered all these issues as I went along. Many I could solve without discussion, but some need discussion and this is the time to ask some of those questions and share reflections.

The test in tests/linuxcncrsh-tcp works just fine. I am working on getting the test in tests/linuxcncrsh to work because it is much more comprehensive (currently disabled in master).

You mention SET JOG_STOP as your end boss.

Yeah, that is a very odd situation. It appears as if something in LCNC terminates when the command is issued and that wraps it all up. Need to do some more digging.

It reminds me a bit of the failure of multiple spindles I was tracing. It took me a whole day to realize that the wrong hal-file was referenced in the INI. And the INI has the SPINDLES variable in the wrong place. Then, homing was again a problem. It needs a small sleep pause before issuing the next home command. Otherwise the homing process is still running and you get an error.

To be on the safe side when a second access is requested: A warning that linuxcncrsh is already in use and terminate?

I'd not have it terminate immediately. There is an argument for being able to kill a session from another session. Whether that can be done safely and with valid machine-state needs to be determined. There already is a limited set of available commands when not in "enable" mode. This can be useful for more things.

And .. maybe linuxcncrsh is not meant to be ultimately safe in the first place? It can be started in parallel to a GUI [snip]

Indeed, you have a very valid point. OTOH, it would be nice to prevent some obvious risk or at least warn about it. You can rebuild a crashed machine. Regrowing limbs is not (yet) common practice ;-)

@rmu75
Copy link
Collaborator

rmu75 commented Feb 5, 2026

Now, a real question is whether it makes sense, at all, to have multiple sessions to linuxcncrsh that can control a single machine. There may be scenarios, but I'd need someone to tell me which scenario that would be. However, if so, then this probably needs to me multiplexed onto one single NML connection, but that is a speculation I have at this time.
It would be easier to have only one session controlling the machine and then, if the interface allows it, allow for other session only to observe some few things. Or maybe you may want to kick one (stuck) session from another session (but may lead to other problems).

It can make sense in case you have multiple HMIs / operator stations. The only real problem is that error messages and user message can only be received by one endpoint and if you connect multiple guis it is more or less non-deterministic which GUi receives a particular message. The status OTOH is polled and while in theory there could be races where one gui receives the status another gui polled, that is harmless in my experience. I wrote a debug tool gui that polls with screen refresh rate, and it is no problem to run that parallel to some "proper" gui, in fact, I use that also as a kind of DRO on a second screen on a mill.. In a sense multiple sessions / guis / operator stations is like attaching a pendant that allows to jog and "cycle run".

On a real machine, all safety-relevant stuff has to be external to the linuxcnc control anyways. Also linuxcncrsh is something that has to be explicitely enabled / started.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 5, 2026

multiplexing multiple sessions onto one NML session is probably a bad idea.

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 5, 2026

multiplexing multiple sessions onto one NML session is probably a bad idea.

Ah, but then shcom.{cc,hh} needs to become a proper class that can be instantiated for every connection to encapsulate separate NML sessions within the same process. That is no problem, I already had a feeling it would be necessary. There is also other work in those files because I detected static buffers that need to go anyway.

When that is done, then there should be no problem in accepting multiple sessions.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 5, 2026

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 5, 2026

see https://github.com/rmu75/cockpit/blob/main/src/shcom.cpp and https://github.com/rmu75/cockpit/blob/main/include/shcom.hh
I also have variants that talk zeroMQ.

Thanks, that is a good start. I'll have a deeper look and see what I can "steal" ;-)
There are a few things that need refactoring, like getting rid of the shared char buffer(s) killing threaded applications. But I see that a lot of your changes fit in my thoughts.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 6, 2026

You should probably connect to user and error channels only once. Or not connect at all -- it won't work correctly if there is a real GUI in parallel.

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 6, 2026

There are three NML channels: emcCommand(W), emcStatus(R) and emcError(R). All are connected to namespace xemc.

Which of these is the 'user' channel?

The problem is that if two processes send a command on emcCommand, then the first reader of the emcError will retrieve any current error. Interestingly, the emcStatus channel is only peek'ed at and never actually read.

I am a bit baffled how this is supposed to work at all...

@rmu75
Copy link
Collaborator

rmu75 commented Feb 6, 2026

It has been some time since I last looked at that stuff, seems I misrembered about a "user" channel.

I didn't want to look too closely, there are too many things that induce headache. Like https://github.com/LinuxCNC/linuxcnc/blob/master/src/libnml/nml/nml.cc#L1260 and https://github.com/LinuxCNC/linuxcnc/blob/master/src/libnml/nml/nml.cc#L1281. Maybe there is some stubtle stuff I miss, but to me, it looks like a copy/paste error. I assume that every code path that is not used by linuxcnc contains some subtle bugs (like messages containing long truncating those to 32bit or the really strange length calculations in the ascii and display updaters).

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 6, 2026

Yeah, I tried going down that rabbit hole...

At least I can explain the peek() in the status channel. It is because the nml-file shows the buffer as not queued. Both emcCommand and emcError are queued buffers, hence the write/read.

Unfortunately, I've not seen an unread(). Anyway, multiple processes may interfere with each other on basis of the serial number. It seems that a lot of synchronization between write command and read status are based on the serial number. No one is checking the serial number of the error, so I don't even know if any error is correlated to the last command.

@andypugh
Copy link
Collaborator

andypugh commented Feb 6, 2026

the first reader of the emcError will retrieve any current error.
...
I am a bit baffled how this is supposed to work at all...

This bit has been known to be a problem for decades. The first thing to "look" gets the error status, and might or might not inform the user.

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 6, 2026

the first reader of the emcError will retrieve any current error.
I am a bit baffled how this is supposed to work at all...

This bit has been known to be a problem for decades. The first thing to "look" gets the error status, and might or might not inform the user.

Ok then. My suspicions confirmed.

The real problem is that there are separate uncorrelated queued read and write channels that share the backing store.

Analysis: The emc.emcCommand(R) (emctaskmain.cc) receives the commands from any of emcsrv.emcCommand(W) or xemc.emcCommand(W) (for the default nml file). However, it cannot reply to the appropriate ???.emcError channel because these are shared for both emcsrv and xemc.

To fix this, you need a back channel for emcError that matches the command connection. This is not a scalable solution because you need to create complex nml configs with an error buffer for each connection and also track that in the task handling, which is a near impossibility. The changes required are not worth the effort.

The way it looks is that NML is inappropriate for handling this particular case. For this you'd want a socket connection and use sendmsg()/recvmsg(). That means, zeroMQ will be much better than NML to handle communication, including status because it makes a simple bidirectional connection.

@rmu75 you said something about files using zeroMQ. Have you hacked a LCNC version, ripped out NML and replaced it with zeroMQ?

@rmu75
Copy link
Collaborator

rmu75 commented Feb 6, 2026

@rmu75 you said something about files using zeroMQ. Have you hacked a LCNC version, ripped out NML and replaced it with zeroMQ?

I didn't rip out anything, but added all nml messages as flatbuf structures and played around a bit stuffing those into zeromq sockets. See https://github.com/rmu75/linuxcnc/tree/rs/zmq-experiments and https://github.com/rmu75/cockpit/tree/rs/zmq.

It is just a hack, nothing is integrated into the build system, and generated files are checked in. If you want to try that, you will need to install flatbuffer and zeromq dev libs and invoke flatc by hand so generated stuff matches your installed flatbuf libs. flatc needs "--gen-object-api" IIRC.

I did look at machinekit and their machinetalk stuff (google protobuf and zeromq), and I wondered why they didn't go the full length and ripped out NML but built this "bridge". Turned out it is not that hard to feed alternate message format into task. I didn't like protobuf though, flatbuf is more efficient and doesn't suffer from a problematic compatibility story and a 2-3 version-transition that reminds of python's.

IIRC in my zeromq implementation, if you connect the error "channel", you receive error messages. Also status updates are not "peeked" but messages are send to each connection of the status socket.

@smoe
Copy link
Collaborator Author

smoe commented Feb 8, 2026

The ROS community now has adopted Zenoh (https://docs.ros.org/en/kilted/Installation/RMW-Implementations/Non-DDS-Implementations/Working-with-Zenoh.html). Open Source and faster than ZMQ - and would possibly help building bridges to the ROS community.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 8, 2026 via email

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 8, 2026

If I am reading the docs correctly, then Zenoh seems to be a key/value bases system that looks a bit like MQTT.
[ninja'ed :-)]

The part that is bothering me is that I do not know how to map the NML functionality. This is a very non-trivial problem because we have both queued (commands) and non-queued (status) communication. The odd one is the NML error channel, which is also queued but should be one-to-many mapped, but only to a point.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 8, 2026

we need some kind of remote procedure call (maybe with useful return-values, that doesn't exist now) and a channel to report status and error.

error messages could be put into a lru structure that keeps the last 10 or some number error messages.

status could in principle be published into a key value structure like mqtt but I think per default we need something that could deal with 1kHz or even 4khz of updates.

https://zenoh.io/blog/2023-03-21-zenoh-vs-mqtt-kafka-dds/

@smoe
Copy link
Collaborator Author

smoe commented Feb 8, 2026

status could in principle be published into a key value structure like mqtt but I think per default we need something that could deal with 1kHz or even 4khz of updates.

https://zenoh.io/blog/2023-03-21-zenoh-vs-mqtt-kafka-dds/

4 kHz - that is 250 µs between invocations. That suggests to look for the one with minimal latencies. MQTT and all other contestants are already distributed if I get this right, so some external error monitoring would be supported?

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 8, 2026

The fact that the realtime thread runs this fast does not necessarily mean that the non-realtime system must be updated this fast. Actually, when you look at the non-realtime, like GUIs, they only updates at about 10 times per second. For example, the [DISPLAY]CYCLE_TIME is set at 0.1 seconds (sim's axis.ini).

Another thing to remember is that status updates can be done in an incremental fashion. But, for the sake of argument, let a status update happen at 4kHz with a 1400 byte payload (about one eth frame size excl. overhead) and it needs to be handled by a network connection (very unlikely). That would mean we're sending ~5.4 MiB/s data. Well, my internet connection is 4 times faster and I manage to saturate it with little CPU efforts :-)

What is important is data sequence and correlation. Every client (non-realtime) needs to be able to track the server (realtime) and not get confused because multiple clients are acting on the same connection (like we have with the error channel). That is the big challenge to solve so that we have consistency, predictability and repeatability for all clients.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 8, 2026

4kHz is probably on the extreme end, but it could also cover cases where you want to synchronise multiple linuxcnc instances or implement remote access to hal.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 8, 2026

What is important is data sequence and correlation. Every client (non-realtime) needs to be able to track the server (realtime) and not get confused because multiple clients are acting on the same connection (like we have with the error channel). That is the big challenge to solve so that we have consistency, predictability and repeatability for all clients.

I don't understand what you mean with this. Now NML just broadcasts a command into emctask. No result is communicated back except for changes in status. IMO it can remain that way.

Status can also be more or less broadcast to all interested parties. Error or user messages can be put in a list. Error message only has informal character and don't need to be acked per se. If something really goes wrong then machine state will reflect that.

So in principle, we can treat communication channels as a room where every client can shout in commands through a window and observe what is happening, who is shouting isn't interesting, state has to be handled by emctask anyways. It might make sense to limit acceptable commands depending on channel.

@smoe
Copy link
Collaborator Author

smoe commented Feb 8, 2026

What is important is data sequence and correlation. Every client (non-realtime) needs to be able to track the server (realtime) and not get confused because multiple clients are acting on the same connection (like we have with the error channel). That is the big challenge to solve so that we have consistency, predictability and repeatability for all clients.

I don't understand what you mean with this.

I got this as "among all the many independently created sensor or position or whatever status updates you need to know at what moment what error kicked in and who triggered it".

Now NML just broadcasts a command into emctask. No result is communicated back except for changes in status. IMO it can remain that way.

Status can also be more or less broadcast to all interested parties. Error or user messages can be put in a list. Error message only has informal character and don't need to be acked per se. If something really goes wrong then machine state will reflect that.

There is the kind of error info that is printed to stdout/stderr. That is lost to the system. Such messages are both informal and informative :-) But lost.

So in principle, we can treat communication channels as a room where every client can shout in commands through a window and observe what is happening, who is shouting isn't interesting, state has to be handled by emctask anyways. It might make sense to limit acceptable commands depending on channel.

It may be a constructive misunderstanding on my side, but once we have a bit of a more formal representation of errors, then those errors could be also be interpreted by the controller, beyond some "-1" return value, and affect the systems further action - and human monitoring alike. I think we should start out with a diagram about how NML is used today and ornament our documentation with it. And then think more about what we all want. My instant hunch is that we want to borrow more from what ROS2 is doing.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 9, 2026

It may be a constructive misunderstanding on my side, but once we have a bit of a more formal representation of errors, then those errors could be also be interpreted by the controller, beyond some "-1" return value, and affect the systems further action - and human monitoring alike. I think we should start out with a diagram about how NML is used today and ornament our documentation with it. And then think more about what we all want. My instant hunch is that we want to borrow more from what ROS2 is doing.

A typical error that is communicated via error channel is "probe move stopped without contact" or similar. That is a global situation that can and should be communicated to each operator station. But there is not much else to do there.

You only need to look at emctask.cc and emctaskmain.cc. In a nutshell, emccanon sends stuff from the interpreter as NML into task, and the other thing is the GUI that sends commands.

I would proceed step by step, new API can be bolted on later, for me it is a separate issue.

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 9, 2026

What is important is data sequence and correlation. Every client (non-realtime) needs to be able to track the server (realtime) and not get confused because multiple clients are acting on the same connection (like we have with the error channel). That is the big challenge to solve so that we have consistency, predictability and repeatability for all clients.

I don't understand what you mean with this. Now NML just broadcasts a command into emctask. No result is communicated back except for changes in status. IMO it can remain that way.

No, NML does not broadcast. It consists of two queues (command and error) and one (virtually) shared memory region (status).

There is, at this time, no way to correlate the command issued with the error generated. Even worse, the error channel carries a mix of operator error, text and display messages. You have no way to relate that to what the machine was executing at the time the message was generated. And more disastrous, multiple clients reading the error queue will prevent consistency between readers because only the first-to-read will actually get the information.

When you connect two DROs that can display messages, then you really want to have them display the same thing, don't you?

Status can also be more or less broadcast to all interested parties. Error or user messages can be put in a list. Error message only has informal character and don't need to be acked per se. If something really goes wrong then machine state will reflect that.

Status is a non-queued broadcast. It is the virtually shared memory segment containing the machine's state.

Errors are not necessarily a list nor merely informal. These "error" messages may be broadcast to all clients but correlating this with the current status remains a problem because status is asynchronous from the queues. There needs to be more information. Maybe the status channel could contain some of the actual status, display and error messages and the "error" channel is a broadcast that "something important happened", which listeners can use to act upon. But there also needs to be (possibly historic) information on the error channel which client/command caused what situation.

So in principle, we can treat communication channels as a room where every client can shout in commands through a window and observe what is happening, who is shouting isn't interesting, state has to be handled by emctask anyways. It might make sense to limit acceptable commands depending on channel.

I disagree that clients/servers just (should) shout at each other. It must be a very coordinated communication that guarantees both consistency and repeatability.

I do agree that we might want to limit commands on some channel depending function/intent of the client.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 9, 2026

I don't understand what you mean with this. Now NML just broadcasts a command into emctask. No result is communicated back except for changes in status. IMO it can remain that way.

No, NML does not broadcast. It consists of two queues (command and error) and one (virtually) shared memory region (status).

Maybe I formulated a bit too informal. The effect is the same, commands are fed into a big switch statement, no distinction is made who sent the command.

There is, at this time, no way to correlate the command issued with the error generated. Even worse, the error channel carries a mix of operator error, text and display messages. You have no way to relate that to what the machine was executing at the time the message was generated.

That is a design issue of linuxcnc / emc task, not a comms interface issue.

And more disastrous, multiple clients reading the error queue will prevent consistency between readers because only the first-to-read will actually get the information.

This is a comms issue and can be fixed (but probably not with NML and acceptable dev-effort)

When you connect two DROs that can display messages, then you really want to have them display the same thing, don't you?

yes absolutely.

@rmu75
Copy link
Collaborator

rmu75 commented Feb 9, 2026

Errors are not necessarily a list nor merely informal. These "error" messages may be broadcast to all clients but correlating this with the current status remains a problem because status is asynchronous from the queues. There needs to be more information. Maybe the status channel could contain some of the actual status, display and error messages and the "error" channel is a broadcast that "something important happened", which listeners can use to act upon.

It could make sense to include some sort of "state tag" to error messages created in the interpreter that contains information like filename, linenumber etc... in a non-sprintf-way.

But there also needs to be (possibly historic) information on the error channel which client/command caused what situation.

Why. If something happens, machine will go either into "Idle", "Off" or "ESTOP", depending on severity. Some message should indicate the cause of course. I agree that if g-code or remapped code is involved, the file/linenumber could/should be indicated. Historic error information IMO needs to go into a separate component that logs problems, I wouldn't put that into emctask.

What would actually be helpful: a message that says "machine can't be enabled because cover X is not closed" or "machine can't be enabled because spindle coolant is out of safe temperature range" or "machine can't be enabled because vacuum pump motor protection tripped", but those things usually are in the HAL domain and would need customization. AFAIK there is currently no possibility to communicate such info from HAL to a GUI. Usually in these situations you just get an enable button that doesn't enable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants