Forward attribute not receiving change events properly.

Hi All,

Pre-requisite:
We have around 8 TANGO ANGO Java DS, say "DS-A, DS-B, DS-C, DS-D, DS-E, DS-F, DS-G & DS-H". Among them, DS-A has an attribute "x" which is a root for the forward attribute on all the other 7 (DS-B, DS-C, DS-D, DS-E, DS-F, DS-G & DS-H) Tango Java DS. In other words, each of the 7 Tango Java DS has an attribute whose value comes from DS-A\x via change event.

Scenario 1:
Many times it is observed that the value of "x" on DS-A changes and creates a change event, however, the value does not get updated on other DS (DS-B, DS-C, DS-D, DS-E, DS-F, DS-G & DS-H). Attached are the screenshots which depict that the values are received sometimes & other times it gives an error.

Pain area: When the attribute (forward attribute of DS-A\x) on DS-B, DS-C, DS-D, DS-E, DS-F, DS-G & DS-H gives an error the tarus label attached to it does not update, thereby giving misleading value.
Screenshot: 1) Forward attribute fluctuating_1 - For all these screenshots please refer the boxes in red. (time & attribute (agn_gab_owner) value)
2) Forward attribute fluctuating_2
3) Forward attribute fluctuating_3
4) Forward attribute fluctuating_4

Scenario 2:
Many a time after restarting the TANGO DS (DS-B, DS-C, DS-D, DS-E, DS-F, DS-G & DS-H), randomly one of the DS gives an error. Screenshot attached. Currently, I am unable to understand why this happens. Correct me if I am wrong, as per the error message, it appears that this error has occurred in "initDevice()" method of Java TANGO DS. B

I am suspecting that this is happening because of usage of the forward attribute (i.e. one root attribute having more than one forwarded attribute, in this case, there are 7 TANGO DS which should receive the change event), not sure though !!!

Pain area: The DS thereby does not perform it's normal tasks properly.
Screenshot: 1) Tango DS Init failed:- In this screenshot, mnc/cmc/cpx (in the error portion) is the DS-A from the above pre-requisite.

Scenario 3:
Many times the Tango DS process or the Starter DS process goes into a defunct state.

Pain area: The process is still reachable, however, it does not perform the intended task. Therefore when we try to check if a DS is alive it returns "true" but actually it is not the real status.
Screenshot: 1) Tango DS process defunct - process id 3765 is a defunct process
2) Starter process defunct - In this screenshot, process id 27993 is a defunct process.

TANGO version: 9.2.2
OS: Ubuntu 16.04 LTS
Regards,
TCS_GMRT
Hi TCS GMRT Team,

Please let us know what is the problem that you are trying to solve by having a single attribute forwarded on 7 different devices. It may be possible to work out a different solution that removes the need to have so many forwarded attributes.

Regards,
Vatsal Trivedi
Hi,

I think your problem might be due to the current implementation of the forwarded attributes in JTango.
I think you might be facing the following bug:
https://github.com/tango-controls/JTango/issues/44

Your problem might be due to this:
the Java device server, when receiving an event from the root attribute, is ignoring the attribute value contained in the received event, and is doing a synchronous read of the root attribute before pushing an event with this value.

As I wrote in the JTango issue:
I think the Java device server should simply forward the attribute value it received in the event coming from the root attribute instead of doing a synchronous read of the root attribute in this case.

I personally consider this bug as a critical bug and I would advise to avoid to use forwarded events in a Java Tango Device servers in the current situation.

Since you have 7 attributes pointing to the same root attribute, when there is a change of the root attribute, a change event is sent to the 7 other devices and when the event is received, all the devices are doing a synchronous read on the root attribute, more or less at the same time.
Maybe there are some cases where this is causing a problem.
Do you have polling enabled on your root attribute or are you pushing change events manually from the code for this root attribute?

It would be interesting to see the errors reported by ATK when you see the forwarded attributes in error (ATKPanel View Menu –> Error History).

About the other issues you are reporting (defunct processes), this is not clear to me what is happening from the details and the screenshots you provided.

How do you identify that the Starter or Java DS is in a defunct state?
The screenshots you provided do not seem to show that.
In the screenshots, ps says their state is "Sl", which means, according to ps documentation:
  • S: interruptible sleep (waiting for an event to complete)
  • l: is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)

In our control system, ps is returning exactly the same state (Sl) most of the time for our running Starters.

Moreover, you are writing the process is still reachable so it means it is not defunct!?
You are also writing:
it does not perform the intended task

What are the tasks it does not perform?

When you are restarting your 7 device servers, are you restarting them all at once at the same time?
If this is the case, could you try to restart them one after the other and see if this solves the timeout issue?

Hoping this helps a bit.
Reynald
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
Do you have polling enabled on your root attribute or are you pushing change events manually from the code for this root attribute?
Yes, polling is enabled on the root attribute. We do not push change events manually on the root attribute.

How do you identify that the Starter or Java DS is in a defunct state?
Most of the times it is observed when TTY field has a value "?" the process is seen as running (via ps command) but actually, it is not.

Moreover, you are writing the process is still reachable so it means it is not defunct!?
When a process is checked using "ping" it returns "true" therefore the logic assumes that the DS is running, however when we send some command to the DS that throws an exception.

When you are restarting your 7 device servers, are you restarting them all at once at the same time?
All the 7 DS are started as a part of a loop, the iteration takes place over an array containing the DS details that are to be started.

If this is the case, could you try to restart them one after the other and see if this solves the timeout issue?
We tried giving some amount of delay during the iteration, however, that did not help is fixing the problem.
Regards,
TCS_GMRT
TCS_GMRT
Most of the times it is observed when TTY field has a value "?" the process is seen as running (via ps command) but actually, it is not.

It all depends on how the process was started.
If the Starter is started as a service when your computer is booting (which is the recommended way), no TTY will be associated to the process. This is expected. All the device servers started by the Starter will behave the same way (no TTY will be associated to them) because of the way the Starter is starting the other device servers.
So I think your method to identify a defunct process is not correct.

You are writing your device servers are answering correctly to the ping command, which is a command available on the admin device.
This means your admin device is still working as expected.
In your case, it seems you have some situations where the admin device is answering correctly but other devices in the same device server instance are no longer answering as expected. This can happen if there is a deadlock somewhere in the code if this device class for instance.
What kind of error do you get in this case when you execute a command/read or write an attribute on these "defunct" devices?
Is it just a timeout error?
If this is the case, could you try to execute the command/read or write an attribute with a much longer timeout on your client side just to see whether your problem is not due to the fact that some commands on your devices/attributes are taking too long to be executed/read/written?
For instance, with the Device Panel (Test Device from jive), you can set this client timeout from the admin tab.

Kind regards,
Reynald
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
On similar lines today I have the following scenario:

The DS "MNC/CMC/CPX" is not reachable over ping. However, Jive has to say something else (attached).
I am still finding at my end as to how the DS stopped, although we did not ask the DS to stop.
Regards,
TCS_GMRT
Dear TCS_GMRT,

I conclude that jive says the admin device is not reachable because the information about polling is obtained via the admin device. This is coherent with the ping not responding. The device server is not accessible. I do not know the reason either. Have you looked at the server with some Java debugging tools?

Andy
 
Register or login to create to post a reply.