Event loss issue

Dear all,

I'm experiencing problems with the event system in my application and I would like to ask you suggestions on how to debug and understand the error.
I have 3 C++ device servers with some attributes (created dynamically at device startup with Yat) and a C++ device client subscribing to CHANGE/USER/PERIODIC events from those attributes (about 20 attributes in total) with a unique callback. The maximum polling period for those attributes is 1000 ms. The client has a user thread where events are processed (e.g. event data received in the callback are passed to the worker thread).

The problem I'm facing is that after the client has run for a while some of the events are not received anymore. I suspect that I have some deadlock (+memory issue?).

One of the problem in the client, if I'm not wrong, it should be addressed to code that pushes events manually from a user thread.
Infact I occasionally saw these error messages in the client:


Tango exception   
Severity = ERROR   
Error reason = API_CommandTimedOut   
Desc : Not able to acquire serialization (dev, class or process) monitor   
Origin : TangoMonitor::get_monitor

and also:


Tango::ZmqEventConsumer::push_heartbeat_event() timeout on channel monitor of (dserver address)

and these (apparently) disappeared if I shut down event pushing in the user threads (e.g. leave only events generation from polling thread).
Could you give me more details on this Tango core messages?

However the issue with the event stop did not disappear.
I understand that it is very difficult being helped without code sketches or additional details, so I would like to ask first if you ever experienced the same (Tango version 9.2.5a, zmq v4.2.0, omni v4.2.1) and how to debug the problem. Is there any flag (e.g. related to zmq) I can turn on to explore the issue in detail both in the servers and in the client? Could it be a zmq issue (e.g. some queue full)?
Right now I'm preparing a very simple device client that just subscribes to the same events and monitor the received event rate to understand more but if you have hints that would help me a lot.

Thanks for your support,

Simone





****************************************************************
Simone Riggi
INAF, Osservatorio Astrofisico di Catania
Via S. Sofia 78
95123, Catania - Italy
phone: +39 095 7332 extension 282
e-mail: simone.riggi@gmail.com,
sriggi@oact.inaf.it
skype: simone.riggi
****************************************************************
Dear all,

I did some further tests and produced a simple example to reproduce the issue (see attachment).
Basically there are 3 device servers with few attributes and a client that just subscribe to server attributes.
To run the example, configure the client (EventListener device) properties (rx_proxy_name, spf_proxy_name, ds_proxy_name) giving the device server names (as configured in your host) and run the four servers.

In the Tango configuration of my laptop (Ubuntu 16.04, Tango v9.2.5a, omniORB v4.2.1, zmq v4.2.0), after a while (few minutes) I do not receive anymore the events with the smallest polling period (1000 ms).
I run the same test in another Tango host (Scientific Linux 6, Tango v9.2.2, zmq v4.2.1) and I did not see the issue so there should be something wrong in my laptop configuration or a very trivial error that I'm doing. Any hints or suggestion to see what's wrong? Could it be related to zmq (e.g. PUB events dropped or a bug?).

If you have time could you run the same test on your Tango host (latest Tango release) to confirm if you see the same problem?

Many thanks for your help,

Simone
****************************************************************
Simone Riggi
INAF, Osservatorio Astrofisico di Catania
Via S. Sofia 78
95123, Catania - Italy
phone: +39 095 7332 extension 282
e-mail: simone.riggi@gmail.com,
sriggi@oact.inaf.it
skype: simone.riggi
****************************************************************
Dear Simone,

I will try your code today and let you know if I get the same results. It will be interesting to know if someone else observes the same.

Cheers

Andy
Hi guys,
the exception reported we sometimes face as well. As you know is caused by a contention on the tango monitor. This may be caused by too many clients (e.g the polling thread(s) and additional client(s) reading any non polled attributes) accessing the device. Or by a underestimated polling period with respect to method execution time and number of attributes.
Moreover, my feeling is that event the push-event-by-code approach is acquiring some locks in the TANGO core… but this can be better confirmed by someone actively working on it.
Cheers,
Lorenzo
Simone,
I have your servers running on my laptop. I will let you know if I see the same problem or not.
Cheers
Andy
Dear all,

many thanks for your support!
Yesterday I re-installed Tango+DB on my laptop with a previous zmq version (v4.0.8) and I did not notice the same error. I will repeat tests today.

Thanks again,

Simone
****************************************************************
Simone Riggi
INAF, Osservatorio Astrofisico di Catania
Via S. Sofia 78
95123, Catania - Italy
phone: +39 095 7332 extension 282
e-mail: simone.riggi@gmail.com,
sriggi@oact.inaf.it
skype: simone.riggi
****************************************************************
I am running on Ubuntu 16.04 with Tango 9.2.5a and zmq 5 (from Ubuntu repository) and so far I have not seen the issue either.
How soon does it happen when it does happen? I am counting the events too to see if there is any correlation.
Cheers
Andy
Dear Andy,

many thanks for the prompt report.
I saw the issue after few minutes (<15 minutes). If I'm not wrong you should be using zmq 4.1.4-7. In case I do the check with this lib also.

Cheers,

Simone
****************************************************************
Simone Riggi
INAF, Osservatorio Astrofisico di Catania
Via S. Sofia 78
95123, Catania - Italy
phone: +39 095 7332 extension 282
e-mail: simone.riggi@gmail.com,
sriggi@oact.inaf.it
skype: simone.riggi
****************************************************************
Hi Simone,
I ran your servers on my laptop yesterday for 6 hours without any event loss. More than 340000 events were received by EventListener. This morning I have increased the event rate to 10 Hz to see if I can reproduce your problem. So far not. I will let you you know the result.
I checked on Ubuntu 16.04 and indeed I am using version 4.1.4-7 of zmq from the Ubuntu repository. The standard one installed by Ubuntu.
I haven't noticed anything strange i your code - it looks very well written to me!
Cheers
Andy
lorenzo
Moreover, my feeling is that event the push-event-by-code approach is acquiring some locks in the TANGO core… but this can be better confirmed by someone actively working on it.
Cheers,
Lorenzo

Hi,
I confirm what Lorenzo was writing. The push-event-by-code approach is acquiring some locks in the TANGO core.
Cheers,
Reynald
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
 
Register or login to create to post a reply.