Very slow response causing 3s timeout error

Hello,

we have implemented for the first time a device server aimed at controlling a dome (protection for telescope).

I describe the environment to explain the situation:
- we have 2 machines on which Tango 8.something is installed as recommended (normally)
- A machine 1, called workstation (installed with ubuntu 14.04 LTS)which host the database (TANGO_HOST)
- A machine 2, called LCU (stands for Local Control Unit) which is the machine connected to the hardware (installed with ubuntu 14.04 LTS and which hosts a windows virtual machine; the driver for the dome is provided for windows)
- we wrapped the dome driver with python, in order to be able to test the driver itself independently of Tango (among other things)
- we implement a device server containing a Tango class to control the dome which uses the wrapper
- this dome device server is running on the LCU machine (machine 2)
- we did a PyTango script to test the dome device server (access to attribute, command execution …)
- this script is executed from the workstation (machine 1)

When executing this script, we have the following symptoms,
- sometimes once and sometimes for a long period (2 - 3 hours), the script is executed correctly
- but sometimes we encountered some errors/failures: the script stops because of the 3s Tango timeout
- in most cases, it stops on a state or status command (we don't have overloaded them, we use the inherited ones), but it could be on an attribute read (implemented by us, not turned into asynchronous mechanism, because shoul reply quickier than 3s)
- the script does not fail always on the same point
Then the problem is erratic, not reproductible … and seems to be linked to the environment

- The machine used are performing and dedicated to our control experiment
- The network is Gigabyte
- Endurance tests (python scripts) have been performed sevral times during several hours, using dome wrapper only (without Tango) and are ok

- When testing the state and status command with our device server, from the test device utility of Jive on the LCU, the commands take between 150ms and 1800ms

We did some extra tests, with other DS, not done by us
- we tested the Tango test DS, TangoTest, provided by the Tango distribution, written in C++
+ if the DS is running on the linux OS of the LCU => it goes very fast, the state command takes 0ms from the jive test device utility
+ if the DS is running on the windows virtual machine of the LCU => it goes very fast, the state command takes 0ms from the jive test device utility
- we tested the Tango test DS, DevTest.py found on github (https://github.com/tango-cs/PyTango/blob/master/tests/DevTest.py), written in python
+ if the DS is running on the linux OS of the LCU => it goes very slowly, and it fails because of the 3s Tango timeout error, when retrieving the attributes
+ if the DS is running on the windows virtual machine of the LCU => it goes very slowly, and it fails because of the 3s Tango timeout error, when retrieving the attributes

It seems that
- the virtual machine does not affect the behaviour
- the language C++ or python affects the behaviour

but …????

Do the users of PyTango or other users already had this problem or this kind of problem?
Does someone have an idea to help us on this problem?

Thanks by advance.
Laurence.


Hi,

very strange. This kind of configuration (and ones with many more hosts) works well at the ESRF and at many sites. I have not tested the DevTest.py device server but it should not slow down in the way you describe. It seems a mystery. Maybe there is a problem with the version of PyTango. How did you install it? Have you checked to see if the python device server is using lots of memory / cpu? I would switch on logging in the python device server and send it to a file to try to see what is happening by correlating it with logging messages from the client.

At what frequency does the client interrogate the server?

Andy
Hi,

thanks for your fast answer.
When you say "This kind of configuration (and ones with many more hosts) works well at the ESRF and at many sites."; I suppose and indeed something should be wrong in our configuration
I will only be at IPAG (in front of machines) on thursday.

Laurence.
Hi Laurence,

There is a great chance that you have some kind of "deadlock" in your device server process. To check this, increase the timeout
value in your client to 3.5 sec (DeviceProxy::set_timeout_millis() method) and check the error message you received.
There is a chance that it won't be "Timeout" anymore but rather something like "Not able to acquire serialization monitor".

Could you confirm that the error message you now get is this "not able to …"?
Could you also tell us if you have your own thread(s) within your Tango class?

Regards

Emmanuel
I will not be at work today, so I can't test right now.
What I can say now is that
- in the device server we develop, we have our own threads
- in the Tango test DS, DevTest.py found on github (https://github.com/tango-cs/PyTango/blob/master/tests/DevTest.py), written in python, thead module is also used

I join in attachment three files (described in the first post) we used for our device server, if it can help
- the dome driver wrapper: functions have been turned to synchronous mode to handle end of "action"
- the Tango class included in the device server: functions have been turned to asynchronous mode to avoid Tango 3s timeout error, with thread python module
- the PyTango script we used to test the device server; since we call asynchronous functions, we wait for the end of operation (command, …) with the function waitForState(state, delay) that waits untill the state is not the required one

I will try the other tests tomorrow if possible.

Thanks, regards,
Laurence.
Hi Manu and Andy,

PyTango version is 8.1.1
PyTango has been installed from the provided Ubuntu package as recommended

On our LCU, in the Virtual machine windows
Windows Powershell powershel.exe 38,148 K
Python.exe *32 35,616K

Performance
UC utilisée between 0% and 5% generally could reach ~20% in a peak
Memory 773Mo


I've increased the timeout to 3.5s testing the DevTest DS which is a python test DS
- I don't have the "Not able to acquire serialization monitor" error but still the timeout error
- I don't have the error message "not able to …"?
- Whatever the DS run in the LCU VM or linux or even on the workstation (I did this extra test to run the DS on the same machine than the TANGO_HOST), the error is
+ DevError[
desc = Timeout (3000 mS) exceeded on device toto/python/test if the timeout is set to 3S
….
or
+ DevError[
desc = Timeout (3500 mS) exceeded on device toto/python/test if the timeout is 3.5s


I join in attachment the log file as well as some error screenshots
The error is produced when reading the attr_async_to attribute (output on the console where the DS is started)

# Start DevTest DS and run the client test script
xtramgr@xtworkstation:~$ ./DevTest.py test
Failed to import EventChannelFactory notifd/factory/xtworkstation from the Tango database
In DevTestClass __init__
In DevTest __init__
In Python init_device method
In read_attr_hardware
[DevTest::read_attr] attribute name Long_attr
[DevTest::read_attr] attribute name DefUserAttr
[DevTest::read_attr] attribute name DefClassUserAttr
In read_attr_hardware
[DevTest::read_attr] attribute name attr_wrong_type
In read_attr_hardware
[DevTest::read_attr] attribute name Float_attr
In read_attr_hardware
[DevTest::read_attr] attribute name UShort_ima_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name Short_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name UChar_spec_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name Short_spec_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name ULong64_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name attr_asyn_to
Leaving reading attr_asyn_to attribute

and I can reproduce the timeout error when reading this attr_asyn_to attribute in the jive test device utility

thanks for your help, regards
Laurence.
Hi Laurence,

Here is the code of the attr_asyn_to attribute from your link (https://github.com/tango-cs/PyTango/blob/master/tests/DevTest.py):

        def read_attr_asyn_to(self,the_att):
        print "[DevTest::read_attr] attribute name attr_asyn_to"
        time.sleep(4)
        the_att.set_value(5.55)
        print "Leaving reading attr_asyn_to attribute"

On line 1919, there is:
 time.sleep(4)

So it will take more than 4 seconds to read this attribute.
Therefore it is normal to get a timeout exception if you have set your client timeout to 3s or 3.5s.

Increasing the timeout or changing the code of read_attr_asyn_to attribute so that it returns in less than 3s or 3.5s should solve this problem. smile

Hoping this helps,
Reynald
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
Hi Reynald,

thanks; I've modified the code of DevTest DS by reducing the sleep time under 3S, and of course it's ok.

I have still to find out of what goes wrong in OUR DS (see first post), but I have first to reproduce the problem and re build a system (because the "hardware has left"

I will investigate as soon as possible and ask if needed.

Thanks again.

Laurence.
Dear all,

In the same thread, I get some " desc = IMP_LIMIT CORBA system exception: Unknown minor code: a" error randomly, even after seeing our timeout to 30s…

Any idea of what that means ?

Best regards,
Sylvain
 
Register or login to create to post a reply.