Event Error as Server-side Exception: resource limit reached

# 7 years ago
TCS_GMRT	Hello, I am subscribing a spectrum attribute of a DS(a/b/c). Sometimes I get an error as org.omg.CORBA.TRANSIENT: Server-side Exception: resource limit reached vmcid: OMG minor code: 1 completed: No Severity: PANIC Reason: TangoApi_CANNOT_IMPORT_DEVICE Origin: Connection.dev_import(a/b/c) Can anybody tell what this error exactly notifies and the probable reasons for this? Regards, TCS_GMRT

# 7 years ago
Reynald	Hi, Good question. I don't think I have ever encountered this error but the error message is suggesting the Tango DatabaseDS device server didn't have enough resources (RAM, opened file descriptors limit, …?) to execute your request properly when trying to import a Tango device. Is the computer where your TANGO database server server is running under heavy load? Do you have a huge number of clients doing a huge number of queries in parallel at high frequency? (A big number of client applications trying to connect to many TANGO devices which are currently not running?) Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects? You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server. You can start this tool from Astor by right-clicking on the TANGO_HOST database and select "Database Monitoring". In a similar way, you can also use the Database BlackBox right-click menu item to see the last 50 queries sent to the Database server. This could help to eventually identify a client which would send a huge/abnormal number of commands to the Database server. Kind regards, Reynald Rosenberg's Law: Software is easy to make, except when you want it to do something new. Corollary: The only software that's worth making is software that does something new. Attachments:

# 7 years ago

Reynald

Hi,

Good question.
I don't think I have ever encountered this error but the error message is suggesting the Tango DatabaseDS device server didn't have enough resources (RAM, opened file descriptors limit, …?) to execute your request properly when trying to import a Tango device.

Is the computer where your TANGO database server server is running under heavy load?
Do you have a huge number of clients doing a huge number of queries in parallel at high frequency? (A big number of client applications trying to connect to many TANGO devices which are currently not running?)

Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects?

You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server.
You can start this tool from Astor by right-clicking on the TANGO_HOST database and select "Database Monitoring".

In a similar way, you can also use the Database BlackBox right-click menu item to see the last 50 queries sent to the Database server. This could help to eventually identify a client which would send a huge/abnormal number of commands to the Database server.

Kind regards,
Reynald

Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.

# 6 years ago
TCS_GMRT	Hi Team, We are still facing this issue. These are some of the artefacts that we have. I will post the other artefacts shortly. Is the computer where your TANGO database server service is running under heavy load? Yes, it has load, but the system resources are available as per the output of htop command Do you have a huge number of clients doing a huge number of queries in parallel at high frequency? We have approx. 20 Tango DS running in parallel on one Host. (The same host has TANGO Facility) A big number of client applications trying to connect to many TANGO devices which are currently not running? Yes, there is a possibility of having more than 30+ TANGO DS running on 30+ TANGO Facility. The TANGO DS on the host of interest (where we get resource limit reached error) tries to reach to these 30+ DS(s) which may or may not be alive. Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects? No. We re-use the device Proxy once created. If we do not get it we may re-try getting the proxy but once we get the proxy we do not re-create it. You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server. I will try to do this and get back with the updates. Regards, TCS_GMRT Edited 6 years ago Attachments:

# 6 years ago

TCS_GMRT

Hi Team,

We are still facing this issue. These are some of the artefacts that we have. I will post the other artefacts shortly.

Is the computer where your TANGO database server service is running under heavy load?

Yes, it has load, but the system resources are available as per the output of htop command

Do you have a huge number of clients doing a huge number of queries in parallel at high frequency?

We have approx. 20 Tango DS running in parallel on one Host. (The same host has TANGO Facility)

A big number of client applications trying to connect to many TANGO devices which are currently not running?

Yes, there is a possibility of having more than 30+ TANGO DS running on 30+ TANGO Facility. The TANGO DS on the host of interest (where we get resource limit reached error) tries to reach to these 30+ DS(s) which may or may not be alive.

Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects?

No. We re-use the device Proxy once created. If we do not get it we may re-try getting the proxy but once we get the proxy we do not re-create it.

You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server.

I will try to do this and get back with the updates.

Regards,
TCS_GMRT

Edited 6 years ago

# 6 years ago
lorenzo	We are still facing this issue. These are some of the artefacts that we have. I will post the other artefacts shortly. Is the computer where your TANGO database server service is running under heavy load? Yes, it has load, but the system resources are available as per the output of htop command Yes, there still are resources available, anyway looks like bit of a loaded system… Do you have a huge number of clients doing a huge number of queries in parallel at high frequency? We have approx. 20 Tango DS running in parallel on one Host. (The same host has TANGO Facility) 20 is not a large number of devices running on a host. But the specific question is whether you have a large number of clients querying at high frequency… A big number of client applications trying to connect to many TANGO devices which are currently not running? Yes, there is a possibility of having more than 30+ TANGO DS running on 30+ TANGO Facility. The TANGO DS on the host of interest (where we get resource limit reached error) tries to reach to these 30+ DS(s) which may or may not be alive. This it typically something you want to avoid. It is really preferable to have TANGO devices always running, maybe just sitting idle, rather than starting and stopping services, since clients hitting non-running devices create a unwanted load on the database server, which depending on how much unfair the client is, can turn out in a heavy load. Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects? No. We re-use the device Proxy once created. If we do not get it we may re-try getting the proxy but once we get the proxy we do not re-create it. You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server. I will try to do this and get back with the updates. This can provide some more detail to look at. Cheers, Lorenzo

# 6 years ago

lorenzo

We are still facing this issue. These are some of the artefacts that we have. I will post the other artefacts shortly.

Is the computer where your TANGO database server service is running under heavy load?

Yes, it has load, but the system resources are available as per the output of htop command

Yes, there still are resources available, anyway looks like bit of a loaded system…

Do you have a huge number of clients doing a huge number of queries in parallel at high frequency?

We have approx. 20 Tango DS running in parallel on one Host. (The same host has TANGO Facility)

20 is not a large number of devices running on a host. But the specific question is whether you have a large number of clients querying at high frequency…

A big number of client applications trying to connect to many TANGO devices which are currently not running?

Yes, there is a possibility of having more than 30+ TANGO DS running on 30+ TANGO Facility. The TANGO DS on the host of interest (where we get resource limit reached error) tries to reach to these 30+ DS(s) which may or may not be alive.

This it typically something you want to avoid. It is really preferable to have TANGO devices always running, maybe just sitting idle, rather than starting and stopping services, since clients hitting non-running devices create a unwanted load on the database server, which depending on how much unfair the client is, can turn out in a heavy load.

Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects?

No. We re-use the device Proxy once created. If we do not get it we may re-try getting the proxy but once we get the proxy we do not re-create it.

You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server.

I will try to do this and get back with the updates.

This can provide some more detail to look at.

Cheers,
Lorenzo

# 6 years ago
TCS_GMRT	Thanks, Lorenzo for the prompt reply. We are actively looking at this issue so that we will be able to fix it. I will keep posted with the artefacts as and when my testing have some. Regards, TCS_GMRT

# 6 years ago
Andy	My quick answer is check the number of open file descriptors for the database server and the limit of your operating system for open file descriptors per process. If you reach the limit of the operating system per process this could explain your problem. The solution is to increase the number of file descriptors per process. Andy

# 6 years ago
TCS_GMRT	My quick answer is check the number of open file descriptors for the database server and the limit of your operating system for open file descriptors per process. Andy, I will check this. Please find the images attached from the Database Monitoring feature from Astor as suggested by Reynald Regards, TCS_GMRT Edited 6 years ago Attachments:

# 6 years ago
Andy	How many file descriptors are open? You should also check how many import calls are being made per second. You can do that using the database timing attributes. If there are a lot of calls per second you should check which client is doing this and why. It might be a badly configured client. We have many more devices for one database so there shouldn't be a problem. Andy

# 6 years ago
Reynald	Hi, The exception you get: org.omg.CORBA.TRANSIENT: Server-side Exception: resource limit reached vmcid: OMG minor code: 1 means the following according to the CORBA specifications (https://www.omg.org/spec/CORBA/3.0/): TRANSIENT OMG minor code 1 means: Request discarded because of resource exhaustion in POA, or because POA is in discarding state. So my guess would be that there are far too many connections attempts to this device server and JacORB is not able any more to handle more requests. A queue must be full somewhere or it has reached the maximum number of allowed open file descriptors? You can check the number of file descriptors opened by your device server by executing the following shell command on cmsserver (Thanks Emmanuel Taurel for the tip): `ls /proc/<DEVICE_SERVER_PID>/fd \| wc -l` So, in the case of your jive screenshot (if you didn't restart Node/AGN0 device server since that screenshot), it would be: `ls /proc/6828/fd \| wc -l` Could it be that you have some clients which are creating new DeviceProxy objects without deleting old DeviceProxy objects? Do you receive this exception all the time? If you start from a clean state (no client and device server restarted), how long does it take to reach this state where you get this error? I would advise you to stop all the clients of this device server (you can see the clients using the blackbox feature) and to try to restart from a clean state and to add clients one by one slowly until you hopefully see which client is triggering the problem (if the problem comes from a specific client, this still needs to be proven). Hoping this helps, Reynald Rosenberg's Law: Software is easy to make, except when you want it to do something new. Corollary: The only software that's worth making is software that does something new.

# 6 years ago

Reynald

Hi,

The exception you get:

org.omg.CORBA.TRANSIENT: Server-side Exception: resource limit reached vmcid: OMG minor code: 1

means the following according to the CORBA specifications (https://www.omg.org/spec/CORBA/3.0/):

TRANSIENT OMG minor code 1 means:

Request discarded because of resource exhaustion in POA, or because POA is in discarding state.

So my guess would be that there are far too many connections attempts to this device server and JacORB is not able any more to handle more requests. A queue must be full somewhere or it has reached the maximum number of allowed open file descriptors?

You can check the number of file descriptors opened by your device server by executing the following shell command on cmsserver (Thanks Emmanuel Taurel for the tip):

ls /proc/<DEVICE_SERVER_PID>/fd | wc -l

So, in the case of your jive screenshot (if you didn't restart Node/AGN0 device server since that screenshot), it would be:

ls /proc/6828/fd | wc -l

Could it be that you have some clients which are creating new DeviceProxy objects without deleting old DeviceProxy objects?

Do you receive this exception all the time?
If you start from a clean state (no client and device server restarted), how long does it take to reach this state where you get this error?

I would advise you to stop all the clients of this device server (you can see the clients using the blackbox feature) and to try to restart from a clean state and to add clients one by one slowly until you hopefully see which client is triggering the problem (if the problem comes from a specific client, this still needs to be proven).

Hoping this helps,

Reynald

Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.

# 6 years ago
TCS_GMRT	Could it be that you have some clients which are creating new DeviceProxy objects without deleting old DeviceProxy objects? I do not think so as we are re-using the proxy Do you receive this exception all the time? No, we do not get it all the time, but most of the time. If you start from a clean state (no client and device server restarted), how long does it take to reach this state where you get this error? No pattern observed, sometimes it is immediately sometimes it takes some hours. I would advise you to stop all the clients of this device server (you can see the clients using the blackbox feature) and to try to restart from a clean state and to add clients one by one slowly until you hopefully see which client is triggering the problem (if the problem comes from a specific client, this still needs to be proven). I tried this but does not seem to help because, when I repeated we got altogether different client count. I have put some checks to see if the file descriptor increases over a period of time. Will post the results once available. Regards, TCS_GMRT Edited 6 years ago

# 6 years ago

TCS_GMRT

Could it be that you have some clients which are creating new DeviceProxy objects without deleting old DeviceProxy objects?

I do not think so as we are re-using the proxy

Do you receive this exception all the time?

No, we do not get it all the time, but most of the time.

If you start from a clean state (no client and device server restarted), how long does it take to reach this state where you get this error?

No pattern observed, sometimes it is immediately sometimes it takes some hours.

I would advise you to stop all the clients of this device server (you can see the clients using the blackbox feature) and to try to restart from a clean state and to add clients one by one slowly until you hopefully see which client is triggering the problem (if the problem comes from a specific client, this still needs to be proven).

I tried this but does not seem to help because, when I repeated we got altogether different client count.

I have put some checks to see if the file descriptor increases over a period of time. Will post the results once available.

Regards,
TCS_GMRT

Edited 6 years ago