Open Stack

Mon Jan 16 03:03:24 UTC 2017

Hi,

I was alerted to translate-dev performance issues today.  Indeed, it
seemed that things were going crazy with the java wildfly process
sucking up all CPU.

At first there didn't seem to be anything in the logs.  Java was
clearly going mad however, with the following threads going flat-out.

--- gone crazy processes ---
14807 wildfly   20   0  9.892g 4.729g  34232 R 93.7 60.7  26:56.73 java
14804 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:51.91 java
14806 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:53.92 java
14808 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:53.97 java
14810 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:56.28 java
14809 wildfly   20   0  9.892g 4.729g  34232 R 92.1 60.7  26:57.74 java
14803 wildfly   20   0  9.892g 4.729g  34232 R 91.1 60.7  26:54.90 java
14805 wildfly   20   0  9.892g 4.729g  34232 R 90.4 60.7  26:52.44 java
---

Hoping to find a easy smoking-gun, I made the java process dump it's
threads to see what these are doing

--- thread dump ---

14807  14804  14806  14808  14810  14809  14803, 14805
0x39d7 0x39d4 0x39d6 0x39d8 0x39da 0x39d9 0x39d3 0x39d5

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fb2a8026000 nid=0x39d7 runnable 
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fb2a8026000 nid=0x39d7 runnable 
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fb2a8026000 nid=0x39d7 runnable 
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fb2a8026000 nid=0x39d7 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fb2a8021000 nid=0x39d4 runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fb2a8021000 nid=0x39d4 runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fb2a8021000 nid=0x39d4 runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fb2a8021000 nid=0x39d4 runnable 

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fb2a8024800 nid=0x39d6 runnable 
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fb2a8024800 nid=0x39d6 runnable 
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fb2a8024800 nid=0x39d6 runnable 
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fb2a8024800 nid=0x39d6 runnable 

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fb2a8028000 nid=0x39d8 runnable 
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fb2a8028000 nid=0x39d8 runnable 
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fb2a8028000 nid=0x39d8 runnable 
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fb2a8028000 nid=0x39d8 runnable 

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fb2a802b800 nid=0x39da runnable 
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fb2a802b800 nid=0x39da runnable 
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fb2a802b800 nid=0x39da runnable 
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fb2a802b800 nid=0x39da runnable 

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fb2a8029800 nid=0x39d9 runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fb2a8029800 nid=0x39d9 runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fb2a8029800 nid=0x39d9 runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fb2a8029800 nid=0x39d9 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fb2a801f000 nid=0x39d3 runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fb2a801f000 nid=0x39d3 runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fb2a801f000 nid=0x39d3 runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fb2a801f000 nid=0x39d3 runnable
---

Unfortunately, these all appear to be GC threads so there's nothing
really obvious there.

However, eventually in the log you start getting stuff like

---
2017-01-16T01:58:00,844Z SEVERE [javax.enterprise.resource.webcontainer.jsf.application] (default task-43) Error Rendering View[/project/project.xhtml]: javax.el.ELException: /WEB-INF/layout/project/settings-tab-languages.xhtml @117,88 rendered="#{not projectHome.hasLocaleAlias(locale)}": javax.persistence.PersistenceException: org.hibernate.HibernateException: Transaction was rolled back in a different thread!
...
2017-01-16T01:08:48,805Z WARN  [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (pool-5-thread-1) SQL Error: 0, SQLState: null
2017-01-16T01:08:48,806Z ERROR [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (pool-5-thread-1) javax.resource.ResourceException: IJ000460: Error checking for a transaction
2017-01-16T01:08:48,813Z ERROR [org.hibernate.AssertionFailure] (pool-5-thread-1) HHH000099: an assertion failure occured (this may indicate a bug in Hibernate, but is more likely due to unsafe use of the session): org.hibernate.exception.GenericJDBCException: Could not open connection
2017-01-16T01:08:48,813Z WARN  [com.arjuna.ats.jta] (pool-5-thread-1) ARJUNA016029: SynchronizationImple.afterCompletion - failed for org.hibernate.engine.transaction.synchronization.internal.RegisteredSynchronization at 3fc123aa with exception: org.hibernate.AssertionFailure: Exception releasing cache locks
---

There's literally hundreds of lines of traceback across various
exceptions, but at the bottom this really seems to point to mysql
being at the root of all this?

In this case, mysql is running in RAX's cloud db service.  I logged
into the RAX control-panel and everything for the DB was green.  I
used the same credentials as zanata to authenticate and I can connect
to the db via translate-dev.o.o and do a few basic things.

---
MySQL [(none)]> USE zanata;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MySQL [zanata]> show tables;
+----------------------------------+
| Tables_in_zanata                 |
+----------------------------------+
| Activity                         |
| DATABASECHANGELOG                |
...
---

So it's not as simple as "db not working" :( I have copied the console
log to [1] in case that makes any sense to zanata people.

If this persists, and isn't a known Zanata issue, I think we might
have to enable root access for the db via the API which might allow us
to do some thing like debug long-queries.  Otherwise, there might be
something in the mysql logs available to RAX support people that would
otherwise help identify the issue.  OTOH, maybe it's as simple as
upping java heaps or something.

In the mean time, I have restarted wildfly which seems to get things
going for a while.

-i

[1] http://people.redhat.com/~iwienand/console.log.translate-dev-2017-01-16.log.gz

Open Stack

[OpenStack-Infra] translate-dev wildfly/zanata/trove issues

OpenStack

Community

Documentation

Branding & Legal