Open Stack

Thu Apr 5 04:04:40 UTC 2018

On 04/05/2018 08:30 AM, Paul Belanger wrote:
> We likely need to reduce the number of days we retain database
> backups / http logs or look to attach a volume to increase storage.

We've long had problems with this host and I've looked at it before
[1].  It often drops out.

It seems there's enough interest we should dive a bit deeper.  Here's
what I've found out:

askbot
------

Of the askbot site, it seems under control, except for an unbounded
session log file.  Proposed [2]

 root at ask:/srv# du -hs *
 2.0G	askbot-site
 579M	dist

overall
-------

The major consumer is /var; where we've got

 3.9G	log
 5.9G	backups
 9.4G	lib

backups
-------

The backup seem under control at least; we're rotating them out and we
keep 10, and the size is pretty consistently 500mb:

 root at ask:/var/backups/pgsql_backups# ls -lh
 total 5.9G
 -rw-r--r-- 1 root root 599M Apr  5 00:03 askbotdb.sql.gz
 -rw-r--r-- 1 root root 598M Apr  4 00:03 askbotdb.sql.gz.1
 ...

We could reduce the backup rotations to just one if we like -- the
server is backed up nightly via bup, so at any point we can get
previous dumps from there.  bup should de-duplicate everything, but
still, it's probably not necessary.

The db directory was sitting at ~9gb

 root at ask:/var/lib/postgresql# du -hs
 8.9G	.

AFAICT, it seems like the autovacuum is running OK on the busy tables

 askbotdb=# select relname,last_vacuum, last_autovacuum, last_analyze, last_autoanalyze from pg_stat_user_tables where last_autovacuum is not NULL;
      relname      | last_vacuum |        last_autovacuum        |         last_analyze          |       last_autoanalyze        
 ------------------+-------------+-------------------------------+-------------------------------+-------------------------------
  django_session   |             | 2018-04-02 17:29:48.329915+00 | 2018-04-05 02:18:39.300126+00 | 2018-04-05 00:11:23.456602+00
  askbot_badgedata |             | 2018-04-04 07:19:21.357461+00 |                               | 2018-04-04 07:18:16.201376+00
  askbot_thread    |             | 2018-04-04 16:24:45.124492+00 |                               | 2018-04-04 20:32:25.845164+00
  auth_message     |             | 2018-04-04 12:29:24.273651+00 | 2018-04-05 02:18:07.633781+00 | 2018-04-04 21:26:38.178586+00
  djkombu_message  |             | 2018-04-05 02:11:50.186631+00 |                               | 2018-04-05 02:14:45.22926+00

Out of interest I did run a manual

 su - postgres -c "vacuumdb --all --full --analyze"

We dropped something

 root at ask:/var/lib/postgresql# du -hs
 8.9G	.
 (after)
 5.8G	

I installed pg_activity and watched for a while; nothing seemed to be
really stressing it.

Ergo, I'm not sure if there's much to do in the db layers.

logs
----

This leaves the logs

 1.1G	jetty
 2.9G	apache2

The jetty logs are cleaned regularly.  I think they could be made more
quiet, but they seem to be bounded.

Apache logs are rotated but never cleaned up.  Surely logs from 2015
aren't useful.  Proposed [3]

Random offline
--------------

[3] is an example of a user reporting the site was offline.  Looking
at the logs, it seems that puppet found httpd not running at 07:14 and
restarted it:

 Apr  4 07:14:40 ask puppet-user[20737]: (Scope(Class[Postgresql::Server])) Passing "version" to postgresql::server is deprecated; please use postgresql::globals instead.
 Apr  4 07:14:42 ask puppet-user[20737]: Compiled catalog for ask.openstack.org in environment production in 4.59 seconds
 Apr  4 07:14:44 ask crontab[20987]: (root) LIST (root)
 Apr  4 07:14:49 ask puppet-user[20737]: (/Stage[main]/Httpd/Service[httpd]/ensure) ensure changed 'stopped' to 'running'
 Apr  4 07:14:54 ask puppet-user[20737]: Finished catalog run in 10.43 seconds

Which first explains why when I looked, it seemed OK.  Checking the
apache logs we have:

 [Wed Apr 04 07:01:08.144746 2018] [:error] [pid 12491:tid 140439253419776] [remote 176.233.126.142:43414] mod_wsgi (pid=12491): Exception occurred processing WSGI script '/srv/askbot-site/config/django.wsgi'.
 [Wed Apr 04 07:01:08.144870 2018] [:error] [pid 12491:tid 140439253419776] [remote 176.233.126.142:43414] IOError: failed to write data
 ... more until ...
 [Wed Apr 04 07:15:58.270180 2018] [:error] [pid 17060:tid 140439253419776] [remote 176.233.126.142:43414] mod_wsgi (pid=17060): Exception occurred processing WSGI script '/srv/askbot-site/config/django.wsgi'.
 [Wed Apr 04 07:15:58.270303 2018] [:error] [pid 17060:tid 140439253419776] [remote 176.233.126.142:43414] IOError: failed to write data

and the restart logged

 [Wed Apr 04 07:14:48.912626 2018] [core:warn] [pid 21247:tid 140439370192768] AH00098: pid file /var/run/apache2/apache2.pid overwritten -- Unclean shutdown of previous Apache run?
 [Wed Apr 04 07:14:48.913548 2018] [mpm_event:notice] [pid 21247:tid 140439370192768] AH00489: Apache/2.4.7 (Ubuntu) OpenSSL/1.0.1f mod_wsgi/3.4 Python/2.7.6 configured -- resuming normal operations
 [Wed Apr 04 07:14:48.913583 2018] [core:notice] [pid 21247:tid 140439370192768] AH00094: Command line: '/usr/sbin/apache2'
 [Wed Apr 04 14:59:55.408060 2018] [mpm_event:error] [pid 21247:tid 140439370192768] AH00485: scoreboard is full, not at MaxRequestWorkers

This does not appear to be disk-space related; see the cacti graphs
for that period that show the disk is full-ish, but not full [5].

What caused the I/O errors?  dmesg has nothing in it since 30/Mar.
kern.log is empty.

Server
------

Most importantly, this sever wants a Xenial upgrade.  At the very
least that apache is known to handle the "scoreboard is full" issue
better.

We should ensure that we use a bigger instance; it's using up some
swap

 postgres at ask:~$ free -h
              total       used       free     shared    buffers     cached
 Mem:          3.9G       3.6G       269M       136M        11M       819M
 -/+ buffers/cache:       2.8G       1.1G
 Swap:         3.8G       259M       3.6G

tl;dr
-----

I don't think there's anything run-away bad going on, but the server
is undersized and needs a system update.

Since I've got this far with it, over the next few days I'll see where
we are with the puppet for a Xenial upgrade and see if we can't get a
migration underway.

Thanks,

-i

[1] https://review.openstack.org/406670
[2] https://review.openstack.org/558977
[3] https://review.openstack.org/558985
[4] http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-04-04.log.html#t2018-04-04T07:11:22
[5] http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2547&rra_id=0&view_type=tree&graph_start=1522859103&graph_end=1522879839

Open Stack

[openstack-dev] Asking for ask.openstack.org

OpenStack

Community

Documentation

Branding & Legal