[Openstack-operators] [openstack-dev] Asking for ask.openstack.org
Jimmy McArthur
jimmy at openstack.org
Thu Apr 5 13:39:08 UTC 2018
Ian, thanks for digging in and helping sort out some of these issues!
> Ian Wienand <mailto:iwienand at redhat.com>
> April 4, 2018 at 11:04 PM
>
> We've long had problems with this host and I've looked at it before
> [1]. It often drops out.
>
> It seems there's enough interest we should dive a bit deeper. Here's
> what I've found out:
>
> askbot
> ------
>
> Of the askbot site, it seems under control, except for an unbounded
> session log file. Proposed [2]
>
> root at ask:/srv# du -hs *
> 2.0G askbot-site
> 579M dist
>
> overall
> -------
>
> The major consumer is /var; where we've got
>
> 3.9G log
> 5.9G backups
> 9.4G lib
>
> backups
> -------
>
> The backup seem under control at least; we're rotating them out and we
> keep 10, and the size is pretty consistently 500mb:
>
> root at ask:/var/backups/pgsql_backups# ls -lh
> total 5.9G
> -rw-r--r-- 1 root root 599M Apr 5 00:03 askbotdb.sql.gz
> -rw-r--r-- 1 root root 598M Apr 4 00:03 askbotdb.sql.gz.1
> ...
>
> We could reduce the backup rotations to just one if we like -- the
> server is backed up nightly via bup, so at any point we can get
> previous dumps from there. bup should de-duplicate everything, but
> still, it's probably not necessary.
>
> The db directory was sitting at ~9gb
>
> root at ask:/var/lib/postgresql# du -hs
> 8.9G .
>
> AFAICT, it seems like the autovacuum is running OK on the busy tables
>
> askbotdb=# select relname,last_vacuum, last_autovacuum, last_analyze,
> last_autoanalyze from pg_stat_user_tables where last_autovacuum is not
> NULL;
> relname | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze
> ------------------+-------------+-------------------------------+-------------------------------+-------------------------------
> django_session | | 2018-04-02 17:29:48.329915+00 | 2018-04-05
> 02:18:39.300126+00 | 2018-04-05 00:11:23.456602+00
> askbot_badgedata | | 2018-04-04 07:19:21.357461+00 | | 2018-04-04
> 07:18:16.201376+00
> askbot_thread | | 2018-04-04 16:24:45.124492+00 | | 2018-04-04
> 20:32:25.845164+00
> auth_message | | 2018-04-04 12:29:24.273651+00 | 2018-04-05
> 02:18:07.633781+00 | 2018-04-04 21:26:38.178586+00
> djkombu_message | | 2018-04-05 02:11:50.186631+00 | | 2018-04-05
> 02:14:45.22926+00
>
> Out of interest I did run a manual
>
> su - postgres -c "vacuumdb --all --full --analyze"
>
> We dropped something
>
> root at ask:/var/lib/postgresql# du -hs
> 8.9G .
> (after)
> 5.8G
>
> I installed pg_activity and watched for a while; nothing seemed to be
> really stressing it.
>
> Ergo, I'm not sure if there's much to do in the db layers.
>
> logs
> ----
>
> This leaves the logs
>
> 1.1G jetty
> 2.9G apache2
>
> The jetty logs are cleaned regularly. I think they could be made more
> quiet, but they seem to be bounded.
>
> Apache logs are rotated but never cleaned up. Surely logs from 2015
> aren't useful. Proposed [3]
>
> Random offline
> --------------
>
> [3] is an example of a user reporting the site was offline. Looking
> at the logs, it seems that puppet found httpd not running at 07:14 and
> restarted it:
>
> Apr 4 07:14:40 ask puppet-user[20737]:
> (Scope(Class[Postgresql::Server])) Passing "version" to
> postgresql::server is deprecated; please use postgresql::globals instead.
> Apr 4 07:14:42 ask puppet-user[20737]: Compiled catalog for
> ask.openstack.org in environment production in 4.59 seconds
> Apr 4 07:14:44 ask crontab[20987]: (root) LIST (root)
> Apr 4 07:14:49 ask puppet-user[20737]:
> (/Stage[main]/Httpd/Service[httpd]/ensure) ensure changed 'stopped' to
> 'running'
> Apr 4 07:14:54 ask puppet-user[20737]: Finished catalog run in 10.43
> seconds
>
> Which first explains why when I looked, it seemed OK. Checking the
> apache logs we have:
>
> [Wed Apr 04 07:01:08.144746 2018] [:error] [pid 12491:tid
> 140439253419776] [remote 176.233.126.142:43414] mod_wsgi (pid=12491):
> Exception occurred processing WSGI script
> '/srv/askbot-site/config/django.wsgi'.
> [Wed Apr 04 07:01:08.144870 2018] [:error] [pid 12491:tid
> 140439253419776] [remote 176.233.126.142:43414] IOError: failed to
> write data
> ... more until ...
> [Wed Apr 04 07:15:58.270180 2018] [:error] [pid 17060:tid
> 140439253419776] [remote 176.233.126.142:43414] mod_wsgi (pid=17060):
> Exception occurred processing WSGI script
> '/srv/askbot-site/config/django.wsgi'.
> [Wed Apr 04 07:15:58.270303 2018] [:error] [pid 17060:tid
> 140439253419776] [remote 176.233.126.142:43414] IOError: failed to
> write data
>
> and the restart logged
>
> [Wed Apr 04 07:14:48.912626 2018] [core:warn] [pid 21247:tid
> 140439370192768] AH00098: pid file /var/run/apache2/apache2.pid
> overwritten -- Unclean shutdown of previous Apache run?
> [Wed Apr 04 07:14:48.913548 2018] [mpm_event:notice] [pid 21247:tid
> 140439370192768] AH00489: Apache/2.4.7 (Ubuntu) OpenSSL/1.0.1f
> mod_wsgi/3.4 Python/2.7.6 configured -- resuming normal operations
> [Wed Apr 04 07:14:48.913583 2018] [core:notice] [pid 21247:tid
> 140439370192768] AH00094: Command line: '/usr/sbin/apache2'
> [Wed Apr 04 14:59:55.408060 2018] [mpm_event:error] [pid 21247:tid
> 140439370192768] AH00485: scoreboard is full, not at MaxRequestWorkers
>
> This does not appear to be disk-space related; see the cacti graphs
> for that period that show the disk is full-ish, but not full [5].
>
> What caused the I/O errors? dmesg has nothing in it since 30/Mar.
> kern.log is empty.
>
> Server
> ------
>
> Most importantly, this sever wants a Xenial upgrade. At the very
> least that apache is known to handle the "scoreboard is full" issue
> better.
>
> We should ensure that we use a bigger instance; it's using up some
> swap
>
> postgres at ask:~$ free -h
> total used free shared buffers cached
> Mem: 3.9G 3.6G 269M 136M 11M 819M
> -/+ buffers/cache: 2.8G 1.1G
> Swap: 3.8G 259M 3.6G
>
> tl;dr
> -----
>
> I don't think there's anything run-away bad going on, but the server
> is undersized and needs a system update.
>
> Since I've got this far with it, over the next few days I'll see where
> we are with the puppet for a Xenial upgrade and see if we can't get a
> migration underway.
>
> Thanks,
>
> -i
>
> [1] https://review.openstack.org/406670
> [2] https://review.openstack.org/558977
> [3] https://review.openstack.org/558985
> [4]
> http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-04-04.log.html#t2018-04-04T07:11:22
> [5]
> http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2547&rra_id=0&view_type=tree&graph_start=1522859103&graph_end=1522879839
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> Paul Belanger <mailto:pabelanger at redhat.com>
> April 4, 2018 at 5:30 PM
>
> We also have a 2nd issue where the ask.o.o server doesn't appear to be
> large
> enough any more to handle the traffic. A few times over the last few
> weeks we've
> had outages due to the HDD being full.
>
> We likely need to reduce the number of days we retain database backups
> / http
> logs or look to attach a volume to increase storage.
>
> Paul
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> Jimmy McArthur <mailto:jimmy at openstack.org>
> April 4, 2018 at 4:26 PM
> Hi everyone!
>
> We have a very robust and vibrant community at ask.openstack.org
> <https://ask.openstack.org/>. There are literally dozens of posts a
> day. However, many of them don't receive knowledgeable answers. I'm
> really worried about this becoming a vacuum where potential community
> members get frustrated and don't realize how to get more involved with
> the community.
>
> I'm looking for thoughts/ideas/feelings about this tool as well as
> potential admin volunteers to help us manage the constant influx of
> technical and not-so-technical questions around OpenStack.
>
> For those of you already contributing there, Thank You! For those
> that are interested in becoming a moderator (instant AUC status!) or
> have some additional ideas around fostering this community, please
> respond.
>
> Looking forward to your thoughts :)
>
> Thanks!
> Jimmy
> irc: jamesmcarthur
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20180405/885a5943/attachment.html>
More information about the OpenStack-operators
mailing list