Open Stack

Mon Jan 2 23:03:07 UTC 2017

On 2016-12-30 12:38:02 +0800 (+0800), Tom Fifield wrote:
> This problem (legitimate users having all posts flatly rejected as
> spam) is still happening. Any progress?

We're basically ready to retry the previous upgrade now that some
issues have been identified/corrected by Marton. Change
https://review.openstack.org/408657 to trigger it is waiting on
https://review.openstack.org/416072 to grant him a shell account on
the production server for improved reactivity if it fails again, so
that he can collaborate with us more directly troubleshooting before
we have to make the call to fix or revert.

> I've been doing what I can without access to the server, but my
> latest attempt - completely deleting an account so it could be
> re-created by the affected user - was thwarted by a 500 error. Did
> that appear in the logs?

Given that I don't know the exact time you tried nor your IP
address, and am unsure what that failure would look like in the logs
aside from coming with a 500 error code this Apache access log entry
18 minutes before your E-mail's timestamp stands out:

    [30/Dec/2016:04:20:40 +0000] "GET /admin/askbot/post/25842/
    HTTP/1.1" 500 917
    "https://ask.openstack.org/admin/auth/user/2253/delete/"
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101
    Firefox/50.0"

I don't see anything related around that timeframe in the Apache
error log files, the Askbot application log, the Solr Jetty log...
though there's this in dmesg which is suspiciously close (timestamps
in dmesg aren't entirely reliable, so this could have happened 5
minutes earlier):

    [Fri Dec 30 04:25:25 2016] apache2 invoked oom-killer:
    gfp_mask=0x200da, order=0, oom_score_adj=0

And indeed, there's a spike in swap utilization around that time
which, given the five-minute granularity could certainly support the
notion that a runaway process ate all available virtual memory on
the system:

<URL: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2546&rra_id=2&view_type=&graph_start=1483029900&graph_end=1483156284&graph_height=120&graph_width=500&title_font_size=10 >

The RAM usage graph suggests we were doing okay with a target
utilization of ~50% so something eating an additional 6GiB of memory
in a matter of a few minutes would definitely count as anomalous:

<URL: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2544&rra_id=2&view_type=&graph_start=1483029900&graph_end=1483156284&graph_height=120&graph_width=500&title_font_size=10 >

Was the memory spike caused by that deletion? Or was the deletion
error coincidental and caused by a memory event which just happened
to be going on at the same time? I have insufficient knowledge of
the system to be able to say either way. It's possible there are
additional logs I don't know to look at which could tell us.
-- 
Jeremy Stanley

Open Stack

[OpenStack-Infra] Logs for Ask.O.o - chasing false positive spam labeling

OpenStack

Community

Documentation

Branding & Legal