[Openstack-operators] Ceph crashes with larger clusters and denser hardware

Warren Wang warren at wangspeed.com
Thu Aug 28 21:23:26 UTC 2014


It happened upon initial creation of pools on an empty cluster. Every
single ceph process crashed on every single node. This is on extremely
dense hardware with 40Gbe connectivity. After this tweak, the cluster has
been stable. There are a number of other issues regarding performance that
should be looked at, but as far as stability goes, this was the key. Sadly
there is no message logged by the kernel, making this extremely difficult
to find, if you don't happen to be on the host at the time.

Warren
@comcastwarren

Warren


On Thu, Aug 28, 2014 at 5:20 PM, David Moreau Simard <dmsimard at iweb.com>
wrote:

>   BTW the tracker link is http://tracker.ceph.com/issues/6142
>
>  This is an interesting issue, I'm definitely curious.
>
>  May I ask if this happened to you as well during recovery as is
> described in the tracker issue ?
> Also, if you divide the amount of placement groups by the amount of OSDs -
> what number are you getting at ?
>
>  If this happens mostly during recovery, I'm curious if the amount of
> placement groups (other than the thread config) plays a role in the amount
> of threads required for healing and replication.
>
>  Thanks.
>  --
> David Moreau Simard
>
>   De : "Fischer, Matt" <matthew.fischer at twcable.com>
> Date : Thu, 28 Aug 2014 16:51:18 -0400
> À : Warren Wang <warren at wangspeed.com>, "
> openstack-operators at lists.openstack.org" <
> openstack-operators at lists.openstack.org>
> Objet : Re: [Openstack-operators] Ceph crashes with larger clusters and
> denser hardware
>
>   What version of ceph was this seen on?
>
>   From: Warren Wang <warren at wangspeed.com>
> Date: Thursday, August 28, 2014 10:38 AM
> To: "openstack-operators at lists.openstack.org" <
> openstack-operators at lists.openstack.org>
> Subject: [Openstack-operators] Ceph crashes with larger clusters and
> denser hardware
>
>  One of my colleagues here at Comcast just returned from the Operators
> Summit and mentioned that multiple folks experienced Ceph instability with
> larger clusters. I wanted to send out a note and save headache for some
> folks.
>
>  If you up the number of threads per OSD, there are situations where many
> threads could be quickly spawned. You must up the max number of PIDs
> available to the OS, otherwise you essentially get fork bombed. Every
> single Ceph process with crash, and you might see a message in your shell
> about "Cannot allocate memory".
>
> In your sysctl.conf:
>
> # For Ceph
> kernel.pid_max=4194303
>
>  Then run "sysctl -p". In 5 days on a lab Ceph box, we have mowed through
> nearly 2 million PIDs. There's a tracker about this to add it to the
> ceph.com docs.
>
> Warren
>  @comcastwarren
>
> ------------------------------
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
>  _______________________________________________ OpenStack-operators
> mailing list OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140828/b05b2f00/attachment.html>


More information about the OpenStack-operators mailing list