We recently upgraded our swift servers to 20.04 from 18.04. After the upgrades were completed we are seeing 1 server where the ssync receiver keeps crashing. The crash is preventing hand offs from being cleanup up in the cluster (50 servers). Looking for some
advice to fix/workaround the issue. We have 20 clusters but are only seeing this issue in 1 cluster. It is strange to me that we only see this ssync receiver error in one server out of the ~1000 we are running.
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/swift/obj/ssync_receiver.py", line 166, in __call__
for data in self.missing_check():
File "/usr/lib/python3/dist-packages/swift/obj/ssync_receiver.py", line 340, in missing_check
line = self.fp.readline(self.app.network_chunk_size)
File "/usr/lib/python3/dist-packages/eventlet/wsgi.py", line 226, in readline
return self._chunked_read(self.rfile, size, True)
File "/usr/lib/python3/dist-packages/eventlet/wsgi.py", line 211, in _chunked_read
raise ChunkReadError(err)
eventlet.wsgi.ChunkReadError: invalid literal for int() with base 16: b''
On the sender side we are seeing this message when I try to run the reconstructor manually with debug logging:
root@use1-saas-p6-paco-9:~# swift-object-reconstructor object-server.conf -p 27999 -v -o
object-reconstructor: Starting 271723
object-reconstructor: Spawned worker 271755 with {'override_partitions': [27999], 'override_devices': ['d10', 'd17', 'd19', 'd9', 'd29', 'd15', 'd2', 'd24', '
d32', 'd1', 'd8', 'd14', 'd11', 'd28', 'd27', 'd6', 'd7', 'd16', 'd13', 'd25', 'd26', 'd20', 'd18', 'd12', 'd4', 'd0', 'd22', 'd3', 'd5', 'd21', 'd33', 'd23',
'd30', 'd34', 'd31'], 'multiprocess_worker_index': 0}
object-reconstructor: [worker 1/1 pid=271755] Running object reconstructor in script mode.
object-reconstructor: [worker 1/1 pid=271755] Run listdir on /srv/node/d22/objects-4/27999
object-reconstructor: [worker 1/1 pid=271755] recieverIP:6200/d31/27999 10.0 seconds: connect receive
object-reconstructor: [worker 1/1 pid=271755] 1/26820 (0.00%) partitions reconstructed in 10.14s (0.10/sec, 75h remaining)
object-reconstructor: [worker 1/1 pid=271755] Object reconstruction complete (once). (0.17 minutes)
object-reconstructor: Forked worker 271755 finished
object-reconstructor: Worker 271755 exited
object-reconstructor: Finished 271723
object-reconstructor: Exited 271723