Hi I regularly get the above errors when running plonectl start on a FreeBSD buildout.
My investigations so far have uncovered the following:
for the start operation, zdctl will produce 14 exceptions in making socket calls for each of the zeoserver and clients
The first three are raised as a "No such file or directory" exception during sock.connect. For the other 11 a "socket is not connected" exception is raised during sock.shutdown(1).
The site still comes up; however, (looks like sending a start is not really necessary). But I can't stop it as the get_status() calls are as if it does not exist (a stop call gives a not connected exception)
There are rare rare cases when it does connect, though, in which case the only exceptions are the first three no such file exceptions. These three I imagine come from the constructor, the 1st get_status in do_start and the get_status in awhile that is before the while loop (and its 1 second timer). It looks that this is just from having to wait for zdrun.py to set up the socket. I have no idea why the socket can't connect.
My setup is FreeBSD11.1 64bit and I'm installing via the Unified installer tar (v. 5.0.7)
# tail -f var/zeoserver/zeoserver.log
2017-11-05T21:03:15 daemonizing the process
2017-11-05T21:03:15 set current directory: '/usr/local/www/plone/zeocluster/parts/zeoserver'
2017-11-05T21:03:15 daemon manager started
2017-11-05T21:03:15 spawned process pid=801
2017-11-05T21:03:15 (801) created PID file '/usr/local/www/plone/zeocluster/var/zeoserver/zeoserver.pid'
2017-11-05T21:03:15 (801) opening storage '1' using FileStorage
2017-11-05T21:03:15 StorageServer created RW with storages: 1:RW:/usr/local/www/plone/zeocluster/var/filestorage/Data.fs
2017-11-05T21:03:16 (801) listening on ('127.0.0.1', 8100)
2017-11-05T21:04:05 new connection ('127.0.0.1', 48198): <ManagedServerConnection ('127.0.0.1', 48198)>
2017-11-05T21:04:05 (127.0.0.1:48198) received handshake 'Z3101'
2017-11-05T21:04:05 new connection ('127.0.0.1', 25996): <ManagedServerConnection ('127.0.0.1', 25996)>
2017-11-05T21:04:05 (127.0.0.1:25996) received handshake 'Z3101'
# tail -f var/client1/event.log
2017-11-05T21:03:47 INFO ZServer HTTP server started at Sun Nov 5 21:03:47 2017
Hostname: 0.0.0.0
Port: 8080
... (snip patching)
2017-11-05T21:04:05 INFO ZEO.ClientStorage zeostorage ClientStorage (pid=806) created RW/normal for storage: '1'
------
2017-11-05T21:04:05 INFO ZEO.cache created temporary cache file '<fdopen>'
------
2017-11-05T21:04:05 INFO ZEO.zrpc.Connection(C) (127.0.0.1:8100) received handshake 'Z3101'
------
2017-11-05T21:04:05 INFO ZEO.ClientStorage zeostorage Testing connection <ManagedClientConnection ('127.0.0.1', 8100)>
------
2017-11-05T21:04:05 INFO ZEO.ClientStorage zeostorage Server authentication protocol None
------
2017-11-05T21:04:05 INFO ZEO.ClientStorage zeostorage Connected to storage: ('localhost', 8100)
------
2017-11-05T21:04:05 INFO ZEO.ClientStorage zeostorage No verification necessary -- empty cache
------
2017-11-05T21:04:32 INFO Plone OpenID system packages not installed, OpenID support not available
------
2017-11-05T21:04:42 INFO PloneFormGen Patching plone.app.portlets ColumnPortletManagerRenderer to not catch Retry exceptions
------
2017-11-05T21:04:42 INFO Zope Ready to handle requests
There's some issue with your plonectl script (not sure what). But to debug Plone startup it helps to run the zeoserver and clients in fg, e.g. bin/zeoserver fg in one terminal and bin/client1 fg in another and watch what happens. See https://docs.plone.org/manage/troubleshooting/basic.html
Sometimes I see a similar behaviour in which the shipped startup script fails but it's because I'd already started the clients and/or server in another terminal.
I would ps auxwww | grep Python or maybe grep for part of the path that you've got your Plone installed in.
Testing out now different numbers of cores.
The above was from running on a single core. With four cores I get it all communicating fine (zeoserver and 2 clients). With 2 cores I get the first client most of the time but never get the second. Three cores gets the first client fine, but the second comes awfully close to timing out.
There are the occasions where they all seem to come up ok but I get daemon manger not running errors for one or both clients (I think timing out). Rerunning the stop command usually gets it.
Seems to go against the rule of thumb of two instances per core I'm not too sure why the interprocess communication would be so laggy
I did notice that earlier but was getting different errors.
I'm keeping a good eye on it now and regularly clear things out with killall -u plone_daemon -m .
Trouble is the logging in the scripts does not go far enough to help with diagnosis in this case. E.g. the send_action function in zdctl.py catches exceptions from the socket function, but returns without logging the message.
This doesn't sound like the right approach. You shouldn't kill Plone (or any) processes like that except under rare circumstances, in this case because it means you must have something trying to start them repeatedly in a way that is not working correctly. You should stop that first, then diagnose why they're failing to start correctly.
I'm not following your comments about the number of cores.
The scripts are pretty simple. Forget using the scripts for now. Instead, start your zeoserver and client[12] manually and ensure they work correctly that way, to eliminate one set of issues.
Like I said in the opening post, I haven't been able to stop it. zdctl isn't communicating with zdrun due to not being able to communicate over the sockets. So the usual mechanism for stopping it is out.
Issues re cores:
1 core - standard behaviour is that server and client 1-2 can't connect with get_status() => daemon manager not running (it may very well be running but zdctl can't tell so that is the error message it spits out)
2 cores - standard behaviour is that the server starts up and most of the time client 1 but not client 2
3 cores - all start up though sometimes there's still some issues
4 cores pretty stable behaviour.