Major thing in this commit is reworking striker-update-cluster to work without expecting anvil-daemon to be running on target machines. Similarly, they had to be able to work when the Striker DBs were not available. This is to account for cases where the Striker dashboards have updated, and the schema has changed, preventing the not-yet-updated DR hosts and subnodes from being able to use the DB. To do this, anvil-safe-stop, anvil-update-system, and anvil-shutdown-server had to be updated to use the new --no-db switch, which tells then to run without the database being available.
* Updated Server->shutdown_virsh() to work without a database connection.
* Updated System->reboot_needed() to store/read from a cache file when the database is not available.
* Updated anvil-safe-start to remove the old --enable/disable/status switches, now that we use anvil-safe-start.service systemd unit.
* Reworked anvil-safe-stop to work without a database connection, and to work on DR hosts.
* Updated anvil-special-operations to add new tasks, but it's likely these new tasks aren't needed and will be removed very shortly.
* Added/updated multiple man pages.
Signed-off-by: digimer <mkelly@alteeve.ca>
@ -23,10 +23,10 @@ When logging, record sensitive data, like passwords.
Set the log level to 1, 2 or 3 respectively. Be aware that level 3 generates a significant amount of log data.
.SS"Commands:"
.TP
\fB\-\-job-uuid\fR <uuid>
\fB\-\-job\-uuid\fR <uuid>
This is set to the job UUID when the request to boot is coming from a database job. When set, the referenced job will be updated and marked as complete / failed when the run completes.
.TP
\fB\-\-no-wait\fR
\fB\-\-no\-wait\fR
This controls whether the request to boot the server waits for the server to actually boot up before returning. Normally, the program will check every couple of seconds to see if the server has actually booted before returning. Setting this tells the program to return as soon as the request to boot the server has been passed on to the resource manager.
.TP
\fB\-\-server\fR <all|name|uuid>
@ -34,7 +34,7 @@ This is either 'all', the name, or server UUID (as set in the definition XML) of
.TP
When set to 'all', all servers assigned to the local sub-cluster are booted. Servers on other Anvil! nodes are not started.
.TP
\fB\-\-server-uuid\fR <uuid>
\fB\-\-server\-uuid\fR <uuid>
This is the server UUID of the server to boot. Generally this isn't needed, except when two servers somehow share the same name. This should not be possible, but this option exists in case it happens anyway.
anvil-safe-start \- This program safely joins an Anvil! subnode to a node.
.SHSYNOPSIS
.Banvil-safe-start
\fI\,<command> \/\fR[\fI\,options\/\fR]
.SHDESCRIPTION
This program will safely join an Anvil! subnode to an Anvil! node. If both nodes are starting, it will communicate with the peer, once available. This includes booting hosted servers.
.TP
NOTE: This tool runs at boot (or not) via the 'anvil-safe-start.service' systemd daemon.
.TP
\-?, \-h, \fB\-\-help\fR
Show this man page.
.TP
\fB\-\-log-secure\fR
When logging, record sensitive data, like passwords.
.TP
\-v, \-vv, \-vvv
Set the log level to 1, 2 or 3 respectively. Be aware that level 3 generates a significant amount of log data.
.SS"Commands:"
.TP
NOTE: This tool takes no specific commands.
.IP
.SHAUTHOR
Written by Madison Kelly, Alteeve staff and the Anvil! project contributors.
anvil-safe-stop \- This program safely stop a subnode in an Anvil! node, and DR hosts
.SHSYNOPSIS
.Banvil-safe-stop
\fI\,<command> \/\fR[\fI\,options\/\fR]
.SHDESCRIPTION
This program will safely withdraw a subnode from an Anvil! node, and safely stop DR hosts. Optionally, it can also power off the machine.
.TP
\-?, \-h, \fB\-\-help\fR
Show this man page.
.TP
\fB\-\-log-secure\fR
When logging, record sensitive data, like passwords.
.TP
\-v, \-vv, \-vvv
Set the log level to 1, 2 or 3 respectively. Be aware that level 3 generates a significant amount of log data.
.SS"Commands:"
.TP
\fB\-\-no\-db\fR
.TP
This tells this program to run without connecting to the Striker databases. This should only be used if the Strikers are not available (either they're off, or they've been updated and this host hasn't been, and can't use them until this host is also updated).
.TP
NOTE: This is generally only used by 'striker-update-cluster'.
.TP
\fB\-\-poweroff\fR, \fB\-\-power\-off\fR
.TP
By default, the host will remain powered on when this program exits. Using this switch will have the host power off once the host is safely stopped.
.TP
\fB\-\-stop\-reason\fR <user, power, thermal>
.TP
Optionally used to set 'system::stop_reason' reason for this host. Valid values are 'user' (default), 'power' and 'thermal'. If set to 'user', ScanCore will not turn this host back on. If 'power', then ScanCore will reboot the host once the power under the host looks safe again. If thermal, then ScanCore will reboot the host once themperatures are back into safe levels.
.TP
\fB\-\-stop\-servers\fR
.TP
By default, on Anvil! sub-nodes, any servers running on this host will be migrated to the peer subnode. If the peer isn't available, this will refuse to stop. Using this switch will instead tell the system to stop all servers running on this host.
.TP
NOTE: On DR hosts, any running servers are always stopped.
.IP
.SHAUTHOR
Written by Madison Kelly, Alteeve staff and the Anvil! project contributors.
anvil-shutdown-server \- This program shuts down servers hosted on the Anvil! cluster.
.SHSYNOPSIS
.Banvil-shutdown-server
\fI\,<command> \/\fR[\fI\,options\/\fR]
.SHDESCRIPTION
This program shuts down a server that is running on a Anvil! node or DR host. It can optionally stop all servers.
.TP
\-?, \-h, \fB\-\-help\fR
Show this man page.
.TP
\fB\-\-log-secure\fR
When logging, record sensitive data, like passwords.
.TP
\-v, \-vv, \-vvv
Set the log level to 1, 2 or 3 respectively. Be aware that level 3 generates a significant amount of log data.
.SS"Commands:"
.TP
\fB\-\-no\-db\fR
.TP
This tells the program to run without connecting to any databases. This is used mainly when the host is being taken down as part of a cluster-wise upgrade.
.TP
\fB\-\-no\-wait\fR
.TP
This tells the program to call the shut down, but not wait for the server to actually stop. By default, when shutting down one specific server, this program will wait for the server to be off before it returns.
.TP
\fB\-\-server\fR {<name>,all}
.TP
This is the name of the server to shut down. Optionally, this can be 'all' to shut down all servers on this host.
.TP
\fB\-\-server\-uuid\fR <uuid>
.TP
This is the server UUID of the server to shut down. NOTE: This can not be used with \fB\-\-no\-db\fR.
.TP
\fB\-\-wait\fR
.TP
This tells the program to wait for the server(s) to stop before returning. By default, when '\fB\-\-server all\fR' is used,, the shutdown will NOT wait. This makes the shutdowns sequential.
.IP
.SHAUTHOR
Written by Madison Kelly, Alteeve staff and the Anvil! project contributors.
@ -25,6 +25,18 @@ Set the log level to 1, 2 or 3 respectively. Be aware that level 3 generates a s
This is the task being requested. Current optiopns are:
.IPrefresh-drbd-resource
This requires \fB\-\-resource <new name>\fR, and will call 'drbdadm adjust <resource>' as a background task and then return immediately. This is required when adding a new volume to an existing resource as 'drbdadm adjust <res>' will hold until it is called on all active DRBD nodes. This blocks the caller after the first remote host call.
.TP
.IPsafe-stop
This implies \fB\-\-no\-db\fR, and will call 'anvil-safe-stop' as a background task. This is designed to ensure that a nodes leave the subcluster, and for DR host to shut down servers. This is done when the host is not yet updated, and the Striker dashboards have been upgraded with a new database schema.
.TP
.IPupdate-system
This implies \fB\-\-no\-db\fR, and will call 'anvil-update-system' as a background task. This allows remote machines to call for the update without risk of timing out the network connection.
.TP
Note: \fB\-\-no\-reboot\fR, \fB\-\-clear\-cache\fR, and \fB\-\-reboot\fR are all available here and passed to 'anvil-update-system'. See it's manpage for usage information.
.TP
\fB\-\-no\-db\fR
.TP
This tells the program to run without a database connection.
.IP
.SHAUTHOR
Written by Madison Kelly, Alteeve staff and the Anvil! project contributors.
@ -29,6 +29,12 @@ Set the log level to 1, 2 or 3 respectively. Be aware that level 3 generates a s
.TP
This will force the dnf cache to be cleared before the OS update is started. This slows the update down a bit, but ensures the latest updates are installed.
.TP
\fB\-\-no\-db\fR
.TP
This tells the update tool to run without a database connection. This is needed if the Striker dashboards are already updated, and the local system may no longer be able to talk to them.
.TP
NOTE: After the OS update is complete, an attempt will be made to connect to the database(s). This allows for registering a request to reboot if needed.
.TP
\fB\-\-no\-reboot\fR
.TP
If the kernel is updated, the system will normally be rebooted. This switch prevents the reboot from occuring.
@ -54,6 +54,10 @@ See \fB\-\-reboot\fR for rebooting if anything is updated.
Normally, the system will only reboot if the kernel is updated. If this is used, and if any packages are updated, then a reboot will be performed. This is recommended in most cases.
.TP
Must be used with \fB\-\-reboot\-self\fR to reboot the local system. Otherwise, it is passed along to target machines via their anvil-update-system calls.
.TP
\fB\-\-timeout\fR <seconds>
.TP
When given, if a system update doesn't complete in this amount of time, error out and abort the update. By default, updates will wait forever.
.IP
.SHAUTHOR
Written by Madison Kelly, Alteeve staff and the Anvil! project contributors.
@ -366,12 +366,12 @@ The attempt to start the cluster appears to have failed. The return code '0' was
<keyname="error_0257"><![CDATA[No server specified to boot. Please use '--server <name|all>' or '--server-uuid <UUID>.]]></key>
<keyname="error_0258">This host is not a node or DR, unable to boot servers.</key>
<keyname="error_0259">The definition file: [#!variable!definition_file!#] doesn't exist, unable to boot the server.</key>
<keyname="error_0260">This host is not in an Anvil! system, aborting.</key>
<keyname="error_0260">This subnode is not in an Anvil! node yet, aborting.</key>
<keyname="error_0261">The definition file: [#!variable!definition_file!#] exists, but the server: [#!variable!server!#] does not appear to be in the cluster. Unable to boot it.</key>
<keyname="error_0262">The server: [#!variable!server!#] status is: [#!variable!status!#]. We can only boot servers that are off, not booting it.</key>
<keyname="error_0263"><![CDATA[No server specified to shut down. Please use '--server <name|all>' or '--server-uuid <UUID>.]]></key>
<keyname="error_0264">This host is not a node or DR, unable to shut down servers.</key>
<keyname="error_0265">This feature isn't enabled on DR hosts yet.</key>
<keyname="error_0265">Specifying a server to shutdown using a UUID is not available when there are no DB connections.</key>
<keyname="error_0266">The server: [#!variable!server!#] does not appear to be in the cluster. Unable to shut it down.</key>
<keyname="error_0267">The server: [#!variable!server!#] failed to boot. The reason why should be in the logs.</key>
<keyname="error_0268">The server: [#!variable!server!#] failed to shut down. The reason why should be in the logs.</key>
@ -1562,7 +1562,7 @@ Note: This is a permanent action! If you protect this server again later, a full
<keyname="job_0467">Update the base operating system.</key>
<keyname="job_0468">This uses 'dnf' to do an OS update on the host. If this is run on a node, 'anvil-safe-stop' will be called to withdraw the subnode from the node's cluster. If the peer subnode is also offline, hosted servers will be shut down.</key>
<keyname="job_0469">Update beginning. Verifying all known machines are accessible...</key>
<keyname="job_0470"></key>
<keyname="job_0470">This is a DR host, no migration possible.</key>
@ -2254,7 +2254,7 @@ The file: [#!variable!file!#] needs to be updated. The difference is:
<keyname="log_0595">Updated the lvm.conf file to add the filter: [#!variable!filter!#] to prevent LVM from seeing the DRBD devices as LVM devices.</key>
<keyname="log_0596">The host: [#!variable!host_name!#] last updated the database: [#!variable!difference!#] seconds ago, skipping power checks.</key>
<keyname="log_0597">The host: [#!variable!host_name!#] has no entries in the 'updated' table, so ScanCore has likely never run. Skipping this host for now.</key>
<keyname="log_0598">This host is not a node, this program isn't designed to run here.</key>
<keyname="log_0598">This host is not an Anvil! sub node, this program isn't designed to run here.</key>
<keyname="log_0599">Enabled 'anvil-safe-start' locally on this node.</key>
<keyname="log_0600">Enabled 'anvil-safe-start' on both nodes in this Anvil! system.</key>
<keyname="log_0601">Disabled 'anvil-safe-start' locally on this node.</key>
@ -2407,6 +2407,7 @@ The file: [#!variable!file!#] needs to be updated. The difference is:
<keyname="log_0740">Running the scan-agent: [#!variable!agent!#] now to ensure that the database has an updated view of resources.</key>
<keyname="log_0741">I was about to start: [#!variable!command!#] with the job UUID: [#!variable!this_job_uuid!#]. However, another job using the same command with the job UUID: [#!variable!other_job_uuid!#]. To avoid race conditions, only one process with a given command is run at the same time.</key>
<keyname="log_0742">The job with the command: [#!variable!command!#] and job UUID: [#!variable!job_uuid!#] is restarting.</key>
<keyname="log_0743">Will run without connecting to the databases. Some features will be unavailable.</key>
<!-- Messages for users (less technical than log entries), though sometimes used for logs, too. -->
<keyname="message_0001">The host name: [#!variable!target!#] does not resolve to an IP address.</key>
@ -2920,6 +2921,7 @@ Proceed? [y/N]</key>
<keyname="message_0321">Removing the old drbd-kmod RPMs now.</key>
<keyname="message_0322">Installing the latest DRBD kmod RPM now.</key>
<keyname="message_0323">Retrying the OS update now.</key>
<keyname="message_0324">Update almost complete. Picked this job up after a '--no-db' run, and now we have database access again.</key>
<!-- Translate names (protocols, etc) -->
<keyname="name_0001">Normal Password</key><!-- none in mail-server -->
@ -29,19 +29,16 @@ if (($running_directory =~ /^\./) && ($ENV{PWD}))
$| = 1;
my $anvil = Anvil::Tools->new();
$anvil->data->{switches}{'job-uuid'} = "";
$anvil->data->{switches}{'poweroff'} = "";
$anvil->data->{switches}{'power-off'} = ""; # By default, the node is withdrawn. With this switch, the node will power off as well.
$anvil->data->{switches}{'stop-reason'} = ""; # Optionally used to set 'system::stop_reason' reason for this host. Valid values are 'user', 'power' and 'thermal'.
$anvil->data->{switches}{'stop-servers'} = ""; # Default behaviour is to migrate servers to the peer, if the peer is up. This overrides that and forces hosted servers to shut down.
$anvil->Get->switches;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => {
my $say_time = $anvil->Get->date_and_time({time_only => 1});
if ($pacemaker_up)
{
print "[ Warning ] - The job has not been picked up yet. Is 'anvil-daemon' running on: [".$short_host_name."]?\n";
print "[ Note ] - [".$say_time."] - The subnode is still in the cluster.\n";
}
else
{
print "[ Note ] - [".$anvil->Get->date_and_time({time_only => 1})."] - The job progress is: [".$anvil->data->{jobs}{job_progress}."], continuing to wait.\n";
print "[ Note ] - [".$say_time."] - The subnode is no longer in the cluster, good.\n";
}
foreach my $resource (sort {$a cmp $b} keys %{$anvil->data->{drbd}{status}{$short_host_name}{resource}})
{
print "[ Note ] - [".$say_time."] - The resource: [".$resource."] is still up.\n";
}
$next_log = time + 60;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { next_log => $next_log }});
print "- Will check again shortly\n";
}
sleep 5;
if (time > $wait_until)
{
# Timeout.
print "[ Error ] - Timed out while waiting for the subnode: [".$short_host_name."] to stop all DRBD resources nad leave the cluster. Aborting the update.\n";
$anvil->nice_exit({exit_code => 1});
}
sleep 10;
}
}
# Record the start time so that we can be sure the subnode has rebooted (uptime is
# less than the current time minus this start time), if the host reboots as part of
# the update.
my $reboot_time = time;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => {
reboot_time => $reboot_time,
short_host_name => $short_host_name,
}});
# Do the OS update.
print "- Beginning OS update of: [".$short_host_name."]\n";