* Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though.
* Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations.
* Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck.
* Updated the 'servers' database table (and associated Database methods) to add columns for;
** server_ram_in_use - tracking RAM used by a running server
** server_configured_ram - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM)
** server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server.
** server_boot_time - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes).
* Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID.
* Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime).
* Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source.
* Made scan-hardware smarter about RAM change alerts.
* Updated scancore to load agent strings on startup so that processing pending alerts works properly.
Signed-off-by: Digimer <digimer@alteeve.ca>
@ -271,7 +271,6 @@ DROP FUNCTION history_servers() CASCADE;
DROP TABLE history.servers;
DROP TABLE servers;
-- This stores servers made available to Anvil! systems and DR hosts.
CREATE TABLE servers (
server_uuid uuid not null primary key,
server_name text not null, -- This is the server's name. It can change without re-uploading the server.
@ -280,12 +279,16 @@ CREATE TABLE servers (
server_start_after_server_uuid uuid not null, -- This can be the server_uuid of another server. If set, this server will boot 'server_start_delay' seconds after the referenced server boots. A value of '00000000-0000-0000-0000-000000000000' will tell 'anvil-safe-start' to not boot the server at all. If a server is set not to start, any dependent servers will also stay off.
server_start_delay integer not null default 0, -- See above.
server_host_uuid uuid not null, -- This is the current hosts -> host_uuid for this server. If the server is off, this will be blank.
server_state text not null, -- This is the current state of this server.
server_state text not null, -- This is the current state of this server, as reported by 'virsh list --all' (see: man virsh -> GENERIC COMMANDS -> --list)
server_live_migration boolean not null default TRUE, -- When false, servers will be stopped and then rebooted when a migration is requested. Also, when false, preventative migrations will not happen.
server_pre_migration_file_uuid uuid not null, -- This is set to the files -> file_uuid of a script to run BEFORE migrating a server. If the file isn't found or can't run, the script is ignored.
server_pre_migration_arguments text not null, -- These are arguments to pass to the pre-migration script
server_post_migration_file_uuid uuid not null, -- This is set to the files -> file_uuid of a script to run AFTER migrating a server. If the file isn't found or can't run, the script is ignored.
server_post_migration_arguments text not null, -- These are arguments to pass to the post-migration script
server_ram_in_use numeric not null, -- This is the amount of RAM currently used by the server. If the server is off, then this is the amount of RAM last used when the server was running.
server_configured_ram numeric not null, -- This is the amount of RAM allocated to the server in the on-disk definition file. This should always match the table above, but allows us to track when a user manually updated the allocated RAM in the on-disk definition, but that hasn't yet been picked up by the server
server_updated_by_user numeric not null, -- This is set to a unix timestamp when the user last updated the definition. When set, scan-server will check this value against the age of the definition file on disk. If this is newer and the running definition is different from the database definition, the database definition will be used to update the on-disk definition.
server_boot_time numeric not null, -- This is the unix time (since epoch) when the server booted. It is calculated by checking the 'ps -p <pid> -o etimes=' when a server is seen to be running when it had be last seen as off. If a server that had been running is seen to be off, this is set back to 0.
running - The domain is currently running on a CPU
idle - The domain is idle, and not running or runnable. This can be caused because the domain is waiting on IO (a traditional wait state) or has gone to sleep because there was nothing else for it to do.
paused - The domain has been paused, usually occurring through the administrator running virsh suspend. When in a paused state the domain will still consume allocated resources like memory, but will not be eligible for scheduling by the hypervisor.
in shutdown - The domain is in the process of shutting down, i.e. the guest operating system has been notified and should be in the process of stopping its operations gracefully.
shut off - The domain is not running. Usually this indicates the domain has been shut down completely, or has not been started.
crashed - The domain has crashed, which is always a violent ending. Usually this state can only occur if the domain has been configured not to restart on crash.
pmsuspended - The domain has been suspended by guest power management, e.g. entered into s3 state.
=cut
### TODO: Should we treat 'idle' same as crashed?
if ($state eq "crashed")
@ -1205,8 +1215,7 @@ sub migrate_server
# The actual migration command will involve enabling dual primary, then beginning the migration. The
# virsh call will depend on if we're pushing or pulling. Once the migration completes, regardless of
# success or failure, dual primary will be disabled again.
my $migration_command = "";
my $migrated = 0;
my $migrated = 0;
if ($target)
{
# Can I even connect to the target?
@ -1527,7 +1536,7 @@ sub validate_storage
{
my ($anvil) = @_;
# When checking on a running server, use 'from_memory'.
# When checking on a running server, use 'from_virsh'.
my $server = $anvil->data->{environment}{OCF_RESKEY_name};
my $target = defined $anvil->data->{environment}{OCF_RESKEY_CRM_meta_migrate_target} ? $anvil->data->{environment}{OCF_RESKEY_CRM_meta_migrate_target} : "";
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => {
@ -1536,9 +1545,9 @@ sub validate_storage
}});
my $local_host = $anvil->Get->short_host_name();
my $xml_source = "from_disk";
if ($anvil->data->{server}{$local_host}{$server}{from_memory}{host})
if ($anvil->data->{server}{$local_host}{$server}{from_virsh}{host})
{
$xml_source = "from_memory";
$xml_source = "from_virsh";
}
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => {
<keyname="scan_hardware_alert_0012">The amount of RAM (as reported by dmidecode) on the system has changed. If there was a hardware upgrade, then this is safe to ignore. If it was unexpected, a RAM module may have failed.
- New: [#!variable!new!#]
- Old: [#!variable!old!#]
<keyname="scan_hardware_alert_0012">The amount of RAM (as reported by dmidecode) on the system has dropped. If it was unexpected, a RAM module may have failed.
- New: ...... [#!variable!new!#]
- Old: ...... [#!variable!old!#]
- Difference: [#!variable!difference!#]
</key>
<keyname="scan_hardware_alert_0013">The amount of memory (as reported by /proc/meminfo) on the system has changed. If there was a hardware upgrade, then this is safe to ignore. If it was unexpected, a RAM module may have failed.
- New: [#!variable!new!#]
- Old: [#!variable!old!#]
<keyname="scan_hardware_alert_0013">The amount of memory (as reported by /proc/meminfo) on the system has dropped.
- New: ...... [#!variable!new!#]
- Old: ...... [#!variable!old!#]
- Difference: [#!variable!difference!#]
</key>
<keyname="scan_hardware_alert_0014">The amount of memory (as reported by /proc/meminfo) on the system has changed. If there was a hardware upgrade, then this is safe to ignore. If it was unexpected, a RAM module may have failed.
- New: [#!variable!new!#]
- Old: [#!variable!old!#]
<keyname="scan_hardware_alert_0014">The amount of swap (as reported by /proc/meminfo) on the system has dropped.
- New: ...... [#!variable!new!#]
- Old: ...... [#!variable!old!#]
- Difference: [#!variable!difference!#]
</key>
<keyname="scan_hardware_alert_0015">The ID LED (identification light) state has changed;
- New: [#!variable!new!#]
@ -88,11 +91,11 @@ The differences are:
- New: [#!variable!new!#]
- Old: [#!variable!old!#]
</key>
<keyname="scan_hardware_alert_0018">The amount of available memory (as reported by /proc/meminfo) has changed (this is common and expected);
<keyname="scan_hardware_alert_0018">The amount of free memory (as reported by /proc/meminfo) has changed (this is common and expected);
- New: [#!variable!new!#]
- Old: [#!variable!old!#]
</key>
<keyname="scan_hardware_alert_0019">The amount of available swap space (as reported by /proc/meminfo) has changed (this is common and expected);
<keyname="scan_hardware_alert_0019">The amount of free swap space (as reported by /proc/meminfo) has changed (this is common and expected);
- New: [#!variable!new!#]
- Old: [#!variable!old!#]
</key>
@ -146,6 +149,11 @@ If the RAM is being updated, this alert will clear once this node has been upgra
- Peer's RAM: [#!variable!peer_ram!#]
</key>
<keyname="scan_hardware_alert_0028">The amount of RAM on both nodes is back to being the same. They both have: [#!variable!ram!#] now.</key>
<keyname="scan_hardware_alert_0029">The amount of RAM (as reported by dmidecode) on the system has increased. Likely the system was upgraded.
my $anvil_uuid = $anvil->Cluster->get_anvil_uuid({debug => 2});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { anvil_uuid => $anvil_uuid }});
my ($output, $return_code) = $anvil->System->call({shell_call => $anvil->data->{path}{exe}{virsh}." list --all", source => $THIS_FILE, line => __LINE__});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { output => $output, return_code => $return_code }});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { output => $output, return_code => $return_code }});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { line => $line }});
### TODO: We may want to handle crashed / idle servers.
=cut
* Server states;
running - The domain is currently running on a CPU
idle - The domain is idle, and not running or runnable. This can be caused because the domain is waiting on IO (a traditional wait state) or has gone to sleep because there was nothing else for it to do.
paused - The domain has been paused, usually occurring through the administrator running virsh suspend. When in a paused state the domain will still consume allocated resources like memory, but will not be eligible for scheduling by the hypervisor.
in shutdown - The domain is in the process of shutting down, i.e. the guest operating system has been notified and should be in the process of stopping its operations gracefully.
shut off - The domain is not running. Usually this indicates the domain has been shut down completely, or has not been started.
crashed - The domain has crashed, which is always a violent ending. Usually this state can only occur if the domain has been configured not to restart on crash.
pmsuspended - The domain has been suspended by guest power management, e.g. entered into s3 state.
- Ones we set:
migrating - Set and cleared by Server->migrate_virsh();
DELETED - Marks a server as no longer existing
=cut
if ($line =~ /^\d+ (.*?) (.*)$/)
{
my $server = $1;
my $state = $2;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => {
server => $server,
'state' => $state,
my $server_name = $1;
my $server_state = $2;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => {
my $seen_state = $anvil->data->{'scan-server'}{server}{$server_name}{server_state};
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => $debug, list => { seen_state => $seen_state }});
# If it needs to be undefined, do so now.
if (($anvil->data->{'scan-server'}{server}{$server_name}{undefine}) && ($anvil->data->{scancore}{'scan-server'}{'auto-undefine'}))
{
# Something went wrong.
next;
# It's off, and should be undefined. Dump the defition to archive and underfine.
my $backup_file = $anvil->data->{path}{directories}{shared}{archives}."/".$server_name.".pre-undefine.".$anvil->Get->date_and_time({file_name => 1}).".xml";
my $shell_call = $anvil->data->{path}{exe}{virsh}." dumpxml --inactive ".$server_name." > ".$backup_file;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => $debug, list => { shell_call => $shell_call }});
my ($output, $return_code) = $anvil->System->call({shell_call => $anvil->data->{path}{exe}{virsh}." list --all", source => $THIS_FILE, line => __LINE__});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { output => $output, return_code => $return_code }});
# This reads the definition file from the database and parses it.
sub get_and_parse_database_definition
{
my ($anvil, $server_name, $server_uuid) = @_;
my $database_definition = $anvil->data->{server_definitions}{server_definition_server_uuid}{$server_uuid}{server_definition_xml};
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { database_definition => $database_definition }});
$anvil->Server->parse_definition({
debug => 2,
server_name => $server_name,
source => "from_db",
definition => $database_definition,
});
return($database_definition);
}
# This reads the definition file for a given server and parses it, returning the definition XML.
sub get_and_parse_disk_definition
{
my ($anvil, $server_name) = @_;
my $on_disk_definition = $anvil->Storage->read_file({file => $xml_file});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { on_disk_definition => $on_disk_definition }});
$anvil->Server->parse_definition({
debug => 2,
server_name => $server_name,
source => "from_disk",
definition => $on_disk_definition,
});
return($on_disk_definition);
}
# This dumps the definition for a given server and parses it, returning the definition XML.
sub get_and_parse_virsh_definition
{
my ($anvil, $server_name) = @_;
my ($virsh_definition, $return_code) = $anvil->System->call({shell_call => $anvil->data->{path}{exe}{virsh}." dumpxml --inactive ".$server_name, source => $THIS_FILE, line => __LINE__});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { output => $virsh_definition, return_code => $return_code }});
$anvil->Server->parse_definition({
debug => 2,
server_name => $server_name,
source => "from_virsh",
definition => $virsh_definition,
});
return($virsh_definition);
}
# This defines the server using the on-disk XML file. Effectively, this updates the 'inactive' XML definition
# in virsh.
sub redefine_server_from_disk
{
my ($anvil, $server_name) = @_;
my $xml_file = $anvil->data->{path}{directories}{shared}{definitions}."/".$server_name.".xml";
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { xml_file => $xml_file }});
# Push the new definition into virsh (it won't take effect until a reboot likely, but it will update
# the 'inactive' definition immediately.
my ($output, $return_code) = $anvil->System->call({shell_call => $anvil->data->{path}{exe}{virsh}." defined ".$xml_file, source => $THIS_FILE, line => __LINE__});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { output => $output, return_code => $return_code }});
# Now undefine the server again so it disappears when stopped.
@ -15,19 +15,62 @@ NOTE: All string keys MUST be prefixed with the agent name! ie: 'scan_server_log
<!-- Alert entries -->
<keyname="scan_server_alert_0001">
The definition file for the server: [#!variable!server!#] has changed relative to the database record. This is usually harmless.
The differences are:
====
#!variable!difference!#
====
The definition for the server: [#!variable!server!#] was changed via Striker.
- Pushing the new version to the on-disk definition file [#!variable!definition_file!#]
- Also updating definition used by the hypervisor.
- Note: You may need to reboot or power cycle for the changes to take effect.
- The changes are:
==[ Disk ]============
#!variable!disk_difference!#
==[ Hypervisor ]======
#!variable!virsh_difference!#
==[ New Definition ]==
#!variable!new_difference!#
======================
</key>
<keyname="scan_server_alert_0002">
The definition file for the server: [#!variable!server!#] has changed relative to disk record. This is usually harmless.
The differences are:
The on-disk definition for the server: [#!variable!server!#] was directly edited.
- Pushing the new version to the database definition.
- Also updating definition used by the hypervisor.
- Note: You may need to reboot or power cycle for the changes to take effect.
- The changes are:
==[ Database ]========
#!variable!db_difference!#
==[ Hypervisor ]======
#!variable!virsh_difference!#
==[ New Definition ]==
#!variable!new_difference!#
======================
</key>
<keyname="scan_server_alert_0003">
The definition for the server: [#!variable!server!#] was edited outside of the Anvil! system. This usually means it was updated using Virtual Machine Manager (or another libvirt tool like the virsh shell).
- Pushing the new version to the on-disk definition file [#!variable!definition_file!#]
- Pushing the new version to the database definition as well.
- Note: You may need to reboot or power cycle for the changes to take effect.
- The changes are:
==[ Disk ]============
#!variable!disk_difference!#
==[ Database ]========
#!variable!db_difference!#
==[ New Definition ]==
#!variable!new_difference!#
======================
</key>
<keyname="scan_server_alert_0004">The active definition for the server: [#!variable!server!#] now matches the definition on disk. Changes should now be applied.</key>
server_start_after_server_uuiduuidnotnull,-- This can be the server_uuid of another server. If set, this server will boot 'server_start_delay' seconds after the referenced server boots. A value of '00000000-0000-0000-0000-000000000000' will tell 'anvil-safe-start' to not boot the server at all. If a server is set not to start, any dependent servers will also stay off.
server_start_delayintegernotnulldefault0,-- See above.
server_host_uuiduuidnotnull,-- This is the current hosts -> host_uuid for this server. If the server is off, this will be blank.
server_statetextnotnull,-- This is the current state of this server.
server_live_migrationbooleannotnulldefaultTRUE,-- When false, servers will be stopped and then rebooted when a migration is requested. Also, when false, preventative migrations will not happen.
server_statetextnotnull,-- This is the current state of this server, as reported by 'virsh list --all' (see: man virsh -> GENERIC COMMANDS -> --list)
server_live_migrationbooleannotnulldefaultTRUE,-- When false, servers will be frozen for a migration, instead of being migrated while the server is migrating. During a cold migration, the server will be unresponsive, so connections to it could time out. However, by being frozen the migration will complete faster.
server_pre_migration_file_uuiduuidnotnull,-- This is set to the files -> file_uuid of a script to run BEFORE migrating a server. If the file isn't found or can't run, the script is ignored.
server_pre_migration_argumentstextnotnull,-- These are arguments to pass to the pre-migration script
server_post_migration_file_uuiduuidnotnull,-- This is set to the files -> file_uuid of a script to run AFTER migrating a server. If the file isn't found or can't run, the script is ignored.
server_post_migration_argumentstextnotnull,-- These are arguments to pass to the post-migration script
server_ram_in_usenumericnotnull,-- This is the amount of RAM currently used by the server. If the server is off, then this is the amount of RAM last used when the server was running.
server_configured_ramnumericnotnull,-- This is the amount of RAM allocated to the server in the on-disk definition file. This should always match the table above, but allows us to track when a user manually updated the allocated RAM in the on-disk definition, but that hasn't yet been picked up by the server
server_updated_by_usernumericnotnull,-- This is set to a unix timestamp when the user last updated the definition (via striker). When set, scan-server will check this value against the age of the definition file on disk. If this is newer, the on-disk defition will be updated. On the host with the server (if any), the new definition will be loaded into virsh as well.
server_boot_timenumericnotnull,-- This is the unix time (since epoch) when the server booted. It is calculated by checking the 'ps -p <pid> -o etimes=' when a server is seen to be running when it had be last seen as off. If a server that had been running is seen to be off, this is set back to 0.