anvil/tools/scancore

480 lines
19 KiB
Plaintext
Raw Normal View History

#!/usr/bin/perl
#
# This is the main ScanCore program. It is managed by systemd
#
# Examples;
#
# Exit codes;
# 0 = Normal exit.
# 1 = Not running as root.
# 2 =
#
# TODO:
# - Decide if it's worth having a separate ScanCore.log file or just feed into anvil.log.
# - Examine limits in: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#LimitCPU=
# - Use 'nvme-cli' to write a scan-nvme scan agent, can get thermal and wear data
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
# - Record how long a server's migration took in the past, and use that to determine which node to evacuate
# during load shed. Also, track how long it takes for servers to stop to determine when to initiate a total
# shutdown.
# - Add a '--silence-alerts --anvil <name>' and '--restore-alerts --anvil <name>' to temporarily
# disable/re-enable alerts. This is to allow for quiet maintenance without stopping scancore itself.
#
# - Disable resync checks by default, and have a resync check happen on scancore startup, anvil-daemon
# startup, and during configuration.
# - Delete records from temperature and power tables that are older than 48 hours, checking periodically in
# scancore on strikers only. Delete jobs records that are 100% complete for 48 hours or more.
#
use strict;
use warnings;
use Anvil::Tools;
use Data::Dumper;
# Disable buffering
$| = 1;
# Prevent a discrepency between UID/GID and EUID/EGID from throwing an error.
$< = $>;
$( = $);
my $THIS_FILE = ($0 =~ /^.*\/(.*)$/)[0];
my $running_directory = ($0 =~ /^(.*?)\/$THIS_FILE$/)[0];
if (($running_directory =~ /^\./) && ($ENV{PWD}))
{
$running_directory =~ s/^\./$ENV{PWD}/;
}
my $anvil = Anvil::Tools->new();
# Make sure we're running as 'root'
# $< == real UID, $> == effective UID
if (($< != 0) && ($> != 0))
{
# Not root
print $anvil->Words->string({key => "error_0005"})."\n";
$anvil->nice_exit({exit_code => 1});
}
$anvil->data->{scancore} = {
threshold => {
warning_temperature => 5,
warning_critical => 5,
},
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
power => {
safe_boot_percentage => 35,
},
};
$anvil->Storage->read_config();
# Read switches
$anvil->data->{switches}{purge} = "";
$anvil->data->{switches}{'run-once'} = "";
$anvil->Get->switches;
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0115", variables => { program => $THIS_FILE }});
# If purging, also set 'run-once'.
if ($anvil->data->{switches}{purge})
{
$anvil->data->{switches}{'run-once'} = 1;
}
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => {
"switches::purge" => $anvil->data->{switches}{purge},
"switches::run-once" => $anvil->data->{switches}{'run-once'},
}});
# Calculate my sum so that we can exit if it changes later.
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
$anvil->Storage->record_md5sums();
# Connect to DBs.
wait_for_database($anvil);
# If we're not configured, sleep.
wait_until_configured($anvil);
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
# Startup tasks.
startup_tasks($anvil);
* Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though. * Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations. * Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck. * Updated the 'servers' database table (and associated Database methods) to add columns for; ** server_ram_in_use - tracking RAM used by a running server ** server_configured_ram - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM) ** server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server. ** server_boot_time - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes). * Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID. * Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime). * Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source. * Made scan-hardware smarter about RAM change alerts. * Updated scancore to load agent strings on startup so that processing pending alerts works properly. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-02 06:13:34 +00:00
# Load the strings from all the agents we know about before we process alerts so that we can include their
# messages in any emails we're going to send.
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
$anvil->Words->load_agent_strings();
* Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though. * Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations. * Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck. * Updated the 'servers' database table (and associated Database methods) to add columns for; ** server_ram_in_use - tracking RAM used by a running server ** server_configured_ram - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM) ** server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server. ** server_boot_time - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes). * Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID. * Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime). * Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source. * Made scan-hardware smarter about RAM change alerts. * Updated scancore to load agent strings on startup so that processing pending alerts works properly. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-02 06:13:34 +00:00
# Send a startup message immediately
$anvil->Email->check_config();
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
$anvil->Alert->register({alert_level => "notice", message => "message_0179", set_by => $THIS_FILE, sort_position => 0});
$anvil->Email->send_alerts();
# Disconnect. We'll reconnect inside the loop
$anvil->Database->disconnect();
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 3, key => "log_0203"});
# The main loop
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 2, key => "log_0248"});
while(1)
{
# Do the various pre-run tasks.
my $start_time = time;
prepare_for_run($anvil);
The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-01 02:58:01 +00:00
# Set our sleep time
my $run_interval = 60;
The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-01 02:58:01 +00:00
if ((exists $anvil->data->{scancore}{timing}{run_interval}) && ($anvil->data->{scancore}{timing}{run_interval} =~ /^\d+$/))
{
$run_interval = $anvil->data->{scancore}{timing}{run_interval};
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { run_interval => $run_interval }});
The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-01 02:58:01 +00:00
}
# If we're in maintenance mode, do nothing.
my $maintenance_mode = $anvil->System->maintenance_mode();
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { maintenance_mode => $maintenance_mode }});
The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-01 02:58:01 +00:00
if ($maintenance_mode)
{
# Sleep and skip.
sleep($run_interval);
next;
}
# Do we have at least one database?
my $agent_runtime = 0;
if ($anvil->data->{sys}{database}{connections})
{
# Run the normal tasks
$anvil->ScanCore->call_scan_agents();
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
# Do post-scan analysis.
$anvil->ScanCore->post_scan_analysis({debug => 2});
}
else
{
# No databases available, we can't do anything this run. Sleep for a couple of seconds and
# then try again.
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 0, key => "log_0202"});
my $db_retry_interval = 2;
if ((exists $anvil->data->{scancore}{timing}{db_retry_interval}) && ($anvil->data->{scancore}{timing}{db_retry_interval} =~ /^\d+$/))
{
$db_retry_interval = $anvil->data->{scancore}{timing}{db_retry_interval};
}
sleep($db_retry_interval);
next;
}
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
# Send alerts.
$anvil->Email->send_alerts({debug => 2});
# Exit if 'run-once' selected.
if ($anvil->data->{switches}{'run-once'})
{
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 0, key => "message_0055"});
$anvil->Alert->register({set_by => $THIS_FILE, alert_level => "notice", message => "message_0055"});
$anvil->Email->send_alerts();
$anvil->nice_exit({exit_code => 0});
}
# Clean up
cleanup_after_run($anvil);
# Check how much RAM we're using.
check_ram($anvil);
# Sleep until it's time to run again.
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0249", variables => {
run_interval => $run_interval,
runtime => (time - $start_time),
}});
sleep($run_interval);
# In case something has changed, exit.
exit_if_sums_changed($anvil);
}
$anvil->nice_exit({exit_code => 0});
#############################################################################################################
# Functions #
#############################################################################################################
# If we're using too much ram, send an alert and exit.
sub check_ram
{
my ($anvil) = @_;
# Problem 0 == ok, 1 == too much ram used, 2 == no pid found
my ($problem, $ram_used) = $anvil->System->check_ram_use({program => $THIS_FILE});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => {
problem => $problem,
ram_used => $anvil->Convert->add_commas({number => $ram_used})." (".$anvil->Convert->bytes_to_human_readable({'bytes' => $ram_used}).")",
}});
if ($problem)
{
# Send an alert and exit.
$anvil->Alert->register({alert_level => "notice", message => "error_0357", variables => {
program => $THIS_FILE,
ram_used => $anvil->Convert->bytes_to_human_readable({'bytes' => $ram_used}),
ram_used_bytes => $anvil->Convert->add_commas({number => $ram_used}),
}, set_by => $THIS_FILE, sort_position => 0});
$anvil->Email->send_alerts();
# Log the same
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, level => 0, priority => "err", key => "error_0357", variables => {
program => $THIS_FILE,
ram_used => $anvil->Convert->bytes_to_human_readable({'bytes' => $ram_used}),
ram_used_bytes => $anvil->Convert->add_commas({number => $ram_used}),
}});
# Exit with RC0 so that systemctl restarts
$anvil->nice_exit({exit_code => 0});
}
return(0);
}
# This cleans things up after a scan run has completed.
sub cleanup_after_run
{
my ($anvil) = @_;
# Delete what we know about existing scan agents so that the next scan freshens the data.
foreach my $agent_name (sort {$a cmp $b} keys %{$anvil->data->{scancore}{agent}})
{
# Remove the agent's words file data.
my $agent_words = $anvil->data->{scancore}{agent}{$agent_name}.".xml";
delete $anvil->data->{words}{$agent_words};
}
# Disconnect from the database(s) and sleep now.
$anvil->Database->disconnect();
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
exit_if_sums_changed($anvil);
# Now delete all the remaining agent data.
delete $anvil->data->{scancore}{agent};
}
# This checks to see if any files on disk have changed and, if so, exits.
sub exit_if_sums_changed
{
my ($anvil) = @_;
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
my $changed = $anvil->Storage->check_md5sums();
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { changed => $changed }});
if ($changed)
{
# NOTE: We exit with '0' to prevent systemctl from showing a scary red message.
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 0, priority => "alert", key => "message_0014"});
$anvil->nice_exit({exit_code => 0});
}
return(0);
}
# Handle pre-run tasks.
sub prepare_for_run
{
my ($anvil) = @_;
# Reload defaults, re-read the config and then connect to the database(s)
$anvil->_set_paths();
$anvil->_set_defaults();
$anvil->Storage->read_config();
* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 06:10:23 +00:00
$anvil->Get->switches();
$anvil->Words->read();
$anvil->Database->connect({check_for_resync => 1});
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 2, key => "log_0132"});
# See if the mail server needs to be updated.
$anvil->Email->check_config;
return(0);
}
# This loops until it can connect to at least one database.
sub wait_for_database
{
my ($anvil) = @_;
# Don't check for resync here as we may need to load agent schemas. We'll check for resync in the
# main loop.
$anvil->Database->connect({check_for_resync => 0});
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, level => 2, key => "log_0132"});
if (not $anvil->data->{sys}{database}{connections})
{
# No databases, sleep until one comes online.
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0244"});
until($anvil->data->{sys}{database}{connections})
{
# Disconnect, Sleep, then check again.
$anvil->Database->disconnect();
sleep 60;
$anvil->_set_paths();
$anvil->_set_defaults();
$anvil->Storage->read_config();
$anvil->Database->connect({check_for_resync => 0});
if ($anvil->data->{sys}{database}{connections})
{
# We're good
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0245"});
}
else
{
# Not yet...
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 2, key => "log_0244"});
}
# In case something has changed, exit.
exit_if_sums_changed($anvil);
}
}
return(0);
}
# Wait until the local system has been configured.
sub wait_until_configured
{
my ($anvil) = @_;
my $configured = $anvil->System->check_if_configured;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { configured => $configured }});
if (not $configured)
{
# Sleep for a minute and check again.
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, level => 1, key => "log_0246"});
until ($configured)
{
# Sleep, then check.
sleep 60;
$configured = $anvil->System->check_if_configured;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 2, list => { configured => $configured }});
if ($configured)
{
# We're good
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0247"});
}
else
{
# Not yet...
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 2, key => "log_0246"});
}
# In case something has changed, exit.
exit_if_sums_changed($anvil);
}
}
return(0);
}
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
# Things we need to do at startup.
sub startup_tasks
{
my ($anvil) = @_;
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 2, key => "log_0572"});
* THe get_cpu endpoint was completed. * The get_mmeory endpoint was completed. * The get_replicated_storage endpoint was completed, though it requires testing and likely has issues. To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this. * Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated. * To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume. * Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker. * Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted). * Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status. * Updated anvil-join-anvil to disabled ksm and ksmtuned daemons. * Updated scancore and anvil-daemon to set the host_status to 'online' on startup. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-10 00:51:29 +00:00
# Make sure all agents schemas are loaded so that resyncs where a table on one DB doesn't exist on
# another, causing a fault.
$anvil->ScanCore->_scan_directory({directory => $anvil->data->{path}{directories}{scan_agents}});
foreach my $scan_agent (sort {$a cmp $b} keys %{$anvil->data->{scancore}{agent}})
{
my $schema_file = $anvil->data->{path}{directories}{scan_agents}."/".$scan_agent."/".$scan_agent.".sql";
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => {
scan_agent => $scan_agent,
schema_file => $schema_file,
}});
if (-e $schema_file)
{
my $tables = $anvil->Database->get_tables_from_schema({debug => 3, schema_file => $schema_file});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { tables => $tables }});
my $table_count = @{$tables};
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { table_count => $table_count }});
# It's possible that some agents don't have a database (or use core database tables only)
if (@{$tables} > 0)
{
$anvil->Database->check_agent_data({
agent => $scan_agent,
tables => $tables,
});
}
}
}
* THe get_cpu endpoint was completed. * The get_mmeory endpoint was completed. * The get_replicated_storage endpoint was completed, though it requires testing and likely has issues. To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this. * Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated. * To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume. * Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker. * Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted). * Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status. * Updated anvil-join-anvil to disabled ksm and ksmtuned daemons. * Updated scancore and anvil-daemon to set the host_status to 'online' on startup. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-10 00:51:29 +00:00
# Update our status
$anvil->Database->get_hosts({debug => 3});
my $host_uuid = $anvil->Get->host_uuid();
$anvil->Database->insert_or_update_hosts({
debug => 2,
* THe get_cpu endpoint was completed. * The get_mmeory endpoint was completed. * The get_replicated_storage endpoint was completed, though it requires testing and likely has issues. To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this. * Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated. * To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume. * Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker. * Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted). * Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status. * Updated anvil-join-anvil to disabled ksm and ksmtuned daemons. * Updated scancore and anvil-daemon to set the host_status to 'online' on startup. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-10 00:51:29 +00:00
host_ipmi => $anvil->data->{hosts}{host_uuid}{$host_uuid}{host_ipmi},
host_key => $anvil->data->{hosts}{host_uuid}{$host_uuid}{host_key},
host_name => $anvil->data->{hosts}{host_uuid}{$host_uuid}{host_name},
host_type => $anvil->data->{hosts}{host_uuid}{$host_uuid}{host_type},
host_uuid => $host_uuid,
host_status => "online",
});
# Make sure our stop reason is cleared.
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
my $variable_uuid = $anvil->Database->insert_or_update_variables({
variable_name => 'system::stop_reason',
variable_value => '',
variable_default => '',
variable_description => 'striker_0279',
variable_section => 'system',
The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-01 02:58:01 +00:00
variable_source_uuid => $anvil->Get->host_uuid,
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
variable_source_table => 'hosts',
});
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { variable_uuid => $variable_uuid }});
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
# If this is a node and we've been up for less than ten minutes, call anvil-safe-start as a
# background process. It will exit if it is disabled.
my $host_type = $anvil->Get->host_type();
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { host_type => $host_type }});
if ($host_type eq "node")
{
my $uptime = $anvil->Get->uptime;
$anvil->Log->variables({source => $THIS_FILE, line => __LINE__, level => 3, list => { uptime => $uptime }});
if ($uptime < 600)
{
# Run it as a background task
my $shell_call = $anvil->data->{path}{exe}{'anvil-safe-start'}.$anvil->Log->switches;
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0210", variables => { command => $shell_call }});
$anvil->System->call({shell_call => $shell_call, background => 1});
}
else
{
# Log that we've been up too long to auto-start the cluster.
my $say_uptime = $anvil->Convert->time({'time' => $uptime});
$anvil->Log->entry({source => $THIS_FILE, line => __LINE__, 'print' => 1, level => 1, key => "log_0620", variables => { uptime => $say_uptime }});
}
}
elsif ($host_type eq "striker")
{
# We're a striker, so we're going to check for / remove transient database records on tables
# that always grow (temperature, power, etc) and whose data loses value as it ages.
}
* Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call. * Created Database->get_power() that loads data from the special 'power' table. * Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables. * Updated Database->get_anvils() to record the machine types when recording host information. * Updated Database->get_hosts_info() to also load the 'host_ipmi' column. * Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available. * Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon. * Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned. * Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete. ** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful. * Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic. * Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored. * Updated scancore to clear the 'system::stop_reason' variable. * Added missing packages to striker-manage-install-target. Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 21:00:35 +00:00
return(0);
}
=pod
"I'm sorry, but I don't want to be an emperor. That's not my business. I don't want to rule or conquer anyone. I should like to help everyone if possible - Jew, Gentile - black man - white.
We all want to help one another. Human beings are like that. We want to live by each other's happiness - not by each other's misery. We don't want to hate and despise one another. In this world there's room for everyone and the good earth is rich and can provide for everyone.
The way of life can be free and beautiful, but we have lost the way. Greed has poisoned men's souls - has barricaded the world with hate - has goose-stepped us into misery and bloodshed. We have developed speed, but we have shut ourselves in. Machinery that gives abundance has left us in want. Our knowledge has made us cynical; our cleverness, hard and unkind. We think too much and feel too little. More than machinery we need humanity. More than cleverness, we need kindness and gentleness. Without these qualities, life will be violent and all will be lost.
The aeroplane and the radio have brought us closer together. The very nature of these inventions cries out for the goodness in man - cries for universal brotherhood - for the unity of us all. Even now my voice is reaching millions throughout the world - millions of despairing men, women, and little children - victims of a system that makes men torture and imprison innocent people. To those who can hear me, I say: 'Do not despair.' The misery that is now upon us is but the passing of greed - the bitterness of men who fear the way of human progress. The hate of men will pass, and dictators die, and the power they took from the people will return to the people. And so long as men die, liberty will never perish.
Soldiers! Don't give yourselves to brutes - men who despise you and enslave you - who regiment your lives - tell you what to do - what to think and what to feel! Who drill you - diet you - treat you like cattle, use you as cannon fodder. Don't give yourselves to these unnatural men - machine men with machine minds and machine hearts! You are not machines! You are not cattle! You are men! You have the love of humanity in your hearts. You don't hate, only the unloved hate - the unloved and the unnatural!
Soldiers! Don't fight for slavery! Fight for liberty! In the seventeenth chapter of St Luke, it is written the kingdom of God is within man not one man nor a group of men, but in all men! In you! You, the people, have the power - the power to create machines. The power to create happiness! You, the people, have the power to make this life free and beautiful - to make this life a wonderful adventure. Then in the name of democracy - let us use that power - let us all unite. Let us fight for a new world - a decent world that will give men a chance to work - that will give youth a future and old age a security.
By the promise of these things, brutes have risen to power. But they lie! They do not fulfil that promise. They never will! Dictators free themselves but they enslave the people. Now let us fight to fulfil that promise! Let us fight to free the world - to do away with national barriers - to do away with greed, with hate and intolerance. Let us fight for a world of reason - a world where science and progress will lead to all men's happiness. Soldiers, in the name of democracy, let us unite!"
=pod