Commit Graph

18 Commits

Author SHA1 Message Date
Digimer
8abb5b46e0 * Added support for setting per-agent log-level and log secure values in amvil.conf.
* Moved the check for an agent being disabled into ScanCore->agent_startup()

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-18 23:07:15 -04:00
Digimer
15d8309095 This commit adds scan agent DB connection info caching to help minimize the number of unnecessary DB resync checks that happen.
* Created ScanCore->agent_shutdown() that writes out the time the scan agent last ran, and how many databases were available when it last ran.
* Updated ScanCore->agent_startup() to read the the last run data created above.
* Updated Database->connect() to set 'sys::database::last_db_count' to the scan agent's recorded last DB count.
* Updated all agents to call ScanCore->agent_shutdown() at the end of their run.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-11 21:14:35 -04:00
Digimer
daca6c887b * This contains a fairly major change to how time stamps are handled. All INSERT and UPDATE calls now generate a new timestamp via Database->refresh_timestamp, instead of using 'sys::database::timestamp'. This was done in responce to finding a bug where tables in a database differed in both counts of public and private schemas (ip_addresses table, specifically) that failed to resync because the timestamps were re-used too often.
* WIP - Continuing work on the new anvil-manage-server tool.
* Updated Database->get_anvils() to load information on the files available on each Anvil! system.
* Updated Database->insert_or_update_network_interfaces() to no longer take the 'timestamp' parameter.
* Removed all logging from Database->refresh_timestamp() to speed it up, given how often it will be called now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 15:23:15 -04:00
Digimer
6abe06f125 The theme of these commits is improving DB responsiveness.
* Created Database->_age_out_data() to delete records from the database that are old enough to no longer be useful. This is designed to significantly reduce the size of the database, allowing a better focus on performance.
* Changed Database->connect() to default to NOT check for resync, reworking the old 'no_resync' to 'check_for_resync', so that resync checks happen on demard, instead of by default.
* Updated get_tables_from_schema() to now allow 'schema_file' to be set to 'all', which then loads the schema files of all scan agents as well as the core anvil schema file. Fixed a bug where commented out tables were being counted.
* Re-enabled triggering resyncs on 'last_updated' differences.
* Fixed a bug in scan-ipmitool where the history_id column in history.scan_ipmitool_value was incorrect.
* Created a new tool called striker-show-db-counts that shows the number of records in all public and history schema tables for all databases.
* Updated anvil-update-states to detect when a libvirtd NAT'ed bridge exists and to delete it when found.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-29 23:34:22 -04:00
Digimer
ad4a1ecc78 * Increaded the scancore agent run timeout to 60 seconds.
* Updated anvil-safe-start to start DRBD resources when the peer's DRBD resourcs is 'Connecting',
* Updated fence_pacemaker to more intelligently check the list of host names related to an IP address when looking for the peer host name

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-15 00:12:43 -04:00
Digimer
473c728117 * Updated scan-ipmitool to use 'jump' thresholds for a common sensor name where duplicates with the hex address appended may exist.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-13 23:40:50 -04:00
Digimer
44864ce321 * Updated Database->resync_databases() to set a default schema of 'public'. Also fixed a bug where, when the difference in record numbers between two line was > 999, it would not trigger a resync.
* Updated the scan agent timeout to 60 seconds. Also made the scan agent exit code log entries more helpful.
* Updated System->collect_ipmi_data() to now better handle duplicate sensor names. Now, instead of simply appending an integer, we find the hex address and use that in the sensor name when duplicates exist. This solves the problem of the sensor names not being consistently shown in order.
* Fixed message bugs (bad variable insertions) in scan-apc-pdu and scan-apc-ups.
* Fixed schema procedure bugs in the 'temperature' and 'ip_address' tables where the columns were in bad order, causing constanty updates.

Incomplete work;
* Create the shell of 'anvil-manage-storage', but virtually no logic exists in it yet.
* Started work on anvil-safe-start to deal with an issue where DRBD resources don't start when a server is running on a peer.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-13 23:27:38 -04:00
Digimer
f833c311ba * To address issues with scancore debugging, we needed a tool to purge old anvils and hosts from the database. The 'test.pl' in this commit contains the new logic that will be merged into tools/striker-purge-host shortly.
* Created Database->find_host_uuid_columns() and ->_find_column() to create a list of tables and column names in the proper order to allow deletion of foreign keys to that deeply nested primary keys can be deleted. Specifically, this was meant for hosts -> host_uuid and anvils -> anvil_uuid, though it should work for other tables.
* Updated html/jquery-ui-1.12.1/package.json to address CVE-2020-7729
* Fixed a bug in the temperature table's history procedure where temperature_weight wasn't being copied.
* Updated anvil-provision-server to support '--anvil' that can take either the anvil-uuid or anvil-name.
* Updated anvil-safe-stop to default the stop-reason to 'user'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-08 02:02:46 -04:00
Digimer
ca7052dd53 The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :)
* Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively.
* Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf.
* Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state.
* Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested.
* Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times.
* Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan.
* Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-30 22:58:01 -04:00
Digimer
59b867cc25 * Updated DRBD->gather_data() to check if drbdadm exists before trying to call it to avoid scary errors in the logs. Also moved some strings that pulled from the scan-drbd agent into the main words file.
* Fixed a bug in ScanCore->agent_startup() where a (thankfully broken) check to append tables to the 'sys::database::check_tables' would cause an infinite loop as both were pointers to the same anonymous array.
* Fixed a bug in scan-ipmitool where the scan_ipmitool_variables table didn't use a host_uuid reference, causing resyncs of that table to sync for all hosts and cause DB errors when the scan_ipmitool record from another host wasn't sync'ed yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-31 01:36:11 -04:00
Digimer
a1eede2757 * Added new jumps to scan-ipmitool to make it less likely to trigger a jump alert for 'Temp{1..4}' sensors.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:54:12 -05:00
Digimer
6009590352 * Fixed a bug in scan-apc-ups where changes in the transfer reason were not being recorded.
* Cleaned up a log of logging to reduce the amount of log entries when running at log level 1.
* Bumped the scan-ipmitool default 'jump' range to 10c.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:11:29 -05:00
Digimer
8d0f873912 * Updated scan-storcli to check if a MegaRAID controlled exists and neither storcli64 or perccli64 exist. If a controller is found but no RPM is installed, it checks to see if the host is Dell and then decides to try and install perccli or storcli.
* Reworked scan-ipimitool so that on nodes and dr hosts, it only scans itself. On strikers, it scans all hosts found in active Anvil! systems with a host_ipmi entry. `
* For all agents, reduced log verbosity to not push too much noise into anvil.log while scancore is running in the background.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-05 23:34:51 -05:00
Digimer
1d03a386d3 * Created Database->get_bridges() that, surprise, loads data from the 'bridges' table.
* Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers.
* Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file.
* Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors.
* Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums.
* When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory.
* Moved the 'scancore' daemon's 'load_agent_strings' to 'Words'
* Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly.
* Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed.
* Fixed a bug in scan-storcli where a while loop was broken, preventing execution.
* Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 01:10:23 -05:00
Digimer
96bc1f0b78 * Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call.
* Created Database->get_power() that loads data from the special 'power' table.
* Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables.
* Updated Database->get_anvils() to record the machine types when recording host information.
* Updated Database->get_hosts_info() to also load the 'host_ipmi' column.
* Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available.
* Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon.
* Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned.
* Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete.
** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful.
* Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic.
* Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored.
* Updated scancore to clear the 'system::stop_reason' variable.
* Added missing packages to striker-manage-install-target.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 16:00:35 -05:00
Digimer
cda51e562d * Finished porting scan-hpacucli, the last M2 scan agent!
* Updated Database->insert_or_update_temperature() to accespt a 'delete' parameter (which, surprise, deletes a record).
* Updated (but not yet tested) the RPM .spec to require that 'core' and the other three packages are required to be the same version.
* Updated scan-ipmitool and scan-storcli scan agents to now delete temperature data belonging to lost sensors.
* Fixed tools/striker-manage-install-target to add multiple missing packages.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-25 19:06:21 -05:00
Digimer
1c00060d6e * Finished porting scan-storcli! This was the largest scan agent to migrate to M3.
* Updated Alert->register() to take message variables using the 'variables' parameter.
* Added a 'cache' parameter to Database->insert_or_update_health() and ->insert_or_update_temperature(). When set, the SQL UPDATE/INSERT calls and pushed into the array reference set in 'cache'. This is to allow performance improvements when processing a large amount of sensor/device data.
* Updated Log->variables() to take a 'prefix' parameter that, when set, will prefix the string to each variable line.
* Updated scan-ipmitool to use Database->insert_or_update_health() and ->insert_or_update_temperature().

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-23 11:18:00 -05:00
Digimer
51de6c721f * Created scan-ipmitool (needs more testing but seems to work now). Logical straight port from M2.
* Fixed a bug if Get->free_memory() where host_type was still being called from the old System->host_type method.
* Added global support for '--log-secure' and '--log-db' switches to enable logging of secure data and DB transactions, respectively.
* Created Database->get_tables_from_schema() that parses a SQL schema file and returns an array reference of tables found, in the order they were found.
* Updated ScanCore->agent_startup() to no longer require manually defining database tables, using Database->get_tables_from_schema() when not manually set.. Updated all existing agents to use this.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-17 01:48:45 -05:00