Commit Graph

189 Commits

Author SHA1 Message Date
Digimer
056f8edf48 * Moved the scan-drbd function 'gather_drbd_date' and moved it into DRBD->gather_date().
* Finished DRBD->get_next_resource() that returns the next available minor and the next free TCP port (with two free ports available after it).
* Created Storage->get_storage_group_details() that pulls together the LVM, storage group members and storage groups into one block of data.
* Made more progress on tools/anvil-provision-server. It now gets up to the point of creating LVs, creating DRBD resource files, loading them, creating metadata and up'ing the resource. It doesn't yet (successfully) force a new resource to primary.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-15 03:10:16 -05:00
Digimer
162f4913b1 * Started work on DRBD->get_next_resource(), that will eventually return the next free DRBD minor and TCP port numbers.
* Fixed a bug in scan-drbd that was still looking for the scan_drbd_resource_uuid from the resource config file. Also added a check to see if 'scan-drbd::resource_status' directory exists before trying to read the files in it.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-13 01:31:44 -05:00
Digimer
8f823d3b86 * Switched out the static list of core table to use the array generated by Database->get_tables_from_schema().
* Fixed bugs around creating and filtering storage groups.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-28 21:59:32 -05:00
Digimer
4f33eeef2e * This commit introduces a new concept called "Storage Groups". Given that LVs back DRBD resources in M3, there needed to be a way to determine which VGs would be used when creating the backing DRBD resources, and how large those LVs could be (based on the minimum free space of the VGs in a group). A new, as yet incomplete Get->available_resources() method will handle determining what resources are available to grow exist
ing or create new servers.
* Created Database->insert_or_update_storage_groups() and ->insert_or_update_storage_group_members() to manage the new associated tables. Together, these tables create storage groups and track the VGs that are members of the group.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-28 01:22:21 -05:00
Digimer
1d03a386d3 * Created Database->get_bridges() that, surprise, loads data from the 'bridges' table.
* Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers.
* Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file.
* Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors.
* Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums.
* When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory.
* Moved the 'scancore' daemon's 'load_agent_strings' to 'Words'
* Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly.
* Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed.
* Fixed a bug in scan-storcli where a while loop was broken, preventing execution.
* Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 01:10:23 -05:00
Digimer
96bc1f0b78 * Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call.
* Created Database->get_power() that loads data from the special 'power' table.
* Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables.
* Updated Database->get_anvils() to record the machine types when recording host information.
* Updated Database->get_hosts_info() to also load the 'host_ipmi' column.
* Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available.
* Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon.
* Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned.
* Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete.
** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful.
* Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic.
* Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored.
* Updated scancore to clear the 'system::stop_reason' variable.
* Added missing packages to striker-manage-install-target.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 16:00:35 -05:00
Digimer
d8fc660a8b * Finished scan-drbd. Removed the code to track resources by UUID via resource config file. The UUID tracking it fairly easy messed up by manual edits/rsyncs, and the code to handle updating the UUID in the resource config files was complicated enough that it negated any benefit of avoiding resources being marked as removed / newly created on name change. Left the DRBD->resource_uuid() method though, in case we decide to revisit this.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-06 18:11:26 -05:00
Digimer
20b75310b7 * Created DRBD->resource_uuid() which reads (or adds) the 'scan_drbd_uuid' to the resource configuration file, which will allow us to track resources when they're renamed.
* Got scan-drbd's read_last_scan() and process_drbd() functions done (but untested so far).

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-01 23:59:54 -05:00
Digimer
7bf4c25f9b * Cleaned up the parsing of collected data in preperation for insertion/updating of the DB.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-01 04:25:15 -05:00
Digimer
4e843baa17 * Started work on parsing the gathered DRBD data.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-01 03:37:43 -05:00
Digimer
4d5ec72026 * Started work on the scan-drbd scan agent. Got it to the point that it is gathering needed data.
* Fixed a bug in Database->check_agent_data() where the list of tables wasn't passed in, and thus the table list wasn't then passed on to Database->_find_behind_databases().
* Started work on a new method called Storage->parse_lsblk().

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-30 04:18:16 -05:00
Digimer
cda51e562d * Finished porting scan-hpacucli, the last M2 scan agent!
* Updated Database->insert_or_update_temperature() to accespt a 'delete' parameter (which, surprise, deletes a record).
* Updated (but not yet tested) the RPM .spec to require that 'core' and the other three packages are required to be the same version.
* Updated scan-ipmitool and scan-storcli scan agents to now delete temperature data belonging to lost sensors.
* Fixed tools/striker-manage-install-target to add multiple missing packages.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-25 19:06:21 -05:00
Digimer
1c00060d6e * Finished porting scan-storcli! This was the largest scan agent to migrate to M3.
* Updated Alert->register() to take message variables using the 'variables' parameter.
* Added a 'cache' parameter to Database->insert_or_update_health() and ->insert_or_update_temperature(). When set, the SQL UPDATE/INSERT calls and pushed into the array reference set in 'cache'. This is to allow performance improvements when processing a large amount of sensor/device data.
* Updated Log->variables() to take a 'prefix' parameter that, when set, will prefix the string to each variable line.
* Updated scan-ipmitool to use Database->insert_or_update_health() and ->insert_or_update_temperature().

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-23 11:18:00 -05:00
Digimer
51de6c721f * Created scan-ipmitool (needs more testing but seems to work now). Logical straight port from M2.
* Fixed a bug if Get->free_memory() where host_type was still being called from the old System->host_type method.
* Added global support for '--log-secure' and '--log-db' switches to enable logging of secure data and DB transactions, respectively.
* Created Database->get_tables_from_schema() that parses a SQL schema file and returns an array reference of tables found, in the order they were found.
* Updated ScanCore->agent_startup() to no longer require manually defining database tables, using Database->get_tables_from_schema() when not manually set.. Updated all existing agents to use this.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-17 01:48:45 -05:00
Digimer
9c92b6bbb8 * Created Database->get_tables_from_schema() that returns an ordered list of tables in a database schema file, removing the need for scan agents to manually provide a list for agent start-up / DB purge.
* Updated ScanCore->agent_startup() to use Database->get_tables_from_schema() when the 'tables' parameter isn't passed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-13 03:57:45 -05:00
Digimer
7e6e345513 * Updated tools/striker-manage-install-target to now check to see if the striker is a RHEL host. If not, the packages in the RHEL High Availability add-on are merged into the main package list. If it is RHEL though, a search is made for nodes that it can access and once one is found that is a matching RHEL version / arch, and has Internet access, it is used as a proxy to download the packages in the HA add on and then pulls those packages to the local repo.
* Updated Get->os_type() to work on local and remote hosts.
* Fixed a but in tools/striker-initialize-host where calls to Network->find_matches() where being checked properly for success/failure.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-13 03:14:36 -05:00
Digimer
713f77bc78 * Finally finished scan-apc-ups! Proved way harder than anticipated... (over a solid week of work!) In M3, this agent is no longer host-bound, and the UPSes to scan based on entries in 'upses' using this scan agent.
* Fixed a bug in Database->insert_or_update_power() where the check to see if 'power_ups_uuid' was passed in was reversed. Also fixed a bug where the convertion of the value to TRUE/FALSE for the old value wasn't being set correctly.
* Updated Server->get_definition() to only translate the host name to a uuid if the host uuid wasn't passed in. Added a sanity check on the UUID as well.
* Cleaned up how existing UPSes are displayed in Striker when managing UPSes. Also renamed the form's scan agents to match the real agent names.
* Fixed alert sorting in scan-apc-pdu.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-12 00:35:51 -05:00
Digimer
7516047c15 * Created Convert->celsius_to_fahrenheit, ->fahrenheit_to_celsius and ->format_mmddyy_to_yymmdd.
* Created (but not yet finished) scan-apc-ups.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-11-02 11:43:35 -05:00
Digimer
18eba9bb55 * Updated Database->write() to record DB transactions when 'debug' is set appropriately.
* Finished scan-apc-pdu! Unlike M2, it tracks PDUs without host-binding, and tracks them by their fences entry / scan_apc_pdu_uuid instead of by serial number.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-27 00:04:02 -04:00
Digimer
d677d19ca0 * Moved Database->check_condition_age to Alert.
* Created (but not finished) scan-apc-pdu
* Added support to tracking maintenance-mode for nodes in Cluster->parse_cib
* Created Remote->read_snmp_oid().
* Created Server->get_definition.
* Updated Server->get_status() to write-out server XML files on-demand.
* Finished scan-cluster.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-23 01:28:21 -04:00
Digimer
5d89357c16 * Updated scan-lvm to use a generated UUID as the reported UUIDs are pre-RFC 4122 and not usable as UUIDs in the database.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-09 15:25:13 -04:00
Digimer
aaa330dd41 * Finished (though known bugs exist) scan-lvm.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-09 02:13:12 -04:00
Digimer
2f4a06f2e0 * Updated System->call() to take the 'timeout' parameter which, when set, prepends the call with 'timeout X <shell_call>' to make it easier to deal with calls that could potentially hang.
* Renamed scan-storage to scan-lvm as we only really care about LVM data in this agent. A dedicated scan-drbd will be created later. Got the agent to parse the pvs/vgs/lvs data.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-08 01:38:41 -04:00
Digimer
be88be6d30 * Did a bunch of testing / bugfixes for scan-server.
* Updated the servers table to remove the 'not null' constraints on the server_start_after_server_uuid, server_pre_migration_file_uuid and server_post_migration_file_uuid columns.
* Updated ScanCore->agent_startup() to connect to the database(s) when there isn't a table list.
* Updated Server->migrate_virsh() to test for DB access before making DB calls (to allow ocf:alteeve:server to function even if all ScanCore DBs are offline).
* Updated ocf:alteeve:server to connect to the databases (though work without it), and changed '$FILE_NAME' to be 'ocf:alteeve:server' (to make logging more legible)
* Created the skeleton for 'scan-storage'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-05 23:47:56 -04:00
Digimer
262cbccb35 * Finished scan-server, though lots of testing needed.
* Renamed servers -> 'server_clean_stop' to 'server_user_stop' to make it clearer what the column represents.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-05 00:15:44 -04:00
Digimer
46f1a05789 * Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though.
* Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations.
* Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck.
* Updated the 'servers' database table (and associated Database methods) to add columns for;
** server_ram_in_use      - tracking RAM used by a running server
** server_configured_ram  - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM)
** server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server.
** server_boot_time       - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes).
* Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID.
* Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime).
* Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source.
* Made scan-hardware smarter about RAM change alerts.
* Updated scancore to load agent strings on startup so that processing pending alerts works properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-02 02:13:34 -04:00
Digimer
1a1fa7ce88 * Created Cluster->get_anvil_uuid() that returns the 'anvil_uuid' of a given 'host_uuid'.
* Renamed the 'defitintions' table to 'server_definitions' to clarify the purpose, and made all the 'server' columns have then 'not null' constraint.
* Created Database->insert_or_update_servers(), ->get_servers(), ->insert_or_update_server_definitions() and ->get_server_definitions().
* Updated scancore, anvil-daemon, and scan agents to not run unless they're run with root privs.
* Got scan-server to update the servers / server_definition tables and the on-disk file when needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-28 00:20:13 -04:00
Digimer
e6e4c7d530 * Moved Server->_parse_definition() to -> parse_definition() to make it a publid method.
* Began (but haven't finished) Database->insert_or_update_servers().
* Created Storage->get_file_stats() to collect the (l)stat information for a file.
* Got more work done on scan-server.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-26 01:45:08 -04:00
Digimer
e240a32a19 * Created Cluster->parse_crm_mon and updated Cluster->parse_cib() to determine what state a server is in and which host has a server.
* Added support in anvil.conf to disable scan agents with 'scancore::<agent_name>::disable', and added handling this to agents. Also allowed for '--force' to override this setting.
* Updated ScanCore->agent_startup() to allow for empty scan agent table lists.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-24 18:16:05 -04:00
Digimer
4dfe0cb5a0 * Created Cluster->boot_server, ->shutdown_server and ->migrate_server methods that handle booting, migrating and shutting down servers. Also created the private method ->_set_server_constraint which is used by migrate and boot to set resource constraints to control where a server boots or migrates to.
* Did more work on parsing server data out of the CIB. There is still an issue with determining which node currently hosts a resource, however.
* Renamed Server->boot to ->boot_virsh, ->shutdown to ->shutdown_virsh and ->migrate to ->migrate_virsh to clarify that these methods work on the raw virsh calls, outside of pacemaker (indeed, they are what the pacemaker RA uses to do what pacemaker asks).
* Got more work done on the scan-cluster SA.
* Created the empty files for the pending scan-server SA.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-24 02:09:18 -04:00
Digimer
0f7267eae1 * Moved the '_host_name', '_short_host_name', and '_domain_name' private methods in Tools.pm over to Get.pm (removing the leading '_' in the method names).
* Created 'Cluster->which_node' that returns 'node1' or 'node2' to indicate which node a host is.
* Continued working on scan_cluster; decided to make it not host-dependent.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-20 00:27:36 -04:00
Madison Kelly
b86251c9d6 * Touch more work on scan-cluster
Signed-off-by: Madison Kelly <mkelly@alteeve.ca>
2020-09-17 21:07:34 -04:00
Digimer
44dc4f4b47 * Fixed a bug in Words->string() where not having the parameter 'file' set caused the default 'words.xml' to be specified, preventing strings from scan agents from being used.
* Started work on the new scan-cluster scan agent that will parse out and store data from the pacemaker CIB.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-16 19:32:53 -04:00
Digimer
4be943ebf3 * Finished (initial) testing of scan-hardware. The first M3 scan agent is done!
* Updated Alert->check_alert_sent() and the 'alert_sent' table to remove 'alert_name' as it really isn't helpful given record_locator.
* Created 'Database->purge_data' that takes an array of tables and, in reverse, purges them from the database(s). This also disables archiving and resync functions now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-16 01:19:21 -04:00
Digimer
165117edfe * Finished scan-hardware! (though totally untested, expect bug fixes).
* Created Database->insert_or_update_updated() to manage entries in the special 'updated' table.
* Updated Alert->check_alert_sent() to not use the 'type' parameter, but instead use the 'clear' parameter, to be more consistent with other methods.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-15 02:16:35 -04:00
Digimer
0a1dc809a2 * Created the ScanCore.pm module with the first 'agent_startup' method which generalized scan agent start up.
* Updated Alert->register to take a hash reference for message variables to simplify when a caller plans to log and register an alert at the same time.
* Updated Convert->bytes_to_human_readable() to name the 'size' variable used internally for 'bytes' to actually be 'bytes' for better consistency.
* Created multiple new Database methods;
** ->check_condition_age() is meant to be used by scan agents to see how long a given condition has been in play (ie: how long ago power was lost to a UPS or a sensor became unreadable).
** ->insert_or_update_health() handles recording data to the new 'health' table, used for determining ideal hosts for servers between nodes.
** ->insert_or_update_power() handles recording data to the new 'power' table, used for determining how power events are handled.
** ->insert_or_update_temperature() handles recording temperature data to the new 'temperature' table, used to determine how thermal events are handled.
* Got a lot more done on the scan-hardware scan agent. Only part left now is post-scan health processing.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-15 01:18:36 -04:00
Digimer
28ac266024 * Updated Alert->check_alert_sent() to remove the 'modified_date' parameter and changed the 'set' value to 'changed' to be cleared.
* Created Dataase->check_for_schema() that scancore agents can use to check/load the SQL schema
* Updated Database->write to take the 'transaction' parameter.
* Used scan-hardware as the test mule for scan agent SQL handling.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-09 01:48:51 -04:00
Digimer
925664762a * Created Database->check_for_schema() (not finished) that will check/add a schema for a scan agent.
* Renamed the scan-network skeleton scan agent to scan-hardware and started work on it based on the M2 version.
* Updated Database->get_recipients() to take the 'include_deleted' parameter, and changed the default behaviour to only return active records.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-08 01:07:23 -04:00
Digimer
946fce018a * Renamed the 'ScanCore' executable to just 'scancore', moved it into the standard 'tools' directory and changed the agents directory to '/usr/sbin/scancore-agents'.
* Got scancore scanning the agents directory, and properly holding on startup until at least one database is available (instead of exiting), and holding on startup until the local system is configured.
* Created the skeleton of the first scan agent; scan-network.
* Fixed a bug in Storage->check_md5sums() where dynamically loaded modules, loaded after the initial md5sum calcs, would cause the calling daemon to exit (possibly on every invocation).
* Created the scancore.README that will eventually be the main scan agent guide / API document.

Signed-off-by: Digimer <digimer@alteeve.ca>
2018-12-28 18:17:59 -04:00