Commit Graph

344 Commits

Author SHA1 Message Date
Digimer
33101f969a * Fixed several bugs related to tracking server boots, migrations and shut downs in the anvil database. The 'ocf:alteeve:server' now has (mostly?) safe integration with the Anvil! database. This was mostly done by updating Servers->boot_virsh(), ->shutdown_virsh() and ->migrate_server().
* Updates servers -> server_host_uuid to drop the 'NOT NULL' constraint.
* Created the new Get->server_uuid_from_name() that does what it says on the tin. Fixed a bug in ->host_uuid_from_name() where the host name was being returned instead of the UUID.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-07 02:40:31 -04:00
Digimer
46f1a05789 * Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though.
* Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations.
* Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck.
* Updated the 'servers' database table (and associated Database methods) to add columns for;
** server_ram_in_use      - tracking RAM used by a running server
** server_configured_ram  - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM)
** server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server.
** server_boot_time       - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes).
* Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID.
* Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime).
* Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source.
* Made scan-hardware smarter about RAM change alerts.
* Updated scancore to load agent strings on startup so that processing pending alerts works properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-02 02:13:34 -04:00
Digimer
1a1fa7ce88 * Created Cluster->get_anvil_uuid() that returns the 'anvil_uuid' of a given 'host_uuid'.
* Renamed the 'defitintions' table to 'server_definitions' to clarify the purpose, and made all the 'server' columns have then 'not null' constraint.
* Created Database->insert_or_update_servers(), ->get_servers(), ->insert_or_update_server_definitions() and ->get_server_definitions().
* Updated scancore, anvil-daemon, and scan agents to not run unless they're run with root privs.
* Got scan-server to update the servers / server_definition tables and the on-disk file when needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-28 00:20:13 -04:00
Digimer
e6e4c7d530 * Moved Server->_parse_definition() to -> parse_definition() to make it a publid method.
* Began (but haven't finished) Database->insert_or_update_servers().
* Created Storage->get_file_stats() to collect the (l)stat information for a file.
* Got more work done on scan-server.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-26 01:45:08 -04:00
Digimer
e240a32a19 * Created Cluster->parse_crm_mon and updated Cluster->parse_cib() to determine what state a server is in and which host has a server.
* Added support in anvil.conf to disable scan agents with 'scancore::<agent_name>::disable', and added handling this to agents. Also allowed for '--force' to override this setting.
* Updated ScanCore->agent_startup() to allow for empty scan agent table lists.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-24 18:16:05 -04:00
Digimer
4dfe0cb5a0 * Created Cluster->boot_server, ->shutdown_server and ->migrate_server methods that handle booting, migrating and shutting down servers. Also created the private method ->_set_server_constraint which is used by migrate and boot to set resource constraints to control where a server boots or migrates to.
* Did more work on parsing server data out of the CIB. There is still an issue with determining which node currently hosts a resource, however.
* Renamed Server->boot to ->boot_virsh, ->shutdown to ->shutdown_virsh and ->migrate to ->migrate_virsh to clarify that these methods work on the raw virsh calls, outside of pacemaker (indeed, they are what the pacemaker RA uses to do what pacemaker asks).
* Got more work done on the scan-cluster SA.
* Created the empty files for the pending scan-server SA.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-24 02:09:18 -04:00
Digimer
0f7267eae1 * Moved the '_host_name', '_short_host_name', and '_domain_name' private methods in Tools.pm over to Get.pm (removing the leading '_' in the method names).
* Created 'Cluster->which_node' that returns 'node1' or 'node2' to indicate which node a host is.
* Continued working on scan_cluster; decided to make it not host-dependent.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-20 00:27:36 -04:00
Digimer
44dc4f4b47 * Fixed a bug in Words->string() where not having the parameter 'file' set caused the default 'words.xml' to be specified, preventing strings from scan agents from being used.
* Started work on the new scan-cluster scan agent that will parse out and store data from the pacemaker CIB.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-16 19:32:53 -04:00
Digimer
4be943ebf3 * Finished (initial) testing of scan-hardware. The first M3 scan agent is done!
* Updated Alert->check_alert_sent() and the 'alert_sent' table to remove 'alert_name' as it really isn't helpful given record_locator.
* Created 'Database->purge_data' that takes an array of tables and, in reverse, purges them from the database(s). This also disables archiving and resync functions now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-16 01:19:21 -04:00
Digimer
0a1dc809a2 * Created the ScanCore.pm module with the first 'agent_startup' method which generalized scan agent start up.
* Updated Alert->register to take a hash reference for message variables to simplify when a caller plans to log and register an alert at the same time.
* Updated Convert->bytes_to_human_readable() to name the 'size' variable used internally for 'bytes' to actually be 'bytes' for better consistency.
* Created multiple new Database methods;
** ->check_condition_age() is meant to be used by scan agents to see how long a given condition has been in play (ie: how long ago power was lost to a UPS or a sensor became unreadable).
** ->insert_or_update_health() handles recording data to the new 'health' table, used for determining ideal hosts for servers between nodes.
** ->insert_or_update_power() handles recording data to the new 'power' table, used for determining how power events are handled.
** ->insert_or_update_temperature() handles recording temperature data to the new 'temperature' table, used to determine how thermal events are handled.
* Got a lot more done on the scan-hardware scan agent. Only part left now is post-scan health processing.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-15 01:18:36 -04:00
Digimer
925664762a * Created Database->check_for_schema() (not finished) that will check/add a schema for a scan agent.
* Renamed the scan-network skeleton scan agent to scan-hardware and started work on it based on the M2 version.
* Updated Database->get_recipients() to take the 'include_deleted' parameter, and changed the default behaviour to only return active records.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-08 01:07:23 -04:00
Digimer
dc5ec9c264 * Added checking the email server config to anvil-daemon. Email works now!
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-07 00:39:43 -04:00
Digimer
fe7cdb18fb * Updated all methods to add (or fix) logging the method entry.
* More work done on Email->send_email() to, well, actually send email (which it isn't doing yet, but it's close).
* Updated Words->key() to include the bad key name when no entry for the requested key exists in the words.xml file.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-06 01:52:03 -04:00
Digimer
911523dfce * Got a lot of work done in generating emails. Doesn't work yet, but the code to generate emails for recipients using their preferred language and alert level is done (though limited testing so far).
* Dropped support for supporting imperial measurements in generated emails.
* Created Database->get_alerts() to read in alert data and ->get_recipients() to get the list of alert recipients.
* SQL Schema changes;
** Added 'alert_processed' to 'alerts' to track what alerts have been processed.
** Changed 'recipient_new_level' to 'recipient_level' now that we're only using 'notifications' as a per-host override for user/hosts alert levels.
** Removed 'recipient_units' as we're no longer supporting non-metric values.
* Updated Alert->register() to take strings for the alert level (which gets translated to integers).
* Created Email->get_current_server() to returned the mail_server_uuid of the active mail server (if any). Created ->send_alerts() to process unprocessed alerts and send emails to recipients.
* Updated Words->parse_banged_string() to take the 'language' parameter.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-05 01:18:42 -04:00
Digimer
4f39272d9a * Fixed a big in Jobs->get_job_details() where jobs weren't being found via 'switches::job-uuid'.
* Enabled anvil-join-anvil debugging of stonith handling to later catch an 'uninitialized value' warning (despite seeming to complete configuration of the Anvil! successfully).

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-04 01:14:22 -04:00
Digimer
82acb4e104 * Fixed a resync bug where bridges needed to sync before bonds
* Re-enabled user-selected BCN subnet ranges (needs more testing).

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-03 22:52:38 -04:00
Digimer
49682a01d7 * Fixed a bug in Database->disconnect() where the database idenitification number wasn't being removed, so connecting again triggered the duplicate DB connection check.
* Fixed a bug in Tools->_set_defaults where the order the tables were sync'ed it caused primary/foreign keys would trigger DB errors when resync'ing in some cases.
* Created Database->log_connections to make it easier to log which databases are actively in use and other data about the connections.
* Fixed bugs in striker-manage-peers that (partly because of the above bugs) failed to connect to new peers properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-03 01:36:11 -04:00
Digimer
aad68b8ed0 * Got email reconfiguration when a mail queue is over ten minutes old (needs testing).
* Updated Storage->write_file to remove the cache for a written file if it exists.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-28 02:32:27 -04:00
Digimer
767148b538 * Updated Database->get_mail_servers() to clear old stored data, and to pull out the list of when a mail server was last used.
* Got email server configuration under way. A mail server can now be configured via Email->_configure_for_server(), but more work is needed on when to switch between configs.
* Fixed some logging of passwords that wasn't being checked to see if secure logging was enabled or not.
* Fixed a bug in Striker where the back arrow in email config sub-sections weren't going back to the main email menu.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-27 02:09:21 -04:00
Digimer
b2c7fd95fb * Renamed the ScanCore unit file to scancore.
* Added support to parsing location contraints to Cluster->parse_cib

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-24 12:54:31 -04:00
Digimer
1498e1b53c * Got server migration working using ocf:alteeve:server in a test environment!
* Converted most 'eval { }' calls to localize $@ and test the output of the eval, instead of checking to see if $@ was set.
* Converted all 'local' hash references to instead use the short host name of the local machine as a new standard.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-19 18:54:09 -04:00
Madison Kelly
30f2b3fa8e * Switched all hash 'local' keys to be the host's short user name. Untested, likely bugs to be fixed in the next commit.
Signed-off-by: Madison Kelly <mkelly@alteeve.ca>
2020-08-18 19:34:08 -04:00
Digimer
47203490a9 * Working on getting live migration to work with ocf:anvil:striker using the environment variables that pacemaker sets. Incomplete, but getting close.
* Added support to Cluster->parce_cib() to track if maintenance mode is set.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-17 11:55:49 -04:00
Digimer
e35800c413 * Fixed up (though more testing/work needed) to ocf:alteeve:server to get it working with DRBD resources referenced using '/dev/drbd/by-res/...'.
* Added the anvil.conf option 'sys::privacy::strong' that controls if the Anvil! ever "calls home". Initially, this controls DRBD's usage flag.
* Updated DRBD->get_devices() to track resources by their 'by-res' names as well and by the normal '/dev/drbdX' devices.
* To mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1868467, updated Get->bridges() to parse the normal (non-JSON) data if we get invalid JSON output.
* Updated anvil-join-anvil to not disable, and in fact enable, libvirtd on boot. With DRBD 9, the original fear of a user accidentally booting a VM that's running on the peer no longer is an issue. By enabling it and leaving it on, Striker dashboard users won't lose their virtual machine manager link unless the node powers off. Also enabled actually updating the job progress, completing this tool!

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-13 00:12:20 -04:00
Digimer
39b4a912af * Remember in the last commit how I said that DRBD->update_global_common() was done? Well that was cute, 'cause it was quite broken. Now it's working.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-10 15:42:05 -04:00
Digimer
d647014ad1 * Created (finished but not yet tested) DRBD->update_global_common() to update DRBD's global_common.conf file.
* Cleaned up a lot of log levels.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-07 23:12:11 -04:00
Digimer
ef208fd3fb * Finished the logic for adding stonith devices and levels to pacemaker! More testing is needed though, bugs expected, but it adds them.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-06 17:45:43 -04:00
Digimer
c27cc7507f * Renamed striker-parse-fence-agents to anvil-parse-fence-agents and changed anvil-daemon to run it on all machines.
* Cleaned up a lot of logging.
* Updated Cluster->parse_cib() to track if a stonith device has 'delay' set.
* Got a lot more work done on anvil-join-anvil's stonith processing, but it still isn't complete. Updated it to change shell user passwords as well.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-06 01:20:53 -04:00
Digimer
61f4dcc41f * Updated Cluster->parse_cib() to pull out fencing (stonith) devices and levels.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-31 01:45:20 -04:00
Digimer
3c2f25a860 * Added 'fence_delay' fence agent to handle the corner cases where an IPMI BMC had crashed until a power cycle, and PDU fencing was effected, but failed to report as such.
* Updated Cluster->parse_cib() to take a CIB as a parameter.
* Fixed a bug in Database->get_hosts() where loading the host_ipmi value was filtered through Log->is_secure.
* Updated Striker->get_fence_data() to parse the switches to make it easier to map a fence agent's command line switches to STDIN arguments.
* Created System->parse_arguments() that converts a series of command line switches and their values into a hash. It's similar to Get->switches(), but works on any string.
* Continued work on anvil-join-anvil's fence configuration logic.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-30 00:23:47 -04:00
Digimer
d2d5d7b460 * Fixed a bug in Striker->load_manifest() where fences were parsed twice, the second time missing a hash reference.
* Updated striker to now only offer gateway for IFN networks. EL8 seems to ignore 'GATEWAY="x"' in interface configs which caused anvil-join-anvil to always think an interface needs to be updated. Updated as well to remove DNS entries set in interfaces that are not the default gateway.
* Fixed a bug where DNS entries were being missed, causing entries to be repeatedly added to the interface that was the gateway interface.
* In anvil-update-states, added Get->switches() so that verbosity switches are used.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-28 00:59:02 -04:00
Digimer
1bf71f8428 * Updated Database->get_hosts() to run host_ipmi the Log->is_secure if the string contains 'passw'.
* Fixed Database->get_ip_addresses() to clear stale IP addresses.
* Finished (for now, more testing needed) System->configure_ipmi! Also created System->test_ipmi() that handles trying lanplus and various password lengths, updating hosts -> host_ipmi on successful check.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-24 16:22:27 -04:00
Digimer
ef70f90ba4 * Updated Log->entry() to set the log file handle to UTF-8 when opened.
* Got more work done to System->configure_ipmi() to warm reset HP IPMI BMCs. It also now finds the IPMI user have started the password management.
* Created Words->shorten_string() that shortens a string to a number of bytes (as opposed to shortening to a character length).

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-22 22:32:33 -04:00
Digimer
dcfdf1127c * Got more work done on System->configure_ipmi(). It should now configure the IP address, subnet mask and default gateway using information from the manifest and anvil-join-anvil data.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-22 03:00:44 -04:00
Digimer
1fa63d2ea3 * Added 'anvil_uuid' as a set parameter in Database->get_hosts().
* Added calling 'debug => $debug' in System->X methods.
* Got more work done on System->configure_ipmi(). It should now determine if a BMC exists and pull the OEM and network details automatically.
* Updated anvil-configure-host to log more data in an attempt to find a reproducer for an odd bug where (apparently) a host was picking up the wrong job data meant for another host.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-21 02:00:56 -04:00
Digimer
99afd2e936 * Fixed a bug in Database->manage_anvil_conf() where initializing a host set the DB information with the wrong DB port and password.
* Fixed a bug in Get->host_type() the type wasn't being set for nodes and dr hosts.
* Fixed a bug in Validate->host_name() where the wrong method was being called.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-17 22:49:25 -04:00
Digimer
345d2e33d4 * Updated Cluster->parse_cib() to pre-fill some hashes to avoid undefined errors.
* Updated Storage->rsync() to only create the rsync wrapper if a password was given, allowing for rsync to work to/from a remote system when passwordless SSH is enabled.
* Updated anvil-join-anvil to disable/stop drbd.service, and to properly wait until both nodes are online.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-16 12:40:53 -04:00
Digimer
01974d7efe * Finished (though testing is needed) the updated ocf:alteeve:server resource agent. It now handles starting and stopping libvirtd and drbd daemons on-demand.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-16 00:33:00 -04:00
Digimer
dcd1fd1492 * Created Cluster->check_node_status() that checks the status of a node (in pacemaker).
* Created Cluster->get_peers() that figures out who the peer node (and DR host, if applicable) are.
* Updated Cluster->parse_cib() to dig out more information.
* Created Cluster->start_cluster() to start pacemaker (via pcsd) locally or on all (both) nodes.
* Started working on ocf:alteeve:server to start/stop the libvirtd/drbd daemons as needed, instead of having pacemaker do it.
* Got more work done on anvil-join-anvil. Node 2 now waits for the cluster to start, and node 1 will do setup as needed, then wait for the cluster to start as well.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-15 18:40:37 -04:00
Digimer
62d0a2aa39 * Created Cluster->parse_cib() that parses pacemaker's CIB (cluster information base) XML. This also switches to the XML::LibXML, starting the replacement of XML::Simple. It's far from finished, but parses out basic node data and fence data.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-11 23:26:19 -04:00
Digimer
597d9413a5 * Created the skeleton Cluster.pm.
* Got anvil-join-anvil to the point where is initializes and starts the cluster.
* Deleted the old ssh key handling logic in anvil-daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-10 00:49:30 -04:00
Digimer
16157f7697 * Finished System->update_hosts().
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-07 16:37:27 -04:00
Digimer
de43ea3ac1 * Renamed all Validate->is_X to Validate->X. Also created Validate->ipv6() to validate IPv6 addresses using Data::Validate::IP (and added it as a requirement to the .spec base RPM).
* Added the fix from the last commit for System->call to handle returned data without an ending newline to Remote->call.
* Got more work done on System->update_hosts(). It's able to add new hosts, but misses the short and FQDN host names. Need to fix that and the verify existing / manual entries aren't molested.

Signed-off-by: digimer <digimer@pulsar.alteeve.com>
2020-07-07 01:18:38 -04:00
Digimer
76b6550ac6 * Created Database->get_ip_addresses() that pulls the IPs out and stores them in a hash that allows for easy referencing to associated interfaces and networks.
* Created Get->trusted_hosts() that finds the dashboards the host uses and, if the host is in an Anvil!, the peers in the same anvil.
* Created (but not finished yet) System->update_hosts() that will add and edit entries for all IPs to trusted hosts.
* Fixed a logging bug in Striker->load_manifest().
* Fixed a bug in System->call where, the the output from the shell call didn't end in a new-line, it would not parse the return code and lease the return code string appended to the shell output.
* Fixed a big in System->change_shell_user_password() where a new-line (\n) meant for the shell call wasn't escaped properly. There was also a duplicate 'return_code' variable preventing the actual return code from being read.
* Got more work done on anvil-join-anvil to update the hacluster password (when needed).

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-03 18:11:56 -04:00
Digimer
aa2fdfb609 * Fixed a bug in Database->get_anvils() that was clearing the manifests hash.
* got anvil-join-anvil working up to the start of configuring pacemaker.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-03 00:49:08 -04:00
Digimer
453f5c6223 * Fixed a bug where $anvil->nice_exit() was being passed 'exit' instead of 'exit_code' as a parameter.
* Update striker manifest run to add an entry into the 'anvils' table, and pass the anvil_uuid to the jobs rather than the various host_uuid's.
* Fixed a bug in the 'anvils' SQL procedure that copied data into the history schema (a few columns were missing).
* Updated anvil-configure-host to reboot when finished to be certain network changes have taken effect. Also updated the handling of virsh bridges to delete the autostart symlinks if libvirtd daemon isn't running.
* Added some logic to anvil-daemon to call 'anvil-update-states' with the -v{1,3} flag depending on the active debug level.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-06-24 00:39:56 -04:00
Digimer
1e89ef55f3 * Updated Network->get_ips() to (again) record the MAC addresses and to create a MAC to interface name lookup hash. This was (accidentally?) removed back when the ->get_ips() was changed to store the data in a host-specific hash.
* Updated tools/anvil-configure-host to use the new MAC to interface name lookup hash.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-06-19 12:01:29 -04:00
Digimer
bd9e74a7b1 * Added missing packages to the RPM spec and the RPMs to provide on Striker.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-06-18 18:46:04 -04:00
Digimer
b076463b5f * Fixed a bug in multiple System daemon control methods where the return_code wasn't being returned properly.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-06-18 15:17:34 -04:00
Digimer
9ab03e1f17 * Added a check to verify that connections to multiple databases are not, in fact, going to the same host.
* Added a global switch --resync-db which takes a UUID and forces that DB to be marked as needing a resync.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-06-17 20:11:20 -04:00