Commit Graph

577 Commits

Author SHA1 Message Date
digimer
74ddb7f3a9 Updated Database-get_files() to detect/remove duplicate file entries.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-29 21:28:49 -04:00
digimer
fcbace6713 Updated anvil-join-anvil to hold if either node is still running anvil-configure-host
* Fixed a minor bug and added logging of maintenance_mode calls in anvil-configure-host.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-28 16:01:32 -04:00
digimer
582a8b292c Added more job updates to anvil-manage-power.
* This is a test to see if the job waiting for the uptime to be 300s,
  leaving the job_progress as 0, was causing the job to be repeatedly
  called.
* This is related to issue #479

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-28 01:19:52 -04:00
digimer
ef042eef25 Cleaned up logging while waiting for subnodes.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-28 00:15:14 -04:00
digimer
5d5270486e Added a wait loop when forming node clusters.
* This adds a check where anvil-join-anvil waits until both subnodes are
  marked as configured and not in maintenance mode.
* Should address issue #479 (maybe, this shouldn't trigger reboots, but
  it was certainly a race condition found while investigating).

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-27 22:38:07 -04:00
Digimer
745081b649
Merge branch 'main' into patch-screenshot 2023-09-22 23:28:12 -04:00
digimer
c039c58128 * This commit moves taking screenshots of hosted servers onto the strikers using the Sys::Virt module. This was needed because the screenshots were being taken by scan-server, and that was causing it to take a long time to run. It should never have been handled by the scan agent anyway. This update requires a WebUI fix to use the new screenshot tool. This tool also adds holding multiple screenshots to allow users to "scrub" through screenshots up to 10 hours in the past.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-22 17:15:09 -04:00
digimer
8925dabb9d * Updated anvil-shutdown-server to take the new '--immediate' switch which forces a server to shut down immediately (akin to pulling the power on a traditional machine). This is needed to allow a user to recover a crash or hung server.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 18:56:12 -04:00
digimer
580980717d This commit covers the convertion of 'virsh' shell calls to using 'Sys::Virt' module, and fixes several small bugs related to scan-server;
* Switched all calls to virsh to use Sys::Virt to deal with contention of simultaneous virsh calls.
* Removed collecting screenshots from scan-server.
* Fixed a bad variable substitution in an alert.
* Fixed a bug where a server's boot time wasn't being recorded properly.
* Reworked how we determine which server definition was most recently updated and propogated.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 15:59:43 -04:00
digimer
3c9086d1f3 Fixed bugs related to running jobs.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-07 20:06:00 -04:00
digimer
e8a84e1c97 Added job handling to anvil-manage-server-storage (needs more testing though).
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-07 15:37:31 -04:00
digimer
2f429d2bc7 Fixed bugs related to adding drives and extending drives to servers.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-05 22:53:52 -04:00
digimer
e895e1f264 * Finished writting the anvil-manage-server-storage.
* Fixed handling --eject and --insert to work without a device target specified when only one exists, or to find the file path when only the file name is given.
* Updated anvil-manage-server-storage to show files when processing an optical devices without a file being passed.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-05 16:53:08 -04:00
digimer
17078347ee Reworked anvil-manage-server-storage to use the translation system.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-03 16:02:47 -04:00
digimer
02de75a6ab * Improved log messaging to not log of a potential boot failure when the local DRBD volume(s) are all UpToDate and the peer is offline.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-14 12:58:00 -04:00
digimer
3ee30e6e24 * Updated DRBD->allow_two_primaries() to gracefully fail if the peer isn't connected.
* Updated DRBD->manage_resource() to check if the host is StandAlone when asked to 'up' a resource and, if so, connect first. Also updated this to error out gracefully if the call to allow_two_primaries() returns non-zero.
* Update Server->migrate_virsh() to error out gracefully if the DRBD->allow_two_primaries() returns non-zero.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 22:52:16 -04:00
digimer
88af919142 * Fixed bugs in ocf:alteeve:server
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 11:52:36 -04:00
digimer
6ee2ad75db * Updated anvil-delete-server to actively check for and delete any drbd-fenced attributes left over in the CIB after a server is deleted. This addresses issue #374.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-25 21:45:34 -04:00
digimer
be290bf561 This commit fixes a bug where the drbd kernel module build was being killed mid-compile, leaving DBRD unusable.
* Created System->wait_on_dnf() which was plucked from anvil-daemon, and now also called in scancore and anvil-safe-start.
* Updated scancore and anvil-safe-start to check on start that DRBD's kernel module is available (and build if not).

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-24 22:32:41 -04:00
digimer
d68adb5b4e * Updated anvil-manage-power to not reboot if anvil-version-changes is running (which, if it's taking time, is generating new kmods).
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-24 20:44:40 -04:00
digimer
66c82e5e22 * Fixed a bug in anvil-update-system where updating a single package with --reboot wouldn't request a reboot. Finished reworking it so that a check is made to see if the kernel or DRBD kmod will be updated and, if so, removes the kmod-drbd RPMs prior to doing the update (as opposed to the sloppier check-on-error method).
* Fixed a bug in System->reboot_needed() where the cache file path had a typo in the hash key.
* Updated anvil-daemon to use the full path to dnf when determining if a dnf process was running.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-23 21:43:26 -04:00
digimer
e278de4b5a The main change in this commit deals with anvil-daemon startup. During OS updates, it would pick up the queued update job and run it while the other --no-db one was still running. This could become an issue for other tasks in the future, so updated anvil-daemon to not run any jobs for the first minute after startup. Also updated it to see if an OS update is underway (given how it can start mid-RPM update, before packages like kmod-drbd are ready to build). While doing this, implemented caching of daily tasks (like agine out data, archiving data, network scans, etc) to only run once per day, period. As it was before, they would always run on anvil-daemon startup, then wait 24 hours.
Note that work has started it reworking anvil-update-system, but it is incomplete (and broken) in this commit.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-23 21:43:26 -04:00
digimer
b0c54b6dae * Updated anvil-update-system to check if another instance of anvil-update-system is running and, if so, exit.
* Removed the new tasks from anvil-special-operations.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-22 20:03:39 -04:00
digimer
7bd76c10dc Major thing in this commit is reworking striker-update-cluster to work without expecting anvil-daemon to be running on target machines. Similarly, they had to be able to work when the Striker DBs were not available. This is to account for cases where the Striker dashboards have updated, and the schema has changed, preventing the not-yet-updated DR hosts and subnodes from being able to use the DB. To do this, anvil-safe-stop, anvil-update-system, and anvil-shutdown-server had to be updated to use the new --no-db switch, which tells then to run without the database being available.
* Updated Server->shutdown_virsh() to work without a database connection.
* Updated System->reboot_needed() to store/read from a cache file when the database is not available.
* Updated anvil-safe-start to remove the old --enable/disable/status switches, now that we use anvil-safe-start.service systemd unit.
* Reworked anvil-safe-stop to work without a database connection, and to work on DR hosts.
* Updated anvil-special-operations to add new tasks, but it's likely these new tasks aren't needed and will be removed very shortly.
* Added/updated multiple man pages.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-22 18:09:01 -04:00
digimer
9bc78860a6 * Updated anvil-update-system to detect kmod-drbd upgrade problems and fix them.
* Updated striker-update-cluster and anvil-update-system to take '--reboot' to request a reboot if any packages are updated.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-16 20:45:47 -04:00
digimer
42b44ac864 * Updated the log showing why anvil-daemon isn't exiting when a job is running with the job's current progress.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-16 00:08:53 -04:00
digimer
d741f4aa6f * Updated anvil-daemon to not exit on high RAM use is any job is running.
* Updated anvil-update-system to reboot a target whose kernel updated using an anvil-manage-power job,
* Started making striker-update-cluster run as a job (not at all complete). Fixed a bug where the wrong IP was being used when finding access to a target.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-15 22:23:30 -04:00
digimer
751687129a * Updated anvil-daemon to not exit on RAM use if anvil-update-system is running.
* Fixed a bug in anvil-safe-stop where it wouldn't trigger a migration when the peer is online.
* Updated anvil-update-system to set job_data to 'failed' and exit with rc 4 if the os update failed.
* Got striker-update-cluster to error out and exit if a called 'anvil-update-system' job failed.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-15 16:23:38 -04:00
digimer
3016fb875b * Reworded striker-update-cluster to use anvil-update-system for on-system OS updates.
* Updated DRBD->get_status() to take the new 'host' paramter to allow the caller to define the hash key string used in the stored data.
* Updated Get->anvil_version() (and a few other places) to use the new 'striker-ui-api' shell user, replacing the 'apache' user.
* Updated Remote->test_access() to take the new 'close' parameter to close the SSH session used when testing access to the target.
* Fixed a logging bug in anvil-manage-power.
* Updated anvil-update-system to take the '--no-reboot' and 'clear-cache' command line switches.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 22:29:07 -04:00
digimer
1b8b0bc493 * Created the new 'anvil-manage-server-storage' with the first role of reload a DRBD resource.
* Updated Remote->call() to remove the 'background' parameter as it wasn't working.
* Updated anvil-manage-server-storage to use 'anvil-manage-server-storage' to adjust resources in a way that doesn't block.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-30 21:02:30 -04:00
digimer
ea95d26cc5 * Fixed a bug in DRBD->get_next_resource() where reserved minor numbers were not being released. Also added a new parameter, "minor_only", that returns the next minor number but doesn't bother processing TCP ports.
* Did more work on adding support for adding new disk drives to servers in anvil-manage-server-storage.
* Updated anvil-manage-storage-groups To check for / delete duplicate storage groups with the same name.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-26 23:55:19 -04:00
digimer
88cc76914d This is an attempt to fix issue #341. It replaces the search for SN IPs from Network->find_matches() to Network->find_access(). The later of which doesn't care about the interface the IP was found on.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-24 21:24:37 -04:00
digimer
c9e11fbbfc * Added checks to anvil-provision-server to fail out if either of the SN IPs are not found when generating a DRBD resource config.
* Added logging to anvil-provision-server and anvil-daemon to try to find the cause of jobs being re-run after completing. May have fixed with a fix to job_progress updates going to 100 too early in some cases.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-19 21:44:45 -04:00
digimer
156a0ca201 Updated anvil-daemon's new job launching logic to allow the restart of a running job that failed out early.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-16 11:43:49 -04:00
digimer
47f7a35df3 The main purpose of this commit is to add serial execution of similar jobs to help reduce race conditions for scripted jobs, like multiple server creation.
* Fixed a small logging bug in DRBD->allow_two_primaries().
* Updated Database->get_jobs() to record jobs sorted by modified_date so that jobs can be run in the order they were recorded.
* Updated anvil-daemon to track which commands need to be run, and when two or more of the same command need to be run, they're run serially, with each subsequent run starting after the previous one completes.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-15 21:13:53 -04:00
digimer
b6a249d5e7 * Updated Cluster->add_server() to set the preferred host based first on if the server is running on a node, and if not, on the primary node (where before it defaulted to node 1).
* Updated DRBD->delete_resource() to call scan-drbd and scan-lvm to ensure that the database is updated with the newly freed resources.
* Updated anvil-delete-server and anvil-provision-server to call select scan agents to ensure freed resources are immediately recorded.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-11 23:46:21 -04:00
digimer
b7abc481e6 Updated scan-cluster to check to see that migrate_to and migrate_from are given a timeout of 600s and an on-fail of "block". Updated Cluster->add_server() to set migrate_from to timeout=600s and on-fail=block as well.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-08 20:30:25 -04:00
digimer
c82bd9d73a * Created the new anvil-watch-power tool that shows the status of UPSes known on the system, including their "on battery" state, charge percentage, estimated hold up time, etc.
* Updated Database->get_power() and ->get_upses() to store both the time stamp and unix time stamps.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-06 23:40:15 -04:00
digimer
0e57836c8f This commit addresses (hopefully) issue #329.
* Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now.
* Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster.
* Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-05 22:53:34 -04:00
digimer
110dceb55e * Added a check to make sure files were ready before provisioning a server.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-04 01:15:08 -04:00
digimer
895f1ec262 This fixes a race condition when multiple servers are provisioned at (nearly) the same time.
* In DRBD->get_next_resource(), implemented a "hold" system where the DRBD minor and TCP port(s) returned are marked as being held for one minute. So subsequent calls won't use the same numbers.
* In anvil-daemon, added a check in run_jobs() where only one instance of a given job command will be started per 2-second loop. This should help reduce the chance of simultaneous race confitions in general.
* Removed from anvil-provision-server and most other tools the call to Job->get_job_uuid(). If the program is called without the job_uuid, don't try to find it. This allows a human (or script) to make repeated calls to a program without one of those calls running a pending job instead.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-28 00:19:53 -04:00
digimer
0874ad571a Updated anvil-safe-start to not give up on starting corosync/pacemaker if it fails on the first try.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-18 14:33:58 -04:00
digimer
83a527f4fa * Removed enabling anvil-safe-start out of the RPM and into anvil-join-anvil.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-18 11:18:42 -04:00
digimer
89eae7098e NOTE: This updates the reserved RAM to 8 GiB from 4 GiB!
* Adds support for 'anvil_resources:🐏:reserved' that can be set to a number of MiB to override the default 8192.
* Adds support for 'anvil::<anvil_uuid>::resources:🐏:reserved' to allow for per-Anvil! node override on the reserved RAM default, and over the 'anvil_resources:🐏:reserved' option.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-17 20:43:28 -04:00
digimer
f9689a7106 Updated ocf:alteeve:server to look for /tmp/<resource>.fail' and, if that file exists, exits with rc:1. This is done to allow for testing.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-10 17:40:46 -04:00
digimer
cf73d8ed36 * Updated System->configure_ipmi() to auto-configure DR hosts once they've been assigned a BCN IP address.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-05 15:04:39 -04:00
Fabio M. Di Nitto
a6f2c2271e Fix typo in log message
Resolves: https://github.com/ClusterLabs/anvil/issues/294

Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2023-03-22 07:38:10 +01:00
digimer
b144976853 This resolves Issue #310.
* Updated Database->get_file_locations() to record files available on Anvil! nodes by tracking hosts in Anvil! systems (needed after reworking how DR hosts are linked).
* Updated Get->available_resources() to call Database->get_files() and ->get_file_locations() to restore tracking files available on Anvil! nodes.
* Fixed a couple display bugs in anvil-provision-server when called with --ci-test --options.
* Continued work on anvil-manage-server-storage.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-20 23:43:40 -04:00
digimer
645f54ab89 This commit has more changes than I would normally like, but it's all linked to changing file uploads to rsync serially.
* To update file handling for the new DR host linking mechanism, file_locations -> file_location_anvil_uuid was changed to file_location_host_uuid.
  This required a fair number of changes elsewhere to handle this, with a particular noted change to Database->get_anvils() to look at host_uuid's for the subnodes in an Anvil! and, if either is marked as needing a file, make sure the peer is as well. Similarly, any linked DRs are set to have the file as well.
* Created a new Network->find_access that simply takes a target host name or UUID, and it returns a list of networks and IPs that the target can be accessed by.
* Updated Network->load_ips() to find the network interface being used for traffic so that things like the interface speed can be recorded, even when an IP is on a bridge or bond.

Unrelated, but in this commit, is a restoration of calling scan agents with a timeout now that the virsh hang issue has been resolved.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-14 02:29:40 -05:00
digimer
7773e5f9b8 * Updated logging in DRBD->get_devices().
* Added a check and exit if anvil-manage-dr is asked to protect a server on a machine that doesn't know about that server.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-30 11:30:36 -05:00
digimer
f8743a7435 * Further work on anvil-manage-dr. Now properly sanity checks that a valid server is passed.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-19 22:14:17 -05:00
digimer
1a217d21cf * Updated anvil-manage-dr to provide the ability to link anvil nodes to dr hosts. Also began work on making it work with the new DR links system.
* Created Database->get_anvil_uuid_from_string(), Database->get_host_uuid_from_string() and Database->get_server_uuid_from_string() to simplify the process of converting --anvil <string>, --host <string> and --server <string> respectively.
* Fixed bugs in Database->get_dr_links() and Database->insert_or_update_dr_links().
* Updated Database->insert_or_update_states() to make direct calls to hosts instead of using get_hosts to drop out if a host_uuid doesn't yet exist in a DB.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-19 19:41:02 -05:00
digimer
17863404e3 * Updated Database->_age_out_data() to only run once per day, unless explicitely called with --age-out-database.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 20:33:28 -05:00
digimer
ff69916a85 * Applied typo fixed from PR #286 (thanks, Deezzir!). Also moved all the raw prints into words.xml.
* Updated Convert->human_readable_to_bytes() to return an empty string if passed an empty string.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-16 20:23:29 -05:00
digimer
9d2f9c4d88 * Fixed a string key name typo.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-15 19:53:57 -05:00
digimer
b8b4352117 * Added support for Migration Network configs in old striker and anvil-configure-host
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-15 01:24:26 -05:00
digimer
b27a43eaf7 * Updated striker to only require 6 interfaces when configuring a node.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-14 16:59:41 -05:00
digimer
0fa6ddebc5 Updated scan-network to see an interface state of 'activated' as up (used to check specifically for 'active').
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-14 16:22:51 -05:00
digimer
a3988cc3e5 * Added System->configure_logind() to ensure that nodes are configured to ignore ACPI power button events so that IPMI-based fences work immediately.
* Added call to System->configure_logind() to anvil-join-anvil and anvil-version-changes.
* Updated fence_pacemaker to add '--reboot' to the 'stonith_admin' call to ensure DRBD-triggered fence requests reboot instead of just turning nodes off.
This commit address issue #279.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-13 21:42:10 -05:00
digimer
dfa93a1837 * Added 'setsid' to all 'virsh' calls as nested calls (ie: crm_resource -> ocf:alteeve:server -> virsh) would fail because virsh couldn't connect to a terminal. See:
** https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background
* Added explicity setting of $ENV{PATH} when it's null (as it is when pacemaker calls our tools).
* Updated the copyright to 2023.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-12 21:52:26 -05:00
digimer
b666caec64 * Updated anvil-provision-server to handle startup when the peer doesn't create/connect it's DRBD resource (ie: node is offline).
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-06 03:00:38 -05:00
digimer
a5cee52153 * Fixed a bug in DRBD->get_devices() where old test host UUIDs were left hard-coded.
* Fixed a duplicate header in words.xml
* Fixed display bugs in anvil-report-usage and removed the old DR host display info.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-04 22:58:28 -05:00
Digimer
6d59399c73 * Updated the short OS list.
* Created Get->virsh_list_net() and Get->virsh_list_os() that call and parse osinfo-query directly to create lists of supported network interfaces and OS optimization options used when provisioning VMs. The later of which is used to replace the old language list of OSes, which was clunky and prone to missing valid options.
* Updated Get->available_resources() to remove the old anvil_dr1_host_uuid mechanism of finding and referencing DR resources.
* Started adding --network support to anvil-provision-server to allow users to specify a specific network bridge, MAC address and model to use for a new VM.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-24 10:08:06 -05:00
Digimer
f9ca6fb170 * This adds the new anvil-version-change tool which anvil-daemon will call on startup to handle checks for changes made over releases/updates.
* Added the new 'dr_link_note" column to the dr_links tables so that links can be marked as DELETED.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-13 17:36:43 -05:00
Digimer
02e371ac56 Updated virsh OS list.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-06 18:08:52 -05:00
Digimer
4ba1982183 This is the start of a set of changes needed to rework how we handle DRBD fence requests, so that they create location constraints instead of triggering a full stonith fence.
* In Cluster->parse_cib(), added parsers for node attributes and resource rules. Also stored the existence of and details of each under the server resources for easier referencing.
* Updated scan-server to check for / add DRBD fence rules as needed.

Scancore APC agent bugs;
* For clarity, converted all '#!no_value!#' and '#!no_connection!#' to use '!!' instead in APC scan agents.
* Fixed a bug to set/clear alerts related to phases disappearing to deal with concurrent logins from different hosts triggering false phase loss alerts.
* Fixed missing variables not being passed to alerts/log entries.

Started more work on anvil-manage-server, but on hold again while the DRBD fencing work is completed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-29 22:17:12 -05:00
Digimer
6eb99a2168 * FInished the anvil-manage-alerts tool. It can now send test alerts at a user-requested alert level.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-22 01:10:53 -05:00
Digimer
8b7a44cf75 * Finished cleaning up the output of Machines.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-22 00:19:00 -05:00
Digimer
3e53c87a6b Formatted the output of anvil-manage-alerts data (not yet machines) to be more presentable.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-17 23:28:50 -05:00
Digimer
622fb84652 * Renamed the 'notifications' table to 'alert-override', better reflecting what it does.
* Got anvil-manage-alerts managing alert overrides.
* Created, but for now commented out, the new 'audit' table.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-17 00:34:52 -05:00
Digimer
586ce6e5b9 * Got recipints working in anvil-manage-alerts().
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-15 22:17:12 -05:00
Digimer
35cf0c37fb * Updated System->check_ram_use() to set the maximum RAM based on the host type, and set those values in _set_default() so that the user can override if they want.
* Got anvil-manage-alerts to the point where you can add, edit and delete mail servers.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-14 17:17:30 -05:00
Digimer
a6cd5c6604 * Starting work in the new anvil-manage-alerts, which will (when done), allow for management of mail servers, alert recipients, notification over-rides and to trigger test alerts.
* Updated Database->get_recipients() to record recipients by name for better sorting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-28 20:00:53 -04:00
Digimer
bde0b2e7ec * Fixed a bug where deleting ports from a fence device in an Install Manifest would not cause the fence methods to be removed from the associated cluster.
* Created Get->anvil_from_switch and Get->server_from_switch() (both need testing) that takes a string that could be either a name or UUID, figures out which it is, finds the entry in the DB and started the X_uuid and X_name switch variables.
* Started work on a second attempt at anvil-manage-server.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-20 22:33:41 -04:00
Digimer
93427a7a38 * Updated Get->switches() to always support job-uuid.
* Updated striker-initialize-host to support calls from command line switches, and wrote the man page for it.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-18 19:16:32 -04:00
Digimer
c23c79cdf0 Added 'system::all::configured' to anvil-join-anvil to mark an explicit end of config.
Started updating striker-initialize-host to handle the new anvil repo config.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-18 10:56:58 -04:00
Digimer
596855405f * Added variables to record when pacemaker and DRBD are configured.
* Added verify-alg to DRBD configs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-17 21:57:00 -04:00
Digimer
3b721b849c * Fixed a bug in anvil-configure-host where if the same MAC address was assigned to two interfaces, it would cause an endless reboot loop.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-28 19:20:23 -04:00
Digimer
599373816f * Fixed bugs that came up in testing. Was now able to setup long-throw DR!
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-22 16:40:40 -04:00
Digimer
2fab7bc1b7 This adds support (testing needed) for "Long-Throw" DR; which is a wrapper for using 'drbd-proxy' to provide larger transmit buffers so slow/high-latency DR hosts.
* Created DRBD->check_proxy_license() to do (some level of) sanity checks on the DRBD proxy license file.
* Updated DRBD->gather_data() to parse out the inside and outside ports for resource configs using proxy.
* Reworked DRBD->get_next_resource() to return 1, 3 or 7 TCP ports depending, with the new long_throw_ports parameter triggering the 7 ports.
* Added 'tcpdump' to the anvil-core requires list.
* Reworked scan-drbd to record the ports used in proxy configs. This required adding a check to change the 'scan_drbd_peer_tcp_port' column type to 'text' to support CSVs.
* Reworked anvil-manage-dr (needs testing!) to support "long-throw" DR configs.
* Updated anvil-safe-stop to check if the nodes are in the cluster before trying to migrate.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-21 23:35:06 -04:00
Digimer
c8ee75420d * Updated anvil-manage-dr to check if a server is protected before processing a --connect or --disconnect request. Also made it smarter if an attempt to connect a resource fails.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-01 16:09:37 -04:00
Digimer
e90dae96f7 * In Server->shutdown_virsh(), disabled trying to resume a paused VM. Also updated the logging around not waiting for a VM to stop.
* Updated anvil-safe-stop to check for VMs running, even if the cluster is stopped, when --stop-servers is used.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-31 18:12:07 -04:00
Digimer
d271ffec26 * Updated Cluster->parse_crm_mon() to record the role of stonith resources.
* Fixed a bug in System->parse_arguments() where a quoted password without spaces was returned without being recorded in the hash. Also updated logging to log 'suppressed' for passwords when secure logging is disabled.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-31 12:57:01 -04:00
Digimer
89121a2b3b * Fixed a bug in Alert->check_condition_age() where not setting a host_uuid caused the returned age to always be 0.
* Updated scan_apc_pdu to not report a lost PDU unless it's been gone for ten minutes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-29 17:30:52 -04:00
Digimer
99a6593fe6 * Fixed a bug when connecting to databases when one DB has no variable entries, making it seem like a DB was disabled.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-25 21:43:21 -04:00
Digimer
b8bb7cc423 * Changed the default trigger of live migrations to require a health score difference of 2 or higher. This can be user-adjusted using the new 'feature::scancore::threshold::preventative-live-migration' anvil.conf option.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-25 12:43:51 -04:00
Digimer
9675ebf986 * Added --remove support to anvil-manage-dr, completing all the features for this tool.
* Updated DRBD.pm to move the logic to wipe and delete an LV into a new method called 'remove_backing_lv'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-24 22:08:48 -04:00
Digimer
93e6a59841 * Added 'vnc-server' to the list of firewall services enabled on strikers.
* Created the anvil-manage-dr man page.
* Reworked anvil-manage-dr's --protect logic to search for which network works with the DR host, instead of assuming it's the SN.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-22 13:38:46 -04:00
Digimer
29a28ee97a * Fixed a bug with anvil-provision-server where running the command line menu from a Striker would not assign the job to the target Anvil!.
* Updated Server->parse_definition() to check if a failed 'virsh list' output was passed in. Also changed it to not exit if the XML can't be parsed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-16 19:01:36 -04:00
Digimer
cbb441759e * Fixed a couple bugs in anvil-manage-files where a file moved from incoming to files or definitions wasn't having the directory updated properly in the database. Also made an explicit check when looking for missing files to check to see if the file exists in another managed directory and, if so and if a striker, update the DB.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-15 23:27:40 -04:00
Digimer
a81478f2bc * Updated 'db_in_use' state to add the caller's name to the state name. This is pulled out when logging stale locks that are being reaped, to help debug where stale locks are coming from.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-09 00:29:03 -04:00
Digimer
e7cf8ac789 * Got more work done on anvil-manage-files. It now picks up new files on nodes/dr hosts in an Anvil! and downloads them if needed.
* Updated anvil-daemon to call anvil-manage-files on a per-minute basis to handle files added outside of the WebUI.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-09 00:08:19 -04:00
Digimer
be84a23924 * There were still references in anvil-manage-files to 'file_locations' -> 'file_location_host_uuid'. Had to rework some logic to get things working. More testing needed, but so far at least the "missing file" function is working again.
* Added missing always-available switchs in Get->switches
* Create Storage->_wait_if_changing() to check to see if a file's size is changing and, if so, not return until it stops.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-08 21:31:56 -04:00
Digimer
15aadc3a4e * Updated scan-network to check for inactive or activating interfaces and manually bring them up, if the uptime is less than 10 minutes.
* Fixed a bug in scancore-agents/Makefile.am where scan-network was missing.
* Started work on anvil-delete-server.8. Incomplete at this time.
* Updated Network->get_ips() to record the interface status.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-03 23:38:56 -04:00
Digimer
171ea74000 * There is a fix in this commit to resolve a race condition where, when reconfiguring the network, the request to set a job to reboot would fail because the connections to all Strikers could be lost, causing Database->_test_access() would error out, blocking the reboot. When restarted, the network would not be changed, so no reboot would be requested, leaving the machine in an innaccesible state.
* Updated anvil-boot-server when called with '--all' to honour boot ordering, delays and condtions.
* Updated Database->get_servers() to collect the server's XML as well as data from the 'servers' table.
* Updated anvil-provision-server to make a new DRBD resource 'secondary' after forcing it to primary to begin the initial sync.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-06 19:22:28 -04:00
Digimer
bce9e2caaf This is the first attempt at enabling firewalld completely. There is a decent chance that problems exist, so it won't be a surprise if a few more commits are needed to this branch before things work.
* Added multiple new private methods to Network that help in managing the firewall.
* Updated Server->boot_server to manage the firewall after the server boots. Updated ->migrate_server to create a job, if a database connection exists, for the migration target to update it's firewall as soon after the server appears as possible.
* Updated ocf:server:alteeve to manage the firewall when called post-migration, in case there was no DB connection and the job above didn't run. Fixed a bug where the disk state wasn't being evaluated properly.
* Updated scan-server to check that the firewall is managed when a server state has changed.
* Updated anvil-daemon to run Network->manage_firewall on startup.
* Heavily reworked 'anvil-manage-server' to either just run 'Network->manage_firewall', or if passed '--server X', to wait for the server to appear for up to 1 minute, then to check that the firewall is managed (to capture servers being migrated to the host.)
* Removed firewall management from striker-prep-database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-02 17:06:04 -04:00
Digimer
b2ea4f9adc * Moved System->manage_firewall() to Network->manage_firewall(). Started working on actually implementing it, which involves basically fully rewritting it.
* Updated tools/Makefile.am and scancore-agents/Makefile.am to add missing files.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-30 00:01:50 -04:00
Digimer
f2d06fa9b1 * Updated striker-parse-oui to only run if/when the system has been running for at least one hour.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-23 21:21:38 -04:00
Digimer
ab9b00a2f7 * Updated anvil-daemon, in its daily checks, to disable ksm and ksmtuned daemons.
* Updated scan-drbd to purge peer records that no longer have corresponding LVM data.
* Updated System->{en,dis}able-service to take the 'now' paramter which, when passed, causes the action to take immediate effect.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-21 22:25:07 -04:00
Digimer
b77bb81343 * Found a bug where, if a record was deleted from the public schema but not from the history schema, and then later a resync was performed, the record would be added to the peer database's public schema (while still not existing locally). This condition should never occur as data in history should only exist to track the public record. This update checks for this condition and purges those records prior to resync'ng a database table.
* Continued work on fixing issues with striker-purge-target (which led the the discovery of the above bug). Added expliit checks to purge file_location and storage_group data when purging an sub-anvil from the database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-20 20:52:07 -04:00