Commit Graph

888 Commits

Author SHA1 Message Date
digimer
edc544255e Rebased with main and resolved conflicts.
This branch resolves issue #462; Auto growing PVs. Specifically, it looks at the LVM PVs on the host and checks to see if there is unused free space after the backing partition. If there is, it auto-grows the partition and then resizes the PV. This featu
re is designed to make life easier for users who deleted the auto-created '/home' partition during the anaconda disk partitioning tool.
* Created Storage->auto_grow_pv() that does the above.
* Added the missing hidden method name _create_rsync_wrapper in the Storage module POD.
* Added a call to Storage->auto_grow_pv() in anvil-configure-host and anvil-version-changes for nodes and DR.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-08 12:00:48 -05:00
digimer
556e892002 Added 'sessions' to the age-out table list
* To help mitigate issue #520.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-03 20:13:47 -04:00
digimer
09fcc45c4a Skipped checking 'sessions' when considering a DB resync.
* This relatesto issue #520

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-03 19:51:00 -04:00
digimer
e0a4bd4d69 Undid the Remote->call to System->call piping
Builds have been failing since this change. Debugging is needed but it's
too large a task for just now.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-02 23:23:25 -04:00
digimer
081c5ea90e Possibly fixed the anvil-delete-server hang bug.
* Updated Server->connect_to_libvirt() to check that the target URI's
  SSH fingerprint is recorded before connecting. Also added an alarm
  wrapper around the Sys::Virt->new() call.
* Continued work on anvil-manage-server-system, working on the boot
  order section now.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-02 15:01:08 -04:00
digimer
747e759248 Updated node restart logic to boot unless power or temps are known bad
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-01 21:01:25 -04:00
digimer
8c97f478a8 Updated Server->update_definition() to undefine a server when needed.
* Boosted logging to debug anvil-delete-server hang in jenkins.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-01 16:06:59 -04:00
digimer
9b55504872 Updated anvil-manage-server-system to update defined servers.
* Updated Server->locate() to take the new 'anvil' parameter to speed up
  searches.
* Updated Server->update_definition() to use Server->locate() to find
  where updates are needed. It now also defines the server with the new
  config.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-01 00:15:13 -04:00
digimer
fd461f940d Fixed a bug in Remote->call()
* If the call to Remote-call() set the target that was actually the
  local short hostname, it would fail to make the call at all. Now if
  the 'target' is local, the shell call is instead passed to
  System->call() instead.
* Cleaned up logging.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-31 13:16:39 -04:00
digimer
49f194eac6 Fixed issue #515; anvil-join-anvil updates hostnames properly now
* Updated Get->host_name() to accept the new 'refresh' parameter. This
  forces a reread of the hostname, instead of using the cached value.
* Updated System->host_name() so that, when it's updating the hostname,
  it updates the database and cached variables.
* Updated Words->center_text() to avoid undefinied parameter issues.
* Updated anvil-join-anvil to ensure the 'sys::host_name' variable.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-27 19:58:37 -04:00
digimer
85f86cda8a Updated Storage->read_file() to test if the target file exists.
This allows for better logging if the file to be copied doesn't exist.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-25 21:21:52 -04:00
digimer
4f6fa4b6ed Working on a bug where broken manifests are saved.
* Updated Striker->generate_manifest() to add pod and make the prefix,
  sequence and domain parameters required.
* Created the check_for_broken_manifests() function for anvil-daemon to
  detect/remove broken manifests.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-24 13:36:30 -04:00
digimer
a5f424d340 Test fix for variable deletion bug.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-23 12:58:15 -04:00
digimer
8bc35b322f More logging to debug variable deletion bug.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-23 00:24:23 -04:00
digimer
56bb18951a Hopefully fixed an empty variable bug in duplicate variable searches.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-22 00:11:24 -04:00
digimer
4d1528f614 Added more logging to debug variable deletion bug.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-21 23:10:31 -04:00
digimer
9ee8f782ee Continuing to try to resolve duplicate variables bug.
* Added a called to Database->_check_for_duplicates to Database->resync_databases
* Added 'check_for_resync => 1' to anvil-configure-host.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-21 14:31:14 -04:00
digimer
2a3f0bab24 Reworked how and when duplicate variables are checked/cleared.
Moved the logic to a new private method, and call it now from the active
Striker in the once per minute loop. The duplicate variable issue seems
to be not entirely uncommon.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-21 13:33:14 -04:00
digimer
ed269aa450 Fixed a bug where duplicate variables were being created in some cases.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-20 22:05:33 -04:00
digimer
be89bfb438 Updated striker-get-screenshots to create the screenshot directory.
Also fixed a typo in the POD for Storage->make_directory().

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-19 21:36:15 -04:00
digimer
5ec395c53a Reworked DB resync logic.
With this new system, a 'primary_db' is chosen (first connected DB UUID when sorted) and only it does resyncs. Further, resyncs have been pulled from all tools except anvil-daemon. So with this new system, the chances of duplicate, simultaneous resyncs should be removed (hopefully for real this time).

* Database->check_agent_data() no longer calls a resync after loading a
  schema.
* Removed the Database->coonnect() 'all' parameter
* The database used to read from is now always the same as the primary,
  even if there is a local DB.
* Database->connect() 'check_for_resync' parameter can now be set to
  '2', which means "check for resync _if_ I am primary", where '1' still
  checks for resync no matter what.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-19 20:41:57 -04:00
digimer
b1f89c2723 Finished initial version of striker-show-jobs
* Updated Database->get_jobs() to take 'job_host_uuid = all' to allow
  loading jobs from all cluster machines. Also updated it to record the
  'job_host_uuid' and the unix timestamp version of 'modified_date'.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-18 20:51:20 -04:00
digimer
0014cc591d Re-enabled DB connections in ocf:alteeve:server.
Added DB connections to ocf:alteeve:server when starting or stopping
servers. This is to ensure that the servers -> server_state are updated
properly.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-18 20:51:20 -04:00
digimer
b3c067b016 Fixed a bug in anvil-manage-files where missing files weren't being downloaded.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-12 01:01:31 -04:00
digimer
7545df1e55 Fixed a bug in which host runs an anvil-delete-server job.
* Updated anvil-delete-server to use the new Server->locate method. This
  was done as the old Server->locate() was failing to find the server
running on the peer when anvil-delete-server was running on the backup
subnode.
* Updated Server->locate() to search hosts for XML definition and DRBD
  configs so that it can record where the server is recorded to run,
even if the server isn't running or defined at the time the locate ran.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 22:22:06 -04:00
digimer
68521cdab7 Updated striker-get-screenshots to set permissions properly.
This updates the /opt/alteeve/screenshot directories and the screenshots
in them to be readible by the WebUI.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
55b1380031 Finished (but need more testing) of Server->locate().
This includes the changes in PR#492.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
829ae546a2 Beginning work on new Server->locate() method to find servers across an
Anvil! cluster.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
f12e001ac2 Finished Server->connect_to_virsh().
* Now, connecting to virsh can detect when still-open connections
  already exist.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
245f75de9b Added Server->update_definition()
* This takes a server and new definition XML and updated the database and any available hosts. Does not yet update defined or running servers.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
e361d0b424 More progress on anvil-manage-server-system
* It now edits the XML to change the boot menu, but doesn't save yet.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
201cd53265 Improved the logic behind Network->find_target_ip()
* This adds the new 'networks' and 'test_access' parameters to allow
  restricting/ordering matched networks, and adds 'test_access' to
  validate the link is working.
* Continued work on anvil-manage-server-system

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
ed91435211 Fixed a bug where duplicate files weren't actually being deleted.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-30 00:13:56 -04:00
digimer
74ddb7f3a9 Updated Database-get_files() to detect/remove duplicate file entries.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-29 21:28:49 -04:00
digimer
c3fe39e6a8 Added a check file unlinked files.
* On subnodes and DR hosts, a check is made now in Storage->check_files() for files not linked in file_locations. Any found are added, with a check to see if the file already exists locally and, if so, that the md5sum is accurate or not (to set if the file is ready for use or not).

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-29 00:05:29 -04:00
Digimer
745081b649
Merge branch 'main' into patch-screenshot 2023-09-22 23:28:12 -04:00
digimer
c039c58128 * This commit moves taking screenshots of hosted servers onto the strikers using the Sys::Virt module. This was needed because the screenshots were being taken by scan-server, and that was causing it to take a long time to run. It should never have been handled by the scan agent anyway. This update requires a WebUI fix to use the new screenshot tool. This tool also adds holding multiple screenshots to allow users to "scrub" through screenshots up to 10 hours in the past.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-22 17:15:09 -04:00
digimer
8925dabb9d * Updated anvil-shutdown-server to take the new '--immediate' switch which forces a server to shut down immediately (akin to pulling the power on a traditional machine). This is needed to allow a user to recover a crash or hung server.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 18:56:12 -04:00
digimer
4646f4a030 * Quieted logging.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 16:39:18 -04:00
digimer
580980717d This commit covers the convertion of 'virsh' shell calls to using 'Sys::Virt' module, and fixes several small bugs related to scan-server;
* Switched all calls to virsh to use Sys::Virt to deal with contention of simultaneous virsh calls.
* Removed collecting screenshots from scan-server.
* Fixed a bad variable substitution in an alert.
* Fixed a bug where a server's boot time wasn't being recorded properly.
* Reworked how we determine which server definition was most recently updated and propogated.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 15:59:43 -04:00
digimer
3c9086d1f3 Fixed bugs related to running jobs.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-07 20:06:00 -04:00
digimer
e895e1f264 * Finished writting the anvil-manage-server-storage.
* Fixed handling --eject and --insert to work without a device target specified when only one exists, or to find the file path when only the file name is given.
* Updated anvil-manage-server-storage to show files when processing an optical devices without a file being passed.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-05 16:53:08 -04:00
digimer
d255adc7b4 * Updated anvil-daemon to set the mode of /mnt/shared/* to 0777 during creation and to check that that mode is set for existing sub-directories. This resolves issue #443.
* Cleaned up anvil-manage-dr.8 hyphen escapes.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-17 22:14:40 -04:00
digimer
f23768ac32 * Updated Striker->generate_manifest() to remove integral DR support and add migration-network support. This helps address issue #440.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-17 18:58:12 -04:00
digimer
4a6ee0c283 Updated Database->get_lvm_data to ignore data from deleted hosts.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-10 16:21:55 -04:00
digimer
79ff96cee5 * Fixed a bug in DRBD->manage_resource() that prevented new resources from being created.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-09 11:23:57 -04:00
digimer
3ee30e6e24 * Updated DRBD->allow_two_primaries() to gracefully fail if the peer isn't connected.
* Updated DRBD->manage_resource() to check if the host is StandAlone when asked to 'up' a resource and, if so, connect first. Also updated this to error out gracefully if the call to allow_two_primaries() returns non-zero.
* Update Server->migrate_virsh() to error out gracefully if the DRBD->allow_two_primaries() returns non-zero.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 22:52:16 -04:00
digimer
55aaf7876e Starting work on checking if the peer is connected before managing allow-two-primaries.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 17:33:39 -04:00
digimer
cc71df686b Added a pcs wrapper to serialize pcs constraint calls.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 14:32:33 -04:00
digimer
59ade94124 * Added PID logging as an option, and enabled it in ocf:alteeve:server
* Updated DRBD->manage_resource() to take the task 'adjust'.
* Updated ocf:alteeve:server's start_drbd_resource() to call adjust if startup of a resource isn't needd.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-07 22:28:13 -04:00
Fabio M. Di Nitto
824e3e07e3 virsh: add wrapper to serialize calls to virsh list
avoid storm of virsh list that overloads libvirtd API causing
unnecessary timeouts during pcmk monitoring operations.

Resolves: https://github.com/ClusterLabs/anvil/issues/395

Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2023-08-07 08:35:08 +02:00
Fabio M. Di Nitto
fc75bda6ef ocf:alteeve:server: add support for log levels and bump timeouts
also improve logging for migrations

Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2023-08-07 08:22:50 +02:00
digimer
a0ff080741 * Deleted some old unused code from Cluster->assemble_storage_groups().
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-25 19:30:53 -04:00
digimer
ed480cf1cb * Fixed a double-$ bug in Remote->_check_known_hosts_for_target()
* Updated striker-update-cluster to take '--timeout' and a number of seconds, or 'Xm' or 'Xh' for minutes or hourse, respectively. Also updated to show the remaining time while waiting, and added waiting timeout to the rest of the while loops that prior had no time limit. This addresses issue #383 and issue #382.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-25 19:13:41 -04:00
digimer
be290bf561 This commit fixes a bug where the drbd kernel module build was being killed mid-compile, leaving DBRD unusable.
* Created System->wait_on_dnf() which was plucked from anvil-daemon, and now also called in scancore and anvil-safe-start.
* Updated scancore and anvil-safe-start to check on start that DRBD's kernel module is available (and build if not).

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-24 22:32:41 -04:00
digimer
d68adb5b4e * Updated anvil-manage-power to not reboot if anvil-version-changes is running (which, if it's taking time, is generating new kmods).
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-24 20:44:40 -04:00
digimer
556e91238d * Updated Network->find_access() to clear the data from previous scans, which fixes a bug where checking multiple hosts could return stale data for the previous host.
* Updated anvil-manage-server-storage, striker-collect-debug, and striker-update-cluster to be able to find a connection on an interface when none were found on preferred networks.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-24 15:43:54 -04:00
digimer
66c82e5e22 * Fixed a bug in anvil-update-system where updating a single package with --reboot wouldn't request a reboot. Finished reworking it so that a check is made to see if the kernel or DRBD kmod will be updated and, if so, removes the kmod-drbd RPMs prior to doing the update (as opposed to the sloppier check-on-error method).
* Fixed a bug in System->reboot_needed() where the cache file path had a typo in the hash key.
* Updated anvil-daemon to use the full path to dnf when determining if a dnf process was running.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-23 21:43:26 -04:00
Tsu-ba-me
9163cf4513 chore: remove anvil-manage-tunnel 2023-07-23 21:43:26 -04:00
Tsu-ba-me
25f5c38ade chore: rename striker-manage-vnc-pipes->anvil-manage-vnc-pipe 2023-07-23 21:43:26 -04:00
Tsu-ba-me
c46ff969f3 fix: add UUID to server process during find in Server.pm 2023-07-23 21:43:26 -04:00
Tsu-ba-me
4bdd206e0c fix: replace ps|grep with pgrep to reduce run time 2023-07-23 21:43:25 -04:00
Tsu-ba-me
ea345a0476 fix: add ss, websockify paths to Tools.pm 2023-07-23 21:43:25 -04:00
Tsu-ba-me
711cb5b696 refactor: rename striker-open-ssh-tunnel->anvil-manage-tunnel 2023-07-23 21:43:25 -04:00
Tsu-ba-me
54c98f89ab fix: allow extend remote call with openssh options 2023-07-23 21:43:25 -04:00
Tsu-ba-me
7f9b418de1 fix: add signal INT, TERM hooks to Tools.pm 2023-07-23 21:43:25 -04:00
digimer
7bd76c10dc Major thing in this commit is reworking striker-update-cluster to work without expecting anvil-daemon to be running on target machines. Similarly, they had to be able to work when the Striker DBs were not available. This is to account for cases where the Striker dashboards have updated, and the schema has changed, preventing the not-yet-updated DR hosts and subnodes from being able to use the DB. To do this, anvil-safe-stop, anvil-update-system, and anvil-shutdown-server had to be updated to use the new --no-db switch, which tells then to run without the database being available.
* Updated Server->shutdown_virsh() to work without a database connection.
* Updated System->reboot_needed() to store/read from a cache file when the database is not available.
* Updated anvil-safe-start to remove the old --enable/disable/status switches, now that we use anvil-safe-start.service systemd unit.
* Reworked anvil-safe-stop to work without a database connection, and to work on DR hosts.
* Updated anvil-special-operations to add new tasks, but it's likely these new tasks aren't needed and will be removed very shortly.
* Added/updated multiple man pages.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-22 18:09:01 -04:00
Digimer
c1e4380a64
Merge branch 'main' into anvil-tools-dev 2023-07-15 00:06:49 -04:00
digimer
458cb267da * Fixed a bug in Cluster->get_primary_host_uuid() where servers were not loaded before trying to calculate RAM use.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-15 00:04:12 -04:00
digimer
4dc1b0e117 * Added a check to Network->get_company_from_mac() to manually set the company to KVM/qemu if the prefix is 52:54:00.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 23:00:16 -04:00
digimer
02c3d204ea * Updated anvil-update-system to set 'job_data' to track reboots, and striker-update-cluster to read it.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 22:52:51 -04:00
digimer
3016fb875b * Reworded striker-update-cluster to use anvil-update-system for on-system OS updates.
* Updated DRBD->get_status() to take the new 'host' paramter to allow the caller to define the hash key string used in the stored data.
* Updated Get->anvil_version() (and a few other places) to use the new 'striker-ui-api' shell user, replacing the 'apache' user.
* Updated Remote->test_access() to take the new 'close' parameter to close the SSH session used when testing access to the target.
* Fixed a logging bug in anvil-manage-power.
* Updated anvil-update-system to take the '--no-reboot' and 'clear-cache' command line switches.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 22:29:07 -04:00
Tsu-ba-me
a7751da153 fix: rename, relocate function to find qemu-kvm processes 2023-07-12 18:40:11 -04:00
Tsu-ba-me
c3c69733d9 fix: correct base port check, server info extract, vnc alive assign in Server.pm 2023-07-12 18:27:50 -04:00
Tsu-ba-me
3cce3c39b8 fix: add Server subroutine to extract server VM info from qemu-kvm process(es) 2023-07-12 02:27:26 -04:00
digimer
d56b7f9a84 * Created (but not finished!) the new striker-update-cluster tool.
* Updated Cluster->get_primary_host_uuid() to only load anvils if not already loaded.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-07 17:54:57 -04:00
digimer
a7ebe45f76 This adds the new 'striker-collect-debug' tool that collects all potentially useful debug info into a single tarball.
* Fixed a bug in Get->anvil_from_switch() to work when the Anvil! name is passed.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-05 21:04:05 -04:00
digimer
1b8b0bc493 * Created the new 'anvil-manage-server-storage' with the first role of reload a DRBD resource.
* Updated Remote->call() to remove the 'background' parameter as it wasn't working.
* Updated anvil-manage-server-storage to use 'anvil-manage-server-storage' to adjust resources in a way that doesn't block.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-30 21:02:30 -04:00
digimer
7fbed10864 * Updated Remote->call() to take the new 'background' parameter.
* Continues work on adding new disks (DRBD volumes) to anvil-manage-server-storage.
* Updated DRBD->get_status() to record the peer-role.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-29 22:17:58 -04:00
digimer
ae55ca9187 * Applied the fix for TCP ports aging out reserved TCP ports properly to DRBD->get_next_resource().
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-27 00:04:48 -04:00
digimer
ea95d26cc5 * Fixed a bug in DRBD->get_next_resource() where reserved minor numbers were not being released. Also added a new parameter, "minor_only", that returns the next minor number but doesn't bother processing TCP ports.
* Did more work on adding support for adding new disk drives to servers in anvil-manage-server-storage.
* Updated anvil-manage-storage-groups To check for / delete duplicate storage groups with the same name.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-26 23:55:19 -04:00
digimer
65af56d5bd * Updated Database->insert_or_update_jobs() to not look for jobs that are complete when no job_uuid is passed.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-25 23:20:03 -04:00
digimer
e0316da88b * Got anvil-manage-server-storage working enough to grow existing disk's hard drive sizes, and to insert/eject optical disks.
* Hit a bug where a server's definition file was written to disk while not being valid. Added logging in case it happens again, and additional safe-guards to help avoid it from recurring.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-23 23:09:55 -04:00
digimer
1d12fb32b4 * Completed the new anvil-watch-drbd which replaces watch_drbd.
* Updated Email->get_current_server() to always load mail server data from the database.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-22 20:43:46 -04:00
Digimer
8f491e01ed
Merge branch 'main' into anvil-tools-dev 2023-06-20 20:00:10 -04:00
digimer
0aa72498db * This adds the new tool 'striker-check-machines' which simply walks through all known physical machines and checks to see if they're accessible and powered on.
* Updated Get->uptime() to work on remote targets.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-20 19:57:21 -04:00
Tsu-ba-me
b3f2644d07 fix: allow parameter to overwrite cgi input in Account->login 2023-06-20 00:48:21 -04:00
Tsu-ba-me
226c423af0 fix: allow param override in generate_manifest in Striker.pm 2023-06-19 15:15:32 -04:00
digimer
156a0ca201 Updated anvil-daemon's new job launching logic to allow the restart of a running job that failed out early.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-16 11:43:49 -04:00
digimer
cc15eca6fb * Added anvil-watch-power to git.
* Added a check to cleanup size input to Convert->human_readable_to_bytes() when passed pre-processed strings.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-15 21:35:42 -04:00
digimer
47f7a35df3 The main purpose of this commit is to add serial execution of similar jobs to help reduce race conditions for scripted jobs, like multiple server creation.
* Fixed a small logging bug in DRBD->allow_two_primaries().
* Updated Database->get_jobs() to record jobs sorted by modified_date so that jobs can be run in the order they were recorded.
* Updated anvil-daemon to track which commands need to be run, and when two or more of the same command need to be run, they're run serially, with each subsequent run starting after the previous one completes.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-15 21:13:53 -04:00
digimer
dda0fbd7d5 * Updated DRBD->allow_two_primaries() to be more careful at evaluating peer-node-id.
* Updated DRBD->manage_resource() to set allow-two-primaries=no when up'ing a resource (as no migration can be in progress during an up command).
* Updated scan-drbd to look for StandAlone resources and call DRBD->manage_resource({task = 'up'}) if a connection to a peer node is StandAlone or if the local disk state is detached.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-14 23:34:05 -04:00
digimer
b6a249d5e7 * Updated Cluster->add_server() to set the preferred host based first on if the server is running on a node, and if not, on the primary node (where before it defaulted to node 1).
* Updated DRBD->delete_resource() to call scan-drbd and scan-lvm to ensure that the database is updated with the newly freed resources.
* Updated anvil-delete-server and anvil-provision-server to call select scan agents to ensure freed resources are immediately recorded.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-11 23:46:21 -04:00
digimer
b03587967b * Updated Cluster->add_server() to batch the creation of the server and the location constraints in one commit to the CIB.
* Updated scan-lvm to look for and delete duplicate entries.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-09 19:59:23 -04:00
digimer
b7abc481e6 Updated scan-cluster to check to see that migrate_to and migrate_from are given a timeout of 600s and an on-fail of "block". Updated Cluster->add_server() to set migrate_from to timeout=600s and on-fail=block as well.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-08 20:30:25 -04:00
digimer
c82bd9d73a * Created the new anvil-watch-power tool that shows the status of UPSes known on the system, including their "on battery" state, charge percentage, estimated hold up time, etc.
* Updated Database->get_power() and ->get_upses() to store both the time stamp and unix time stamps.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-06 23:40:15 -04:00
digimer
bc3d04ad2e * Updated Cluster->add_server() to wait up to 15 seconds for a server to appear to ensure that the pcs call to add the server with the right requested running state.
* Updated Cluster->recover_server() to set the desired recovery state before calling the crm_resource refresh.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-06 14:34:02 -04:00
digimer
0e57836c8f This commit addresses (hopefully) issue #329.
* Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now.
* Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster.
* Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-05 22:53:34 -04:00
digimer
c50a1936c0 * This adds the new 'file_locations' -> 'file_location_ready' column and associated methods. This is set to TRUE/1 when the file referenced is found on disk and it is the expected size and md5sum. This is meant to allow programs to wait/watch or a file to be ready if they need to use it. Files are now checked periodically via anvil-daemon.
* Removed hard-coded log levels in anvil-provision-server and anvil-manage-storage-groups.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-04 00:05:56 -04:00
digimer
26fa3c7e32 Fixed a bug where Get->available_resources() was missing LVM/storage group data in some cases.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-02 16:28:05 -04:00
digimer
510db70253 Another attempt to resolve the stoage group race condition. This moves the check for auto-assembly to scan-lvm. It only works for the first assemble, after that the user can/should use anvil-manage-storage-groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-02 00:07:40 -04:00
digimer
e483840ceb Second attempt to fix the storage group race condition. This time, we only let node 1 assemble storage groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-01 20:29:20 -04:00
digimer
d64044c7d1 Test fix for storage group race condition.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-01 13:48:27 -04:00
digimer
9a58f4d1ff * This is a small commit to increase logging while chasing down a race condition issue with assembling storage groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-30 19:47:58 -04:00
digimer
895f1ec262 This fixes a race condition when multiple servers are provisioned at (nearly) the same time.
* In DRBD->get_next_resource(), implemented a "hold" system where the DRBD minor and TCP port(s) returned are marked as being held for one minute. So subsequent calls won't use the same numbers.
* In anvil-daemon, added a check in run_jobs() where only one instance of a given job command will be started per 2-second loop. This should help reduce the chance of simultaneous race confitions in general.
* Removed from anvil-provision-server and most other tools the call to Job->get_job_uuid(). If the program is called without the job_uuid, don't try to find it. This allows a human (or script) to make repeated calls to a program without one of those calls running a pending job instead.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-28 00:19:53 -04:00
digimer
e7537b0ca3 * Fixed a bug where, when DRBD->gather_data() calls 'drbdadm dump-xml' and the output includes usage data, it breaks XML parsing.
* Fixed a bug in Get->available_resources() where DELETED servers were being counted in the used resources math.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-25 13:12:13 -04:00
digimer
dc7b909bfc More logging to debug storage group race condition
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-18 19:14:59 -04:00
digimer
bd575c6a7d Bumped logging for storage group management.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-18 19:02:51 -04:00
digimer
89eae7098e NOTE: This updates the reserved RAM to 8 GiB from 4 GiB!
* Adds support for 'anvil_resources:🐏:reserved' that can be set to a number of MiB to override the default 8192.
* Adds support for 'anvil::<anvil_uuid>::resources:🐏:reserved' to allow for per-Anvil! node override on the reserved RAM default, and over the 'anvil_resources:🐏:reserved' option.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-17 20:43:28 -04:00
digimer
025c2a6f54 * Updated Email->get_next_server() to ignore DELETED mail servers, and it now loads mail servers if not yet in memory.
This resolves issue #306.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-13 00:26:32 -04:00
digimer
1afa7ce09e * Created Cluster->recover_server() that uses crm_resource to try to recover a server that has entered a FAILED state.
* Updated (not not yet completed) scan-cluster's check_resources() function to check if a FAILED server is ready to try to recover.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-10 23:04:15 -04:00
digimer
f9689a7106 Updated ocf:alteeve:server to look for /tmp/<resource>.fail' and, if that file exists, exits with rc:1. This is done to allow for testing.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-10 17:40:46 -04:00
Digimer
660f38ac16
Merge branch 'main' into anvil-tools-dev 2023-04-05 16:11:01 -04:00
digimer
cf73d8ed36 * Updated System->configure_ipmi() to auto-configure DR hosts once they've been assigned a BCN IP address.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-05 15:04:39 -04:00
digimer
1c274ba58d * Fixed a bug in anvil-delete-server that was preventing the complete deletion of a server if the DRBD resource had already been removed.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-03 12:40:58 -04:00
Deezzir
109aa1ba3d docs: added annotation for the new arg 2023-04-03 12:40:58 -04:00
Deezzir
7d5f18b20d fix: introduced optional arg for clean_spaces 2023-04-03 12:40:58 -04:00
Deezzir
9241b5ef6a docs: added annotation for the new arg 2023-03-30 21:03:05 -04:00
Deezzir
deac1fc6a8 fix: introduced optional arg for clean_spaces 2023-03-30 20:57:17 -04:00
digimer
ddc6965b60 * Fixed a bug where references to files on Anvil! nodes was broken in anvil-provision-server and anvil-manage-files.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-30 17:33:49 -04:00
digimer
efebd135eb * Removed more references to 'dr1_host_uuid' from the old way of linking DR hosts to Anvil! nodes.
* Fixed a bug where servers protected by DR hosts aren't deleted when the server itself is deleted.
* Updated DRBD->delete_resource() to remove the server's XML file if the host is a DR host.
* Updated anvil-version-change and anvil.sql to enable update_audits and the audits table.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-30 12:50:44 -04:00
digimer
41fb8baeda * Fixed a bug in Database->get_storage_group_data() that was deleting DR host storage group members.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-27 12:35:22 -04:00
digimer
8ff40ec42c * Fixed a SQL query bug in Database->get_drbd_data().
* Got more work done on anvil-manage-server-storage; Now shows DRBD resource size, backing LV and size, and calculates/displayes metadata size.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-26 02:09:52 -04:00
digimer
040bc02e26 * This adds the new Database->get_drbd_data() that, like ->get_lvm_data, collates the DRBD data collected by scan-drbd into more readibly parsable data structure.
* Updated DRBD->parse_resource() to add references to a resource name and volume for a given backing disk.
* Comtinued work on anvil-manage-server-storage.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-24 19:45:47 -04:00
digimer
8e0e51544c * Continued work on anvil-manage-server-storage.
* Created the new Database->get_lvm_data to compile LVM data from scan-lvm
* Updated DRBD->parse_resource to call Database->get_lvm_data if needed, and to track backing devices to Storage Groups.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-22 22:57:26 -04:00
digimer
b144976853 This resolves Issue #310.
* Updated Database->get_file_locations() to record files available on Anvil! nodes by tracking hosts in Anvil! systems (needed after reworking how DR hosts are linked).
* Updated Get->available_resources() to call Database->get_files() and ->get_file_locations() to restore tracking files available on Anvil! nodes.
* Fixed a couple display bugs in anvil-provision-server when called with --ci-test --options.
* Continued work on anvil-manage-server-storage.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-20 23:43:40 -04:00
digimer
fea10e5bb1 * Prefixed all 'virsh' calls with 'setsid --wait' to help prevent future hangs if the call happens without a shell.
* Updated anvil-manage-server-storage to the point where it can now insert and eject optical disks!
* Updated System->call to log parameters if 'shell_call' isn't set.
* Fixed a bug in anvil-manage-server process_interactive where an $anvil->data reference was being scoped.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-03 14:42:28 -05:00
digimer
7891c9b2b1 * Fixed a bug in Network->load_ips() where interfaces were being marked as type 'bridge' or 'bond'.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-27 12:31:33 -05:00
digimer
5dbdd20d7e * Fixed a bug in Network->load_ips() where the IP address on a bridge or bond was having the device name recorded incorrectly.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-23 00:39:43 -05:00
digimer
ab3e8afe6e Fixed a bug in Storage->push_file() where file path wasn't updated from incoming to files, preventing the push to other hosts from working. Also fixed a minor issue where the file size was sometimes 0, making transfer calculations useless.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-22 13:21:29 -05:00
digimer
254f7ef4e2 This should fix the tracking of what files belong where, using the new DR links system. It also should finish (though testing is still needed) the serial rsync issue.
* Created Database->track_files() as a dedicated method as trying to verify the existence of file_locations during Database->load_anvils() was fragile and prone to recursive loops.
* Updated Database->insert_or_update_file_locations() to take an anvil_uuid and recursively call for each host, to maintain compatibility with the old ways, and make it simpler to add an entry for both sub-nodes in an Anvil!.
* Created Storage->push_file() that takes a file and rsync's it to all other machines, or creates a job for the file to be pulled if the target can't be accessed.
* Updated anvil-manage-files and anvil-sync-shared to use the new Storage->push_files and Database->track_files methods.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-22 02:13:19 -05:00
digimer
645f54ab89 This commit has more changes than I would normally like, but it's all linked to changing file uploads to rsync serially.
* To update file handling for the new DR host linking mechanism, file_locations -> file_location_anvil_uuid was changed to file_location_host_uuid.
  This required a fair number of changes elsewhere to handle this, with a particular noted change to Database->get_anvils() to look at host_uuid's for the subnodes in an Anvil! and, if either is marked as needing a file, make sure the peer is as well. Similarly, any linked DRs are set to have the file as well.
* Created a new Network->find_access that simply takes a target host name or UUID, and it returns a list of networks and IPs that the target can be accessed by.
* Updated Network->load_ips() to find the network interface being used for traffic so that things like the interface speed can be recorded, even when an IP is on a bridge or bond.

Unrelated, but in this commit, is a restoration of calling scan agents with a timeout now that the virsh hang issue has been resolved.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-14 02:29:40 -05:00
digimer
7710d9d109 * Created the new anvil-manage-server-storage tool which will specifically handle managing a server's disks.
* Created DRBD->parse_resource() to pass a specific DRBD resource's XML data.
* Fixed a bug in Get->available_resources() so that if the threads is lower than CPU cores, the cores are used as the total available to VMs.
* Fixed bugs in Get->server_from_switch() where it just wasn't working properly.
* Updated scan_drbd to not reset a resource's size to 0-bytes when a resource goes offline.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-03 22:05:34 -05:00
digimer
9751c883cb * Updated Cluster->assemble_storage_groups() to remove refrences to anvil_dr1_host_uuid. Also added the logic for auto-adding DR host's VGs to a storage group. Commented it out though as, for now, this might be a bad idea. Needs more thought.
* Fixed a bug in Database->get_storage_group_data() to load hosts data when needed. Also fixed a bug where new members didn't return the new storage_group_member_uuid.
* Updated anvil-manage-host to use the new switch handler.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-01 23:19:38 -05:00
digimer
7773e5f9b8 * Updated logging in DRBD->get_devices().
* Added a check and exit if anvil-manage-dr is asked to protect a server on a machine that doesn't know about that server.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-30 11:30:36 -05:00
digimer
d88fde7733 Updated DRBD->delete_resource() to use '--force' instead of 'echo Yes' (which no longer works).
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-20 23:16:30 -05:00
digimer
e012d6016c Tha major point of this commit is to add the new 'anvil-manage-storage-groups' program that, well, manages storage groups.
* Updated the storage_group_members table to add the 'storage_group_member_note' that can be set to 'DELETED' to track when a member is deleted. Updated anvil-version-changes to check for and add this column as needed. Updated the anvil.sql schema for the same.
* Updated Cluster->insert_or_update_storage_group_members to add the new column.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-20 22:10:15 -05:00
digimer
1a217d21cf * Updated anvil-manage-dr to provide the ability to link anvil nodes to dr hosts. Also began work on making it work with the new DR links system.
* Created Database->get_anvil_uuid_from_string(), Database->get_host_uuid_from_string() and Database->get_server_uuid_from_string() to simplify the process of converting --anvil <string>, --host <string> and --server <string> respectively.
* Fixed bugs in Database->get_dr_links() and Database->insert_or_update_dr_links().
* Updated Database->insert_or_update_states() to make direct calls to hosts instead of using get_hosts to drop out if a host_uuid doesn't yet exist in a DB.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-19 19:41:02 -05:00
digimer
16fc4e131c * Fixed a bug where, if a specific request to do a DB resync was made but the active_uuid wasn't matching the host, it wouldn't resync. This broke peering Strikers when the peer source was not the active_uuid.
* Updated anvil-manage-dr to check and delete duplicate dr_link entries.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 22:53:15 -05:00
digimer
985338a064 Fixed typo that broke compilation.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 21:15:09 -05:00
digimer
17863404e3 * Updated Database->_age_out_data() to only run once per day, unless explicitely called with --age-out-database.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 20:33:28 -05:00
digimer
3d6f71f27e * Updated Database->connect to clean up duplicates on setting the read UUID and database handle.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 16:45:26 -05:00
digimer
26a1fe1491 * Updated Database->connect() to allow local reads on strikers, regardless of the active DB.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 15:47:34 -05:00
digimer
98c3868870 * Updated fence_pacemaker to no longer use stonith_admin and instead use pcs. This should resolve the main part of issue #279
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 00:22:03 -05:00
digimer
5fcbb1643c * Updated Database->connect() to set an 'active_uuid', and the host with that UUID will be the only one to do resyncs. This might help with frequent resyncs, which could be caused by simultaneous resyncs happening on both nodes stepping on each other. This should help with issue #276
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-17 17:51:49 -05:00
digimer
6ca0e0da90 * Updated Database->connect() to only try to load from dump files if 2+ databases are configured in striker.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-16 20:51:29 -05:00
digimer
ff69916a85 * Applied typo fixed from PR #286 (thanks, Deezzir!). Also moved all the raw prints into words.xml.
* Updated Convert->human_readable_to_bytes() to return an empty string if passed an empty string.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-16 20:23:29 -05:00
digimer
9d2f9c4d88 * Fixed a string key name typo.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-15 19:53:57 -05:00
digimer
383a6df7c5 Updated Convert->bytes_to_human_readable() to accept already human-readable sizes and return that.
This resolves issue #282.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-13 23:04:33 -05:00
digimer
a3988cc3e5 * Added System->configure_logind() to ensure that nodes are configured to ignore ACPI power button events so that IPMI-based fences work immediately.
* Added call to System->configure_logind() to anvil-join-anvil and anvil-version-changes.
* Updated fence_pacemaker to add '--reboot' to the 'stonith_admin' call to ensure DRBD-triggered fence requests reboot instead of just turning nodes off.
This commit address issue #279.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-13 21:42:10 -05:00