Commit Graph

721 Commits

Author SHA1 Message Date
Digimer
c1e4380a64
Merge branch 'main' into anvil-tools-dev 2023-07-15 00:06:49 -04:00
digimer
458cb267da * Fixed a bug in Cluster->get_primary_host_uuid() where servers were not loaded before trying to calculate RAM use.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-15 00:04:12 -04:00
digimer
4dc1b0e117 * Added a check to Network->get_company_from_mac() to manually set the company to KVM/qemu if the prefix is 52:54:00.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 23:00:16 -04:00
digimer
02c3d204ea * Updated anvil-update-system to set 'job_data' to track reboots, and striker-update-cluster to read it.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 22:52:51 -04:00
digimer
3016fb875b * Reworded striker-update-cluster to use anvil-update-system for on-system OS updates.
* Updated DRBD->get_status() to take the new 'host' paramter to allow the caller to define the hash key string used in the stored data.
* Updated Get->anvil_version() (and a few other places) to use the new 'striker-ui-api' shell user, replacing the 'apache' user.
* Updated Remote->test_access() to take the new 'close' parameter to close the SSH session used when testing access to the target.
* Fixed a logging bug in anvil-manage-power.
* Updated anvil-update-system to take the '--no-reboot' and 'clear-cache' command line switches.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 22:29:07 -04:00
Tsu-ba-me
a7751da153 fix: rename, relocate function to find qemu-kvm processes 2023-07-12 18:40:11 -04:00
Tsu-ba-me
c3c69733d9 fix: correct base port check, server info extract, vnc alive assign in Server.pm 2023-07-12 18:27:50 -04:00
Tsu-ba-me
3cce3c39b8 fix: add Server subroutine to extract server VM info from qemu-kvm process(es) 2023-07-12 02:27:26 -04:00
digimer
d56b7f9a84 * Created (but not finished!) the new striker-update-cluster tool.
* Updated Cluster->get_primary_host_uuid() to only load anvils if not already loaded.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-07 17:54:57 -04:00
digimer
a7ebe45f76 This adds the new 'striker-collect-debug' tool that collects all potentially useful debug info into a single tarball.
* Fixed a bug in Get->anvil_from_switch() to work when the Anvil! name is passed.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-05 21:04:05 -04:00
digimer
1b8b0bc493 * Created the new 'anvil-manage-server-storage' with the first role of reload a DRBD resource.
* Updated Remote->call() to remove the 'background' parameter as it wasn't working.
* Updated anvil-manage-server-storage to use 'anvil-manage-server-storage' to adjust resources in a way that doesn't block.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-30 21:02:30 -04:00
digimer
7fbed10864 * Updated Remote->call() to take the new 'background' parameter.
* Continues work on adding new disks (DRBD volumes) to anvil-manage-server-storage.
* Updated DRBD->get_status() to record the peer-role.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-29 22:17:58 -04:00
digimer
ae55ca9187 * Applied the fix for TCP ports aging out reserved TCP ports properly to DRBD->get_next_resource().
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-27 00:04:48 -04:00
digimer
ea95d26cc5 * Fixed a bug in DRBD->get_next_resource() where reserved minor numbers were not being released. Also added a new parameter, "minor_only", that returns the next minor number but doesn't bother processing TCP ports.
* Did more work on adding support for adding new disk drives to servers in anvil-manage-server-storage.
* Updated anvil-manage-storage-groups To check for / delete duplicate storage groups with the same name.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-26 23:55:19 -04:00
digimer
65af56d5bd * Updated Database->insert_or_update_jobs() to not look for jobs that are complete when no job_uuid is passed.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-25 23:20:03 -04:00
digimer
e0316da88b * Got anvil-manage-server-storage working enough to grow existing disk's hard drive sizes, and to insert/eject optical disks.
* Hit a bug where a server's definition file was written to disk while not being valid. Added logging in case it happens again, and additional safe-guards to help avoid it from recurring.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-23 23:09:55 -04:00
digimer
1d12fb32b4 * Completed the new anvil-watch-drbd which replaces watch_drbd.
* Updated Email->get_current_server() to always load mail server data from the database.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-22 20:43:46 -04:00
Digimer
8f491e01ed
Merge branch 'main' into anvil-tools-dev 2023-06-20 20:00:10 -04:00
digimer
0aa72498db * This adds the new tool 'striker-check-machines' which simply walks through all known physical machines and checks to see if they're accessible and powered on.
* Updated Get->uptime() to work on remote targets.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-20 19:57:21 -04:00
Tsu-ba-me
b3f2644d07 fix: allow parameter to overwrite cgi input in Account->login 2023-06-20 00:48:21 -04:00
Tsu-ba-me
226c423af0 fix: allow param override in generate_manifest in Striker.pm 2023-06-19 15:15:32 -04:00
digimer
156a0ca201 Updated anvil-daemon's new job launching logic to allow the restart of a running job that failed out early.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-16 11:43:49 -04:00
digimer
cc15eca6fb * Added anvil-watch-power to git.
* Added a check to cleanup size input to Convert->human_readable_to_bytes() when passed pre-processed strings.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-15 21:35:42 -04:00
digimer
47f7a35df3 The main purpose of this commit is to add serial execution of similar jobs to help reduce race conditions for scripted jobs, like multiple server creation.
* Fixed a small logging bug in DRBD->allow_two_primaries().
* Updated Database->get_jobs() to record jobs sorted by modified_date so that jobs can be run in the order they were recorded.
* Updated anvil-daemon to track which commands need to be run, and when two or more of the same command need to be run, they're run serially, with each subsequent run starting after the previous one completes.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-15 21:13:53 -04:00
digimer
dda0fbd7d5 * Updated DRBD->allow_two_primaries() to be more careful at evaluating peer-node-id.
* Updated DRBD->manage_resource() to set allow-two-primaries=no when up'ing a resource (as no migration can be in progress during an up command).
* Updated scan-drbd to look for StandAlone resources and call DRBD->manage_resource({task = 'up'}) if a connection to a peer node is StandAlone or if the local disk state is detached.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-14 23:34:05 -04:00
digimer
b6a249d5e7 * Updated Cluster->add_server() to set the preferred host based first on if the server is running on a node, and if not, on the primary node (where before it defaulted to node 1).
* Updated DRBD->delete_resource() to call scan-drbd and scan-lvm to ensure that the database is updated with the newly freed resources.
* Updated anvil-delete-server and anvil-provision-server to call select scan agents to ensure freed resources are immediately recorded.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-11 23:46:21 -04:00
digimer
b03587967b * Updated Cluster->add_server() to batch the creation of the server and the location constraints in one commit to the CIB.
* Updated scan-lvm to look for and delete duplicate entries.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-09 19:59:23 -04:00
digimer
b7abc481e6 Updated scan-cluster to check to see that migrate_to and migrate_from are given a timeout of 600s and an on-fail of "block". Updated Cluster->add_server() to set migrate_from to timeout=600s and on-fail=block as well.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-08 20:30:25 -04:00
digimer
c82bd9d73a * Created the new anvil-watch-power tool that shows the status of UPSes known on the system, including their "on battery" state, charge percentage, estimated hold up time, etc.
* Updated Database->get_power() and ->get_upses() to store both the time stamp and unix time stamps.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-06 23:40:15 -04:00
digimer
bc3d04ad2e * Updated Cluster->add_server() to wait up to 15 seconds for a server to appear to ensure that the pcs call to add the server with the right requested running state.
* Updated Cluster->recover_server() to set the desired recovery state before calling the crm_resource refresh.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-06 14:34:02 -04:00
digimer
0e57836c8f This commit addresses (hopefully) issue #329.
* Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now.
* Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster.
* Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-05 22:53:34 -04:00
digimer
c50a1936c0 * This adds the new 'file_locations' -> 'file_location_ready' column and associated methods. This is set to TRUE/1 when the file referenced is found on disk and it is the expected size and md5sum. This is meant to allow programs to wait/watch or a file to be ready if they need to use it. Files are now checked periodically via anvil-daemon.
* Removed hard-coded log levels in anvil-provision-server and anvil-manage-storage-groups.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-04 00:05:56 -04:00
digimer
26fa3c7e32 Fixed a bug where Get->available_resources() was missing LVM/storage group data in some cases.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-02 16:28:05 -04:00
digimer
510db70253 Another attempt to resolve the stoage group race condition. This moves the check for auto-assembly to scan-lvm. It only works for the first assemble, after that the user can/should use anvil-manage-storage-groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-02 00:07:40 -04:00
digimer
e483840ceb Second attempt to fix the storage group race condition. This time, we only let node 1 assemble storage groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-01 20:29:20 -04:00
digimer
d64044c7d1 Test fix for storage group race condition.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-01 13:48:27 -04:00
digimer
9a58f4d1ff * This is a small commit to increase logging while chasing down a race condition issue with assembling storage groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-30 19:47:58 -04:00
digimer
895f1ec262 This fixes a race condition when multiple servers are provisioned at (nearly) the same time.
* In DRBD->get_next_resource(), implemented a "hold" system where the DRBD minor and TCP port(s) returned are marked as being held for one minute. So subsequent calls won't use the same numbers.
* In anvil-daemon, added a check in run_jobs() where only one instance of a given job command will be started per 2-second loop. This should help reduce the chance of simultaneous race confitions in general.
* Removed from anvil-provision-server and most other tools the call to Job->get_job_uuid(). If the program is called without the job_uuid, don't try to find it. This allows a human (or script) to make repeated calls to a program without one of those calls running a pending job instead.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-28 00:19:53 -04:00
digimer
e7537b0ca3 * Fixed a bug where, when DRBD->gather_data() calls 'drbdadm dump-xml' and the output includes usage data, it breaks XML parsing.
* Fixed a bug in Get->available_resources() where DELETED servers were being counted in the used resources math.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-25 13:12:13 -04:00
digimer
dc7b909bfc More logging to debug storage group race condition
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-18 19:14:59 -04:00
digimer
bd575c6a7d Bumped logging for storage group management.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-18 19:02:51 -04:00
digimer
89eae7098e NOTE: This updates the reserved RAM to 8 GiB from 4 GiB!
* Adds support for 'anvil_resources:🐏:reserved' that can be set to a number of MiB to override the default 8192.
* Adds support for 'anvil::<anvil_uuid>::resources:🐏:reserved' to allow for per-Anvil! node override on the reserved RAM default, and over the 'anvil_resources:🐏:reserved' option.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-17 20:43:28 -04:00
digimer
025c2a6f54 * Updated Email->get_next_server() to ignore DELETED mail servers, and it now loads mail servers if not yet in memory.
This resolves issue #306.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-13 00:26:32 -04:00
digimer
1afa7ce09e * Created Cluster->recover_server() that uses crm_resource to try to recover a server that has entered a FAILED state.
* Updated (not not yet completed) scan-cluster's check_resources() function to check if a FAILED server is ready to try to recover.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-10 23:04:15 -04:00
digimer
f9689a7106 Updated ocf:alteeve:server to look for /tmp/<resource>.fail' and, if that file exists, exits with rc:1. This is done to allow for testing.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-10 17:40:46 -04:00
Digimer
660f38ac16
Merge branch 'main' into anvil-tools-dev 2023-04-05 16:11:01 -04:00
digimer
cf73d8ed36 * Updated System->configure_ipmi() to auto-configure DR hosts once they've been assigned a BCN IP address.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-05 15:04:39 -04:00
digimer
1c274ba58d * Fixed a bug in anvil-delete-server that was preventing the complete deletion of a server if the DRBD resource had already been removed.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-03 12:40:58 -04:00
Deezzir
109aa1ba3d docs: added annotation for the new arg 2023-04-03 12:40:58 -04:00
Deezzir
7d5f18b20d fix: introduced optional arg for clean_spaces 2023-04-03 12:40:58 -04:00