Commit Graph

179 Commits

Author SHA1 Message Date
digimer
a27773a69d scan-network now records interfaces, bonds and bridges!
* Much testing still needed, but this is a significant milestone.

Signed-off-by: digimer <mkelly@alteeve.ca>
2024-01-27 15:39:01 -05:00
digimer
9c67b97fdd Fixed a bug in initializing DROP'ed DBs.
* Got more work done on adding network_interfaces to the database in
  scan-server.

Signed-off-by: digimer <mkelly@alteeve.ca>
2024-01-27 15:39:01 -05:00
digimer
ec11335197 Fixed DB initialization bugs.
* More work done on the new network stack also.

Signed-off-by: digimer <mkelly@alteeve.ca>
2024-01-27 15:39:01 -05:00
digimer
52e7875252 Bumoed logging to find '!!error!!' related parsing errors.
Signed-off-by: digimer <mkelly@alteeve.ca>
2024-01-27 15:39:01 -05:00
digimer
c5e72797fd Made the match for the partition 'swap' more flexible.
Signed-off-by: digimer <mkelly@alteeve.ca>
2024-01-20 16:43:35 -05:00
digimer
e4fc831284 Added a missing variable to an alert.
Signed-off-by: digimer <mkelly@alteeve.ca>
2024-01-20 16:39:08 -05:00
digimer
822854f0c3 Fixed a bug in scan-ipmitool that was causing duplicate history entries
* Increased logging to scan-ipmitool and scan-network to help trace a
  duplicate DB entry bug.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-07 15:32:57 -05:00
digimer
51978e1609 Update scan-server to only alert on large boot time changes
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-11-05 17:07:26 -05:00
digimer
a11b87458e Gracefully handle errors from changed node host names in scan-cluster.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-20 19:15:04 -04:00
digimer
5ec395c53a Reworked DB resync logic.
With this new system, a 'primary_db' is chosen (first connected DB UUID when sorted) and only it does resyncs. Further, resyncs have been pulled from all tools except anvil-daemon. So with this new system, the chances of duplicate, simultaneous resyncs should be removed (hopefully for real this time).

* Database->check_agent_data() no longer calls a resync after loading a
  schema.
* Removed the Database->coonnect() 'all' parameter
* The database used to read from is now always the same as the primary,
  even if there is a local DB.
* Database->connect() 'check_for_resync' parameter can now be set to
  '2', which means "check for resync _if_ I am primary", where '1' still
  checks for resync no matter what.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-19 20:41:57 -04:00
digimer
b1f89c2723 Finished initial version of striker-show-jobs
* Updated Database->get_jobs() to take 'job_host_uuid = all' to allow
  loading jobs from all cluster machines. Also updated it to record the
  'job_host_uuid' and the unix timestamp version of 'modified_date'.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-18 20:51:20 -04:00
digimer
829ae546a2 Beginning work on new Server->locate() method to find servers across an
Anvil! cluster.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-10-11 17:22:06 -04:00
digimer
122816255d * Fixed a bug where a sensor value of '0' was being interpretted as the value not existing.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 17:33:01 -04:00
digimer
580980717d This commit covers the convertion of 'virsh' shell calls to using 'Sys::Virt' module, and fixes several small bugs related to scan-server;
* Switched all calls to virsh to use Sys::Virt to deal with contention of simultaneous virsh calls.
* Removed collecting screenshots from scan-server.
* Fixed a bad variable substitution in an alert.
* Fixed a bug where a server's boot time wasn't being recorded properly.
* Reworked how we determine which server definition was most recently updated and propogated.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-09-21 15:59:43 -04:00
digimer
a81a110261 * Remove forced log level and secure logging. This addresses issue #386
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-09 18:20:14 -04:00
digimer
a0cb791f47 This contains fixes needed for beta from additional testing.
* Updated the pcs wrapper to flock anything but status calls.
* Updated scan-apc-pdu to purge regardless of the host it's called on any host.
* Fixed a bug striker-purge-target that wouldn't purge anvil nodes in various cases.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-09 18:07:03 -04:00
digimer
6ee2ad75db * Updated anvil-delete-server to actively check for and delete any drbd-fenced attributes left over in the CIB after a server is deleted. This addresses issue #374.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-25 21:45:34 -04:00
digimer
6a7c9923ad * Fixed second variable replacement bug, re issue #338.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-25 10:37:41 -04:00
digimer
9ebe192306 This fixes a variable substitution but, addressing issue #338.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-23 15:37:23 -04:00
digimer
7258781712 * Updated scan-cluster to detect stale drbd-fenced attributes in the CIB, generally left after a server is deleted. This addresses issue #374.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-23 15:28:43 -04:00
Tsu-ba-me
dac247f66e fix(scancore-agents): get screenshot of server(s) running on local node in scan-server 2023-07-13 01:52:15 -04:00
digimer
e0316da88b * Got anvil-manage-server-storage working enough to grow existing disk's hard drive sizes, and to insert/eject optical disks.
* Hit a bug where a server's definition file was written to disk while not being valid. Added logging in case it happens again, and additional safe-guards to help avoid it from recurring.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-23 23:09:55 -04:00
digimer
dda0fbd7d5 * Updated DRBD->allow_two_primaries() to be more careful at evaluating peer-node-id.
* Updated DRBD->manage_resource() to set allow-two-primaries=no when up'ing a resource (as no migration can be in progress during an up command).
* Updated scan-drbd to look for StandAlone resources and call DRBD->manage_resource({task = 'up'}) if a connection to a peer node is StandAlone or if the local disk state is detached.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-14 23:34:05 -04:00
digimer
929806cef7 Fixed variable substitution names in scan-server.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-09 20:08:38 -04:00
digimer
b03587967b * Updated Cluster->add_server() to batch the creation of the server and the location constraints in one commit to the CIB.
* Updated scan-lvm to look for and delete duplicate entries.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-09 19:59:23 -04:00
digimer
b7abc481e6 Updated scan-cluster to check to see that migrate_to and migrate_from are given a timeout of 600s and an on-fail of "block". Updated Cluster->add_server() to set migrate_from to timeout=600s and on-fail=block as well.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-08 20:30:25 -04:00
digimer
bc3d04ad2e * Updated Cluster->add_server() to wait up to 15 seconds for a server to appear to ensure that the pcs call to add the server with the right requested running state.
* Updated Cluster->recover_server() to set the desired recovery state before calling the crm_resource refresh.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-06 14:34:02 -04:00
digimer
0e57836c8f This commit addresses (hopefully) issue #329.
* Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now.
* Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster.
* Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-05 22:53:34 -04:00
digimer
510db70253 Another attempt to resolve the stoage group race condition. This moves the check for auto-assembly to scan-lvm. It only works for the first assemble, after that the user can/should use anvil-manage-storage-groups.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-05-02 00:07:40 -04:00
digimer
83aa4e6a5f Updated scan-cluster to check for FAILED resources (servers) and, if found, attempt to recover it.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-11 16:32:31 -04:00
digimer
1afa7ce09e * Created Cluster->recover_server() that uses crm_resource to try to recover a server that has entered a FAILED state.
* Updated (not not yet completed) scan-cluster's check_resources() function to check if a FAILED server is ready to try to recover.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-10 23:04:15 -04:00
digimer
c7a923fdfb * Fixed a bug in scan-server where DELETED servers were being set to 'shut off'.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-03 14:19:31 -04:00
digimer
bf2e3e25fb * Added a check for undefined variable/value pairs in cachevault data that was causeing SQL UPDATE errors.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-03 13:17:01 -04:00
Deezzir
7d5f18b20d fix: introduced optional arg for clean_spaces 2023-04-03 12:40:58 -04:00
digimer
efebd135eb * Removed more references to 'dr1_host_uuid' from the old way of linking DR hosts to Anvil! nodes.
* Fixed a bug where servers protected by DR hosts aren't deleted when the server itself is deleted.
* Updated DRBD->delete_resource() to remove the server's XML file if the host is a DR host.
* Updated anvil-version-change and anvil.sql to enable update_audits and the audits table.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-30 12:50:44 -04:00
digimer
fea10e5bb1 * Prefixed all 'virsh' calls with 'setsid --wait' to help prevent future hangs if the call happens without a shell.
* Updated anvil-manage-server-storage to the point where it can now insert and eject optical disks!
* Updated System->call to log parameters if 'shell_call' isn't set.
* Fixed a bug in anvil-manage-server process_interactive where an $anvil->data reference was being scoped.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-03 14:42:28 -05:00
digimer
7710d9d109 * Created the new anvil-manage-server-storage tool which will specifically handle managing a server's disks.
* Created DRBD->parse_resource() to pass a specific DRBD resource's XML data.
* Fixed a bug in Get->available_resources() so that if the threads is lower than CPU cores, the cores are used as the total available to VMs.
* Fixed bugs in Get->server_from_switch() where it just wasn't working properly.
* Updated scan_drbd to not reset a resource's size to 0-bytes when a resource goes offline.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-03 22:05:34 -05:00
digimer
76c8088aee * Updated scan-apc-pdu to only run on the active striker DB (as set during Database->connect()) to prevent contention from simultaneous scan agent runs from different machines.
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-17 18:02:38 -05:00
digimer
0fa6ddebc5 Updated scan-network to see an interface state of 'activated' as up (used to check specifically for 'active').
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-14 16:22:51 -05:00
Digimer
eae2ab4d9f * Undid the #!no_value!# -> !!no_value!! change as it broke language processing.
* Fixed a bug in scan-apc-pdu that was preventing it from compiling.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-07 21:52:14 -05:00
Digimer
4528f07508 * Fixed a bug where fence-handler was repeatedly added by scan-drbd.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-06 21:30:16 -05:00
Digimer
4ba1982183 This is the start of a set of changes needed to rework how we handle DRBD fence requests, so that they create location constraints instead of triggering a full stonith fence.
* In Cluster->parse_cib(), added parsers for node attributes and resource rules. Also stored the existence of and details of each under the server resources for easier referencing.
* Updated scan-server to check for / add DRBD fence rules as needed.

Scancore APC agent bugs;
* For clarity, converted all '#!no_value!#' and '#!no_connection!#' to use '!!' instead in APC scan agents.
* Fixed a bug to set/clear alerts related to phases disappearing to deal with concurrent logins from different hosts triggering false phase loss alerts.
* Fixed missing variables not being passed to alerts/log entries.

Started more work on anvil-manage-server, but on hold again while the DRBD fencing work is completed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-29 22:17:12 -05:00
Digimer
13b0f5bdcc Bumped 'Exhaust Temp' jump threshold to 30c in scan-ipmitool.
Adjusted some logging.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-08 20:34:09 -04:00
Digimer
a4ef93404c * Fixed a bug in DRBD->gather_data() to remove trailing commas for existing TCP ports.
* Added the missing 'clear-mapping' switch to Get->switches in anvil-daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-05 20:15:32 -04:00
Digimer
ac8135709a Fixed a bug where scan-server faulted with a divide by zero error when the host had no swap.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-27 00:40:30 -04:00
Digimer
2fab7bc1b7 This adds support (testing needed) for "Long-Throw" DR; which is a wrapper for using 'drbd-proxy' to provide larger transmit buffers so slow/high-latency DR hosts.
* Created DRBD->check_proxy_license() to do (some level of) sanity checks on the DRBD proxy license file.
* Updated DRBD->gather_data() to parse out the inside and outside ports for resource configs using proxy.
* Reworked DRBD->get_next_resource() to return 1, 3 or 7 TCP ports depending, with the new long_throw_ports parameter triggering the 7 ports.
* Added 'tcpdump' to the anvil-core requires list.
* Reworked scan-drbd to record the ports used in proxy configs. This required adding a check to change the 'scan_drbd_peer_tcp_port' column type to 'text' to support CSVs.
* Reworked anvil-manage-dr (needs testing!) to support "long-throw" DR configs.
* Updated anvil-safe-stop to check if the nodes are in the cluster before trying to migrate.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-21 23:35:06 -04:00
Digimer
89121a2b3b * Fixed a bug in Alert->check_condition_age() where not setting a host_uuid caused the returned age to always be 0.
* Updated scan_apc_pdu to not report a lost PDU unless it's been gone for ten minutes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-29 17:30:52 -04:00
Digimer
4ecc6097d3 * Cleaned up some old 'die' calls with better nice_exit() calls to help avoid dangling db_in_use flags.
* Reworked Network->bridge_info() to use 'ip' to get the list of bridges, and 'bridge' to find interfaces connected to the bridge.
* Added 'test' messages to Words->string().
* Fixed a bug in scan-lvm where mdadm based PVs didn't read the sector size properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-12 16:32:20 -04:00
Digimer
ef3ac86162 * Fixed a bug where setting the db_in_use flag without a valid $ENV{_}.
* Added a nice_exit call to tools/striker-access-database

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-09 15:45:10 -04:00
Fabio M. Di Nitto
7decdb2887 scan-network: fix path to script
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2022-08-04 19:52:46 +02:00