Commit Graph

76 Commits

Author SHA1 Message Date
digimer
79ff96cee5 * Fixed a bug in DRBD->manage_resource() that prevented new resources from being created.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-09 11:23:57 -04:00
digimer
3ee30e6e24 * Updated DRBD->allow_two_primaries() to gracefully fail if the peer isn't connected.
* Updated DRBD->manage_resource() to check if the host is StandAlone when asked to 'up' a resource and, if so, connect first. Also updated this to error out gracefully if the call to allow_two_primaries() returns non-zero.
* Update Server->migrate_virsh() to error out gracefully if the DRBD->allow_two_primaries() returns non-zero.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 22:52:16 -04:00
digimer
55aaf7876e Starting work on checking if the peer is connected before managing allow-two-primaries.
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-08 17:33:39 -04:00
digimer
59ade94124 * Added PID logging as an option, and enabled it in ocf:alteeve:server
* Updated DRBD->manage_resource() to take the task 'adjust'.
* Updated ocf:alteeve:server's start_drbd_resource() to call adjust if startup of a resource isn't needd.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-08-07 22:28:13 -04:00
digimer
be290bf561 This commit fixes a bug where the drbd kernel module build was being killed mid-compile, leaving DBRD unusable.
* Created System->wait_on_dnf() which was plucked from anvil-daemon, and now also called in scancore and anvil-safe-start.
* Updated scancore and anvil-safe-start to check on start that DRBD's kernel module is available (and build if not).

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-24 22:32:41 -04:00
digimer
3016fb875b * Reworded striker-update-cluster to use anvil-update-system for on-system OS updates.
* Updated DRBD->get_status() to take the new 'host' paramter to allow the caller to define the hash key string used in the stored data.
* Updated Get->anvil_version() (and a few other places) to use the new 'striker-ui-api' shell user, replacing the 'apache' user.
* Updated Remote->test_access() to take the new 'close' parameter to close the SSH session used when testing access to the target.
* Fixed a logging bug in anvil-manage-power.
* Updated anvil-update-system to take the '--no-reboot' and 'clear-cache' command line switches.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-07-14 22:29:07 -04:00
digimer
7fbed10864 * Updated Remote->call() to take the new 'background' parameter.
* Continues work on adding new disks (DRBD volumes) to anvil-manage-server-storage.
* Updated DRBD->get_status() to record the peer-role.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-29 22:17:58 -04:00
digimer
ae55ca9187 * Applied the fix for TCP ports aging out reserved TCP ports properly to DRBD->get_next_resource().
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-27 00:04:48 -04:00
digimer
ea95d26cc5 * Fixed a bug in DRBD->get_next_resource() where reserved minor numbers were not being released. Also added a new parameter, "minor_only", that returns the next minor number but doesn't bother processing TCP ports.
* Did more work on adding support for adding new disk drives to servers in anvil-manage-server-storage.
* Updated anvil-manage-storage-groups To check for / delete duplicate storage groups with the same name.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-26 23:55:19 -04:00
digimer
47f7a35df3 The main purpose of this commit is to add serial execution of similar jobs to help reduce race conditions for scripted jobs, like multiple server creation.
* Fixed a small logging bug in DRBD->allow_two_primaries().
* Updated Database->get_jobs() to record jobs sorted by modified_date so that jobs can be run in the order they were recorded.
* Updated anvil-daemon to track which commands need to be run, and when two or more of the same command need to be run, they're run serially, with each subsequent run starting after the previous one completes.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-15 21:13:53 -04:00
digimer
dda0fbd7d5 * Updated DRBD->allow_two_primaries() to be more careful at evaluating peer-node-id.
* Updated DRBD->manage_resource() to set allow-two-primaries=no when up'ing a resource (as no migration can be in progress during an up command).
* Updated scan-drbd to look for StandAlone resources and call DRBD->manage_resource({task = 'up'}) if a connection to a peer node is StandAlone or if the local disk state is detached.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-14 23:34:05 -04:00
digimer
b6a249d5e7 * Updated Cluster->add_server() to set the preferred host based first on if the server is running on a node, and if not, on the primary node (where before it defaulted to node 1).
* Updated DRBD->delete_resource() to call scan-drbd and scan-lvm to ensure that the database is updated with the newly freed resources.
* Updated anvil-delete-server and anvil-provision-server to call select scan agents to ensure freed resources are immediately recorded.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-11 23:46:21 -04:00
digimer
0e57836c8f This commit addresses (hopefully) issue #329.
* Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now.
* Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster.
* Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-06-05 22:53:34 -04:00
digimer
895f1ec262 This fixes a race condition when multiple servers are provisioned at (nearly) the same time.
* In DRBD->get_next_resource(), implemented a "hold" system where the DRBD minor and TCP port(s) returned are marked as being held for one minute. So subsequent calls won't use the same numbers.
* In anvil-daemon, added a check in run_jobs() where only one instance of a given job command will be started per 2-second loop. This should help reduce the chance of simultaneous race confitions in general.
* Removed from anvil-provision-server and most other tools the call to Job->get_job_uuid(). If the program is called without the job_uuid, don't try to find it. This allows a human (or script) to make repeated calls to a program without one of those calls running a pending job instead.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-28 00:19:53 -04:00
digimer
e7537b0ca3 * Fixed a bug where, when DRBD->gather_data() calls 'drbdadm dump-xml' and the output includes usage data, it breaks XML parsing.
* Fixed a bug in Get->available_resources() where DELETED servers were being counted in the used resources math.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-04-25 13:12:13 -04:00
digimer
efebd135eb * Removed more references to 'dr1_host_uuid' from the old way of linking DR hosts to Anvil! nodes.
* Fixed a bug where servers protected by DR hosts aren't deleted when the server itself is deleted.
* Updated DRBD->delete_resource() to remove the server's XML file if the host is a DR host.
* Updated anvil-version-change and anvil.sql to enable update_audits and the audits table.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-30 12:50:44 -04:00
digimer
040bc02e26 * This adds the new Database->get_drbd_data() that, like ->get_lvm_data, collates the DRBD data collected by scan-drbd into more readibly parsable data structure.
* Updated DRBD->parse_resource() to add references to a resource name and volume for a given backing disk.
* Comtinued work on anvil-manage-server-storage.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-24 19:45:47 -04:00
digimer
8e0e51544c * Continued work on anvil-manage-server-storage.
* Created the new Database->get_lvm_data to compile LVM data from scan-lvm
* Updated DRBD->parse_resource to call Database->get_lvm_data if needed, and to track backing devices to Storage Groups.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-22 22:57:26 -04:00
digimer
7710d9d109 * Created the new anvil-manage-server-storage tool which will specifically handle managing a server's disks.
* Created DRBD->parse_resource() to pass a specific DRBD resource's XML data.
* Fixed a bug in Get->available_resources() so that if the threads is lower than CPU cores, the cores are used as the total available to VMs.
* Fixed bugs in Get->server_from_switch() where it just wasn't working properly.
* Updated scan_drbd to not reset a resource's size to 0-bytes when a resource goes offline.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-02-03 22:05:34 -05:00
digimer
7773e5f9b8 * Updated logging in DRBD->get_devices().
* Added a check and exit if anvil-manage-dr is asked to protect a server on a machine that doesn't know about that server.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-30 11:30:36 -05:00
digimer
d88fde7733 Updated DRBD->delete_resource() to use '--force' instead of 'echo Yes' (which no longer works).
Signed-off-by: digimer <mkelly@alteeve.ca>
2023-01-20 23:16:30 -05:00
digimer
dfa93a1837 * Added 'setsid' to all 'virsh' calls as nested calls (ie: crm_resource -> ocf:alteeve:server -> virsh) would fail because virsh couldn't connect to a terminal. See:
** https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background
* Added explicity setting of $ENV{PATH} when it's null (as it is when pacemaker calls our tools).
* Updated the copyright to 2023.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-12 21:52:26 -05:00
digimer
a5cee52153 * Fixed a bug in DRBD->get_devices() where old test host UUIDs were left hard-coded.
* Fixed a duplicate header in words.xml
* Fixed display bugs in anvil-report-usage and removed the old DR host display info.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-04 22:58:28 -05:00
Digimer
4528f07508 * Fixed a bug where fence-handler was repeatedly added by scan-drbd.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-06 21:30:16 -05:00
Digimer
4fa8d7a446 * This completes the rework of DRBD triggered fencing to use / clear location constraints instead of triggering a power fence.
* Added the new unfence_pacemaker DRBD unfence handler.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-30 16:13:38 -05:00
Digimer
a4ef93404c * Fixed a bug in DRBD->gather_data() to remove trailing commas for existing TCP ports.
* Added the missing 'clear-mapping' switch to Get->switches in anvil-daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-10-05 20:15:32 -04:00
Digimer
2fab7bc1b7 This adds support (testing needed) for "Long-Throw" DR; which is a wrapper for using 'drbd-proxy' to provide larger transmit buffers so slow/high-latency DR hosts.
* Created DRBD->check_proxy_license() to do (some level of) sanity checks on the DRBD proxy license file.
* Updated DRBD->gather_data() to parse out the inside and outside ports for resource configs using proxy.
* Reworked DRBD->get_next_resource() to return 1, 3 or 7 TCP ports depending, with the new long_throw_ports parameter triggering the 7 ports.
* Added 'tcpdump' to the anvil-core requires list.
* Reworked scan-drbd to record the ports used in proxy configs. This required adding a check to change the 'scan_drbd_peer_tcp_port' column type to 'text' to support CSVs.
* Reworked anvil-manage-dr (needs testing!) to support "long-throw" DR configs.
* Updated anvil-safe-stop to check if the nodes are in the cluster before trying to migrate.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-21 23:35:06 -04:00
Digimer
9675ebf986 * Added --remove support to anvil-manage-dr, completing all the features for this tool.
* Updated DRBD.pm to move the logic to wipe and delete an LV into a new method called 'remove_backing_lv'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-24 22:08:48 -04:00
Digimer
bce9e2caaf This is the first attempt at enabling firewalld completely. There is a decent chance that problems exist, so it won't be a surprise if a few more commits are needed to this branch before things work.
* Added multiple new private methods to Network that help in managing the firewall.
* Updated Server->boot_server to manage the firewall after the server boots. Updated ->migrate_server to create a job, if a database connection exists, for the migration target to update it's firewall as soon after the server appears as possible.
* Updated ocf:server:alteeve to manage the firewall when called post-migration, in case there was no DB connection and the job above didn't run. Fixed a bug where the disk state wasn't being evaluated properly.
* Updated scan-server to check that the firewall is managed when a server state has changed.
* Updated anvil-daemon to run Network->manage_firewall on startup.
* Heavily reworked 'anvil-manage-server' to either just run 'Network->manage_firewall', or if passed '--server X', to wait for the server to appear for up to 1 minute, then to check that the firewall is managed (to capture servers being migrated to the host.)
* Removed firewall management from striker-prep-database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-02 17:06:04 -04:00
Digimer
142be7674e * Fixed a bug in striker-scan-network where the scan wasn't running properly when no network was specifically given.
* Updated DRBD->get_devices() to store information about the nodes for each resource.
* Got more work done on anvil-report-usage.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-04-08 18:54:55 -04:00
Digimer
4751c6e747 Updated DRBD->get_devices() and Server->parse_definition() to take 'anvil_uuid' so that server data can be parsed from anywhere.
Created, but not finished, tools/anvil-report-usage that will print a report of server resource allocation and Anvil! resource availability.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-04-06 00:16:32 -04:00
Digimer
bc39c3fe5c Updated ocf:alteeve:server to better handle multi-peer DRBD configurations.
Cleaned up some logging in DRBD->get_status.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-02-21 19:18:28 -05:00
Digimer
3346d31194 * Created Get->kernel_release() that returns the current kernel release (version) in use on the host or on a remote system.
* Created DRBD->_initialize_drbd() to makes sure the DRBD kernel module can load and tries to build the module, if necessary. This is meant to provide support for clients that can't access needed internet resource (or the internet at all).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-12-07 20:03:39 -05:00
Digimer
a034583213 * Updated DRBD->gather_data() to record TCP/IP data between connections of two hosts.
* Updated anvil-manage-dr to use the TCP ports already configured for a resource when re-configuring a DR resource that has been previously configured.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-20 22:34:36 -04:00
Digimer
6664c5b77f * Fixed a bug where scan-drbd, with DR configured, was not recording TCP ports assigned to connections properly.
* More bugs fixed in anvil-manage-dr, tested repeatedly as a job and so far, so good. Other functionality still to come.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-12 23:34:25 -04:00
Digimer
5b35204af4 * Updated DRBD->get_next_resource() to take the new 'dr_tcp_ports' ports which, if set, returns two free TCP ports.
* Got anvil-manage-dr to the point where it writes the updated resource configuration to enable DR support. (untexted)

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-09 23:07:03 -04:00
Digimer
606bd8f1f0 Continuing work on anvil-manage-server.
Created Storage->get_storage_group_from_path() that takes a block device path and tried to find the Storage Group it belongs to.
Updated Storage->get_storage_group_data() to make it possible to look up a storage group UUID using the SG's name.
Updated DRBD->gather_data() to take a pre-generated XML via the new 'xml' parameter.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-06 14:21:11 -04:00
Digimer
d7d418ee1b * Fixed a bug in DRBD->gather_data() where the peer node's data was being recorded where the local node's data should have been saved.
* Fixed a bug in anvil-delete-server where, if a server was off already, the server would not be removed from pacemaker.
* WIP - continuing on scan-network

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-30 14:58:36 -04:00
Digimer
607c097fc8 * Fixed a bug where, once a DRBD resource was allowed to be dual-primary for migration, that wasn't properly disabled post-migration.
* Updated DRBD->allow_two_primaries() to take the 'set_to' parameter which can be 'yes' to all and 'no' to disallow dual-primary.
* Updated ocf:alteeve:server to call allow_two_primaries() with 'set_to' = 'no' instead of calling 'adjust' after a migration completes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-25 00:45:52 -04:00
Digimer
d155c2eb66 * Fixed a bug where 'timeout' would repeatedly get added to drbd's global-common.conf file.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 20:03:17 -04:00
Digimer
41cd1e0319 * Several bugs fixed and enhancements;
* DRBD is now configured to a ping-timeout of 3 seconds.
* Created Log->switches() that returnes the command line switches used by Anvil! tool command line calls based on the active log levels / secure logging. Appended this to all invocations of our tools.
* Updated Database->resync_databases() to now only skip 'jobs' and 'variables' tables with less than 10 record differences. All other differences will trigger a resync.
* Created System->_check_anvil_conf() that, as you might guess, checks in anvil.conf exists and created it (using defaults), if not. It also checks to see if the 'admin' group and user exists and creates them, if not.
* Updated anvil-daemon to check anvil.conf on start up and in each loop. Created the function check_journald() that checks (and sets, if needed) that journald logging is persistent.
* Made striker-manage-peers to check_if_configured on the Database->connect() when updating anvil.conf and the target UUID is the local machine. Also created a loop to make the reconnection a lot more robust.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 00:09:32 -04:00
Digimer
49890762b9 * Fixed a missing semi-colon that broke Database.pm.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-22 12:53:49 -04:00
Digimer
ca7052dd53 The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :)
* Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively.
* Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf.
* Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state.
* Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested.
* Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times.
* Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan.
* Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-30 22:58:01 -04:00
Digimer
f202187c34 * anvil-safe-stop is complete! Testing still needed, of course.
* Updated DRBD->manage_resource() to call 'drbdadm adjust <res>' when starting a resource to help deal with a periodic issue where the 'allow-two-primary' option on the peer doesn't match the local setting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-23 11:56:11 -04:00
Digimer
2e37691116 * Updated DRBD->gather_data() to store data on peers so that the peer's LV path and backing disk is recorded. Also fixed a bug in ->get_status() where the return code for local calls was stored as a host name.
* Added the scan-hpacucli scan agent. It's been done for a while and should have been added ages ago.
* Updated anvil-rename-server to get to the point where it will take down the DRBD resources on all machines, but waits if there is a sync under way. It also verifies that the server is off on all systems from virsh's perspective.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-20 22:46:51 -04:00
Digimer
fb0836f912 * THe get_cpu endpoint was completed.
* The get_mmeory endpoint was completed.
* The get_replicated_storage endpoint was completed, though it requires testing and likely has issues.

To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this.
* Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated.
* To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume.
* Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker.
* Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted).
* Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status.
* Updated anvil-join-anvil to disabled ksm and ksmtuned daemons.
* Updated scancore and anvil-daemon to set the host_status to 'online' on startup.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-09 20:51:29 -04:00
Digimer
59b867cc25 * Updated DRBD->gather_data() to check if drbdadm exists before trying to call it to avoid scary errors in the logs. Also moved some strings that pulled from the scan-drbd agent into the main words file.
* Fixed a bug in ScanCore->agent_startup() where a (thankfully broken) check to append tables to the 'sys::database::check_tables' would cause an infinite loop as both were pointers to the same anonymous array.
* Fixed a bug in scan-ipmitool where the scan_ipmitool_variables table didn't use a host_uuid reference, causing resyncs of that table to sync for all hosts and cause DB errors when the scan_ipmitool record from another host wasn't sync'ed yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-31 01:36:11 -04:00
Digimer
4b9ec56106 * Updated DRBD->delete_resource() to return a success if asked to delete a non-existent resource (as can happen when partial anvil-delete-server runs are re-run).
* Reworked DRBD->get_next_resource() to pull from the database, and to no longer do that increments-of-three nonsense. Avoidable complexity. Also added a call to Cluster->get_anvil_uuid() if the 'anvil_uuid' parameter wasn't passed.
* Updated Database->get_host_from_uuid() and ->get_hosts() to now take 'include_deleted' parameter and default to not returning deleted hosts. This fixed issues where anvil-{delete,provision}-server calls could assign jobs to now-deleted hosts with reused host names.
* Updated anvil-delete-server to print log entries to STDOUT. Also updated it to not wait of shutdown of a server in pacemaker to complete, and instead to destroy it after calling pacemaker's resource stop. Updated to also check to see if the server being deleted is already out of pacemaker and, if so, skip that step and directly try to destroy the server, if it's running.
* Updated anvil-provision-server to force 'peer_mode' runs to pull their TCP Port and DRBD minor numbers from the job. This fixes a bug where the same resource on two machines could use different TCP ports.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-02 23:09:47 -05:00
Digimer
1081645893 * Added parameters to DRBD->get_next_resource to allow for a resource to be searched and either error out if a resource is found, or return the first DRBD minor and tcp port if found.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 02:32:12 -05:00
Fabio M. Di Nitto
8f9892650b [build] first pass at adding a build system to integrate with CI
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-01-30 20:16:30 +01:00