anvil

Commit Graph

Author	SHA1	Message	Date
Fabio M. Di Nitto	fc75bda6ef	ocf:alteeve:server: add support for log levels and bump timeouts also improve logging for migrations Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>	1 year ago
digimer	a0ff080741	* Deleted some old unused code from Cluster->assemble_storage_groups(). Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	ed480cf1cb	* Fixed a double-$ bug in Remote->_check_known_hosts_for_target() * Updated striker-update-cluster to take '--timeout' and a number of seconds, or 'Xm' or 'Xh' for minutes or hourse, respectively. Also updated to show the remaining time while waiting, and added waiting timeout to the rest of the while loops that prior had no time limit. This addresses issue #383 and issue #382. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	d68adb5b4e	* Updated anvil-manage-power to not reboot if anvil-version-changes is running (which, if it's taking time, is generating new kmods). Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	458cb267da	* Fixed a bug in Cluster->get_primary_host_uuid() where servers were not loaded before trying to calculate RAM use. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	d56b7f9a84	* Created (but not finished!) the new striker-update-cluster tool. * Updated Cluster->get_primary_host_uuid() to only load anvils if not already loaded. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	dda0fbd7d5	* Updated DRBD->allow_two_primaries() to be more careful at evaluating peer-node-id. * Updated DRBD->manage_resource() to set allow-two-primaries=no when up'ing a resource (as no migration can be in progress during an up command). * Updated scan-drbd to look for StandAlone resources and call DRBD->manage_resource({task = 'up'}) if a connection to a peer node is StandAlone or if the local disk state is detached. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	b6a249d5e7	* Updated Cluster->add_server() to set the preferred host based first on if the server is running on a node, and if not, on the primary node (where before it defaulted to node 1). * Updated DRBD->delete_resource() to call scan-drbd and scan-lvm to ensure that the database is updated with the newly freed resources. * Updated anvil-delete-server and anvil-provision-server to call select scan agents to ensure freed resources are immediately recorded. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	b03587967b	* Updated Cluster->add_server() to batch the creation of the server and the location constraints in one commit to the CIB. * Updated scan-lvm to look for and delete duplicate entries. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	b7abc481e6	Updated scan-cluster to check to see that migrate_to and migrate_from are given a timeout of 600s and an on-fail of "block". Updated Cluster->add_server() to set migrate_from to timeout=600s and on-fail=block as well. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	bc3d04ad2e	* Updated Cluster->add_server() to wait up to 15 seconds for a server to appear to ensure that the pcs call to add the server with the right requested running state. * Updated Cluster->recover_server() to set the desired recovery state before calling the crm_resource refresh. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	e483840ceb	Second attempt to fix the storage group race condition. This time, we only let node 1 assemble storage groups. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	d64044c7d1	Test fix for storage group race condition. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	895f1ec262	This fixes a race condition when multiple servers are provisioned at (nearly) the same time. * In DRBD->get_next_resource(), implemented a "hold" system where the DRBD minor and TCP port(s) returned are marked as being held for one minute. So subsequent calls won't use the same numbers. * In anvil-daemon, added a check in run_jobs() where only one instance of a given job command will be started per 2-second loop. This should help reduce the chance of simultaneous race confitions in general. * Removed from anvil-provision-server and most other tools the call to Job->get_job_uuid(). If the program is called without the job_uuid, don't try to find it. This allows a human (or script) to make repeated calls to a program without one of those calls running a pending job instead. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	bd575c6a7d	Bumped logging for storage group management. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	1afa7ce09e	* Created Cluster->recover_server() that uses crm_resource to try to recover a server that has entered a FAILED state. * Updated (not not yet completed) scan-cluster's check_resources() function to check if a FAILED server is ready to try to recover. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	f9689a7106	Updated ocf:alteeve:server to look for /tmp/<resource>.fail' and, if that file exists, exits with rc:1. This is done to allow for testing. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	efebd135eb	* Removed more references to 'dr1_host_uuid' from the old way of linking DR hosts to Anvil! nodes. * Fixed a bug where servers protected by DR hosts aren't deleted when the server itself is deleted. * Updated DRBD->delete_resource() to remove the server's XML file if the host is a DR host. * Updated anvil-version-change and anvil.sql to enable update_audits and the audits table. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	9751c883cb	* Updated Cluster->assemble_storage_groups() to remove refrences to anvil_dr1_host_uuid. Also added the logic for auto-adding DR host's VGs to a storage group. Commented it out though as, for now, this might be a bad idea. Needs more thought. * Fixed a bug in Database->get_storage_group_data() to load hosts data when needed. Also fixed a bug where new members didn't return the new storage_group_member_uuid. * Updated anvil-manage-host to use the new switch handler. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	e012d6016c	Tha major point of this commit is to add the new 'anvil-manage-storage-groups' program that, well, manages storage groups. * Updated the storage_group_members table to add the 'storage_group_member_note' that can be set to 'DELETED' to track when a member is deleted. Updated anvil-version-changes to check for and add this column as needed. Updated the anvil.sql schema for the same. * Updated Cluster->insert_or_update_storage_group_members to add the new column. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	9d2f9c4d88	* Fixed a string key name typo. Signed-off-by: digimer <digimer@gravitar.alteeve.com>	2 years ago
digimer	a3988cc3e5	* Added System->configure_logind() to ensure that nodes are configured to ignore ACPI power button events so that IPMI-based fences work immediately. * Added call to System->configure_logind() to anvil-join-anvil and anvil-version-changes. * Updated fence_pacemaker to add '--reboot' to the 'stonith_admin' call to ensure DRBD-triggered fence requests reboot instead of just turning nodes off. This commit address issue #279. Signed-off-by: digimer <digimer@gravitar.alteeve.com>	2 years ago
Digimer	4ba1982183	This is the start of a set of changes needed to rework how we handle DRBD fence requests, so that they create location constraints instead of triggering a full stonith fence. * In Cluster->parse_cib(), added parsers for node attributes and resource rules. Also stored the existence of and details of each under the server resources for easier referencing. * Updated scan-server to check for / add DRBD fence rules as needed. Scancore APC agent bugs; * For clarity, converted all '#!no_value!#' and '#!no_connection!#' to use '!!' instead in APC scan agents. * Fixed a bug to set/clear alerts related to phases disappearing to deal with concurrent logins from different hosts triggering false phase loss alerts. * Fixed missing variables not being passed to alerts/log entries. Started more work on anvil-manage-server, but on hold again while the DRBD fencing work is completed. Signed-off-by: Digimer <digimer@alteeve.ca>	2 years ago
Tsu-ba-me	c413e62798	fix(striker-ui-api): pass Remote->test_access() user to Cluster->get_primary_host_uuid()	2 years ago
Digimer	bde0b2e7ec	* Fixed a bug where deleting ports from a fence device in an Install Manifest would not cause the fence methods to be removed from the associated cluster. * Created Get->anvil_from_switch and Get->server_from_switch() (both need testing) that takes a string that could be either a name or UUID, figures out which it is, finds the entry in the DB and started the X_uuid and X_name switch variables. * Started work on a second attempt at anvil-manage-server. Signed-off-by: Digimer <digimer@alteeve.ca>	2 years ago
Digimer	d271ffec26	* Updated Cluster->parse_crm_mon() to record the role of stonith resources. * Fixed a bug in System->parse_arguments() where a quoted password without spaces was returned without being recorded in the hash. Also updated logging to log 'suppressed' for passwords when secure logging is disabled. Signed-off-by: Digimer <digimer@alteeve.ca>	2 years ago
Digimer	d8f31d9d84	* Added the anvil-boot-server man page. Signed-off-by: Digimer <digimer@alteeve.ca>	2 years ago
Digimer	1e159f548e	Added a couple notes for later dev. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	0c77736dc8	* Fixed a bug in Cluster->manage_fence_delay() where removing the 'delay="15"' attribute was failing, now set it to 0 instead. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	7e7b91b286	* Updates anvil-join-anvil to update corosync.conf to use the BCN1 link as the main knet network with the SN1 link as the backup link. * Fixed a bug in Cluster->parse_cib() where the local machine's ready state was being set to the node name. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	607c097fc8	* Fixed a bug where, once a DRBD resource was allowed to be dual-primary for migration, that wasn't properly disabled post-migration. * Updated DRBD->allow_two_primaries() to take the 'set_to' parameter which can be 'yes' to all and 'no' to disallow dual-primary. * Updated ocf:alteeve:server to call allow_two_primaries() with 'set_to' = 'no' instead of calling 'adjust' after a migration completes. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	d3052c0229	* Finished Cluster->check_server_constraints() and added it to scan-cluster. This now makes sure servers don't roll back to their old host after it has been fenced and recovers. * Completely disabled Network->check_network(), it's causing more problems than it solves. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	5a343d6d75	* WIP; Started work on Cluster->check_server_constraints() that will track when a server's location constraint needs to be updated when the old preferred node is lost. * Removed (for now) setting MTU in the ifcfg-X files during anvil-configure-host runs. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	b71ed28f64	* Added Cluster->manage_fence_delay() that reports back and, optionally, sets a preferred node in a fence race. * Updated scan-cluster to check / set which node should be preferred if a netsplit causes a fence race. * Fixed a bug in Server->shutdown_virsh() where a shutdown timeout would go into a loop. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	80bdac8e34	* Updated the pacemaker server config to drop the stop timeout to 5 minutes and the migration timeout to 10 minutes. This will avoid blocking the entire cluster when a stop or migrate operation times out. Will update scan-server to clean these up when they happen. * Updated Database->archive_table() and ->_find_behind_databases() to loop through connected databases, instead of configured databases. * Updated Network->get_ips() to only record the real MAC addresses on network interfaces (not bonds or bridges) in the "network::${host}::interface::${in_iface}::mac_address" hash. This should help avoid reboot loops caused by anvil-configure-host thinking the network needs to be reconfigured when it doesn't actually need to be. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	16c20ae69c	* Updated Tools->catch_sig() to use return code 0 instead of 255 so that systemd doesn't think our daemons failed on stop. * Updated Cluster->parse_cib() to not require a database connection (part of the work to make ocf:alteeve:server run without a DB) * WIP: Continuing work on the ocf:alteeve:server RA to run without database connections. * Updated the scancore daemon to explcitely check that all scan agent schemas are loaded in all databases on startup. This is to resolve resync issues on rebuilt strikers that may not yet have some schemas loaded when a DB resync runs. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	fc0954d0c8	* Started work on, but not at all finished, anvil-manage-server which will allow manipulation of a server's resources. * Changed the alteeve repo RPM to the new cimmunity/enterprise repo * Fixed a bug where 'fence_data::updated' was causing the fences web page to break. * Fixed a bug in Database->insert_or_update_network_interfaces() where certain interfaces were being repeatedly added to the database. * Fixed a bug in Database->_find_behind_databases() was marking DBs as behind even though they had less than 10 columns off. * Fixed a bug in Get->host_name() where, if the host name was changed on disk but the environment variable was still the old name, it would cause the hostname to waffle back and forth and cause constant updated to /etc/hosts. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	4a87ee71db	* This commit started with work on webui endpoint set_power, but then switched to scancore debugging and I neglected to switch branches. * Created Cluster->check_stonith_config() that checks and, if needed, reconfigures a cluster's fencing (stonith) config. * Updated scan-cluster to call Cluster->check_stonith_config() at the end of each call. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	416f51323a	* Created tools/striker-boot-machine to, well, boot machines. It uses host_ipmi or, failing that, other fence methods when available to boot a node. * Created Cluster->get_fence_methods() that parses all fence methods out of a recorded CIB and stores the in a hash for a given host_uuid. * Fixed a bug in ScanCore->post_scan_analysis_striker() where the short_host_name was not being stored correctly. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	ca7052dd53	The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	3a6902d899	* Made good progress on anvil-safe-stop. It will now stop or migrate servers (testing needed). * Updated Server->shutdown_virsh() to change the parameter 'wait' to 'wait_time' to clarify it's use. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	711a04999e	* Finished anvil-migrate-server and anvil-safe-start! Lots of testing still needed for both though, and 'anvil-safe-start' does run as a job yet, but the logic is all there. * Fixed a bug in Cluster->migrate_server() where waiting for the server to migate would never exit. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	eec14cb013	* Finished tools/anvil-boot-server and tools/anvil-shutdown-server. * Fixed a bug where, in rare cases, $anvil->hostname() would call 'hostnamectl' and get a dbus error during shutdown, which would then cause the hostname to be changed to the error in the database. * Fixed a bug in Cluster->boot_server() where it would never verify that a server has started successfully. * Updated Database->get_ip_addresses() to store the IPs we manage in 'ip_addresses::<ip_address_address>::X'. * Updated ocf:alteeve:server to work from command line calls, though more testing is still needed. * Started work on 'anvil-rename-server', but haven't gotten far with it yet. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	a480357049	* Fixed a bug in Cluster->assemble_storage_groups() where, if a group is created during an anvil-provision-server run, the group would get created multiple times. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	b36093671b	* Updated Database queries that were passing 'debug => $debug' to not do that, as it was causing far too much (useless) noise in the logs. * Turned on print to console for logging in anvil-provision-server. Also updated it to check if the cluster is running and hold until it is. * Cleaned up some code in Get->available_resources() that proved hard to debug. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	e036515df3	* Got anvil-safe-start to the point where is starts the cluster stack. Need to create the 'anvil-boot-server' and 'anvil-shutdown-server' before it can be completed, so those files have been added. * Created Cluster->parse_quorum() to check if a node is quorate as 'have-quorum' in the pacemaker CIB doesn't appear to be super accurate during startup. * Fixed a bug in striker-manage-install-target where if a node didn't have any registered IPs, it would break before generating the repo data. * Fixed a bug in anvil-join-anvil where if the database had to be reconnected, the job data was lost. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	fb0836f912	* THe get_cpu endpoint was completed. * The get_mmeory endpoint was completed. * The get_replicated_storage endpoint was completed, though it requires testing and likely has issues. To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this. * Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated. * To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume. * Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker. * Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted). * Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status. * Updated anvil-join-anvil to disabled ksm and ksmtuned daemons. * Updated scancore and anvil-daemon to set the host_status to 'online' on startup. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	5536e8ff47	* Updated Cluster->assemble_storage_groups() and Cluster->anvil_name_from_uuid() and ->available_resources() to try to detect the anvil_uuid if not passed in. * Updated Database->insert_or_update_storage_group_members() to use the host_uuid when trying to find existing members. * Added the skeleton of a bunch of new json endpoints for the new UI features. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	0ec1bf6b6a	* Updated DRBD->delete_resource() to return a success if asked to delete a non-existent resource (as can happen when partial anvil-delete-server runs are re-run). * Reworked DRBD->get_next_resource() to pull from the database, and to no longer do that increments-of-three nonsense. Avoidable complexity. Also added a call to Cluster->get_anvil_uuid() if the 'anvil_uuid' parameter wasn't passed. * Updated Database->get_host_from_uuid() and ->get_hosts() to now take 'include_deleted' parameter and default to not returning deleted hosts. This fixed issues where anvil-{delete,provision}-server calls could assign jobs to now-deleted hosts with reused host names. * Updated anvil-delete-server to print log entries to STDOUT. Also updated it to not wait of shutdown of a server in pacemaker to complete, and instead to destroy it after calling pacemaker's resource stop. Updated to also check to see if the server being deleted is already out of pacemaker and, if so, skip that step and directly try to destroy the server, if it's running. * Updated anvil-provision-server to force 'peer_mode' runs to pull their TCP Port and DRBD minor numbers from the job. This fixes a bug where the same resource on two machines could use different TCP ports. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	4b9ec56106	* Updated DRBD->delete_resource() to return a success if asked to delete a non-existent resource (as can happen when partial anvil-delete-server runs are re-run). * Reworked DRBD->get_next_resource() to pull from the database, and to no longer do that increments-of-three nonsense. Avoidable complexity. Also added a call to Cluster->get_anvil_uuid() if the 'anvil_uuid' parameter wasn't passed. * Updated Database->get_host_from_uuid() and ->get_hosts() to now take 'include_deleted' parameter and default to not returning deleted hosts. This fixed issues where anvil-{delete,provision}-server calls could assign jobs to now-deleted hosts with reused host names. * Updated anvil-delete-server to print log entries to STDOUT. Also updated it to not wait of shutdown of a server in pacemaker to complete, and instead to destroy it after calling pacemaker's resource stop. Updated to also check to see if the server being deleted is already out of pacemaker and, if so, skip that step and directly try to destroy the server, if it's running. * Updated anvil-provision-server to force 'peer_mode' runs to pull their TCP Port and DRBD minor numbers from the job. This fixes a bug where the same resource on two machines could use different TCP ports. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago

1 2

78 Commits (f681f6f47a06c6f7d14615844ccfdd885c5590d9)