anvil

Commit Graph

Author	SHA1	Message	Date
digimer	0014cc591d	Re-enabled DB connections in ocf:alteeve:server. Added DB connections to ocf:alteeve:server when starting or stopping servers. This is to ensure that the servers -> server_state are updated properly. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	829ae546a2	Beginning work on new Server->locate() method to find servers across an Anvil! cluster. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	02de75a6ab	* Improved log messaging to not log of a potential boot failure when the local DRBD volume(s) are all UpToDate and the peer is offline. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	3ee30e6e24	* Updated DRBD->allow_two_primaries() to gracefully fail if the peer isn't connected. * Updated DRBD->manage_resource() to check if the host is StandAlone when asked to 'up' a resource and, if so, connect first. Also updated this to error out gracefully if the call to allow_two_primaries() returns non-zero. * Update Server->migrate_virsh() to error out gracefully if the DRBD->allow_two_primaries() returns non-zero. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	88af919142	* Fixed bugs in ocf:alteeve:server Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	59ade94124	* Added PID logging as an option, and enabled it in ocf:alteeve:server * Updated DRBD->manage_resource() to take the task 'adjust'. * Updated ocf:alteeve:server's start_drbd_resource() to call adjust if startup of a resource isn't needd. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
Fabio M. Di Nitto	fc75bda6ef	ocf:alteeve:server: add support for log levels and bump timeouts also improve logging for migrations Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>	1 year ago
Fabio M. Di Nitto	f71b8dabf0	ocf:alteeve:server: fix return code to match ocf standards Resolves: https://github.com/ClusterLabs/anvil/issues/392 Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>	1 year ago
Fabio M. Di Nitto	1b4ac8ab56	ocf:alteeve:server: fix typo in the description Resolves: https://github.com/ClusterLabs/anvil/issues/390 Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>	1 year ago
digimer	dda0fbd7d5	* Updated DRBD->allow_two_primaries() to be more careful at evaluating peer-node-id. * Updated DRBD->manage_resource() to set allow-two-primaries=no when up'ing a resource (as no migration can be in progress during an up command). * Updated scan-drbd to look for StandAlone resources and call DRBD->manage_resource({task = 'up'}) if a connection to a peer node is StandAlone or if the local disk state is detached. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	bc3d04ad2e	* Updated Cluster->add_server() to wait up to 15 seconds for a server to appear to ensure that the pcs call to add the server with the right requested running state. * Updated Cluster->recover_server() to set the desired recovery state before calling the crm_resource refresh. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	0e57836c8f	This commit addresses (hopefully) issue #329 . * Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now. * Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster. * Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	284a2957d6	* Fixes issue #329 ; When multiple attributes exist when checking if we're in maintenance mode in fence_pacemaker, the expected hash reference was actually an array reference. * Fixed a bug in anvil-version-changes where update_file_location_ready() needed to be called before update_file_locations(). * Added a bit more logging for future debugging. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	f9689a7106	Updated ocf:alteeve:server to look for /tmp/<resource>.fail' and, if that file exists, exits with rc:1. This is done to allow for testing. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	fea10e5bb1	* Prefixed all 'virsh' calls with 'setsid --wait' to help prevent future hangs if the call happens without a shell. * Updated anvil-manage-server-storage to the point where it can now insert and eject optical disks! * Updated System->call to log parameters if 'shell_call' isn't set. * Fixed a bug in anvil-manage-server process_interactive where an $anvil->data reference was being scoped. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	c5fbf20615	* This inverts the --live logic on migrations in Server->migrate_virsh() to default to live. * Adds a "sensitive" DB connection to ocf:alteeve:server when migrating a VM. This is needed so that migrations can be done cold or live, based on servers -> server_live_migration. This resolves issue #284. Signed-off-by: digimer <digimer@gravitar.alteeve.com>	2 years ago
digimer	dfa93a1837	* Added 'setsid' to all 'virsh' calls as nested calls (ie: crm_resource -> ocf:alteeve:server -> virsh) would fail because virsh couldn't connect to a terminal. See: ** https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background * Added explicity setting of $ENV{PATH} when it's null (as it is when pacemaker calls our tools). * Updated the copyright to 2023. Signed-off-by: digimer <digimer@gravitar.alteeve.com>	2 years ago
digimer	b666caec64	* Updated anvil-provision-server to handle startup when the peer doesn't create/connect it's DRBD resource (ie: node is offline). Signed-off-by: digimer <digimer@gravitar.alteeve.com>	2 years ago
Digimer	bce9e2caaf	This is the first attempt at enabling firewalld completely. There is a decent chance that problems exist, so it won't be a surprise if a few more commits are needed to this branch before things work. * Added multiple new private methods to Network that help in managing the firewall. * Updated Server->boot_server to manage the firewall after the server boots. Updated ->migrate_server to create a job, if a database connection exists, for the migration target to update it's firewall as soon after the server appears as possible. * Updated ocf:server:alteeve to manage the firewall when called post-migration, in case there was no DB connection and the job above didn't run. Fixed a bug where the disk state wasn't being evaluated properly. * Updated scan-server to check that the firewall is managed when a server state has changed. * Updated anvil-daemon to run Network->manage_firewall on startup. * Heavily reworked 'anvil-manage-server' to either just run 'Network->manage_firewall', or if passed '--server X', to wait for the server to appear for up to 1 minute, then to check that the firewall is managed (to capture servers being migrated to the host.) * Removed firewall management from striker-prep-database. Signed-off-by: Digimer <digimer@alteeve.ca>	2 years ago
Digimer	1b70b49cf8	* Updated Network->find_matches() to try to populate the first and second parameters if they're not passed in. * Updated Network->load_ips() to load extra information about the interfaces. * Updated ocf:alteeve:server to not check libvirtd daemon state on server start. * Updated scan-hardware to check for duplicate entries and purge if found. * Updated scan-network to check for the 'default' virbr0 interface by checking if the config file exists instead of calling virsh. * Updated scan-server to have better logging. * Created the new (and incomplete) anvil-test-alerts tool * Updated scancore to support --purge to pass to all agents and then exit. * Updated ScanCore->call_scan_agents() to no longer use 'timeout' as it was causing issues with virsh calls. Signed-off-by: Digimer <digimer@alteeve.ca>	2 years ago
Digimer	1dbca79dde	* Created Network->get_ip_from_mac() which takes a MAC address and returns an IP address. * Updated ocf:alteeve:server to always try to bring up the peer's DRBD resource, even when the local resource is up. * Fixed a bug in scan-network where purging duplicate bridges failed in some cases. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	7023ffb56b	Further improved startup DRBD logic in ocf:alteeve:server. Specifically, it will startup if a local resource/volume is sync'ing. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	bc39c3fe5c	Updated ocf:alteeve:server to better handle multi-peer DRBD configurations. Cleaned up some logging in DRBD->get_status. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	e62e5d7b0c	* Updated ocf:alteeve:server to better handle starting up DRBD resources before trying to boot a VM. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	3c0435a455	* Updated ocf:alteeve:server to better handle starting up DRBD resources before trying to boot a VM. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	e6d7ac7038	Fixed a bug in ocf:alteeve:server's new migration network support. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	0fc394b294	Updated ocf:akteeve:server to see in the target for a migration has a '<shortname>.mn1' host name, and if so, and if the target can be reached on that address, it will be used for the live migration. This is to allow for inexpensive 10 Gbps live migration speeds. Removed the stub Server->provision method that was never used. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	607c097fc8	* Fixed a bug where, once a DRBD resource was allowed to be dual-primary for migration, that wasn't properly disabled post-migration. * Updated DRBD->allow_two_primaries() to take the 'set_to' parameter which can be 'yes' to all and 'no' to disallow dual-primary. * Updated ocf:alteeve:server to call allow_two_primaries() with 'set_to' = 'no' instead of calling 'adjust' after a migration completes. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	5b4bfa747c	* Reworked the anvil-join-anvil job parsing to help diagnose occassional faults. Also changed a fatal parse error to one that allows the run to be retried. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	96fffb0b96	* Finished updating ocf:alteeve:server to no longer require a database connection. To do this, and still be able to track live migration times, the Server->migrate_virsh() method now writes out the server name and migration time to a /tmp/anvil/migration-duration.<server_name>.<unix_time> file. This file is checked for by the scan-server resource agent and, when found, is parsed and the migration duration is recorded, then the file is purged. * Updated anvil-daemon to have a new function called "handle_special_cases" called during startup that does any weird bug mitigation required. For now, this is used to mitigate against rhbz#1961562, though certainly it will be used for other reasons later. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	16c20ae69c	* Updated Tools->catch_sig() to use return code 0 instead of 255 so that systemd doesn't think our daemons failed on stop. * Updated Cluster->parse_cib() to not require a database connection (part of the work to make ocf:alteeve:server run without a DB) * WIP: Continuing work on the ocf:alteeve:server RA to run without database connections. * Updated the scancore daemon to explcitely check that all scan agent schemas are loaded in all databases on startup. This is to resolve resync issues on rebuilt strikers that may not yet have some schemas loaded when a DB resync runs. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	24ec17f8f7	* Added a new parameter called 'sensitive' to Database->connect() that returns after connections before any ancilliary checks are done, minimizing connect time. * Fixed a problem with Database->insert_or_update_variables() where variable_source_uuid being set to an empty string wasn't converted to NULL. * Fixed Database->locking() where the way the lock variable was set was rather broken. * Created Striker->check_httpd_conf() which configured apache to handle the integration of the new WebUI for Anvil! management with the existing WebUI. * Updated System->update_hosts() to specifically set the 127.0.0.1 and ::1 lines to handle how cloud-init overrides /etc/hosts and breaks CI/CD tests. * Removed the old index.html as it's now used for the new WebUI. * Began work on removing DB connection requirements from ocf:alteeve:server. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	7abbc938af	* Renamed tools/striker-purge-host to tools/striker-purge-target and moved the code from test.pl over to it. No longer provides interactive selection, but now does work with Anvil! systems as well as hosts. * Fixed a bug in Database->get_tables_from_schema where history.X and X tables were being stored in the table list. * Updated ocf:alteeve:server to no do resyncs on DB connect. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	4a87ee71db	* This commit started with work on webui endpoint set_power, but then switched to scancore debugging and I neglected to switch branches. * Created Cluster->check_stonith_config() that checks and, if needed, reconfigures a cluster's fencing (stonith) config. * Updated scan-cluster to call Cluster->check_stonith_config() at the end of each call. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	3a6902d899	* Made good progress on anvil-safe-stop. It will now stop or migrate servers (testing needed). * Updated Server->shutdown_virsh() to change the parameter 'wait' to 'wait_time' to clarify it's use. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	2e37691116	* Updated DRBD->gather_data() to store data on peers so that the peer's LV path and backing disk is recorded. Also fixed a bug in ->get_status() where the return code for local calls was stored as a host name. * Added the scan-hpacucli scan agent. It's been done for a while and should have been added ages ago. * Updated anvil-rename-server to get to the point where it will take down the DRBD resources on all machines, but waits if there is a sync under way. It also verifies that the server is off on all systems from virsh's perspective. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	711a04999e	* Finished anvil-migrate-server and anvil-safe-start! Lots of testing still needed for both though, and 'anvil-safe-start' does run as a job yet, but the logic is all there. * Fixed a bug in Cluster->migrate_server() where waiting for the server to migate would never exit. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	eec14cb013	* Finished tools/anvil-boot-server and tools/anvil-shutdown-server. * Fixed a bug where, in rare cases, $anvil->hostname() would call 'hostnamectl' and get a dbus error during shutdown, which would then cause the hostname to be changed to the error in the database. * Fixed a bug in Cluster->boot_server() where it would never verify that a server has started successfully. * Updated Database->get_ip_addresses() to store the IPs we manage in 'ip_addresses::<ip_address_address>::X'. * Updated ocf:alteeve:server to work from command line calls, though more testing is still needed. * Started work on 'anvil-rename-server', but haven't gotten far with it yet. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Fabio M. Di Nitto	8f9892650b	[build] first pass at adding a build system to integrate with CI Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>	4 years ago
Digimer	413a4f73c2	* Updated Tools->_anvil_version() and Get->anvil_version() to now pick up a SchemaVersion from anvil.sql. This will change only when the schema changes and is used when Database->connect() is checking compatibility with other anvil database hosts. This will make it only break connection when there is a reason to do so. The anvil_version still remains as an informational version that will help when supporting users later. * Updated Cluster->add_server() to now set failure timeouts to actual numbers instead of INFINITY after discovering that INFINITY doesn't work in those cases. * Updated Databsae->get_hosts to now check if other entries have the same host name, and if so, to set their host_key to 'DELETED'. This should make it easier to handle when a hardware machine is replaced by new hardware but uses the same host_name. * Updated Email->check_queue() to start and enable postfix.service if it's found to not be running. * Updated Get->available_resources() to return '!!no_data!!' when a given host hasn't got any data in scan_lvm_vgs. Now use this in anvil-provision-server to exit if a node or dr host hasn't run scancore yet. * Fixed a bug in scan-lvm where the pvs_uuid wasn't being loaded properly, preventing lost PVs, VGs and LVs from being flagged as deleted. * Started work on anvil-migate-server, though it's far from complete. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	549dbad635	* Created Cluster->delete_server(), which deletes a server resource from pacemaker (stopping it first, if needed). * Fixed a bug in Cluster->parse_cib() when a server that is off wasn't setting 'status'. * Renamed 'server::location::<server>::host' to '...::host_name' in several places. * Got more work done on anvil-delete-server, up to the point where it calls the new Cluster->delete_server() method. * Updated fence_pacemaker to call 'drbdadm adjust all' to dampen an issue where in-memory fence configs seem to change, preventing reconnection of the peer after it reboots from the fence. More testing needed on this issue. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	05b1fccdb3	* Created Cluster->add_server() which, well, adds a server to a pacemaker cluster, including sorting out location constraints to favour the node the server is running on, if it's running. * Removed the exit-if-no-DB check in ocf:alteeve:server so that (hopefully, needs testing), running servers won't be impacted if the nodes lost contact with both/all strikers. * Updated scan-server to make an explicit check for missing XML definition files on startup and write them if needed. * Very beginning work on anvil-delete-server has been started. * Updated anvil-provision-server to wait when it's running in peer mode until the new XML definition is in the DB and then write it out to disk before exiting. Also updated it to add the new server to pacemaker before exiting. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	1d03a386d3	* Created Database->get_bridges() that, surprise, loads data from the 'bridges' table. * Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers. * Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file. * Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors. * Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums. * When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory. * Moved the 'scancore' daemon's 'load_agent_strings' to 'Words' * Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly. * Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed. * Fixed a bug in scan-storcli where a while loop was broken, preventing execution. * Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	d677d19ca0	* Moved Database->check_condition_age to Alert. * Created (but not finished) scan-apc-pdu * Added support to tracking maintenance-mode for nodes in Cluster->parse_cib * Created Remote->read_snmp_oid(). * Created Server->get_definition. * Updated Server->get_status() to write-out server XML files on-demand. * Finished scan-cluster. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	33101f969a	* Fixed several bugs related to tracking server boots, migrations and shut downs in the anvil database. The 'ocf:alteeve:server' now has (mostly?) safe integration with the Anvil! database. This was mostly done by updating Servers->boot_virsh(), ->shutdown_virsh() and ->migrate_server(). * Updates servers -> server_host_uuid to drop the 'NOT NULL' constraint. * Created the new Get->server_uuid_from_name() that does what it says on the tin. Fixed a bug in ->host_uuid_from_name() where the host name was being returned instead of the UUID. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	be88be6d30	* Did a bunch of testing / bugfixes for scan-server. * Updated the servers table to remove the 'not null' constraints on the server_start_after_server_uuid, server_pre_migration_file_uuid and server_post_migration_file_uuid columns. * Updated ScanCore->agent_startup() to connect to the database(s) when there isn't a table list. * Updated Server->migrate_virsh() to test for DB access before making DB calls (to allow ocf:alteeve:server to function even if all ScanCore DBs are offline). * Updated ocf:alteeve:server to connect to the databases (though work without it), and changed '$FILE_NAME' to be 'ocf:alteeve:server' (to make logging more legible) * Created the skeleton for 'scan-storage'. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	46f1a05789	* Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though. * Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations. * Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck. * Updated the 'servers' database table (and associated Database methods) to add columns for; server_ram_in_use - tracking RAM used by a running server server_configured_ram - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM) server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server. server_boot_time - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes). * Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID. * Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime). * Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source. * Made scan-hardware smarter about RAM change alerts. * Updated scancore to load agent strings on startup so that processing pending alerts works properly. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	4dfe0cb5a0	* Created Cluster->boot_server, ->shutdown_server and ->migrate_server methods that handle booting, migrating and shutting down servers. Also created the private method ->_set_server_constraint which is used by migrate and boot to set resource constraints to control where a server boots or migrates to. * Did more work on parsing server data out of the CIB. There is still an issue with determining which node currently hosts a resource, however. * Renamed Server->boot to ->boot_virsh, ->shutdown to ->shutdown_virsh and ->migrate to ->migrate_virsh to clarify that these methods work on the raw virsh calls, outside of pacemaker (indeed, they are what the pacemaker RA uses to do what pacemaker asks). * Got more work done on the scan-cluster SA. * Created the empty files for the pending scan-server SA. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	0f7267eae1	* Moved the '_host_name', '_short_host_name', and '_domain_name' private methods in Tools.pm over to Get.pm (removing the leading '_' in the method names). * Created 'Cluster->which_node' that returns 'node1' or 'node2' to indicate which node a host is. * Continued working on scan_cluster; decided to make it not host-dependent. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	b2c7fd95fb	* Renamed the ScanCore unit file to scancore. * Added support to parsing location contraints to Cluster->parse_cib Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago

1 2 3

103 Commits (65f7b020e3674629b3edd686eea09a2bf5129a43)