anvil

Commit Graph

Author	SHA1	Message	Date
digimer	4398ffe70c	Updated striker-boot-machine to support booting all machines. * Wrote the man page for striker-boot-machine, changing --host-name to --host, and adding the '--host all' support. * Updated anvil-manage-host to support checking/enabling/disabling network mapping mode. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	55b1380031	Finished (but need more testing) of Server->locate(). This includes the changes in PR#492. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	f12e001ac2	Finished Server->connect_to_virsh(). * Now, connecting to virsh can detect when still-open connections already exist. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	245f75de9b	Added Server->update_definition() * This takes a server and new definition XML and updated the database and any available hosts. Does not yet update defined or running servers. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	62fe62a44b	* Continued work on anvil-manage-server-system. It now displays the boot devices, CPU and RAM info. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	74ddb7f3a9	Updated Database-get_files() to detect/remove duplicate file entries. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	fcbace6713	Updated anvil-join-anvil to hold if either node is still running anvil-configure-host * Fixed a minor bug and added logging of maintenance_mode calls in anvil-configure-host. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	582a8b292c	Added more job updates to anvil-manage-power. * This is a test to see if the job waiting for the uptime to be 300s, leaving the job_progress as 0, was causing the job to be repeatedly called. * This is related to issue #479 Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	ef042eef25	Cleaned up logging while waiting for subnodes. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	5d5270486e	Added a wait loop when forming node clusters. * This adds a check where anvil-join-anvil waits until both subnodes are marked as configured and not in maintenance mode. * Should address issue #479 (maybe, this shouldn't trigger reboots, but it was certainly a race condition found while investigating). Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	c039c58128	* This commit moves taking screenshots of hosted servers onto the strikers using the Sys::Virt module. This was needed because the screenshots were being taken by scan-server, and that was causing it to take a long time to run. It should never have been handled by the scan agent anyway. This update requires a WebUI fix to use the new screenshot tool. This tool also adds holding multiple screenshots to allow users to "scrub" through screenshots up to 10 hours in the past. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	8925dabb9d	* Updated anvil-shutdown-server to take the new '--immediate' switch which forces a server to shut down immediately (akin to pulling the power on a traditional machine). This is needed to allow a user to recover a crash or hung server. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	580980717d	This commit covers the convertion of 'virsh' shell calls to using 'Sys::Virt' module, and fixes several small bugs related to scan-server; * Switched all calls to virsh to use Sys::Virt to deal with contention of simultaneous virsh calls. * Removed collecting screenshots from scan-server. * Fixed a bad variable substitution in an alert. * Fixed a bug where a server's boot time wasn't being recorded properly. * Reworked how we determine which server definition was most recently updated and propogated. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	3c9086d1f3	Fixed bugs related to running jobs. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	e8a84e1c97	Added job handling to anvil-manage-server-storage (needs more testing though). Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	2f429d2bc7	Fixed bugs related to adding drives and extending drives to servers. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	e895e1f264	* Finished writting the anvil-manage-server-storage. * Fixed handling --eject and --insert to work without a device target specified when only one exists, or to find the file path when only the file name is given. * Updated anvil-manage-server-storage to show files when processing an optical devices without a file being passed. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	17078347ee	Reworked anvil-manage-server-storage to use the translation system. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	02de75a6ab	* Improved log messaging to not log of a potential boot failure when the local DRBD volume(s) are all UpToDate and the peer is offline. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	3ee30e6e24	* Updated DRBD->allow_two_primaries() to gracefully fail if the peer isn't connected. * Updated DRBD->manage_resource() to check if the host is StandAlone when asked to 'up' a resource and, if so, connect first. Also updated this to error out gracefully if the call to allow_two_primaries() returns non-zero. * Update Server->migrate_virsh() to error out gracefully if the DRBD->allow_two_primaries() returns non-zero. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	88af919142	* Fixed bugs in ocf:alteeve:server Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	6ee2ad75db	* Updated anvil-delete-server to actively check for and delete any drbd-fenced attributes left over in the CIB after a server is deleted. This addresses issue #374 . Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	be290bf561	This commit fixes a bug where the drbd kernel module build was being killed mid-compile, leaving DBRD unusable. * Created System->wait_on_dnf() which was plucked from anvil-daemon, and now also called in scancore and anvil-safe-start. * Updated scancore and anvil-safe-start to check on start that DRBD's kernel module is available (and build if not). Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	d68adb5b4e	* Updated anvil-manage-power to not reboot if anvil-version-changes is running (which, if it's taking time, is generating new kmods). Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	66c82e5e22	* Fixed a bug in anvil-update-system where updating a single package with --reboot wouldn't request a reboot. Finished reworking it so that a check is made to see if the kernel or DRBD kmod will be updated and, if so, removes the kmod-drbd RPMs prior to doing the update (as opposed to the sloppier check-on-error method). * Fixed a bug in System->reboot_needed() where the cache file path had a typo in the hash key. * Updated anvil-daemon to use the full path to dnf when determining if a dnf process was running. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	e278de4b5a	The main change in this commit deals with anvil-daemon startup. During OS updates, it would pick up the queued update job and run it while the other --no-db one was still running. This could become an issue for other tasks in the future, so updated anvil-daemon to not run any jobs for the first minute after startup. Also updated it to see if an OS update is underway (given how it can start mid-RPM update, before packages like kmod-drbd are ready to build). While doing this, implemented caching of daily tasks (like agine out data, archiving data, network scans, etc) to only run once per day, period. As it was before, they would always run on anvil-daemon startup, then wait 24 hours. Note that work has started it reworking anvil-update-system, but it is incomplete (and broken) in this commit. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	b0c54b6dae	* Updated anvil-update-system to check if another instance of anvil-update-system is running and, if so, exit. * Removed the new tasks from anvil-special-operations. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	7bd76c10dc	Major thing in this commit is reworking striker-update-cluster to work without expecting anvil-daemon to be running on target machines. Similarly, they had to be able to work when the Striker DBs were not available. This is to account for cases where the Striker dashboards have updated, and the schema has changed, preventing the not-yet-updated DR hosts and subnodes from being able to use the DB. To do this, anvil-safe-stop, anvil-update-system, and anvil-shutdown-server had to be updated to use the new --no-db switch, which tells then to run without the database being available. * Updated Server->shutdown_virsh() to work without a database connection. * Updated System->reboot_needed() to store/read from a cache file when the database is not available. * Updated anvil-safe-start to remove the old --enable/disable/status switches, now that we use anvil-safe-start.service systemd unit. * Reworked anvil-safe-stop to work without a database connection, and to work on DR hosts. * Updated anvil-special-operations to add new tasks, but it's likely these new tasks aren't needed and will be removed very shortly. * Added/updated multiple man pages. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	9bc78860a6	* Updated anvil-update-system to detect kmod-drbd upgrade problems and fix them. * Updated striker-update-cluster and anvil-update-system to take '--reboot' to request a reboot if any packages are updated. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	42b44ac864	* Updated the log showing why anvil-daemon isn't exiting when a job is running with the job's current progress. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	d741f4aa6f	* Updated anvil-daemon to not exit on high RAM use is any job is running. * Updated anvil-update-system to reboot a target whose kernel updated using an anvil-manage-power job, * Started making striker-update-cluster run as a job (not at all complete). Fixed a bug where the wrong IP was being used when finding access to a target. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	751687129a	* Updated anvil-daemon to not exit on RAM use if anvil-update-system is running. * Fixed a bug in anvil-safe-stop where it wouldn't trigger a migration when the peer is online. * Updated anvil-update-system to set job_data to 'failed' and exit with rc 4 if the os update failed. * Got striker-update-cluster to error out and exit if a called 'anvil-update-system' job failed. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	3016fb875b	* Reworded striker-update-cluster to use anvil-update-system for on-system OS updates. * Updated DRBD->get_status() to take the new 'host' paramter to allow the caller to define the hash key string used in the stored data. * Updated Get->anvil_version() (and a few other places) to use the new 'striker-ui-api' shell user, replacing the 'apache' user. * Updated Remote->test_access() to take the new 'close' parameter to close the SSH session used when testing access to the target. * Fixed a logging bug in anvil-manage-power. * Updated anvil-update-system to take the '--no-reboot' and 'clear-cache' command line switches. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	1b8b0bc493	* Created the new 'anvil-manage-server-storage' with the first role of reload a DRBD resource. * Updated Remote->call() to remove the 'background' parameter as it wasn't working. * Updated anvil-manage-server-storage to use 'anvil-manage-server-storage' to adjust resources in a way that doesn't block. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	ea95d26cc5	* Fixed a bug in DRBD->get_next_resource() where reserved minor numbers were not being released. Also added a new parameter, "minor_only", that returns the next minor number but doesn't bother processing TCP ports. * Did more work on adding support for adding new disk drives to servers in anvil-manage-server-storage. * Updated anvil-manage-storage-groups To check for / delete duplicate storage groups with the same name. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	88cc76914d	This is an attempt to fix issue #341 . It replaces the search for SN IPs from Network->find_matches() to Network->find_access(). The later of which doesn't care about the interface the IP was found on. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	c9e11fbbfc	* Added checks to anvil-provision-server to fail out if either of the SN IPs are not found when generating a DRBD resource config. * Added logging to anvil-provision-server and anvil-daemon to try to find the cause of jobs being re-run after completing. May have fixed with a fix to job_progress updates going to 100 too early in some cases. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	156a0ca201	Updated anvil-daemon's new job launching logic to allow the restart of a running job that failed out early. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	47f7a35df3	The main purpose of this commit is to add serial execution of similar jobs to help reduce race conditions for scripted jobs, like multiple server creation. * Fixed a small logging bug in DRBD->allow_two_primaries(). * Updated Database->get_jobs() to record jobs sorted by modified_date so that jobs can be run in the order they were recorded. * Updated anvil-daemon to track which commands need to be run, and when two or more of the same command need to be run, they're run serially, with each subsequent run starting after the previous one completes. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	b6a249d5e7	* Updated Cluster->add_server() to set the preferred host based first on if the server is running on a node, and if not, on the primary node (where before it defaulted to node 1). * Updated DRBD->delete_resource() to call scan-drbd and scan-lvm to ensure that the database is updated with the newly freed resources. * Updated anvil-delete-server and anvil-provision-server to call select scan agents to ensure freed resources are immediately recorded. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	b7abc481e6	Updated scan-cluster to check to see that migrate_to and migrate_from are given a timeout of 600s and an on-fail of "block". Updated Cluster->add_server() to set migrate_from to timeout=600s and on-fail=block as well. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	c82bd9d73a	* Created the new anvil-watch-power tool that shows the status of UPSes known on the system, including their "on battery" state, charge percentage, estimated hold up time, etc. * Updated Database->get_power() and ->get_upses() to store both the time stamp and unix time stamps. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	0e57836c8f	This commit addresses (hopefully) issue #329 . * Updated DRBD->get_status() to attempt to recompile the drbd kernel module if the drbdsetup status fails. If it continues to fail, it exits gracefully now. * Updated ocf:alteeve:server to test access over a given IP before calling Server->find to avoid timeouts when the peer is down. Also updated it to set the constraints to keep the server on the new host when the old host returns to the cluster. * Fixed a bug in scan-cluster where a server that is FAILED but not running is now properly recovered. Signed-off-by: digimer <mkelly@alteeve.ca>	1 year ago
digimer	110dceb55e	* Added a check to make sure files were ready before provisioning a server. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	895f1ec262	This fixes a race condition when multiple servers are provisioned at (nearly) the same time. * In DRBD->get_next_resource(), implemented a "hold" system where the DRBD minor and TCP port(s) returned are marked as being held for one minute. So subsequent calls won't use the same numbers. * In anvil-daemon, added a check in run_jobs() where only one instance of a given job command will be started per 2-second loop. This should help reduce the chance of simultaneous race confitions in general. * Removed from anvil-provision-server and most other tools the call to Job->get_job_uuid(). If the program is called without the job_uuid, don't try to find it. This allows a human (or script) to make repeated calls to a program without one of those calls running a pending job instead. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	0874ad571a	Updated anvil-safe-start to not give up on starting corosync/pacemaker if it fails on the first try. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	83a527f4fa	* Removed enabling anvil-safe-start out of the RPM and into anvil-join-anvil. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	89eae7098e	NOTE: This updates the reserved RAM to 8 GiB from 4 GiB! * Adds support for 'anvil_resources:🐏:reserved' that can be set to a number of MiB to override the default 8192. * Adds support for 'anvil::<anvil_uuid>::resources:🐏:reserved' to allow for per-Anvil! node override on the reserved RAM default, and over the 'anvil_resources:🐏:reserved' option. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	f9689a7106	Updated ocf:alteeve:server to look for /tmp/<resource>.fail' and, if that file exists, exits with rc:1. This is done to allow for testing. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago
digimer	cf73d8ed36	* Updated System->configure_ipmi() to auto-configure DR hosts once they've been assigned a BCN IP address. Signed-off-by: digimer <mkelly@alteeve.ca>	2 years ago

1 2 3 4 5 ...

532 Commits (78b75c649a46898f438759d2c91944aba66730f2)