Commit Graph

215 Commits

Author SHA1 Message Date
digimer
fea10e5bb1 * Prefixed all 'virsh' calls with 'setsid --wait' to help prevent future hangs if the call happens without a shell.
* Updated anvil-manage-server-storage to the point where it can now insert and eject optical disks!
* Updated System->call to log parameters if 'shell_call' isn't set.
* Fixed a bug in anvil-manage-server process_interactive where an $anvil->data reference was being scoped.

Signed-off-by: digimer <mkelly@alteeve.ca>
2023-03-03 14:42:28 -05:00
digimer
98c3868870 * Updated fence_pacemaker to no longer use stonith_admin and instead use pcs. This should resolve the main part of issue #279
Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-18 00:22:03 -05:00
digimer
a3988cc3e5 * Added System->configure_logind() to ensure that nodes are configured to ignore ACPI power button events so that IPMI-based fences work immediately.
* Added call to System->configure_logind() to anvil-join-anvil and anvil-version-changes.
* Updated fence_pacemaker to add '--reboot' to the 'stonith_admin' call to ensure DRBD-triggered fence requests reboot instead of just turning nodes off.
This commit address issue #279.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-13 21:42:10 -05:00
digimer
dfa93a1837 * Added 'setsid' to all 'virsh' calls as nested calls (ie: crm_resource -> ocf:alteeve:server -> virsh) would fail because virsh couldn't connect to a terminal. See:
** https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background
* Added explicity setting of $ENV{PATH} when it's null (as it is when pacemaker calls our tools).
* Updated the copyright to 2023.

Signed-off-by: digimer <digimer@gravitar.alteeve.com>
2023-01-12 21:52:26 -05:00
Digimer
6d59399c73 * Updated the short OS list.
* Created Get->virsh_list_net() and Get->virsh_list_os() that call and parse osinfo-query directly to create lists of supported network interfaces and OS optimization options used when provisioning VMs. The later of which is used to replace the old language list of OSes, which was clunky and prone to missing valid options.
* Updated Get->available_resources() to remove the old anvil_dr1_host_uuid mechanism of finding and referencing DR resources.
* Started adding --network support to anvil-provision-server to allow users to specify a specific network bridge, MAC address and model to use for a new VM.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-24 10:08:06 -05:00
Digimer
9194eb3d09 * Updated System->check_if_configured() to record that a host is configured in /etc/anvil to make the system auto-mark as configured if the host is removed from the DB (or, more specifically, variables -> system::configured is lost).
* Updated Database->get_anvils() to record dr_links to reference DR hosts to Anvil! systems.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-15 19:28:00 -05:00
Digimer
f9ca6fb170 * This adds the new anvil-version-change tool which anvil-daemon will call on startup to handle checks for changes made over releases/updates.
* Added the new 'dr_link_note" column to the dr_links tables so that links can be marked as DELETED.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-12-13 17:36:43 -05:00
Digimer
561fa1a9ec
Merge branch 'main' into anvil-tools-dev 2022-12-13 01:21:40 -05:00
Digimer
4fa8d7a446 * This completes the rework of DRBD triggered fencing to use / clear location constraints instead of triggering a power fence.
* Added the new unfence_pacemaker DRBD unfence handler.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-30 16:13:38 -05:00
Digimer
4ba1982183 This is the start of a set of changes needed to rework how we handle DRBD fence requests, so that they create location constraints instead of triggering a full stonith fence.
* In Cluster->parse_cib(), added parsers for node attributes and resource rules. Also stored the existence of and details of each under the server resources for easier referencing.
* Updated scan-server to check for / add DRBD fence rules as needed.

Scancore APC agent bugs;
* For clarity, converted all '#!no_value!#' and '#!no_connection!#' to use '!!' instead in APC scan agents.
* Fixed a bug to set/clear alerts related to phases disappearing to deal with concurrent logins from different hosts triggering false phase loss alerts.
* Fixed missing variables not being passed to alerts/log entries.

Started more work on anvil-manage-server, but on hold again while the DRBD fencing work is completed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-29 22:17:12 -05:00
Tsu-ba-me
e5fc75f306 fix(tools): fetch and send server screenshot from node to striker that made the request 2022-11-28 14:37:18 -05:00
Tsu-ba-me
e14b1fc93e fix(tools): use absolute paths in anvil-get-server-screenshot 2022-11-28 14:37:18 -05:00
Digimer
35cf0c37fb * Updated System->check_ram_use() to set the maximum RAM based on the host type, and set those values in _set_default() so that the user can override if they want.
* Got anvil-manage-alerts to the point where you can add, edit and delete mail servers.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-11-14 17:17:30 -05:00
Digimer
2fab7bc1b7 This adds support (testing needed) for "Long-Throw" DR; which is a wrapper for using 'drbd-proxy' to provide larger transmit buffers so slow/high-latency DR hosts.
* Created DRBD->check_proxy_license() to do (some level of) sanity checks on the DRBD proxy license file.
* Updated DRBD->gather_data() to parse out the inside and outside ports for resource configs using proxy.
* Reworked DRBD->get_next_resource() to return 1, 3 or 7 TCP ports depending, with the new long_throw_ports parameter triggering the 7 ports.
* Added 'tcpdump' to the anvil-core requires list.
* Reworked scan-drbd to record the ports used in proxy configs. This required adding a check to change the 'scan_drbd_peer_tcp_port' column type to 'text' to support CSVs.
* Reworked anvil-manage-dr (needs testing!) to support "long-throw" DR configs.
* Updated anvil-safe-stop to check if the nodes are in the cluster before trying to migrate.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-09-21 23:35:06 -04:00
Digimer
b8bb7cc423 * Changed the default trigger of live migrations to require a health score difference of 2 or higher. This can be user-adjusted using the new 'feature::scancore::threshold::preventative-live-migration' anvil.conf option.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-25 12:43:51 -04:00
Digimer
4ecc6097d3 * Cleaned up some old 'die' calls with better nice_exit() calls to help avoid dangling db_in_use flags.
* Reworked Network->bridge_info() to use 'ip' to get the list of bridges, and 'bridge' to find interfaces connected to the bridge.
* Added 'test' messages to Words->string().
* Fixed a bug in scan-lvm where mdadm based PVs didn't read the sector size properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-12 16:32:20 -04:00
Digimer
e7cf8ac789 * Got more work done on anvil-manage-files. It now picks up new files on nodes/dr hosts in an Anvil! and downloads them if needed.
* Updated anvil-daemon to call anvil-manage-files on a per-minute basis to handle files added outside of the WebUI.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-08-09 00:08:19 -04:00
Digimer
508e278359 Added the new 'anvil-network-profiler' tool.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-19 21:16:46 -04:00
Digimer
bce9e2caaf This is the first attempt at enabling firewalld completely. There is a decent chance that problems exist, so it won't be a surprise if a few more commits are needed to this branch before things work.
* Added multiple new private methods to Network that help in managing the firewall.
* Updated Server->boot_server to manage the firewall after the server boots. Updated ->migrate_server to create a job, if a database connection exists, for the migration target to update it's firewall as soon after the server appears as possible.
* Updated ocf:server:alteeve to manage the firewall when called post-migration, in case there was no DB connection and the job above didn't run. Fixed a bug where the disk state wasn't being evaluated properly.
* Updated scan-server to check that the firewall is managed when a server state has changed.
* Updated anvil-daemon to run Network->manage_firewall on startup.
* Heavily reworked 'anvil-manage-server' to either just run 'Network->manage_firewall', or if passed '--server X', to wait for the server to appear for up to 1 minute, then to check that the firewall is managed (to capture servers being migrated to the host.)
* Removed firewall management from striker-prep-database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-02 17:06:04 -04:00
Digimer
b2ea4f9adc * Moved System->manage_firewall() to Network->manage_firewall(). Started working on actually implementing it, which involves basically fully rewritting it.
* Updated tools/Makefile.am and scancore-agents/Makefile.am to add missing files.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-30 00:01:50 -04:00
Digimer
3fd0db15bf * This rather heavily reworks how database shutdowns works. It adds much more intelligent shutdown, tracking who is using the database, being able to mark a database as "offline" and waiting for users of the database to disconnect before it shuts down.
* Also removed the variables for the database name and DB user name, setting them statically now.
* Created Database->shutdown() to more kindly stop a local database server.
* Added 'check_db_in_use_states()' to anvil-daemon to clean any stale entries marking a database as in use.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-14 16:41:37 -04:00
Digimer
a633ab7f63 Added a periodic check to ensure all users can ping. This fixes a bug where a local striker dashboard whose DB was stopped wouldn't work.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-01-21 14:27:58 -05:00
Digimer
3346d31194 * Created Get->kernel_release() that returns the current kernel release (version) in use on the host or on a remote system.
* Created DRBD->_initialize_drbd() to makes sure the DRBD kernel module can load and tries to build the module, if necessary. This is meant to provide support for clients that can't access needed internet resource (or the internet at all).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-12-07 20:03:39 -05:00
Digimer
0fcde483be
Merge branch 'master' into anvil-tools-dev 2021-09-20 11:44:30 -04:00
Digimer
72b17ff1f9 * Reworked how databases are stopped, now being handled in anvil-daemon. This way, initial starts will still do traditional resyncs, then shut down. This should allow the best of both worlds, where data is not lost on striker start/stop loss/recovery, but operate normally otherwise without delays.
* Updated Database->archive_database() to return the full path to the dump file.
* Disabled enabling the postgresql daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-18 22:33:31 -04:00
Madison Kelly
922899ea78 * WIP: Working on a new method of failing over between which Striker is the active database, instead of running N-number of databases all the time.
* Created Database->backup_database() that creates a pg_dump of the active database.
* Created Database->load_database() that loads the database from a flat file, optionally creating a backup before doing so, and using iptables to block access during the process.
* Updated Database->configure_pgsql() to not start the postgresql daemon unless it just initialized the DB.
* Much work, not yet complete, to Database->connect() to stop after the first successful connection. Added logic that, if not connection was established and the host is a Striker, to load a peer's backup, if it exists, and then start the local daemon.
* Updated anvil-daemon to now have a section to run tasks on a ten minute cycle, which will later be used for the primary Striker to dump / copy its database to peer(s).

Signed-off-by: Madison Kelly <mkelly@alteeve.ca>
2021-09-16 23:10:55 -07:00
Digimer
9edf698c37 Updated Database->get_storage_group_data() to determine when a node or DR host needs to be removed from a Storage group, or when a member of an Anvil! needs to be added to a storage group.
Created Storage->get_vg_name() to assist with anvil-manage-dr, which is still a WIP.
Continued work on anvil-manage-dr (which exposed the issue that required the update to Database->get_storage_group_data().

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-08 00:50:45 -04:00
Tsu-ba-me
d195a53ba2 feat(cgi-bin): add endpoint for fetching server screenshot 2021-09-03 16:21:55 -04:00
Digimer
c449e2edf0 Resetting scan agent timeout to 30 seconds, 60 didn't help with a random
hang issue.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-12 15:11:46 -04:00
Digimer
38f95870bb Changed the agent runtime timeout to 60 seconds.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-11 22:39:23 -04:00
Digimer
15d8309095 This commit adds scan agent DB connection info caching to help minimize the number of unnecessary DB resync checks that happen.
* Created ScanCore->agent_shutdown() that writes out the time the scan agent last ran, and how many databases were available when it last ran.
* Updated ScanCore->agent_startup() to read the the last run data created above.
* Updated Database->connect() to set 'sys::database::last_db_count' to the scan agent's recorded last DB count.
* Updated all agents to call ScanCore->agent_shutdown() at the end of their run.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-11 21:14:35 -04:00
Tsu-ba-me
1d61c8fff7 fix(cgi-bin): modify manage_vnc_pipes endpoint to trigger a job 2021-08-04 13:38:28 -04:00
Tsu-ba-me
d5724c1457 chore(tools): rename striker-start-ssh-tunnel->striker-open-ssh-tunnel 2021-08-04 13:34:58 -04:00
Tsu-ba-me
23d818cfff fix(cgi-bin): avoid direct SSH calls 2021-08-04 13:34:58 -04:00
Tsu-ba-me
3de9912f51 fix(Anvil): use augeas to modify Apache conf 2021-07-28 18:41:28 -04:00
Digimer
11b1900e1b Note: Continuing to resolve the build issues with network startup. Expect breakage.
* Upped the aging of jobs and alerts data from 2 to 24 hours. Also added a check to prevent deleting a job of any age that is incomplete.
* Major update to anvil-configure-host to not touch the network unless something has actually changed. Not yet tested on a fresh system, will verify nothing broke in the CI tests this commit will trigger. Also changed it so that, if after reconfiguring the network it times out trying to reconnect to a database, it calls a reboot instead of simply exiting. Further, a reboot is now not called on exit unless something changed to require it.
* Updated Network->check_bonds() to return '1' if anything was done to heal a bond.
* Updated anvil-update-states to be more careful about clearing virsh bridges. Specifically, it checks to see if virsh is running and that the returned bridges aren't actually error codes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-14 01:58:25 -04:00
Digimer
16c20ae69c * Updated Tools->catch_sig() to use return code 0 instead of 255 so that systemd doesn't think our daemons failed on stop.
* Updated Cluster->parse_cib() to not require a database connection (part of the work to make ocf:alteeve:server run without a DB)
* WIP: Continuing work on the ocf:alteeve:server RA to run without database connections.
* Updated the scancore daemon to explcitely check that all scan agent schemas are loaded in all databases on startup. This is to resolve resync issues on rebuilt strikers that may not yet have some schemas loaded when a DB resync runs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-05 14:32:26 -04:00
Digimer
73267a8ea9 * WIP - Slowly working on anvil-manage-server
* Updated the scancore interval to 60 seconds.
* Updated Database->insert_or_update_health() so that 'delete' can find the health_uuid.
* Updated Convert->time() to return silently when passed '-1'.
* Fixed a bug scan-hardware to call Convert->round(). Also fixed it so it didn't set health scores of 0 for mismatch RAM when the RAM was not mismatched.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-02 14:08:55 -04:00
Digimer
4dcd505753 * Biggest change in this commit; scan-apc-pdu and scan-apc-ups now only run on Striker dashboards! This was because we found that if two machines ran their agents at the same time, the reponce time from SNMP read requests grew a lot. This meant it was likely a third, fourth and so on machne would also then have their scan agent runs while the existing runs were still trying to process, causing the SNMP reads to get slower still until timeouts popped.
* Bumped scancore's scan delay from 30 seconds to 60.
* Shorted the age-out time to 24 hours and again boosted the archive thresholds. As we get a feel for the amount of data collected on multi-Anvil! systems over time, we may continue to tune this.l
* Moved Database->archive_database() to be called daily by anvil-daemon, instead of during '->connect' calls.
* Added locking to Database->_age_out_data to avoid resyncs mid-purge. Also moved the power, temperature and ip_address columns into the same 'to_clean' hash as it was duplicate logic.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-31 13:34:49 -04:00
Digimer
6abe06f125 The theme of these commits is improving DB responsiveness.
* Created Database->_age_out_data() to delete records from the database that are old enough to no longer be useful. This is designed to significantly reduce the size of the database, allowing a better focus on performance.
* Changed Database->connect() to default to NOT check for resync, reworking the old 'no_resync' to 'check_for_resync', so that resync checks happen on demard, instead of by default.
* Updated get_tables_from_schema() to now allow 'schema_file' to be set to 'all', which then loads the schema files of all scan agents as well as the core anvil schema file. Fixed a bug where commented out tables were being counted.
* Re-enabled triggering resyncs on 'last_updated' differences.
* Fixed a bug in scan-ipmitool where the history_id column in history.scan_ipmitool_value was incorrect.
* Created a new tool called striker-show-db-counts that shows the number of records in all public and history schema tables for all databases.
* Updated anvil-update-states to detect when a libvirtd NAT'ed bridge exists and to delete it when found.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-29 23:34:22 -04:00
Digimer
bbad058b33 * Created a new tool, anvil-watch-bonds, which is a live monitor of bonds and interfaces designed to be run from the command line on a given host.
* Created Words->center_text that takes a string (or string key) and centers it to a given string length, padding white spaces on either side of the string as needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-26 20:24:05 -04:00
Digimer
42ffc200bc * Updated remainder pointers to the old repos to the new repos. Added support for the new alteeve-repo-setup.
* Removed the checks for resync that limited resyncs on jobs and variables tables. That approach to minimize unnecessary resyncshas proven faulty, will find another way later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 14:34:15 -04:00
Digimer
41cd1e0319 * Several bugs fixed and enhancements;
* DRBD is now configured to a ping-timeout of 3 seconds.
* Created Log->switches() that returnes the command line switches used by Anvil! tool command line calls based on the active log levels / secure logging. Appended this to all invocations of our tools.
* Updated Database->resync_databases() to now only skip 'jobs' and 'variables' tables with less than 10 record differences. All other differences will trigger a resync.
* Created System->_check_anvil_conf() that, as you might guess, checks in anvil.conf exists and created it (using defaults), if not. It also checks to see if the 'admin' group and user exists and creates them, if not.
* Updated anvil-daemon to check anvil.conf on start up and in each loop. Created the function check_journald() that checks (and sets, if needed) that journald logging is persistent.
* Made striker-manage-peers to check_if_configured on the Database->connect() when updating anvil.conf and the target UUID is the local machine. Also created a loop to make the reconnection a lot more robust.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 00:09:32 -04:00
Digimer
fc0954d0c8 * Started work on, but not at all finished, anvil-manage-server which will allow manipulation of a server's resources.
* Changed the alteeve repo RPM to the new cimmunity/enterprise repo
* Fixed a bug where 'fence_data::updated' was causing the fences web page to break.
* Fixed a bug in Database->insert_or_update_network_interfaces() where certain interfaces were being repeatedly added to the database.
* Fixed a bug in Database->_find_behind_databases() was marking DBs as behind even though they had less than 10 columns off.
* Fixed a bug in Get->host_name() where, if the host name was changed on disk but the environment variable was still the old name, it would cause the hostname to waffle back and forth and cause constant updated to /etc/hosts.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-20 00:16:09 -04:00
Digimer
ad4a1ecc78 * Increaded the scancore agent run timeout to 60 seconds.
* Updated anvil-safe-start to start DRBD resources when the peer's DRBD resourcs is 'Connecting',
* Updated fence_pacemaker to more intelligently check the list of host names related to an IP address when looking for the peer host name

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-15 00:12:43 -04:00
Digimer
ca7052dd53 The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :)
* Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively.
* Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf.
* Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state.
* Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested.
* Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times.
* Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan.
* Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-30 22:58:01 -04:00
Digimer
27259d1d53 * Finished anvil-rename-server!
* Created Storage->delete_file() that, well, deletes files (locally or on a peer).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-22 13:29:50 -04:00
Digimer
711a04999e * Finished anvil-migrate-server and anvil-safe-start! Lots of testing still needed for both though, and 'anvil-safe-start' does run as a job yet, but the logic is all there.
* Fixed a bug in Cluster->migrate_server() where waiting for the server to migate would never exit.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-19 00:32:13 -04:00
Digimer
eec14cb013 * Finished tools/anvil-boot-server and tools/anvil-shutdown-server.
* Fixed a bug where, in rare cases, $anvil->hostname() would call 'hostnamectl' and get a dbus error during shutdown, which would then cause the hostname to be changed to the error in the database.
* Fixed a bug in Cluster->boot_server() where it would never verify that a server has started successfully.
* Updated Database->get_ip_addresses() to store the IPs we manage in 'ip_addresses::<ip_address_address>::X'.
* Updated ocf:alteeve:server to work from command line calls, though more testing is still needed.
* Started work on 'anvil-rename-server', but haven't gotten far with it yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-18 19:54:58 -04:00
Digimer
e036515df3 * Got anvil-safe-start to the point where is starts the cluster stack. Need to create the 'anvil-boot-server' and 'anvil-shutdown-server' before it can be completed, so those files have been added.
* Created Cluster->parse_quorum() to check if a node is quorate as 'have-quorum' in the pacemaker CIB doesn't appear to be super accurate during startup.
* Fixed a bug in striker-manage-install-target where if a node didn't have any registered IPs, it would break before generating the repo data.
* Fixed a bug in anvil-join-anvil where if the database had to be reconnected, the job data was lost.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-14 00:26:06 -04:00