Commit Graph

511 Commits

Author SHA1 Message Date
Tsu-ba-me
7d9013a60b fix(tools): allow striker-manage-vnc-pipes to be executed as a job 2021-08-04 13:38:26 -04:00
Tsu-ba-me
0935b9a990 feat(tools): move manage_vnc_pipes endpoint core logic to separate script 2021-08-04 13:34:58 -04:00
Tsu-ba-me
5459e610aa fix(tools): auto-end tunnel script when connection breaks 2021-08-04 13:34:58 -04:00
Tsu-ba-me
d5724c1457 chore(tools): rename striker-start-ssh-tunnel->striker-open-ssh-tunnel 2021-08-04 13:34:58 -04:00
Tsu-ba-me
23d818cfff fix(cgi-bin): avoid direct SSH calls 2021-08-04 13:34:58 -04:00
Digimer
e3d65d654c * Continuing work on anvil-manage-server.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-29 17:15:02 -04:00
Digimer
3f1c2dd38f * Couple of small cleanups for fence_delay.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-26 15:47:26 -04:00
Digimer
8d2e454d69 * Updated fence_delay to set the ownership of the log file to 'hacluster:haclient'. This should address https://github.com/digimer/fence_delay/issues/1
* WIP - COntinuing work on anvil-manage-server, far from done yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-26 15:35:04 -04:00
Digimer
bc8b9274cb WIP; Reworked anvil-manage-server to have a more interactive menu system (for the sections done so far).
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-26 15:35:04 -04:00
Digimer
28865780f8 * Updated Database->get_server_definitions() to take a specific server UUID, allowing just the one definition to be loaded. Also had it clear previous loads.
* Updated Server->parse_definition() to call DRBD->get_devices() so that referenced LVs can be loaded properly.
* Continued WIP in anvil-manage-server

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-20 23:19:29 -04:00
Digimer
623dbb0863 WIP; Restarted work on anvil-manage-server.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-18 16:21:00 -04:00
Digimer
548c52701a Updates Jobs->update_progress() to take a 'variables' hash reference, and to support logging as well.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-16 15:07:07 -04:00
Digimer
1e159f548e Added a couple notes for later dev.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-16 11:13:48 -04:00
Digimer
39236e9b3f Switched default graphics for new servers to 'vnc' instead of spice.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-16 00:08:02 -04:00
Digimer
cebae28716 * WIP - Fixing a bug in scan-network where vnet devices aren't being recorded against their bridge.
* Updated scan-server to record the VNC port it is using in the database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-15 00:42:47 -04:00
Digimer
7e7b91b286 * Updates anvil-join-anvil to update corosync.conf to use the BCN1 link as the main knet network with the SN1 link as the backup link.
* Fixed a bug in Cluster->parse_cib() where the local machine's ready state was being set to the node name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-14 12:17:19 -04:00
Digimer
d7d418ee1b * Fixed a bug in DRBD->gather_data() where the peer node's data was being recorded where the local node's data should have been saved.
* Fixed a bug in anvil-delete-server where, if a server was off already, the server would not be removed from pacemaker.
* WIP - continuing on scan-network

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-30 14:58:36 -04:00
Digimer
a697011b08 * Disabled debug logging in anvil-daemon.
* WIP - working on new scan-network scan agent.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-30 02:36:06 -04:00
Digimer
f7b8a053b0
Merge branch 'master' into scancore-debugging 2021-06-28 20:22:37 -04:00
Digimer
6777104398 * Fixed a bug in anvil-daemon where, when an anvil-manage-power reboot run had triggered a reboot, anvil-daemon didn't set the job_progress to '100', causing constant reboots. Also fixed a bug where the log level was hard-set to '1' instead of '2' needed during debugging.
* Updated Jobs->get_job_uuid() to accept the new 'incomplete' parameter that, when set, will look for jobs whose progress is > 1 and < 100.
* Updated ScanCore-agent_startup() to take the new 'no_db_ok' parameter which returns with '0' if no DB is available and that parameter is set to '1'.
* Fixed a logging bug in 'anvil-join-anvil'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-28 20:04:11 -04:00
Fabio M. Di Nitto
7aea5e1b11 Switch to kmod-drdb
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-06-27 07:36:36 +02:00
Digimer
04f7571097 * Fixed a typo causing anvil-manage-power to not compile.
* Updated anvil-configure-host to register a reboot job when needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-25 21:40:55 -04:00
Digimer
0c475d2a2e * Fixed a couple logging bugs.
* Updated scan-cluster to get the CIB from pcs instead of reading the CIB from disk.
* Updated anvil-daemon to always call striker-prep-database at log level 2 while trying to find the cause of rare postgres config failures. Also updated striker-prep-database to use the new method of initializing the DB.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 18:22:55 -04:00
Digimer
d3052c0229 * Finished Cluster->check_server_constraints() and added it to scan-cluster. This now makes sure servers don't roll back to their old host after it has been fenced and recovers.
* Completely disabled Network->check_network(), it's causing more problems than it solves.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 14:19:58 -04:00
Digimer
e7a06fce72 * Disabling the periodic network health check in anvil-daemon.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 11:01:33 -04:00
Digimer
30f478267a * Forced anvil-daemon to log-level 2 and to enable secure logging to continue debugging setup issues.
* Fixed a undefined variable warning.
* Removed a debugging die from Database->resync_databases().

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 19:41:00 -04:00
Digimer
47fa126a3c * Fixed a typo that blocked anvil-daemon from starting.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 19:00:26 -04:00
Digimer
023f43eda9 * In the never-ending attempt to resolve the build consistency issues, this commit enables extra debugging logging and, hopefully, implements a fix in anvil-daemon where a job could be started repeatedly.
* Renamed the special job status 'scancore_startup' to 'anvil_startup', given it's handled by anvil-daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 16:12:12 -04:00
Digimer
5a343d6d75 * WIP; Started work on Cluster->check_server_constraints() that will track when a server's location constraint needs to be updated when the old preferred node is lost.
* Removed (for now) setting MTU in the ifcfg-X files during anvil-configure-host runs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-21 23:22:48 -04:00
Digimer
76689aa245 * I've decided that live reconfiguring of NetworkManager interfaces is too unreliable. This commit disables all attempts to reconfigure the network while it's up, and simply reboots on changes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-20 12:03:35 -04:00
Digimer
629c2b8e8c * Moved up when the reboot happens, when it's needed, avoiding a network reload when a reboot is going to happen anyway.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-19 14:56:28 -04:00
Digimer
bbee77d265 * Re-enabled reboot
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 23:18:53 -04:00
Digimer
08a958ec60 * Finished updating Network->check_network() to check/heal bridges.
* Updated anvil-configure-host to not reboot on network chane (will verify when this commit is function tested).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 22:42:10 -04:00
Digimer
6a8a192cfd * Added an explicit delete call when network changes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 21:06:24 -04:00
Digimer
bd24c1c5bb * I _might_ have fixed the network configuration issue in anvil-configure-host... Updated it so that if 'nmcli' doesn't report a valid device name, it looks for it in the ifcfg-X file, and uses 'X' if not found there.
* Added the 'print' parameter to Log->variables() to allow printing to STDOUT when set.
* Renamed Network->check_bonds() to Network->check_networks() in anticipation of adding bridge monitoring / repair to it later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 19:37:37 -04:00
Digimer
c7c6c8dee5 * Reworked the attempt to repair the network in anvil-daemon to not touch the network until the machine has been running for at least two minutes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-15 12:04:27 -04:00
Digimer
11b1900e1b Note: Continuing to resolve the build issues with network startup. Expect breakage.
* Upped the aging of jobs and alerts data from 2 to 24 hours. Also added a check to prevent deleting a job of any age that is incomplete.
* Major update to anvil-configure-host to not touch the network unless something has actually changed. Not yet tested on a fresh system, will verify nothing broke in the CI tests this commit will trigger. Also changed it so that, if after reconfiguring the network it times out trying to reconnect to a database, it calls a reboot instead of simply exiting. Further, a reboot is now not called on exit unless something changed to require it.
* Updated Network->check_bonds() to return '1' if anything was done to heal a bond.
* Updated anvil-update-states to be more careful about clearing virsh bridges. Specifically, it checks to see if virsh is running and that the returned bridges aren't actually error codes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-14 01:58:25 -04:00
Digimer
a1b06e4355 * Continuing to try to get the network to reliably start during configuration...
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 14:59:39 -04:00
Digimer
1e7847d4dd * Added a call to Network->check_bonds() to be called while non-Striker machines wait to connect to a database.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 14:14:37 -04:00
Digimer
3f32a56d0c * Created Network->check_bonds() that checks to see if any bonds are down, or if any interfaces configured to be in a bond are not actually in it. It accepts a 'heal' parameter that, by default, will bring up a bond with no active links, but leaves degraded bonds alone. It call also take 'all' and will try to bring up any missing interfaces. This distinction exists so that if a link is flaky and someone takes it down manually until it can be repaired, it doesn't get turned back on.
* Updated anvil-daemon to call Network->check_bonds() with 'all' on startup, then woth 'down_only' once per minute to try to heal down'ed bonds.
* Updated anvil-watch-bonds to take a 'run-once' switch and exit after one report, if set.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 13:33:51 -04:00
Digimer
0dd92a08c5 * Small change to variable name to help make logs clearer.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-11 21:47:10 -04:00
Digimer
0b6a9e37fa * Added scan_lvm_pv_sector_size to the scan_lvm_pvs table in the scan-lvm. This will be used later for growing a requested disk size for the DRBD metadata.
* Added a 1 minute delay to anvil-configure-host before calling a reboot.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-11 19:57:30 -04:00
Digimer
80bdac8e34 * Updated the pacemaker server config to drop the stop timeout to 5 minutes and the migration timeout to 10 minutes. This will avoid blocking the entire cluster when a stop or migrate operation times out. Will update scan-server to clean these up when they happen.
* Updated Database->archive_table() and ->_find_behind_databases() to loop through connected databases, instead of configured databases.
* Updated Network->get_ips() to only record the real MAC addresses on network interfaces (not bonds or bridges) in the "network::${host}::interface::${in_iface}::mac_address" hash. This should help avoid reboot loops caused by anvil-configure-host thinking the network needs to be reconfigured when it doesn't actually need to be.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-11 03:17:07 -04:00
Digimer
19c41c9171 * Added more logging while chasing a function test bug.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 23:56:48 -04:00
Digimer
0f43961568 * This commit lowers the logging levels of some debug log entries. It's to help diagnose occassional function test failures with an unknown source.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 21:57:51 -04:00
Digimer
daca6c887b * This contains a fairly major change to how time stamps are handled. All INSERT and UPDATE calls now generate a new timestamp via Database->refresh_timestamp, instead of using 'sys::database::timestamp'. This was done in responce to finding a bug where tables in a database differed in both counts of public and private schemas (ip_addresses table, specifically) that failed to resync because the timestamps were re-used too often.
* WIP - Continuing work on the new anvil-manage-server tool.
* Updated Database->get_anvils() to load information on the files available on each Anvil! system.
* Updated Database->insert_or_update_network_interfaces() to no longer take the 'timestamp' parameter.
* Removed all logging from Database->refresh_timestamp() to speed it up, given how often it will be called now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 15:23:15 -04:00
Digimer
5b4bfa747c * Reworked the anvil-join-anvil job parsing to help diagnose occassional faults. Also changed a fatal parse error to one that allows the run to be retried.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-06 01:54:28 -04:00
Digimer
96fffb0b96 * Finished updating ocf:alteeve:server to no longer require a database connection. To do this, and still be able to track live migration times, the Server->migrate_virsh() method now writes out the server name and migration time to a /tmp/anvil/migration-duration.<server_name>.<unix_time> file. This file is checked for by the scan-server resource agent and, when found, is parsed and the migration duration is recorded, then the file is purged.
* Updated anvil-daemon to have a new function called "handle_special_cases" called during startup that does any weird bug mitigation required. For now, this is used to mitigate against rhbz#1961562, though certainly it will be used for other reasons later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-06 00:01:11 -04:00
Digimer
e15c1651ed * Fixed a bug with deleting bad keys where jobs to delete keys on non-dashboard machine wasn't being assigned to the proper target machine.
* Fixed a bug with anvil-manage-keys where a state_uuid entry recorded on one database may not be read from a machine reading from another database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-05 19:07:25 -04:00
Digimer
16c20ae69c * Updated Tools->catch_sig() to use return code 0 instead of 255 so that systemd doesn't think our daemons failed on stop.
* Updated Cluster->parse_cib() to not require a database connection (part of the work to make ocf:alteeve:server run without a DB)
* WIP: Continuing work on the ocf:alteeve:server RA to run without database connections.
* Updated the scancore daemon to explcitely check that all scan agent schemas are loaded in all databases on startup. This is to resolve resync issues on rebuilt strikers that may not yet have some schemas loaded when a DB resync runs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-05 14:32:26 -04:00