Commit Graph

704 Commits

Author SHA1 Message Date
Digimer
4e9882812d * Fixed a bug where the periodic database dumps on the primary database Striker were not sync'ing to peers. Also fixed a bug where these periodic dumps weren't running at all.
* Updated anvil-daemon->prep_database() to only run if the database dump file doesn't exist. (If it does, it's clearly configured).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-18 23:18:06 -04:00
Digimer
72b17ff1f9 * Reworked how databases are stopped, now being handled in anvil-daemon. This way, initial starts will still do traditional resyncs, then shut down. This should allow the best of both worlds, where data is not lost on striker start/stop loss/recovery, but operate normally otherwise without delays.
* Updated Database->archive_database() to return the full path to the dump file.
* Disabled enabling the postgresql daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-18 22:33:31 -04:00
Madison Kelly
922899ea78 * WIP: Working on a new method of failing over between which Striker is the active database, instead of running N-number of databases all the time.
* Created Database->backup_database() that creates a pg_dump of the active database.
* Created Database->load_database() that loads the database from a flat file, optionally creating a backup before doing so, and using iptables to block access during the process.
* Updated Database->configure_pgsql() to not start the postgresql daemon unless it just initialized the DB.
* Much work, not yet complete, to Database->connect() to stop after the first successful connection. Added logic that, if not connection was established and the host is a Striker, to load a peer's backup, if it exists, and then start the local daemon.
* Updated anvil-daemon to now have a section to run tasks on a ten minute cycle, which will later be used for the primary Striker to dump / copy its database to peer(s).

Signed-off-by: Madison Kelly <mkelly@alteeve.ca>
2021-09-16 23:10:55 -07:00
Digimer
6664c5b77f * Fixed a bug where scan-drbd, with DR configured, was not recording TCP ports assigned to connections properly.
* More bugs fixed in anvil-manage-dr, tested repeatedly as a job and so far, so good. Other functionality still to come.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-12 23:34:25 -04:00
Digimer
da9dc03d04 Updated anvil-manage-dr to update the job progress and convert prints into strings.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-12 15:14:03 -04:00
Digimer
ffd15406e0 * anvil-manage-dr can now protect a server! Still lots to do though.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-11 19:42:55 -04:00
Digimer
20a784baa2 * Continuing work on anvil-manage-dr. Got it to the point where it should (but doesn't yet) create the new DRBD config and the LV(s) on DR.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-11 16:41:23 -04:00
Digimer
5b35204af4 * Updated DRBD->get_next_resource() to take the new 'dr_tcp_ports' ports which, if set, returns two free TCP ports.
* Got anvil-manage-dr to the point where it writes the updated resource configuration to enable DR support. (untexted)

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-09 23:07:03 -04:00
Digimer
9edf698c37 Updated Database->get_storage_group_data() to determine when a node or DR host needs to be removed from a Storage group, or when a member of an Anvil! needs to be added to a storage group.
Created Storage->get_vg_name() to assist with anvil-manage-dr, which is still a WIP.
Continued work on anvil-manage-dr (which exposed the issue that required the update to Database->get_storage_group_data().

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-08 00:50:45 -04:00
Tsu-ba-me
f61527edf7 fix(tools): save screenshots to states table 2021-09-03 16:21:55 -04:00
Tsu-ba-me
c1859bc8d8 fix(tools): use netpbm tools instead of imagemagick 2021-09-03 16:21:55 -04:00
Tsu-ba-me
65613f501b fix(tools): add option to resize server screenshot 2021-09-03 16:21:55 -04:00
Tsu-ba-me
7467036054 build(tools): add anvil-get-server-screenshot script to build 2021-09-03 16:21:55 -04:00
Tsu-ba-me
da6b4d39c6 fix(tools): disable line wrap in image Base64 output 2021-09-03 16:21:55 -04:00
Tsu-ba-me
4ef231b567 fix(tools): prevent too frequent inserts of server VM screenshots 2021-09-03 16:21:55 -04:00
Tsu-ba-me
1014299d38 fix(tools): enable anvil-get-server-screenshot to be a job 2021-09-03 16:21:55 -04:00
Tsu-ba-me
f97a820b48 feat(tools): add script to take screenshot of server VM 2021-09-03 16:21:55 -04:00
Digimer
2f8b1fb72e Updated anvil-provision-server so that when the OS type is 'win7', set the disk to sata and the NIC to e1000e. Also updated it to store the virt-install call in the 'variables' table and write it out to /mnt/shared/provision.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-31 00:32:15 -04:00
Digimer
4427fe9f0d * Found the source of the vnet constantly cycling back to 'up' bug. The anvil-update-state tool was marking the vnet device operational state back to 'unknown' and scan-network was marking it back up.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-30 17:59:39 -04:00
Digimer
e40d0e2444 Fixed a bug where if a database is pingable but the pgsql database is down, and it's the first database tested (or local), then the DB handle used to read / quote fails.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-26 23:26:03 -04:00
Digimer
4c7bb45ab9 Fixed a race condition where configuring the IPMI BMC would appear to fail because the BMC wouldn't report the user list after a cold reset.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-25 21:02:00 -04:00
Digimer
6cbdc388d4 Fixed a bug where corosync's configuration of a backup ring was broken.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-24 15:52:44 -04:00
Digimer
04cb116c1b Updated anvil-parse-fence-agents to validate each fence agent's metadata is valid before adding it to the unified XML.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-19 00:58:26 -04:00
Digimer
8abb5b46e0 * Added support for setting per-agent log-level and log secure values in amvil.conf.
* Moved the check for an agent being disabled into ScanCore->agent_startup()

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-18 23:07:15 -04:00
Digimer
3674a47179 WIP - Working a tool to manually load updated server definition files.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-16 11:39:16 -04:00
Digimer
aec22bb79c Added a check in scan-network that finds/removes duplicate network interface names.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-11 12:17:01 -04:00
Digimer
4800f7181f * Updated ScanCore to boot a node that is off without a stop reason.
* Fixed a bug where anvil-safe-stop was not recording the stop-reason. Also made '--poweroff' an alias for '--power-off'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-07 14:01:14 -04:00
Digimer
acaacd9a86 * Created Storage->get_size_of_block_device() that takes a block device path and returns the size of the path, if it's found in the database.
* More work on the storage management of anvil-manage-server.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-06 22:46:02 -04:00
Digimer
7a504467ef
Merge branch 'master' into anvil-tools-dev 2021-08-06 14:51:44 -04:00
Digimer
606bd8f1f0 Continuing work on anvil-manage-server.
Created Storage->get_storage_group_from_path() that takes a block device path and tried to find the Storage Group it belongs to.
Updated Storage->get_storage_group_data() to make it possible to look up a storage group UUID using the SG's name.
Updated DRBD->gather_data() to take a pre-generated XML via the new 'xml' parameter.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-08-06 14:21:11 -04:00
Tsu-ba-me
063840ecb6 fix(tools): correct message_* string keys in striker-manage-vnc-pipes 2021-08-04 13:53:44 -04:00
Tsu-ba-me
8da318c933 fix(tools): patch failure to fix 2nd pipe after server migration 2021-08-04 13:40:02 -04:00
Tsu-ba-me
0f1c3d2435 chore(tools): remove unused function from striker-manage-vnc-pipes 2021-08-04 13:40:02 -04:00
Tsu-ba-me
cdb66019d3 fix(tools): avoid port conflict 2021-08-04 13:40:02 -04:00
Tsu-ba-me
7e447000b4 fix(cgi-bin): use unspecified instead of loopback address in SSH tunnel 2021-08-04 13:40:02 -04:00
Tsu-ba-me
b3b6da8259 chore(cgi-bin): remove debug log level from manage_vnc_pipes and its support scripts 2021-08-04 13:40:02 -04:00
Tsu-ba-me
549758b2f2 build(tools): include support scripts for manager_vnc_pipes endpoint into makefile 2021-08-04 13:40:02 -04:00
Tsu-ba-me
e50bfc7308 fix(tools): correct typo in passing server_uuid to get_vnc_info() 2021-08-04 13:40:02 -04:00
Tsu-ba-me
3a8f4c339b fix(tools): use VNC port in variables table if available 2021-08-04 13:40:02 -04:00
Tsu-ba-me
e4436be17b fix(tools): do checks and kills as root 2021-08-04 13:40:02 -04:00
Tsu-ba-me
bb155a5786 fix(tools): update job progress in catch-all case 2021-08-04 13:40:00 -04:00
Tsu-ba-me
ffc1fb096a fix(tools): correct switch name typo in striker-manage-vnc-pipes 2021-08-04 13:38:28 -04:00
Tsu-ba-me
1fec288ad0 fix(tools): make striker-manage-vnc-pipes executable 2021-08-04 13:38:28 -04:00
Tsu-ba-me
7d9013a60b fix(tools): allow striker-manage-vnc-pipes to be executed as a job 2021-08-04 13:38:26 -04:00
Tsu-ba-me
0935b9a990 feat(tools): move manage_vnc_pipes endpoint core logic to separate script 2021-08-04 13:34:58 -04:00
Tsu-ba-me
5459e610aa fix(tools): auto-end tunnel script when connection breaks 2021-08-04 13:34:58 -04:00
Tsu-ba-me
d5724c1457 chore(tools): rename striker-start-ssh-tunnel->striker-open-ssh-tunnel 2021-08-04 13:34:58 -04:00
Tsu-ba-me
23d818cfff fix(cgi-bin): avoid direct SSH calls 2021-08-04 13:34:58 -04:00
Digimer
e3d65d654c * Continuing work on anvil-manage-server.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-29 17:15:02 -04:00
Digimer
3f1c2dd38f * Couple of small cleanups for fence_delay.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-26 15:47:26 -04:00
Digimer
8d2e454d69 * Updated fence_delay to set the ownership of the log file to 'hacluster:haclient'. This should address https://github.com/digimer/fence_delay/issues/1
* WIP - COntinuing work on anvil-manage-server, far from done yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-26 15:35:04 -04:00
Digimer
bc8b9274cb WIP; Reworked anvil-manage-server to have a more interactive menu system (for the sections done so far).
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-26 15:35:04 -04:00
Digimer
28865780f8 * Updated Database->get_server_definitions() to take a specific server UUID, allowing just the one definition to be loaded. Also had it clear previous loads.
* Updated Server->parse_definition() to call DRBD->get_devices() so that referenced LVs can be loaded properly.
* Continued WIP in anvil-manage-server

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-20 23:19:29 -04:00
Digimer
623dbb0863 WIP; Restarted work on anvil-manage-server.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-18 16:21:00 -04:00
Digimer
548c52701a Updates Jobs->update_progress() to take a 'variables' hash reference, and to support logging as well.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-16 15:07:07 -04:00
Digimer
1e159f548e Added a couple notes for later dev.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-16 11:13:48 -04:00
Digimer
39236e9b3f Switched default graphics for new servers to 'vnc' instead of spice.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-16 00:08:02 -04:00
Digimer
cebae28716 * WIP - Fixing a bug in scan-network where vnet devices aren't being recorded against their bridge.
* Updated scan-server to record the VNC port it is using in the database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-15 00:42:47 -04:00
Digimer
7e7b91b286 * Updates anvil-join-anvil to update corosync.conf to use the BCN1 link as the main knet network with the SN1 link as the backup link.
* Fixed a bug in Cluster->parse_cib() where the local machine's ready state was being set to the node name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-07-14 12:17:19 -04:00
Digimer
d7d418ee1b * Fixed a bug in DRBD->gather_data() where the peer node's data was being recorded where the local node's data should have been saved.
* Fixed a bug in anvil-delete-server where, if a server was off already, the server would not be removed from pacemaker.
* WIP - continuing on scan-network

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-30 14:58:36 -04:00
Digimer
a697011b08 * Disabled debug logging in anvil-daemon.
* WIP - working on new scan-network scan agent.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-30 02:36:06 -04:00
Digimer
f7b8a053b0
Merge branch 'master' into scancore-debugging 2021-06-28 20:22:37 -04:00
Digimer
6777104398 * Fixed a bug in anvil-daemon where, when an anvil-manage-power reboot run had triggered a reboot, anvil-daemon didn't set the job_progress to '100', causing constant reboots. Also fixed a bug where the log level was hard-set to '1' instead of '2' needed during debugging.
* Updated Jobs->get_job_uuid() to accept the new 'incomplete' parameter that, when set, will look for jobs whose progress is > 1 and < 100.
* Updated ScanCore-agent_startup() to take the new 'no_db_ok' parameter which returns with '0' if no DB is available and that parameter is set to '1'.
* Fixed a logging bug in 'anvil-join-anvil'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-28 20:04:11 -04:00
Fabio M. Di Nitto
7aea5e1b11 Switch to kmod-drdb
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-06-27 07:36:36 +02:00
Digimer
04f7571097 * Fixed a typo causing anvil-manage-power to not compile.
* Updated anvil-configure-host to register a reboot job when needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-25 21:40:55 -04:00
Digimer
0c475d2a2e * Fixed a couple logging bugs.
* Updated scan-cluster to get the CIB from pcs instead of reading the CIB from disk.
* Updated anvil-daemon to always call striker-prep-database at log level 2 while trying to find the cause of rare postgres config failures. Also updated striker-prep-database to use the new method of initializing the DB.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 18:22:55 -04:00
Digimer
d3052c0229 * Finished Cluster->check_server_constraints() and added it to scan-cluster. This now makes sure servers don't roll back to their old host after it has been fenced and recovers.
* Completely disabled Network->check_network(), it's causing more problems than it solves.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 14:19:58 -04:00
Digimer
e7a06fce72 * Disabling the periodic network health check in anvil-daemon.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 11:01:33 -04:00
Digimer
30f478267a * Forced anvil-daemon to log-level 2 and to enable secure logging to continue debugging setup issues.
* Fixed a undefined variable warning.
* Removed a debugging die from Database->resync_databases().

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 19:41:00 -04:00
Digimer
47fa126a3c * Fixed a typo that blocked anvil-daemon from starting.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 19:00:26 -04:00
Digimer
023f43eda9 * In the never-ending attempt to resolve the build consistency issues, this commit enables extra debugging logging and, hopefully, implements a fix in anvil-daemon where a job could be started repeatedly.
* Renamed the special job status 'scancore_startup' to 'anvil_startup', given it's handled by anvil-daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 16:12:12 -04:00
Digimer
5a343d6d75 * WIP; Started work on Cluster->check_server_constraints() that will track when a server's location constraint needs to be updated when the old preferred node is lost.
* Removed (for now) setting MTU in the ifcfg-X files during anvil-configure-host runs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-21 23:22:48 -04:00
Digimer
76689aa245 * I've decided that live reconfiguring of NetworkManager interfaces is too unreliable. This commit disables all attempts to reconfigure the network while it's up, and simply reboots on changes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-20 12:03:35 -04:00
Digimer
629c2b8e8c * Moved up when the reboot happens, when it's needed, avoiding a network reload when a reboot is going to happen anyway.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-19 14:56:28 -04:00
Digimer
bbee77d265 * Re-enabled reboot
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 23:18:53 -04:00
Digimer
08a958ec60 * Finished updating Network->check_network() to check/heal bridges.
* Updated anvil-configure-host to not reboot on network chane (will verify when this commit is function tested).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 22:42:10 -04:00
Digimer
6a8a192cfd * Added an explicit delete call when network changes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 21:06:24 -04:00
Digimer
bd24c1c5bb * I _might_ have fixed the network configuration issue in anvil-configure-host... Updated it so that if 'nmcli' doesn't report a valid device name, it looks for it in the ifcfg-X file, and uses 'X' if not found there.
* Added the 'print' parameter to Log->variables() to allow printing to STDOUT when set.
* Renamed Network->check_bonds() to Network->check_networks() in anticipation of adding bridge monitoring / repair to it later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 19:37:37 -04:00
Digimer
c7c6c8dee5 * Reworked the attempt to repair the network in anvil-daemon to not touch the network until the machine has been running for at least two minutes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-15 12:04:27 -04:00
Digimer
11b1900e1b Note: Continuing to resolve the build issues with network startup. Expect breakage.
* Upped the aging of jobs and alerts data from 2 to 24 hours. Also added a check to prevent deleting a job of any age that is incomplete.
* Major update to anvil-configure-host to not touch the network unless something has actually changed. Not yet tested on a fresh system, will verify nothing broke in the CI tests this commit will trigger. Also changed it so that, if after reconfiguring the network it times out trying to reconnect to a database, it calls a reboot instead of simply exiting. Further, a reboot is now not called on exit unless something changed to require it.
* Updated Network->check_bonds() to return '1' if anything was done to heal a bond.
* Updated anvil-update-states to be more careful about clearing virsh bridges. Specifically, it checks to see if virsh is running and that the returned bridges aren't actually error codes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-14 01:58:25 -04:00
Digimer
a1b06e4355 * Continuing to try to get the network to reliably start during configuration...
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 14:59:39 -04:00
Digimer
1e7847d4dd * Added a call to Network->check_bonds() to be called while non-Striker machines wait to connect to a database.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 14:14:37 -04:00
Digimer
3f32a56d0c * Created Network->check_bonds() that checks to see if any bonds are down, or if any interfaces configured to be in a bond are not actually in it. It accepts a 'heal' parameter that, by default, will bring up a bond with no active links, but leaves degraded bonds alone. It call also take 'all' and will try to bring up any missing interfaces. This distinction exists so that if a link is flaky and someone takes it down manually until it can be repaired, it doesn't get turned back on.
* Updated anvil-daemon to call Network->check_bonds() with 'all' on startup, then woth 'down_only' once per minute to try to heal down'ed bonds.
* Updated anvil-watch-bonds to take a 'run-once' switch and exit after one report, if set.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 13:33:51 -04:00
Digimer
0dd92a08c5 * Small change to variable name to help make logs clearer.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-11 21:47:10 -04:00
Digimer
0b6a9e37fa * Added scan_lvm_pv_sector_size to the scan_lvm_pvs table in the scan-lvm. This will be used later for growing a requested disk size for the DRBD metadata.
* Added a 1 minute delay to anvil-configure-host before calling a reboot.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-11 19:57:30 -04:00
Digimer
80bdac8e34 * Updated the pacemaker server config to drop the stop timeout to 5 minutes and the migration timeout to 10 minutes. This will avoid blocking the entire cluster when a stop or migrate operation times out. Will update scan-server to clean these up when they happen.
* Updated Database->archive_table() and ->_find_behind_databases() to loop through connected databases, instead of configured databases.
* Updated Network->get_ips() to only record the real MAC addresses on network interfaces (not bonds or bridges) in the "network::${host}::interface::${in_iface}::mac_address" hash. This should help avoid reboot loops caused by anvil-configure-host thinking the network needs to be reconfigured when it doesn't actually need to be.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-11 03:17:07 -04:00
Digimer
19c41c9171 * Added more logging while chasing a function test bug.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 23:56:48 -04:00
Digimer
0f43961568 * This commit lowers the logging levels of some debug log entries. It's to help diagnose occassional function test failures with an unknown source.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 21:57:51 -04:00
Digimer
daca6c887b * This contains a fairly major change to how time stamps are handled. All INSERT and UPDATE calls now generate a new timestamp via Database->refresh_timestamp, instead of using 'sys::database::timestamp'. This was done in responce to finding a bug where tables in a database differed in both counts of public and private schemas (ip_addresses table, specifically) that failed to resync because the timestamps were re-used too often.
* WIP - Continuing work on the new anvil-manage-server tool.
* Updated Database->get_anvils() to load information on the files available on each Anvil! system.
* Updated Database->insert_or_update_network_interfaces() to no longer take the 'timestamp' parameter.
* Removed all logging from Database->refresh_timestamp() to speed it up, given how often it will be called now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 15:23:15 -04:00
Digimer
5b4bfa747c * Reworked the anvil-join-anvil job parsing to help diagnose occassional faults. Also changed a fatal parse error to one that allows the run to be retried.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-06 01:54:28 -04:00
Digimer
96fffb0b96 * Finished updating ocf:alteeve:server to no longer require a database connection. To do this, and still be able to track live migration times, the Server->migrate_virsh() method now writes out the server name and migration time to a /tmp/anvil/migration-duration.<server_name>.<unix_time> file. This file is checked for by the scan-server resource agent and, when found, is parsed and the migration duration is recorded, then the file is purged.
* Updated anvil-daemon to have a new function called "handle_special_cases" called during startup that does any weird bug mitigation required. For now, this is used to mitigate against rhbz#1961562, though certainly it will be used for other reasons later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-06 00:01:11 -04:00
Digimer
e15c1651ed * Fixed a bug with deleting bad keys where jobs to delete keys on non-dashboard machine wasn't being assigned to the proper target machine.
* Fixed a bug with anvil-manage-keys where a state_uuid entry recorded on one database may not be read from a machine reading from another database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-05 19:07:25 -04:00
Digimer
16c20ae69c * Updated Tools->catch_sig() to use return code 0 instead of 255 so that systemd doesn't think our daemons failed on stop.
* Updated Cluster->parse_cib() to not require a database connection (part of the work to make ocf:alteeve:server run without a DB)
* WIP: Continuing work on the ocf:alteeve:server RA to run without database connections.
* Updated the scancore daemon to explcitely check that all scan agent schemas are loaded in all databases on startup. This is to resolve resync issues on rebuilt strikers that may not yet have some schemas loaded when a DB resync runs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-05 14:32:26 -04:00
Digimer
24ec17f8f7 * Added a new parameter called 'sensitive' to Database->connect() that returns after connections before any ancilliary checks are done, minimizing connect time.
* Fixed a problem with Database->insert_or_update_variables() where variable_source_uuid being set to an empty string wasn't converted to NULL.
* Fixed Database->locking() where the way the lock variable was set was rather broken.
* Created Striker->check_httpd_conf() which configured apache to handle the integration of the new WebUI for Anvil! management with the existing WebUI.
* Updated System->update_hosts() to specifically set the 127.0.0.1 and ::1 lines to handle how cloud-init overrides /etc/hosts and breaks CI/CD tests.
* Removed the old index.html as it's now used for the new WebUI.
* Began work on removing DB connection requirements from ocf:alteeve:server.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-03 22:25:36 -04:00
Digimer
73267a8ea9 * WIP - Slowly working on anvil-manage-server
* Updated the scancore interval to 60 seconds.
* Updated Database->insert_or_update_health() so that 'delete' can find the health_uuid.
* Updated Convert->time() to return silently when passed '-1'.
* Fixed a bug scan-hardware to call Convert->round(). Also fixed it so it didn't set health scores of 0 for mismatch RAM when the RAM was not mismatched.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-02 14:08:55 -04:00
Digimer
4dcd505753 * Biggest change in this commit; scan-apc-pdu and scan-apc-ups now only run on Striker dashboards! This was because we found that if two machines ran their agents at the same time, the reponce time from SNMP read requests grew a lot. This meant it was likely a third, fourth and so on machne would also then have their scan agent runs while the existing runs were still trying to process, causing the SNMP reads to get slower still until timeouts popped.
* Bumped scancore's scan delay from 30 seconds to 60.
* Shorted the age-out time to 24 hours and again boosted the archive thresholds. As we get a feel for the amount of data collected on multi-Anvil! systems over time, we may continue to tune this.l
* Moved Database->archive_database() to be called daily by anvil-daemon, instead of during '->connect' calls.
* Added locking to Database->_age_out_data to avoid resyncs mid-purge. Also moved the power, temperature and ip_address columns into the same 'to_clean' hash as it was duplicate logic.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-31 13:34:49 -04:00
Digimer
8807915bb7 The theme of this commit is database cleanup and fixes.
* Updated Database->_age_out_data() to check for certain scan agent tables and, for those found, purge out old records. This should go a long way to keeping the database data responsive.
* Fixed a bug in Jobs->update_progress() where the 'job_picked_up_by' column was being set to '0' instead of '$$' when clearing the job.
* Fixed a bug in System->update_hosts() where '127.0.0.1' would be used in hosts for the actual host name.
* Updated the default trigger, count and division values in anvil.conf to 100,000, 50,000 and 75,000 respectively. In combination with the aging of data, this should go a long way to minimizing database sizes and overheads.
* Updated anvil-daemon to call $anvil->Database->_age_out_data(); in it's daily tasks.
* Updated various striker-X tools to specifically request a DB resync on Database->connect calls.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-30 15:16:25 -04:00
Digimer
6abe06f125 The theme of these commits is improving DB responsiveness.
* Created Database->_age_out_data() to delete records from the database that are old enough to no longer be useful. This is designed to significantly reduce the size of the database, allowing a better focus on performance.
* Changed Database->connect() to default to NOT check for resync, reworking the old 'no_resync' to 'check_for_resync', so that resync checks happen on demard, instead of by default.
* Updated get_tables_from_schema() to now allow 'schema_file' to be set to 'all', which then loads the schema files of all scan agents as well as the core anvil schema file. Fixed a bug where commented out tables were being counted.
* Re-enabled triggering resyncs on 'last_updated' differences.
* Fixed a bug in scan-ipmitool where the history_id column in history.scan_ipmitool_value was incorrect.
* Created a new tool called striker-show-db-counts that shows the number of records in all public and history schema tables for all databases.
* Updated anvil-update-states to detect when a libvirtd NAT'ed bridge exists and to delete it when found.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-29 23:34:22 -04:00
Digimer
49a700d68f * Fixed a bug in anvil-join-anvil where the desired DNS servers were not matching existing list of used DNS servers, even when they are the same already.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-27 17:36:12 -04:00
Digimer
bbad058b33 * Created a new tool, anvil-watch-bonds, which is a live monitor of bonds and interfaces designed to be run from the command line on a given host.
* Created Words->center_text that takes a string (or string key) and centers it to a given string length, padding white spaces on either side of the string as needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-26 20:24:05 -04:00
Digimer
ff65712fd9 * Created the function check_daemons() in anvil-daemon to check that needed daemons are running when it starts. This was specifically added to address a periodic issue with machines booting without NetworkManager running.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 15:27:10 -04:00
Digimer
42ffc200bc * Updated remainder pointers to the old repos to the new repos. Added support for the new alteeve-repo-setup.
* Removed the checks for resync that limited resyncs on jobs and variables tables. That approach to minimize unnecessary resyncshas proven faulty, will find another way later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 14:34:15 -04:00
Digimer
41cd1e0319 * Several bugs fixed and enhancements;
* DRBD is now configured to a ping-timeout of 3 seconds.
* Created Log->switches() that returnes the command line switches used by Anvil! tool command line calls based on the active log levels / secure logging. Appended this to all invocations of our tools.
* Updated Database->resync_databases() to now only skip 'jobs' and 'variables' tables with less than 10 record differences. All other differences will trigger a resync.
* Created System->_check_anvil_conf() that, as you might guess, checks in anvil.conf exists and created it (using defaults), if not. It also checks to see if the 'admin' group and user exists and creates them, if not.
* Updated anvil-daemon to check anvil.conf on start up and in each loop. Created the function check_journald() that checks (and sets, if needed) that journald logging is persistent.
* Made striker-manage-peers to check_if_configured on the Database->connect() when updating anvil.conf and the target UUID is the local machine. Also created a loop to make the reconnection a lot more robust.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-24 00:09:32 -04:00
Digimer
a846f9ecbc * Fix to the database resync logic. The previous change to only resync if 10+ lines differed broke striker-manage-peers as the difference in host counts is what triggered the pairing of strikers.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-22 12:25:29 -04:00
Digimer
41d528418d * Increased the trigger point for database archiving. The current values were too low and cuasing frequent archive -> resync cycles.
* Fixed a bug in that archiving defaulting to not store on disk was not working properly. Now acts as described in anvil.conf.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-22 11:57:29 -04:00
Digimer
48956d94fb * Fixed anvil-manage-system file name in Makefile.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-20 00:27:34 -04:00
Digimer
fc0954d0c8 * Started work on, but not at all finished, anvil-manage-server which will allow manipulation of a server's resources.
* Changed the alteeve repo RPM to the new cimmunity/enterprise repo
* Fixed a bug where 'fence_data::updated' was causing the fences web page to break.
* Fixed a bug in Database->insert_or_update_network_interfaces() where certain interfaces were being repeatedly added to the database.
* Fixed a bug in Database->_find_behind_databases() was marking DBs as behind even though they had less than 10 columns off.
* Fixed a bug in Get->host_name() where, if the host name was changed on disk but the environment variable was still the old name, it would cause the hostname to waffle back and forth and cause constant updated to /etc/hosts.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-20 00:16:09 -04:00
Digimer
ad4a1ecc78 * Increaded the scancore agent run timeout to 60 seconds.
* Updated anvil-safe-start to start DRBD resources when the peer's DRBD resourcs is 'Connecting',
* Updated fence_pacemaker to more intelligently check the list of host names related to an IP address when looking for the peer host name

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-15 00:12:43 -04:00
Digimer
44864ce321 * Updated Database->resync_databases() to set a default schema of 'public'. Also fixed a bug where, when the difference in record numbers between two line was > 999, it would not trigger a resync.
* Updated the scan agent timeout to 60 seconds. Also made the scan agent exit code log entries more helpful.
* Updated System->collect_ipmi_data() to now better handle duplicate sensor names. Now, instead of simply appending an integer, we find the hex address and use that in the sensor name when duplicates exist. This solves the problem of the sensor names not being consistently shown in order.
* Fixed message bugs (bad variable insertions) in scan-apc-pdu and scan-apc-ups.
* Fixed schema procedure bugs in the 'temperature' and 'ip_address' tables where the columns were in bad order, causing constanty updates.

Incomplete work;
* Create the shell of 'anvil-manage-storage', but virtually no logic exists in it yet.
* Started work on anvil-safe-start to deal with an issue where DRBD resources don't start when a server is running on a peer.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-13 23:27:38 -04:00
Digimer
309aa13684 * Updated the name of striker-purge-target in the makefile.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-11 14:25:27 -04:00
Digimer
7abbc938af * Renamed tools/striker-purge-host to tools/striker-purge-target and moved the code from test.pl over to it. No longer provides interactive selection, but now does work with Anvil! systems as well as hosts.
* Fixed a bug in Database->get_tables_from_schema where history.X and X tables were being stored in the table list.
* Updated ocf:alteeve:server to no do resyncs on DB connect.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-11 14:14:00 -04:00
Digimer
f833c311ba * To address issues with scancore debugging, we needed a tool to purge old anvils and hosts from the database. The 'test.pl' in this commit contains the new logic that will be merged into tools/striker-purge-host shortly.
* Created Database->find_host_uuid_columns() and ->_find_column() to create a list of tables and column names in the proper order to allow deletion of foreign keys to that deeply nested primary keys can be deleted. Specifically, this was meant for hosts -> host_uuid and anvils -> anvil_uuid, though it should work for other tables.
* Updated html/jquery-ui-1.12.1/package.json to address CVE-2020-7729
* Fixed a bug in the temperature table's history procedure where temperature_weight wasn't being copied.
* Updated anvil-provision-server to support '--anvil' that can take either the anvil-uuid or anvil-name.
* Updated anvil-safe-stop to default the stop-reason to 'user'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-08 02:02:46 -04:00
Digimer
3fb81c1a0a * Updated Convert->time() to silently return if the given time was '--'.
* Added a new parameter to Database->connect() called 'no_resync' that, if set, prevents a resync check being performed. Updated ->resync_databases() to find a uuid_column where the table name ends in 'ies' and the UUID column is 'y_uuid'. Updated ->resync_databases() to not fire on updated table age anymore, and to trigger only if the number of rows differ in a given table by more than 10.
* Updated Log->entry() to prefix a tool's name, when the new 'log::scan_agent' value is set. Also set this value in ScanCore->agent_startup(), to help differentiate log entries.
* Fixed a bug in scancore's main loop where it logged the sleep message at the start of the run.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-04 12:33:31 -04:00
Digimer
4a87ee71db * This commit started with work on webui endpoint set_power, but then switched to scancore debugging and I neglected to switch branches.
* Created Cluster->check_stonith_config() that checks and, if needed, reconfigures a cluster's fencing (stonith) config.
* Updated scan-cluster to call Cluster->check_stonith_config() at the end of each call.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-02 21:40:48 -04:00
Digimer
416f51323a * Created tools/striker-boot-machine to, well, boot machines. It uses host_ipmi or, failing that, other fence methods when available to boot a node.
* Created Cluster->get_fence_methods() that parses all fence methods out of a recorded CIB and stores the in a hash for a given host_uuid.
* Fixed a bug in ScanCore->post_scan_analysis_striker() where the short_host_name was not being stored correctly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-05-01 19:49:27 -04:00
Digimer
35e926c52b
Merge branch 'master' into anvil-tools-dev 2021-04-30 23:02:49 -04:00
Digimer
ca7052dd53 The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :)
* Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively.
* Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf.
* Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state.
* Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested.
* Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times.
* Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan.
* Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-30 22:58:01 -04:00
Fabio M. Di Nitto
2214866156 Update to kmod-drbd91
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-04-28 06:17:15 +02:00
Digimer
f202187c34 * anvil-safe-stop is complete! Testing still needed, of course.
* Updated DRBD->manage_resource() to call 'drbdadm adjust <res>' when starting a resource to help deal with a periodic issue where the 'allow-two-primary' option on the peer doesn't match the local setting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-23 11:56:11 -04:00
Digimer
3a6902d899 * Made good progress on anvil-safe-stop. It will now stop or migrate servers (testing needed).
* Updated Server->shutdown_virsh() to change the parameter 'wait' to 'wait_time' to clarify it's use.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-23 00:04:20 -04:00
Digimer
27259d1d53 * Finished anvil-rename-server!
* Created Storage->delete_file() that, well, deletes files (locally or on a peer).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-22 13:29:50 -04:00
Digimer
2e37691116 * Updated DRBD->gather_data() to store data on peers so that the peer's LV path and backing disk is recorded. Also fixed a bug in ->get_status() where the return code for local calls was stored as a host name.
* Added the scan-hpacucli scan agent. It's been done for a while and should have been added ages ago.
* Updated anvil-rename-server to get to the point where it will take down the DRBD resources on all machines, but waits if there is a sync under way. It also verifies that the server is off on all systems from virsh's perspective.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-20 22:46:51 -04:00
Digimer
711a04999e * Finished anvil-migrate-server and anvil-safe-start! Lots of testing still needed for both though, and 'anvil-safe-start' does run as a job yet, but the logic is all there.
* Fixed a bug in Cluster->migrate_server() where waiting for the server to migate would never exit.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-19 00:32:13 -04:00
Digimer
eec14cb013 * Finished tools/anvil-boot-server and tools/anvil-shutdown-server.
* Fixed a bug where, in rare cases, $anvil->hostname() would call 'hostnamectl' and get a dbus error during shutdown, which would then cause the hostname to be changed to the error in the database.
* Fixed a bug in Cluster->boot_server() where it would never verify that a server has started successfully.
* Updated Database->get_ip_addresses() to store the IPs we manage in 'ip_addresses::<ip_address_address>::X'.
* Updated ocf:alteeve:server to work from command line calls, though more testing is still needed.
* Started work on 'anvil-rename-server', but haven't gotten far with it yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-18 19:54:58 -04:00
Digimer
b36093671b * Updated Database queries that were passing 'debug => $debug' to not do that, as it was causing far too much (useless) noise in the logs.
* Turned on print to console for logging in anvil-provision-server. Also updated it to check if the cluster is running and hold until it is.
* Cleaned up some code in Get->available_resources() that proved hard to debug.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-15 02:35:58 -04:00
Digimer
798518ba5e * While working on the boot/shutdown server tools, ran into and fixed a bug where files uploaded before an Anvil! was added could not have those files sync'ed. This was fixed though the new Database->check_file_locations() method.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-14 22:56:18 -04:00
Digimer
e036515df3 * Got anvil-safe-start to the point where is starts the cluster stack. Need to create the 'anvil-boot-server' and 'anvil-shutdown-server' before it can be completed, so those files have been added.
* Created Cluster->parse_quorum() to check if a node is quorate as 'have-quorum' in the pacemaker CIB doesn't appear to be super accurate during startup.
* Fixed a bug in striker-manage-install-target where if a node didn't have any registered IPs, it would break before generating the repo data.
* Fixed a bug in anvil-join-anvil where if the database had to be reconnected, the job data was lost.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-14 00:26:06 -04:00
Digimer
faf1399440 * Continued work on anvil-safe-start. Got it to the point where it detects shared networks with its peer node and waits for all networks to be up.
* Fixed a bug in scan-drbd where the volume_uuid wasn't being stored in the proper hash, breaking insertions into scan_drbd_peers in some cases.
* Updated System->pids() to work with remote targets (will be used later to check for parallel runs of anvil-safe-start).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-12 20:46:30 -04:00
Digimer
15e71768a1 * Started work on anvil-safe-start. The enable/disable logic and how it runs automatically is controlled by the database and the tool can be used to control anvil-safe-start on both the local and peer node. It will be started by ScanCore, if scancore starts within 10 minutes of the node booting. It will always be able to run manually.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-12 00:28:24 -04:00
Digimer
5f0b7740e2 * Fixed a typo that broke compiling anvil-daemon in the last commit. Yay for CI/CD!
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-10 01:28:12 -04:00
Digimer
fb0836f912 * THe get_cpu endpoint was completed.
* The get_mmeory endpoint was completed.
* The get_replicated_storage endpoint was completed, though it requires testing and likely has issues.

To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this.
* Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated.
* To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume.
* Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker.
* Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted).
* Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status.
* Updated anvil-join-anvil to disabled ksm and ksmtuned daemons.
* Updated scancore and anvil-daemon to set the host_status to 'online' on startup.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-09 20:51:29 -04:00
Digimer
cd87c0f521 * Fixed a bug that caused striker-initialize-host to not compile / run.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-04-01 11:35:44 -04:00
Digimer
70dc0598f2 * Created Storage->manage_lvm_conf() that checks / updates lvm.conf to add a filter to avoid seeing DRBD devices as LVM components. This is now called from striker-initialize-host and scan-drbd.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-31 23:59:19 -04:00
Digimer
59b867cc25 * Updated DRBD->gather_data() to check if drbdadm exists before trying to call it to avoid scary errors in the logs. Also moved some strings that pulled from the scan-drbd agent into the main words file.
* Fixed a bug in ScanCore->agent_startup() where a (thankfully broken) check to append tables to the 'sys::database::check_tables' would cause an infinite loop as both were pointers to the same anonymous array.
* Fixed a bug in scan-ipmitool where the scan_ipmitool_variables table didn't use a host_uuid reference, causing resyncs of that table to sync for all hosts and cause DB errors when the scan_ipmitool record from another host wasn't sync'ed yet.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-31 01:36:11 -04:00
Digimer
ec192d9041 * Fixed a bug where backing up a file on a remote machine returned a failure, if the target backup directory had to be created (even if it was created successfully).
* Fixed a bug in a mini bash command to chmod / chown a directory being created on a remote machine.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-23 00:36:17 -04:00
Digimer
b90b18f151
Merge branch 'master' into webui_anvil_page 2021-03-22 15:24:40 -04:00
Digimer
3ed857bacd * Bumped logging for striker-auto-initialize-all debugging.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-22 15:22:46 -04:00
Digimer
296556328b * Fixed a bug in Convert->bytes_to_human_readable() to handle being passed in bytes (with the size units of 'b' ot 'bytes').
* Fixed a bug in Database->_find_behind_databases() _find_behind_databases() where the logic to figure out which column was the host_uuid reference was too liberal, causing the wrong column to be selected in some cases. Also added a check to not look for host_uuid columns on specific tables where a match would be made, but the column is allowed to be null (like server_host_uuid that indicates the host of the server).
* Started work on the scan-filesystems scan agent, which is needed to record the data that the file system UI endpoints will need.
* Removed the 'not null' constraint from 'servers' -> 'server_host_uuid'.
* Fixed a bug in 'anvil-provision-server' where the driver ISO being 'none' caused the provision script to use the driver ISO switch.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-18 00:37:17 -04:00
Digimer
5536e8ff47 * Updated Cluster->assemble_storage_groups() and Cluster->anvil_name_from_uuid() and ->available_resources() to try to detect the anvil_uuid if not passed in.
* Updated Database->insert_or_update_storage_group_members() to use the host_uuid when trying to find existing members.
* Added the skeleton of a bunch of new json endpoints for the new UI features.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-16 02:40:50 -04:00
Fabio M. Di Nitto
a2e2b5b235 build: move striker-auto-initialize-all to proper location
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-03-13 15:04:46 +01:00
Fabio M. Di Nitto
6709efe33b testing: distribute striker auto setup tool outside of normal PATH
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-03-12 06:56:41 +01:00
Digimer
5e9e7e4dde * Removed debug logging from tools.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-10 21:14:56 -05:00
Digimer
54496cbeb0 * Added a check to Database->get_ip_addresses() to check is a hash is set before using it, to help avoid unitialized variable messages.
* Updated Remote->test_access() to not used cached SSH access.
* Updated anvil-configure-host to abort if the host is in a cluster.
* Updated anvil-join-anvil to clean up some variable checks to help avoid unitialized variable messages.
* Updated striker-initialize-host to check if an anvil RPM is installed and, if so, not install the Anvil! repo.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-10 20:35:05 -05:00
Digimer
5db09f565d * Updated anvil-join-anvil to actively call a cluster start once per minute while waiting for initial startup.
* Added a check to striker-initialize-host the see if anvil-X RPM is already installed. If so, it will not install the Alteeve repo, even if it's not found.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-10 14:14:04 -05:00
Digimer
3733220b50 * Updated Log->entry() to prefix log lines with the short 'job-uuid', when the log entry is coming from a program running as a job. This is meant to make it easier to break up what log lines belong to what jobs, if multiple jobs are running at the same time (ie: when initializing multiple nodes / dr hosts in parallel).
* Updated Remote->call() to return ('!!error!!', '!!error!!', 9999) when an error hits. Made Remote->test_access() explicitely check for '1' to be returned in order to confirm access, fixing a bug where bad target value caused false positives. Updated ->_check_known_hosts_for_target() to no longer explicitely check for 'ssh-rsa' so that machine keys using different cyphers are detected as being in known_hosts properly.
* Updated striker-auto-initialize-all to initialize nodes and DR hosts networks before trying to form them into an Anvil!. Fixed several other bugs as well. More testing is needed, but it works now.
* Updated striker-initialize-host to check for the alteeve repo and, it not found, check for accress to alteeve.com. If access, it will install our repo now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-10 02:26:09 -05:00
Digimer
53eefee56c * Re-enabling tools/striker-auto-initialize-all.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-09 11:04:25 -05:00
Digimer
426b5f58f7 * Finished (but not yet tested) tools/striker-auto-initialize-all. Expect many bugs if used.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-09 00:49:09 -05:00
Digimer
e71c7d4966 * Finished getting tools/striker-auto-initialize-all to merge the built Strikers.
* Fixed a bug in Remote->call() where the output of the call not ending in a newline wasn't having the return code parsed off properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-04 23:28:39 -05:00
Digimer
15fd0e5ce8 * Updated anvil-daemon (and Database->insert_or_update_jobs) to now recognize jobs with the job_status of 'scancore_startup' to run only when ScanCore starts.
* Finished initial Striker setup in tools/striker-auto-initialize-all. Started working on peering.
* Cleaned up the handling of converting UIDs to user names in Remote->add_target_to_known_hosts() and ->_call_ssh_keyscan().
* Did a bunch of white-space/alignment cleanup.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-04 01:41:33 -05:00
Digimer
0fb191c00f * Made more progress on tools/striker-auto-initialize-all, now to the point where it loads the variables needed to initialize Striker dashboard.
* Cleaned up / added some logging in various locations.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-03-02 01:18:18 -05:00
Digimer
45a9cb04b0 * Fixed a bug introduced in the last commit that made Get->os_type() fail when called locally.
* Made the error reported by Remote->call() more verbose when called without 'target' being set.
* Updated anvil-daemon to not call jobs more that once per minute.
* Started work on striker-auto-initialize-all, still very far from complete.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-23 01:56:12 -05:00
Digimer
1b65f53faa * Remove host-health from the 'hosts' table as it wasn't needed, given the 'health' table. Bumped the SQL version to 0.0.2
* Updated Get->os_type() to use 'cat' instead of Storage->read_file() because 'rsync' may not be available when it is called during striker-initialize-host calls.
* Updated Database methods to skip 'oui' and 'state' during resync.
* Updatedb striker-initialize-host to detect when it's initializing a CentOS Stream Node / DR Host and enable the HA repo.
* Created the tools/striker-auto-initialize-all tool, which is very much incomplete, that will allow for the rapid creation of a full Anvil! from freshly installed machines autonomously.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-22 19:22:47 -05:00
Digimer
e8efbab343 * The work on PXE / UEFI support is broken, and will be set aside for the time being. The commit here is working to getting things fixed, but it's taking too much time away from more pressing issues.
* This commit includes two unrelated test files for UI work, cgi-bin/get_anvil_status and cgi-bin/get_anvils.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-16 17:32:43 -05:00
Digimer
2937afad26 * Got UEFI booting working up to the grub menu, though files formerly provided by anvil-striker-extra still need to be added to the main anvil-striker to work properly.
* Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-10 23:49:11 -05:00
Digimer
25aa46c359 * Fixed initial UEFI PXE booting (doesn't work yet, but UEFI clients get an IP properly and get the boot image)
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-09 16:59:51 -05:00
Digimer
6f8f97b184 * Updated Get->os_type() to support detection of CentOS Stream separate from CentOS.
* Updated pxe.txt to start support for UEFI boot target.
* Updated update_install_source to be smarter about moving html and tftp directory names to better reflect the host OS, and to make it support converting OSes on the fly. Also added support to the package list for CentOS Stream.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-08 22:37:27 -05:00
digimer-bot
5b06cf5570
Merge branch 'master' into anvil-daemon-debugging 2021-02-08 15:23:31 -05:00
Digimer
06506ba5df * Removing (again) test.pl from Makefile.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-08 13:50:36 -05:00
Digimer
e8e042f0ae * Removed anvil-jobs from Makefile.am
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-08 13:42:56 -05:00
Digimer
1a520b03d5 * Cleaned up a lot of logging in anvil-daemon and tools it calls.
* Deleted anvil-jobs as it never ended up being used.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-08 13:39:34 -05:00
Digimer
482e4f41c2 * Removed 'test.pl' from Makefile.in/.am
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 19:00:03 -05:00
Digimer
a1eede2757 * Added new jumps to scan-ipmitool to make it less likely to trigger a jump alert for 'Temp{1..4}' sensors.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:54:12 -05:00
Digimer
1ec03c9718 * Removing 'test.pl' from git.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:42:34 -05:00
Digimer
6009590352 * Fixed a bug in scan-apc-ups where changes in the transfer reason were not being recorded.
* Cleaned up a log of logging to reduce the amount of log entries when running at log level 1.
* Bumped the scan-ipmitool default 'jump' range to 10c.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:11:29 -05:00
Digimer
b2dab95459 * Updated DRBD->delete_resource() to return a success if asked to delete a non-existent resource (as can happen when partial anvil-delete-server runs are re-run).
* Reworked DRBD->get_next_resource() to pull from the database, and to no longer do that increments-of-three nonsense. Avoidable complexity. Also added a call to Cluster->get_anvil_uuid() if the 'anvil_uuid' parameter wasn't passed.
* Updated Database->get_host_from_uuid() and ->get_hosts() to now take 'include_deleted' parameter and default to not returning deleted hosts. This fixed issues where anvil-{delete,provision}-server calls could assign jobs to now-deleted hosts with reused host names.
* Updated anvil-delete-server to print log entries to STDOUT. Also updated it to not wait of shutdown of a server in pacemaker to complete, and instead to destroy it after calling pacemaker's resource stop. Updated to also check to see if the server being deleted is already out of pacemaker and, if so, skip that step and directly try to destroy the server, if it's running.
* Updated anvil-provision-server to force 'peer_mode' runs to pull their TCP Port and DRBD minor numbers from the job. This fixes a bug where the same resource on two machines could use different TCP ports.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:11:29 -05:00
Digimer
2be14d93a6 * Added a check to anvil-delete-server to remove the XML definition file.
* Added checks to anvil-provision-server to see if an existing server name is flagged as DELETED, instead of outright rejecting a given server name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-07 18:11:29 -05:00
Digimer
9dbb39da5b * Added support for manually setting the server's UUID in anvil-provision-server. Also, if a server name existed before but was deleted, the old UUID is re-used to provide better continuity. The user can override this behaviour with the new --uuid switch.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-05 23:43:17 -05:00
Digimer
0ec1bf6b6a * Updated DRBD->delete_resource() to return a success if asked to delete a non-existent resource (as can happen when partial anvil-delete-server runs are re-run).
* Reworked DRBD->get_next_resource() to pull from the database, and to no longer do that increments-of-three nonsense. Avoidable complexity. Also added a call to Cluster->get_anvil_uuid() if the 'anvil_uuid' parameter wasn't passed.
* Updated Database->get_host_from_uuid() and ->get_hosts() to now take 'include_deleted' parameter and default to not returning deleted hosts. This fixed issues where anvil-{delete,provision}-server calls could assign jobs to now-deleted hosts with reused host names.
* Updated anvil-delete-server to print log entries to STDOUT. Also updated it to not wait of shutdown of a server in pacemaker to complete, and instead to destroy it after calling pacemaker's resource stop. Updated to also check to see if the server being deleted is already out of pacemaker and, if so, skip that step and directly try to destroy the server, if it's running.
* Updated anvil-provision-server to force 'peer_mode' runs to pull their TCP Port and DRBD minor numbers from the job. This fixes a bug where the same resource on two machines could use different TCP ports.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-05 23:41:48 -05:00
Digimer
8d0f873912 * Updated scan-storcli to check if a MegaRAID controlled exists and neither storcli64 or perccli64 exist. If a controller is found but no RPM is installed, it checks to see if the host is Dell and then decides to try and install perccli or storcli.
* Reworked scan-ipimitool so that on nodes and dr hosts, it only scans itself. On strikers, it scans all hosts found in active Anvil! systems with a host_ipmi entry. `
* For all agents, reduced log verbosity to not push too much noise into anvil.log while scancore is running in the background.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-05 23:34:51 -05:00
Digimer
50d529e07c * Added a check to anvil-delete-server to remove the XML definition file.
* Added checks to anvil-provision-server to see if an existing server name is flagged as DELETED, instead of outright rejecting a given server name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-03 19:21:15 -05:00
Digimer
e052c75e2f * Added a check to anvil-delete-server to remove the XML definition file.
* Added checks to anvil-provision-server to see if an existing server name is flagged as DELETED, instead of outright rejecting a given server name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-03 19:20:35 -05:00
Digimer
3f04c9031b * Fixed a bug in Words->parse_banged_string() Where the flattened string wasn't being used for the variable substitutions.
* This resolves issue #23.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-01 12:27:05 -05:00
Digimer
1081645893 * Added parameters to DRBD->get_next_resource to allow for a resource to be searched and either error out if a resource is found, or return the first DRBD minor and tcp port if found.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 02:32:12 -05:00
Digimer
f4bf1fd54a * Removed some XML insertions into strings as the break inserting into strings.
Note: These changes below shouldn't have been in this branch... *sigh*
* Fixed an issue with tools/anvil-provision-server where a VM would be created but didn't boot. When this happens, an explicit boot is sent via virsh. Also bumped up the time it waits for a new server to start up.
* Added an explicit call to scan-drbd after a new resource is created to ensure that if any calls come after looking for the next free DRBD minor or port, they don't use the ones just used.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 01:39:52 -05:00
Digimer
3d7ce84c38 * Fixed a bug in Get->host_from_ip_address() where hosts that are no longer used are returned, meaning 2+ results could be returned after a node was replaced, meaning no host name was returned.
* Fixed a bug in anvil-provision-server where forcing initialization of a new DRBD resource when running on node 2 would fail because the node ID in the drbdsetup command was hard-coded to be run from node 1.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 00:00:23 -05:00
Digimer
e25a424eb4 * Typo fixed in striker-manage-install-target insertion variable.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-30 18:49:15 -05:00
Fabio M. Di Nitto
8f9892650b [build] first pass at adding a build system to integrate with CI
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-01-30 20:16:30 +01:00
Digimer
413a4f73c2 * Updated Tools->_anvil_version() and Get->anvil_version() to now pick up a SchemaVersion from anvil.sql. This will change only when the schema changes and is used when Database->connect() is checking compatibility with other anvil database hosts. This will make it only break connection when there is a reason to do so. The anvil_version still remains as an informational version that will help when supporting users later.
* Updated Cluster->add_server() to now set failure timeouts to actual numbers instead of INFINITY after discovering that INFINITY doesn't work in those cases.
* Updated Databsae->get_hosts to now check if other entries have the same host name, and if so, to set their host_key to 'DELETED'. This should make it easier to handle when a hardware machine is replaced by new hardware but uses the same host_name.
* Updated Email->check_queue() to start and enable postfix.service if it's found to not be running.
* Updated Get->available_resources() to return '!!no_data!!' when a given host hasn't got any data in scan_lvm_vgs. Now use this in anvil-provision-server to exit if a node or dr host hasn't run scancore yet.
* Fixed a bug in scan-lvm where the pvs_uuid wasn't being loaded properly, preventing lost PVs, VGs and LVs from being flagged as deleted.
* Started work on anvil-migate-server, though it's far from complete.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-30 14:03:13 -05:00
Digimer
89dec8e1f9 * Finished anvil-delete-server! (More testing needed though)
* Fixed a bug in Cluster->shutdown_server() where the wrong variable was being evaluated when checking the server state.
* Created DRBD->delete_resource() that deletes a resource's backing device and configuration. Note that this wipes the DRBD MD and and FS signatures before removing the LV. Updated DRBD->gather_data() to record the backing devices for volumes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-26 01:45:17 -05:00
Digimer
549dbad635 * Created Cluster->delete_server(), which deletes a server resource from pacemaker (stopping it first, if needed).
* Fixed a bug in Cluster->parse_cib() when a server that is off wasn't setting 'status'.
* Renamed 'server::location::<server>::host' to '...::host_name' in several places.
* Got more work done on anvil-delete-server, up to the point where it calls the new Cluster->delete_server() method.
* Updated fence_pacemaker to call 'drbdadm adjust all' to dampen an issue where in-memory fence configs seem to change, preventing reconnection of the peer after it reboots from the fence. More testing needed on this issue.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-25 01:00:55 -05:00
Digimer
d9d347ce63 * Updated .spec for the new source location.
* Created a log disable flag to avoid deep recursion when logging at level 3.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-22 00:37:30 -05:00
Digimer
05b1fccdb3 * Created Cluster->add_server() which, well, adds a server to a pacemaker cluster, including sorting out location constraints to favour the node the server is running on, if it's running.
* Removed the exit-if-no-DB check in ocf:alteeve:server so that (hopefully, needs testing), running servers won't be impacted if the nodes lost contact with both/all strikers.
* Updated scan-server to make an explicit check for missing XML definition files on startup and write them if needed.
* Very beginning work on anvil-delete-server has been started.
* Updated anvil-provision-server to wait when it's running in peer mode until the new XML definition is in the DB and then write it out to disk before exiting. Also updated it to add the new server to pacemaker before exiting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-18 00:38:06 -05:00
Digimer
e0ceb5c65f * Provisioning servers is done!! This commit handles the virt-install call and adding the server to the database.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-17 02:54:41 -05:00
Digimer
8127c70237 * Support for OS type selection was missing in tools/anvil-provision-server, this commit adds support for this, as well as some infrastructure to support it. This includes a new 'sys::servers::os_short_list' variable that contains a CSV of main OSes to show in a "short" list (the full list is massive). This variable can be set by the user in anvil.conf. Also added job progress calls that were missing through the storage config.
* Created a new tools/striker-parse-os-list tool that parses 'osinfo-query os' and prints out entries for words.xml for any new OSes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-16 03:30:51 -05:00
Digimer
e82025ba61 * Got the DRBD configuation and start-up completed. Unlike M2, with M3 a server can be provisioned while the peer is disconnected or failed. Also cleaned up the 'run_jobs' function by breaking it up into a set of smaller function calls.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-16 00:47:14 -05:00
Digimer
056f8edf48 * Moved the scan-drbd function 'gather_drbd_date' and moved it into DRBD->gather_date().
* Finished DRBD->get_next_resource() that returns the next available minor and the next free TCP port (with two free ports available after it).
* Created Storage->get_storage_group_details() that pulls together the LVM, storage group members and storage groups into one block of data.
* Made more progress on tools/anvil-provision-server. It now gets up to the point of creating LVs, creating DRBD resource files, loading them, creating metadata and up'ing the resource. It doesn't yet (successfully) force a new resource to primary.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-15 03:10:16 -05:00
Digimer
162f4913b1 * Started work on DRBD->get_next_resource(), that will eventually return the next free DRBD minor and TCP port numbers.
* Fixed a bug in scan-drbd that was still looking for the scan_drbd_resource_uuid from the resource config file. Also added a check to see if 'scan-drbd::resource_status' directory exists before trying to read the files in it.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-13 01:31:44 -05:00
Digimer
a7f0676a0f * Got the 'anvil-provision-server' script to the point where it actually saves the new server job.
* Created the new method Cluster->get_primary_host_uuid() that returns the 'host_uuid' of the primary node in the given cluster. This is useful for external programs to figure out which node is primary. Example is provisioning a new server being assigned to the active node. Also created ->is_primary() that is a similar test to see if the active node is the primary node or not.
* Updated Cluster->parse_cib() and ->parse_crm_mon() to work on remote hosts.
* Updated Database->get_hosts() to store the short host names.
* Created Get->host_from_ip_address() that translates an IP address to a host_uuid and host_name, if it's an IP assigned currently to a known host.
* Created Network->find_target_ip() that simplifies finding which IP address to use when the caller wants to connect to a target host.
* Reworked the anvil-join-anvil to parse fence_arguments in a way that handles passwords with spaces in them.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-12 21:29:01 -05:00
Digimer
68ea6da1d3 * Finished the web interface components of the Anvil! File Manager! Files can be purged, sync'ed or removed from specific Anvil! systems, renamed and their file types changed (and setting/removing the executable bits) as needed.
* Fixed a bug in Database->insert_or_update_jobs() where the 'job_host_uuid' being set to 'all' only translated to a job for the running host.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-06 01:31:09 -05:00
Digimer
6da2b3b17b * Got more work done on file management. A file name is now clickable and that loads a menu to rename, change the file type, purge (delete from everywhere) and select which Anvil! systems the file belongs on. Got the code done to purge a file, but it's not tested yet.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-04 02:05:05 -05:00
Digimer
ea84ba68eb * Fixed a bug in tools/anvil-update-states that was causing deleted interfaces to update the network_interfaces every pass, growing the DB excessively.
* Cleaned up the file manager;
** Got the jquery file uploader JS to be sane and altered it to be more useful.
** Got the list of existing files to be displayed (links clickable but not working yet).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-03 01:14:04 -05:00
Digimer
7d3c4371c7 * Renamed tools/striker-sync-shared to tools/anvil-sync-shared, as it's now designed to run on all machines. Got it to the point where it can be run on Anvil! members to pull down freshly uploaded files. It does so, when two or more strikers are available with the target file, load balancing such that one node downloads from one striker while another node downloads from the other striker. If there is three nodes, and if there is a DR host, the DR host will download from the third striker. If there are 1 or 2 strikers, the DR host will wait to download after both nodes have finished downloading.
* Cleaned up upload.pl now that it isn't responsible for loading the file details into the database. It only sets a job for the local Striker to process the file and move it into /mnt/shared/files, copy it to peer dashboards, then load jobs for Anvil! members to sync the new file.
* Created Database->get_files() and ->get_file_locations() to load the respective data.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-02 02:23:33 -05:00
Digimer
1a36f37065 * Got file uploads working!
* Got tools/striker-sync-shared to pick up 'upload::move_incoming' jobs, move the uploaded file to /mnt/shared/files/, copies it to peer dashboards, adds it to the 'files' table and adds it to 'file_locations'.
* Reworked the 'file_locations' table to now map files to Anvil! systems, not hosts. It simply tracks if a given file should be on Anvil! members or not. Later, striker-sync-shared on the Anvil! members will pull the file down.
* Updated Storage->get_file_stats() to record the file's mimetype.
* Fixed up a few issues in cgi-bin/upload.pl.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-01 04:09:17 -05:00
Digimer
7002c6f7cf * Fixed a bug in Get->available_resources() where re-calling it caused the available space on storage groups kept dropping. Also added recordings in the hash for the space reserved for the host and allocated to existing servers.
* Got tools/anvil-provision-server to the point where it askes up to the storage pool / size.
* Created the shell of 'tools/striker-sync-shared' that will sync /mnt/shared/files.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-30 23:02:33 -05:00
Digimer
f30cce3c5a * Created the new tools/anvil-provision-server tool which will handle provisioning new servers, as well as having an interactive menu system to provision servers from the command line.
* Created Cluster->assemble_storage_groups() and moved the logic to auto-assemble groups out of Get->available_resources().
* Created Cluster->get_anvil_name() that will return an Anvil! name for a given anvil_uuid, or the name of the Anvil! if the host is a member of an Anvil!.
* Updated Cluster->get_anvil_uuid() to return the 'anvil_uuid' if passed a specific 'anvil_name'.
* Updated Jobs->clear() to use 'switches::job-uuid' when a job_uuid is not passed but the value exists in 'switches::job-uuid'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-30 04:35:39 -05:00
Digimer
ddffc9d782 * Finished refining Get->available_resources(). It's now complete and can be used to show a user what is available and to validate new server creation commands (up next).
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-29 20:55:06 -05:00
Digimer
8f823d3b86 * Switched out the static list of core table to use the array generated by Database->get_tables_from_schema().
* Fixed bugs around creating and filtering storage groups.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-28 21:59:32 -05:00
Digimer
1d03a386d3 * Created Database->get_bridges() that, surprise, loads data from the 'bridges' table.
* Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers.
* Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file.
* Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors.
* Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums.
* When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory.
* Moved the 'scancore' daemon's 'load_agent_strings' to 'Words'
* Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly.
* Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed.
* Fixed a bug in scan-storcli where a while loop was broken, preventing execution.
* Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 01:10:23 -05:00
Digimer
96bc1f0b78 * Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call.
* Created Database->get_power() that loads data from the special 'power' table.
* Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables.
* Updated Database->get_anvils() to record the machine types when recording host information.
* Updated Database->get_hosts_info() to also load the 'host_ipmi' column.
* Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available.
* Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon.
* Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned.
* Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete.
** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful.
* Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic.
* Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored.
* Updated scancore to clear the 'system::stop_reason' variable.
* Added missing packages to striker-manage-install-target.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 16:00:35 -05:00
Digimer
0b2407e78b * Added a really simple DRBD monitoring tool to the repo, will likely remove later.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-06 22:56:53 -05:00