Commit Graph

168 Commits

Author SHA1 Message Date
Digimer
cd220e97dc Disabled striker-prep-databas and set Database->configure_pgsql() calls to use debug => 2.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-20 20:32:18 -04:00
Digimer
7fd6185445 * Disabled firewalling for now. There appears to be an issue starting up with DRBD.
* Updated Convert->time() to return whatever was passed in instead of '#!error!#'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-09 19:46:38 -04:00
Digimer
bce9e2caaf This is the first attempt at enabling firewalld completely. There is a decent chance that problems exist, so it won't be a surprise if a few more commits are needed to this branch before things work.
* Added multiple new private methods to Network that help in managing the firewall.
* Updated Server->boot_server to manage the firewall after the server boots. Updated ->migrate_server to create a job, if a database connection exists, for the migration target to update it's firewall as soon after the server appears as possible.
* Updated ocf:server:alteeve to manage the firewall when called post-migration, in case there was no DB connection and the job above didn't run. Fixed a bug where the disk state wasn't being evaluated properly.
* Updated scan-server to check that the firewall is managed when a server state has changed.
* Updated anvil-daemon to run Network->manage_firewall on startup.
* Heavily reworked 'anvil-manage-server' to either just run 'Network->manage_firewall', or if passed '--server X', to wait for the server to appear for up to 1 minute, then to check that the firewall is managed (to capture servers being migrated to the host.)
* Removed firewall management from striker-prep-database.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-07-02 17:06:04 -04:00
Digimer
f2d06fa9b1 * Updated striker-parse-oui to only run if/when the system has been running for at least one hour.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-23 21:21:38 -04:00
Digimer
ab9b00a2f7 * Updated anvil-daemon, in its daily checks, to disable ksm and ksmtuned daemons.
* Updated scan-drbd to purge peer records that no longer have corresponding LVM data.
* Updated System->{en,dis}able-service to take the 'now' paramter which, when passed, causes the action to take immediate effect.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-21 22:25:07 -04:00
Digimer
911f7cfb6a This is another big commit with a lot of DB work. Getting closer to sorting out the frequent resyncs.
* Changes Database->connect to always use the first DB connected to, not the local one if that applies. This treats the first DB (sorted by UUID) as "primary" and the second (or third...) as more of a backup.
* Moved db_in_use and lock_request to use the 'states' table instead of the variables table. These are set and removed so often that it was messing up things with resync's when the data is transient anyway. Fixed multiple bugs with both to better set and clear properly.
* Created Database->read_state() to assist with the above changes.
* Updated Database->refresh_timestamp() to specifically check that the returned time stamp differs from the previously used one, looping until they differ if needed.
* Disabled striker-manage-install-target when called to update the repos, as the Install Target function doesn't work at this point.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-06-16 20:10:43 -04:00
Digimer
e6dcff1cf1 * Added a missing modified_date to ip_addresses in Database->get_ip_addresses().
* Updated scan-network to purge old historical ip_addresses when clearing duplicates now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-05-21 15:52:25 -04:00
Digimer
1b70b49cf8 * Updated Network->find_matches() to try to populate the first and second parameters if they're not passed in.
* Updated Network->load_ips() to load extra information about the interfaces.
* Updated ocf:alteeve:server to not check libvirtd daemon state on server start.
* Updated scan-hardware to check for duplicate entries and purge if found.
* Updated scan-network to check for the 'default' virbr0 interface by checking if the config file exists instead of calling virsh.
* Updated scan-server to have better logging.
* Created the new (and incomplete) anvil-test-alerts tool
* Updated scancore to support --purge to pass to all agents and then exit.
* Updated ScanCore->call_scan_agents() to no longer use 'timeout' as it was causing issues with virsh calls.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-05-20 10:28:21 -04:00
Digimer
142be7674e * Fixed a bug in striker-scan-network where the scan wasn't running properly when no network was specifically given.
* Updated DRBD->get_devices() to store information about the nodes for each resource.
* Got more work done on anvil-report-usage.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-04-08 18:54:55 -04:00
Digimer
0b41029db2 Reworked Database->_find_behind_databases to loop through tables, then databases when evaluating for resync. This is still racy but should be less racy as the time between counts of columns for a given table should be a lot shorter. Also re-enabled triggering resyncs based on the age of the most recent record.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-31 21:19:32 -04:00
Digimer
7212ea1c2f Fixed a bug where reaping db_in_use states wasn't restricted to the caller's host_uuid.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-29 17:12:26 -04:00
Digimer
74b7719cf5 * Created the new anvil-manage-host that can check/set if a host is configured. On Strikers, it can age out data, resync data, and check/set if the local database is active.
* Updated striker-prep-database to again enable the postgresql service.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-29 17:12:26 -04:00
Digimer
edf51adaec * Changed 'anvil-manage-power' to no longer set the job progress to 50 prior to calling a reboot. It now sets to 100 immediately. Also reduced the uptime timer to five minutes from ten.
* Updated striker-auto-initialize-all() to reconnect to DBs during waits to better detect when a DB is marked as offline.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-16 00:35:26 -04:00
Digimer
7b090e1623 * Updated Database->shutdown() to disconnect, stop the postgresql daemon, then reconnect.
* Updated anvil-daemon to not stop a database until both/all DB hosts are in both/all DB's hosts table.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-15 22:33:42 -04:00
Digimer
3fd0db15bf * This rather heavily reworks how database shutdowns works. It adds much more intelligent shutdown, tracking who is using the database, being able to mark a database as "offline" and waiting for users of the database to disconnect before it shuts down.
* Also removed the variables for the database name and DB user name, setting them statically now.
* Created Database->shutdown() to more kindly stop a local database server.
* Added 'check_db_in_use_states()' to anvil-daemon to clean any stale entries marking a database as in use.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-03-14 16:41:37 -04:00
Digimer
b234b79544 Updated anvil-daemon to check if anvil-sync-shared is running if the reported RAM use is too high. If so, it doesn't exit. This fixes an issue where anvil-sync-shared would loop forever as it would constantly be killed when downloading large files.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-02-27 21:29:30 -05:00
Digimer
68b1d12545 Updated anvil-daemon to not shutdown a striker DB until the striker host has been running for at least an hour.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-02-24 22:25:37 -05:00
Digimer
f77f486775 Fixed a typo in scan-network
Fixed a missing 'next' to prevent the first DB from disconnecting when down'ing excess DBs.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-02-09 15:52:21 -05:00
Digimer
d70b9a4956 Updated scancore and anvil-daemon to check their RAM use at the end of each loop and, if it's using more than 1 GiB of RAM, it sends an alert and exits.
* Updated Database->resync_databases() to never run on non-striker machines. On Strikers, before a resync, _age_out_data() is called to clear old data in long-off databases.
* Created System->check_memory() that is loosely based on anvil-check-memory, but checks to see if it's being controlled by a systemctl started daemon and, if so, reads the RAM in use from it's status output.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-02-05 22:08:06 -05:00
Digimer
a633ab7f63 Added a periodic check to ensure all users can ping. This fixes a bug where a local striker dashboard whose DB was stopped wouldn't work.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-01-21 14:27:58 -05:00
Digimer
e37f487704 Fixed a bug in System->check_ssh_keys where the 'admin' user's RSA keys were owned by root.
Signed-off-by: Digimer <digimer@alteeve.ca>
2022-01-20 14:13:27 -05:00
Digimer
892a475881 * Fixed a bug in Convert->format_mmddyy_to_yymmdd() where being passed '--' didn't return the same.
* Fixed a divide-by-zero bug in anvil-boot-server when no servers exist yet.
* Fixed a bug in anvil-daemon where the local databsae engine was being started when it shouldn't.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-01-18 02:38:50 -05:00
Digimer
652f87ec74 * Updated scan-network to also clean up the media type.
* Updated anvil-daemon to check for files in /mnt/shared/incoming on striker dashboards and add them to the media library if needed.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-01-12 23:27:44 -05:00
Digimer
72038e8358 * Fixed a bug where ethtool's Media type contained tab characters that broke JSON when configuring the netowrk interfaces.
* Updated the copywrite date to 2022.
* Updated the database resync to not run on machines host VMs to help reduce the chance of oom-killer terminating a VM.

Signed-off-by: Digimer <digimer@alteeve.ca>
2022-01-03 17:06:56 -05:00
Digimer
3346d31194 * Created Get->kernel_release() that returns the current kernel release (version) in use on the host or on a remote system.
* Created DRBD->_initialize_drbd() to makes sure the DRBD kernel module can load and tries to build the module, if necessary. This is meant to provide support for clients that can't access needed internet resource (or the internet at all).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-12-07 20:03:39 -05:00
Digimer
65dfc22a38 Added an eval{} call around Database->query()'s ->prepare() DBI call to better handle lost database handle.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-11-30 17:58:29 -05:00
Digimer
034c38fdeb Disabled calling striker-prep-database from the spec file, and enabled scancore.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-11-24 00:55:02 -05:00
Digimer
8e41814ca2 * Updated anvil-daemon->prep_database() to start the postgresql daemon if it's not running and no databases are available.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-11-23 21:49:24 -05:00
Digimer
b517117bc1 * Did more work on trying to figure out why iniital setup of the database was failing. I believe it was because, in anvil-daemon, after calling 'prep_database' we called ->connect() _without_ 'check_if_configured' set. Next round of function testing should help confirm is this was the case.
* Added 'configure_firewall()' to 'striker-prep-database' to explicitely open the postgresql service for all active zones.
* Did some general logging changes and cleanup around the same.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-11-23 20:41:29 -05:00
Digimer
3445d008d2 Removed a stray debug die.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-10-10 22:31:23 -04:00
Digimer
63c45430bb * Updated scan-network to clear duplicate IP addresses.
* Fixed a bug in anvil-daemon where striker-prep-database was always being called, when it shouldn't in some cases.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-10-10 22:27:54 -04:00
Digimer
e60a1b46b3 Fixed bugs related to automatic database startup and conditional backup loading.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-19 14:06:18 -04:00
Digimer
4e9882812d * Fixed a bug where the periodic database dumps on the primary database Striker were not sync'ing to peers. Also fixed a bug where these periodic dumps weren't running at all.
* Updated anvil-daemon->prep_database() to only run if the database dump file doesn't exist. (If it does, it's clearly configured).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-18 23:18:06 -04:00
Digimer
72b17ff1f9 * Reworked how databases are stopped, now being handled in anvil-daemon. This way, initial starts will still do traditional resyncs, then shut down. This should allow the best of both worlds, where data is not lost on striker start/stop loss/recovery, but operate normally otherwise without delays.
* Updated Database->archive_database() to return the full path to the dump file.
* Disabled enabling the postgresql daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-09-18 22:33:31 -04:00
Madison Kelly
922899ea78 * WIP: Working on a new method of failing over between which Striker is the active database, instead of running N-number of databases all the time.
* Created Database->backup_database() that creates a pg_dump of the active database.
* Created Database->load_database() that loads the database from a flat file, optionally creating a backup before doing so, and using iptables to block access during the process.
* Updated Database->configure_pgsql() to not start the postgresql daemon unless it just initialized the DB.
* Much work, not yet complete, to Database->connect() to stop after the first successful connection. Added logic that, if not connection was established and the host is a Striker, to load a peer's backup, if it exists, and then start the local daemon.
* Updated anvil-daemon to now have a section to run tasks on a ten minute cycle, which will later be used for the primary Striker to dump / copy its database to peer(s).

Signed-off-by: Madison Kelly <mkelly@alteeve.ca>
2021-09-16 23:10:55 -07:00
Digimer
a697011b08 * Disabled debug logging in anvil-daemon.
* WIP - working on new scan-network scan agent.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-30 02:36:06 -04:00
Digimer
6777104398 * Fixed a bug in anvil-daemon where, when an anvil-manage-power reboot run had triggered a reboot, anvil-daemon didn't set the job_progress to '100', causing constant reboots. Also fixed a bug where the log level was hard-set to '1' instead of '2' needed during debugging.
* Updated Jobs->get_job_uuid() to accept the new 'incomplete' parameter that, when set, will look for jobs whose progress is > 1 and < 100.
* Updated ScanCore-agent_startup() to take the new 'no_db_ok' parameter which returns with '0' if no DB is available and that parameter is set to '1'.
* Fixed a logging bug in 'anvil-join-anvil'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-28 20:04:11 -04:00
Digimer
0c475d2a2e * Fixed a couple logging bugs.
* Updated scan-cluster to get the CIB from pcs instead of reading the CIB from disk.
* Updated anvil-daemon to always call striker-prep-database at log level 2 while trying to find the cause of rare postgres config failures. Also updated striker-prep-database to use the new method of initializing the DB.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 18:22:55 -04:00
Digimer
d3052c0229 * Finished Cluster->check_server_constraints() and added it to scan-cluster. This now makes sure servers don't roll back to their old host after it has been fenced and recovers.
* Completely disabled Network->check_network(), it's causing more problems than it solves.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 14:19:58 -04:00
Digimer
e7a06fce72 * Disabling the periodic network health check in anvil-daemon.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-23 11:01:33 -04:00
Digimer
30f478267a * Forced anvil-daemon to log-level 2 and to enable secure logging to continue debugging setup issues.
* Fixed a undefined variable warning.
* Removed a debugging die from Database->resync_databases().

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 19:41:00 -04:00
Digimer
47fa126a3c * Fixed a typo that blocked anvil-daemon from starting.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 19:00:26 -04:00
Digimer
023f43eda9 * In the never-ending attempt to resolve the build consistency issues, this commit enables extra debugging logging and, hopefully, implements a fix in anvil-daemon where a job could be started repeatedly.
* Renamed the special job status 'scancore_startup' to 'anvil_startup', given it's handled by anvil-daemon.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-22 16:12:12 -04:00
Digimer
bd24c1c5bb * I _might_ have fixed the network configuration issue in anvil-configure-host... Updated it so that if 'nmcli' doesn't report a valid device name, it looks for it in the ifcfg-X file, and uses 'X' if not found there.
* Added the 'print' parameter to Log->variables() to allow printing to STDOUT when set.
* Renamed Network->check_bonds() to Network->check_networks() in anticipation of adding bridge monitoring / repair to it later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-18 19:37:37 -04:00
Digimer
c7c6c8dee5 * Reworked the attempt to repair the network in anvil-daemon to not touch the network until the machine has been running for at least two minutes.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-15 12:04:27 -04:00
Digimer
1e7847d4dd * Added a call to Network->check_bonds() to be called while non-Striker machines wait to connect to a database.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 14:14:37 -04:00
Digimer
3f32a56d0c * Created Network->check_bonds() that checks to see if any bonds are down, or if any interfaces configured to be in a bond are not actually in it. It accepts a 'heal' parameter that, by default, will bring up a bond with no active links, but leaves degraded bonds alone. It call also take 'all' and will try to bring up any missing interfaces. This distinction exists so that if a link is flaky and someone takes it down manually until it can be repaired, it doesn't get turned back on.
* Updated anvil-daemon to call Network->check_bonds() with 'all' on startup, then woth 'down_only' once per minute to try to heal down'ed bonds.
* Updated anvil-watch-bonds to take a 'run-once' switch and exit after one report, if set.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-13 13:33:51 -04:00
Digimer
19c41c9171 * Added more logging while chasing a function test bug.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 23:56:48 -04:00
Digimer
daca6c887b * This contains a fairly major change to how time stamps are handled. All INSERT and UPDATE calls now generate a new timestamp via Database->refresh_timestamp, instead of using 'sys::database::timestamp'. This was done in responce to finding a bug where tables in a database differed in both counts of public and private schemas (ip_addresses table, specifically) that failed to resync because the timestamps were re-used too often.
* WIP - Continuing work on the new anvil-manage-server tool.
* Updated Database->get_anvils() to load information on the files available on each Anvil! system.
* Updated Database->insert_or_update_network_interfaces() to no longer take the 'timestamp' parameter.
* Removed all logging from Database->refresh_timestamp() to speed it up, given how often it will be called now.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-08 15:23:15 -04:00
Digimer
96fffb0b96 * Finished updating ocf:alteeve:server to no longer require a database connection. To do this, and still be able to track live migration times, the Server->migrate_virsh() method now writes out the server name and migration time to a /tmp/anvil/migration-duration.<server_name>.<unix_time> file. This file is checked for by the scan-server resource agent and, when found, is parsed and the migration duration is recorded, then the file is purged.
* Updated anvil-daemon to have a new function called "handle_special_cases" called during startup that does any weird bug mitigation required. For now, this is used to mitigate against rhbz#1961562, though certainly it will be used for other reasons later.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-06-06 00:01:11 -04:00