anvil

Commit Graph

Author	SHA1	Message	Date
Digimer	0c475d2a2e	* Fixed a couple logging bugs. * Updated scan-cluster to get the CIB from pcs instead of reading the CIB from disk. * Updated anvil-daemon to always call striker-prep-database at log level 2 while trying to find the cause of rare postgres config failures. Also updated striker-prep-database to use the new method of initializing the DB. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	d3052c0229	* Finished Cluster->check_server_constraints() and added it to scan-cluster. This now makes sure servers don't roll back to their old host after it has been fenced and recovers. * Completely disabled Network->check_network(), it's causing more problems than it solves. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	87b31a16bb	* Clear out the bond health in Network->check_network(). Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	30f478267a	* Forced anvil-daemon to log-level 2 and to enable secure logging to continue debugging setup issues. * Fixed a undefined variable warning. * Removed a debugging die from Database->resync_databases(). Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	023f43eda9	* In the never-ending attempt to resolve the build consistency issues, this commit enables extra debugging logging and, hopefully, implements a fix in anvil-daemon where a job could be started repeatedly. * Renamed the special job status 'scancore_startup' to 'anvil_startup', given it's handled by anvil-daemon. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	5a343d6d75	* WIP; Started work on Cluster->check_server_constraints() that will track when a server's location constraint needs to be updated when the old preferred node is lost. * Removed (for now) setting MTU in the ifcfg-X files during anvil-configure-host runs. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	b71ed28f64	* Added Cluster->manage_fence_delay() that reports back and, optionally, sets a preferred node in a fence race. * Updated scan-cluster to check / set which node should be preferred if a netsplit causes a fence race. * Fixed a bug in Server->shutdown_virsh() where a shutdown timeout would go into a loop. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	08a958ec60	* Finished updating Network->check_network() to check/heal bridges. * Updated anvil-configure-host to not reboot on network chane (will verify when this commit is function tested). Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	bd24c1c5bb	* I _might_ have fixed the network configuration issue in anvil-configure-host... Updated it so that if 'nmcli' doesn't report a valid device name, it looks for it in the ifcfg-X file, and uses 'X' if not found there. * Added the 'print' parameter to Log->variables() to allow printing to STDOUT when set. * Renamed Network->check_bonds() to Network->check_networks() in anticipation of adding bridge monitoring / repair to it later. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	11b1900e1b	Note: Continuing to resolve the build issues with network startup. Expect breakage. * Upped the aging of jobs and alerts data from 2 to 24 hours. Also added a check to prevent deleting a job of any age that is incomplete. * Major update to anvil-configure-host to not touch the network unless something has actually changed. Not yet tested on a fresh system, will verify nothing broke in the CI tests this commit will trigger. Also changed it so that, if after reconfiguring the network it times out trying to reconnect to a database, it calls a reboot instead of simply exiting. Further, a reboot is now not called on exit unless something changed to require it. * Updated Network->check_bonds() to return '1' if anything was done to heal a bond. * Updated anvil-update-states to be more careful about clearing virsh bridges. Specifically, it checks to see if virsh is running and that the returned bridges aren't actually error codes. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	a1b06e4355	* Continuing to try to get the network to reliably start during configuration... Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	3f32a56d0c	* Created Network->check_bonds() that checks to see if any bonds are down, or if any interfaces configured to be in a bond are not actually in it. It accepts a 'heal' parameter that, by default, will bring up a bond with no active links, but leaves degraded bonds alone. It call also take 'all' and will try to bring up any missing interfaces. This distinction exists so that if a link is flaky and someone takes it down manually until it can be repaired, it doesn't get turned back on. * Updated anvil-daemon to call Network->check_bonds() with 'all' on startup, then woth 'down_only' once per minute to try to heal down'ed bonds. * Updated anvil-watch-bonds to take a 'run-once' switch and exit after one report, if set. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	1a8215a783	* Fixed a bug in Network->get_ips() bridge detection bashlet. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	80bdac8e34	* Updated the pacemaker server config to drop the stop timeout to 5 minutes and the migration timeout to 10 minutes. This will avoid blocking the entire cluster when a stop or migrate operation times out. Will update scan-server to clean these up when they happen. * Updated Database->archive_table() and ->_find_behind_databases() to loop through connected databases, instead of configured databases. * Updated Network->get_ips() to only record the real MAC addresses on network interfaces (not bonds or bridges) in the "network::${host}::interface::${in_iface}::mac_address" hash. This should help avoid reboot loops caused by anvil-configure-host thinking the network needs to be reconfigured when it doesn't actually need to be. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	daca6c887b	* This contains a fairly major change to how time stamps are handled. All INSERT and UPDATE calls now generate a new timestamp via Database->refresh_timestamp, instead of using 'sys::database::timestamp'. This was done in responce to finding a bug where tables in a database differed in both counts of public and private schemas (ip_addresses table, specifically) that failed to resync because the timestamps were re-used too often. * WIP - Continuing work on the new anvil-manage-server tool. * Updated Database->get_anvils() to load information on the files available on each Anvil! system. * Updated Database->insert_or_update_network_interfaces() to no longer take the 'timestamp' parameter. * Removed all logging from Database->refresh_timestamp() to speed it up, given how often it will be called now. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	96fffb0b96	* Finished updating ocf:alteeve:server to no longer require a database connection. To do this, and still be able to track live migration times, the Server->migrate_virsh() method now writes out the server name and migration time to a /tmp/anvil/migration-duration.<server_name>.<unix_time> file. This file is checked for by the scan-server resource agent and, when found, is parsed and the migration duration is recorded, then the file is purged. * Updated anvil-daemon to have a new function called "handle_special_cases" called during startup that does any weird bug mitigation required. For now, this is used to mitigate against rhbz#1961562, though certainly it will be used for other reasons later. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	16c20ae69c	* Updated Tools->catch_sig() to use return code 0 instead of 255 so that systemd doesn't think our daemons failed on stop. * Updated Cluster->parse_cib() to not require a database connection (part of the work to make ocf:alteeve:server run without a DB) * WIP: Continuing work on the ocf:alteeve:server RA to run without database connections. * Updated the scancore daemon to explcitely check that all scan agent schemas are loaded in all databases on startup. This is to resolve resync issues on rebuilt strikers that may not yet have some schemas loaded when a DB resync runs. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	24ec17f8f7	* Added a new parameter called 'sensitive' to Database->connect() that returns after connections before any ancilliary checks are done, minimizing connect time. * Fixed a problem with Database->insert_or_update_variables() where variable_source_uuid being set to an empty string wasn't converted to NULL. * Fixed Database->locking() where the way the lock variable was set was rather broken. * Created Striker->check_httpd_conf() which configured apache to handle the integration of the new WebUI for Anvil! management with the existing WebUI. * Updated System->update_hosts() to specifically set the 127.0.0.1 and ::1 lines to handle how cloud-init overrides /etc/hosts and breaks CI/CD tests. * Removed the old index.html as it's now used for the new WebUI. * Began work on removing DB connection requirements from ocf:alteeve:server. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	73267a8ea9	* WIP - Slowly working on anvil-manage-server * Updated the scancore interval to 60 seconds. * Updated Database->insert_or_update_health() so that 'delete' can find the health_uuid. * Updated Convert->time() to return silently when passed '-1'. * Fixed a bug scan-hardware to call Convert->round(). Also fixed it so it didn't set health scores of 0 for mismatch RAM when the RAM was not mismatched. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	78f3fb7b10	* Updated System->configure_ipmi to pull the machine from the anvils table instead of looking for the original job, which isn't useful now that we purge old jobs. * Shortened up the log messages in scan-drbd Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	4dcd505753	* Biggest change in this commit; scan-apc-pdu and scan-apc-ups now only run on Striker dashboards! This was because we found that if two machines ran their agents at the same time, the reponce time from SNMP read requests grew a lot. This meant it was likely a third, fourth and so on machne would also then have their scan agent runs while the existing runs were still trying to process, causing the SNMP reads to get slower still until timeouts popped. * Bumped scancore's scan delay from 30 seconds to 60. * Shorted the age-out time to 24 hours and again boosted the archive thresholds. As we get a feel for the amount of data collected on multi-Anvil! systems over time, we may continue to tune this.l * Moved Database->archive_database() to be called daily by anvil-daemon, instead of during '->connect' calls. * Added locking to Database->_age_out_data to avoid resyncs mid-purge. Also moved the power, temperature and ip_address columns into the same 'to_clean' hash as it was duplicate logic. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	8807915bb7	The theme of this commit is database cleanup and fixes. * Updated Database->_age_out_data() to check for certain scan agent tables and, for those found, purge out old records. This should go a long way to keeping the database data responsive. * Fixed a bug in Jobs->update_progress() where the 'job_picked_up_by' column was being set to '0' instead of '$$' when clearing the job. * Fixed a bug in System->update_hosts() where '127.0.0.1' would be used in hosts for the actual host name. * Updated the default trigger, count and division values in anvil.conf to 100,000, 50,000 and 75,000 respectively. In combination with the aging of data, this should go a long way to minimizing database sizes and overheads. * Updated anvil-daemon to call $anvil->Database->_age_out_data(); in it's daily tasks. * Updated various striker-X tools to specifically request a DB resync on Database->connect calls. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	6abe06f125	The theme of these commits is improving DB responsiveness. * Created Database->_age_out_data() to delete records from the database that are old enough to no longer be useful. This is designed to significantly reduce the size of the database, allowing a better focus on performance. * Changed Database->connect() to default to NOT check for resync, reworking the old 'no_resync' to 'check_for_resync', so that resync checks happen on demard, instead of by default. * Updated get_tables_from_schema() to now allow 'schema_file' to be set to 'all', which then loads the schema files of all scan agents as well as the core anvil schema file. Fixed a bug where commented out tables were being counted. * Re-enabled triggering resyncs on 'last_updated' differences. * Fixed a bug in scan-ipmitool where the history_id column in history.scan_ipmitool_value was incorrect. * Created a new tool called striker-show-db-counts that shows the number of records in all public and history schema tables for all databases. * Updated anvil-update-states to detect when a libvirtd NAT'ed bridge exists and to delete it when found. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	bbad058b33	* Created a new tool, anvil-watch-bonds, which is a live monitor of bonds and interfaces designed to be run from the command line on a given host. * Created Words->center_text that takes a string (or string key) and centers it to a given string length, padding white spaces on either side of the string as needed. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	d155c2eb66	* Fixed a bug where 'timeout' would repeatedly get added to drbd's global-common.conf file. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	42ffc200bc	* Updated remainder pointers to the old repos to the new repos. Added support for the new alteeve-repo-setup. * Removed the checks for resync that limited resyncs on jobs and variables tables. That approach to minimize unnecessary resyncshas proven faulty, will find another way later. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	41cd1e0319	* Several bugs fixed and enhancements; * DRBD is now configured to a ping-timeout of 3 seconds. * Created Log->switches() that returnes the command line switches used by Anvil! tool command line calls based on the active log levels / secure logging. Appended this to all invocations of our tools. * Updated Database->resync_databases() to now only skip 'jobs' and 'variables' tables with less than 10 record differences. All other differences will trigger a resync. * Created System->_check_anvil_conf() that, as you might guess, checks in anvil.conf exists and created it (using defaults), if not. It also checks to see if the 'admin' group and user exists and creates them, if not. * Updated anvil-daemon to check anvil.conf on start up and in each loop. Created the function check_journald() that checks (and sets, if needed) that journald logging is persistent. * Made striker-manage-peers to check_if_configured on the Database->connect() when updating anvil.conf and the target UUID is the local machine. Also created a loop to make the reconnection a lot more robust. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	2f8becbb11	* Fixed (another) bug in Database->_archive_table() that was preventing Database.pm from compiling. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	49890762b9	* Fixed a missing semi-colon that broke Database.pm. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	a846f9ecbc	* Fix to the database resync logic. The previous change to only resync if 10+ lines differed broke striker-manage-peers as the difference in host counts is what triggered the pairing of strikers. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	41d528418d	* Increased the trigger point for database archiving. The current values were too low and cuasing frequent archive -> resync cycles. * Fixed a bug in that archiving defaulting to not store on disk was not working properly. Now acts as described in anvil.conf. Signed-off-by: Digimer <digimer@alteeve.ca>	3 years ago
Digimer	fc0954d0c8	* Started work on, but not at all finished, anvil-manage-server which will allow manipulation of a server's resources. * Changed the alteeve repo RPM to the new cimmunity/enterprise repo * Fixed a bug where 'fence_data::updated' was causing the fences web page to break. * Fixed a bug in Database->insert_or_update_network_interfaces() where certain interfaces were being repeatedly added to the database. * Fixed a bug in Database->_find_behind_databases() was marking DBs as behind even though they had less than 10 columns off. * Fixed a bug in Get->host_name() where, if the host name was changed on disk but the environment variable was still the old name, it would cause the hostname to waffle back and forth and cause constant updated to /etc/hosts. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	ad4a1ecc78	* Increaded the scancore agent run timeout to 60 seconds. * Updated anvil-safe-start to start DRBD resources when the peer's DRBD resourcs is 'Connecting', * Updated fence_pacemaker to more intelligently check the list of host names related to an IP address when looking for the peer host name Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	44864ce321	* Updated Database->resync_databases() to set a default schema of 'public'. Also fixed a bug where, when the difference in record numbers between two line was > 999, it would not trigger a resync. * Updated the scan agent timeout to 60 seconds. Also made the scan agent exit code log entries more helpful. * Updated System->collect_ipmi_data() to now better handle duplicate sensor names. Now, instead of simply appending an integer, we find the hex address and use that in the sensor name when duplicates exist. This solves the problem of the sensor names not being consistently shown in order. * Fixed message bugs (bad variable insertions) in scan-apc-pdu and scan-apc-ups. * Fixed schema procedure bugs in the 'temperature' and 'ip_address' tables where the columns were in bad order, causing constanty updates. Incomplete work; * Create the shell of 'anvil-manage-storage', but virtually no logic exists in it yet. * Started work on anvil-safe-start to deal with an issue where DRBD resources don't start when a server is running on a peer. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	7abbc938af	* Renamed tools/striker-purge-host to tools/striker-purge-target and moved the code from test.pl over to it. No longer provides interactive selection, but now does work with Anvil! systems as well as hosts. * Fixed a bug in Database->get_tables_from_schema where history.X and X tables were being stored in the table list. * Updated ocf:alteeve:server to no do resyncs on DB connect. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	f833c311ba	* To address issues with scancore debugging, we needed a tool to purge old anvils and hosts from the database. The 'test.pl' in this commit contains the new logic that will be merged into tools/striker-purge-host shortly. * Created Database->find_host_uuid_columns() and ->_find_column() to create a list of tables and column names in the proper order to allow deletion of foreign keys to that deeply nested primary keys can be deleted. Specifically, this was meant for hosts -> host_uuid and anvils -> anvil_uuid, though it should work for other tables. * Updated html/jquery-ui-1.12.1/package.json to address CVE-2020-7729 * Fixed a bug in the temperature table's history procedure where temperature_weight wasn't being copied. * Updated anvil-provision-server to support '--anvil' that can take either the anvil-uuid or anvil-name. * Updated anvil-safe-stop to default the stop-reason to 'user'. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	3fb81c1a0a	* Updated Convert->time() to silently return if the given time was '--'. * Added a new parameter to Database->connect() called 'no_resync' that, if set, prevents a resync check being performed. Updated ->resync_databases() to find a uuid_column where the table name ends in 'ies' and the UUID column is 'y_uuid'. Updated ->resync_databases() to not fire on updated table age anymore, and to trigger only if the number of rows differ in a given table by more than 10. * Updated Log->entry() to prefix a tool's name, when the new 'log::scan_agent' value is set. Also set this value in ScanCore->agent_startup(), to help differentiate log entries. * Fixed a bug in scancore's main loop where it logged the sleep message at the start of the run. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	4a87ee71db	* This commit started with work on webui endpoint set_power, but then switched to scancore debugging and I neglected to switch branches. * Created Cluster->check_stonith_config() that checks and, if needed, reconfigures a cluster's fencing (stonith) config. * Updated scan-cluster to call Cluster->check_stonith_config() at the end of each call. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	416f51323a	* Created tools/striker-boot-machine to, well, boot machines. It uses host_ipmi or, failing that, other fence methods when available to boot a node. * Created Cluster->get_fence_methods() that parses all fence methods out of a recorded CIB and stores the in a hash for a given host_uuid. * Fixed a bug in ScanCore->post_scan_analysis_striker() where the short_host_name was not being stored correctly. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	ca7052dd53	The core logic is done!!!! Still need to finish end-points for the WebUI to hook into, but the core of M3 is complete! Many, many bugs are expected, of course. :) * Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively. * Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf. * Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state. * Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested. * Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times. * Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan. * Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	15dab8aab7	* Started working on the node post-scan login in ScanCore. Created ScanCore->check_temperature() to get a thermal score against a node. * Update ScanCore->check_power() to not require the parameter values. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	f202187c34	* anvil-safe-stop is complete! Testing still needed, of course. * Updated DRBD->manage_resource() to call 'drbdadm adjust <res>' when starting a resource to help deal with a periodic issue where the 'allow-two-primary' option on the peer doesn't match the local setting. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	3a6902d899	* Made good progress on anvil-safe-stop. It will now stop or migrate servers (testing needed). * Updated Server->shutdown_virsh() to change the parameter 'wait' to 'wait_time' to clarify it's use. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	27259d1d53	* Finished anvil-rename-server! * Created Storage->delete_file() that, well, deletes files (locally or on a peer). Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	2e37691116	* Updated DRBD->gather_data() to store data on peers so that the peer's LV path and backing disk is recorded. Also fixed a bug in ->get_status() where the return code for local calls was stored as a host name. * Added the scan-hpacucli scan agent. It's been done for a while and should have been added ages ago. * Updated anvil-rename-server to get to the point where it will take down the DRBD resources on all machines, but waits if there is a sync under way. It also verifies that the server is off on all systems from virsh's perspective. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	711a04999e	* Finished anvil-migrate-server and anvil-safe-start! Lots of testing still needed for both though, and 'anvil-safe-start' does run as a job yet, but the logic is all there. * Fixed a bug in Cluster->migrate_server() where waiting for the server to migate would never exit. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	eec14cb013	* Finished tools/anvil-boot-server and tools/anvil-shutdown-server. * Fixed a bug where, in rare cases, $anvil->hostname() would call 'hostnamectl' and get a dbus error during shutdown, which would then cause the hostname to be changed to the error in the database. * Fixed a bug in Cluster->boot_server() where it would never verify that a server has started successfully. * Updated Database->get_ip_addresses() to store the IPs we manage in 'ip_addresses::<ip_address_address>::X'. * Updated ocf:alteeve:server to work from command line calls, though more testing is still needed. * Started work on 'anvil-rename-server', but haven't gotten far with it yet. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	a480357049	* Fixed a bug in Cluster->assemble_storage_groups() where, if a group is created during an anvil-provision-server run, the group would get created multiple times. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	b36093671b	* Updated Database queries that were passing 'debug => $debug' to not do that, as it was causing far too much (useless) noise in the logs. * Turned on print to console for logging in anvil-provision-server. Also updated it to check if the cluster is running and hold until it is. * Cleaned up some code in Get->available_resources() that proved hard to debug. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago
Digimer	798518ba5e	* While working on the boot/shutdown server tools, ran into and fixed a bug where files uploaded before an Anvil! was added could not have those files sync'ed. This was fixed though the new Database->check_file_locations() method. Signed-off-by: Digimer <digimer@alteeve.ca>	4 years ago

... 5 6 7 8 9 ...

757 Commits (753358a13bc2f0a1a4b78e25c65b74fca23d820c)