* Created Database->_age_out_data() to delete records from the database that are old enough to no longer be useful. This is designed to significantly reduce the size of the database, allowing a better focus on performance.
* Changed Database->connect() to default to NOT check for resync, reworking the old 'no_resync' to 'check_for_resync', so that resync checks happen on demard, instead of by default.
* Updated get_tables_from_schema() to now allow 'schema_file' to be set to 'all', which then loads the schema files of all scan agents as well as the core anvil schema file. Fixed a bug where commented out tables were being counted.
* Re-enabled triggering resyncs on 'last_updated' differences.
* Fixed a bug in scan-ipmitool where the history_id column in history.scan_ipmitool_value was incorrect.
* Created a new tool called striker-show-db-counts that shows the number of records in all public and history schema tables for all databases.
* Updated anvil-update-states to detect when a libvirtd NAT'ed bridge exists and to delete it when found.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created Words->center_text that takes a string (or string key) and centers it to a given string length, padding white spaces on either side of the string as needed.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Removed the checks for resync that limited resyncs on jobs and variables tables. That approach to minimize unnecessary resyncshas proven faulty, will find another way later.
Signed-off-by: Digimer <digimer@alteeve.ca>
* DRBD is now configured to a ping-timeout of 3 seconds.
* Created Log->switches() that returnes the command line switches used by Anvil! tool command line calls based on the active log levels / secure logging. Appended this to all invocations of our tools.
* Updated Database->resync_databases() to now only skip 'jobs' and 'variables' tables with less than 10 record differences. All other differences will trigger a resync.
* Created System->_check_anvil_conf() that, as you might guess, checks in anvil.conf exists and created it (using defaults), if not. It also checks to see if the 'admin' group and user exists and creates them, if not.
* Updated anvil-daemon to check anvil.conf on start up and in each loop. Created the function check_journald() that checks (and sets, if needed) that journald logging is persistent.
* Made striker-manage-peers to check_if_configured on the Database->connect() when updating anvil.conf and the target UUID is the local machine. Also created a loop to make the reconnection a lot more robust.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Changed the alteeve repo RPM to the new cimmunity/enterprise repo
* Fixed a bug where 'fence_data::updated' was causing the fences web page to break.
* Fixed a bug in Database->insert_or_update_network_interfaces() where certain interfaces were being repeatedly added to the database.
* Fixed a bug in Database->_find_behind_databases() was marking DBs as behind even though they had less than 10 columns off.
* Fixed a bug in Get->host_name() where, if the host name was changed on disk but the environment variable was still the old name, it would cause the hostname to waffle back and forth and cause constant updated to /etc/hosts.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated anvil-safe-start to start DRBD resources when the peer's DRBD resourcs is 'Connecting',
* Updated fence_pacemaker to more intelligently check the list of host names related to an IP address when looking for the peer host name
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created DRBD->check_if_syncsource() and ->check_if_synctarget() that return '1' if the target host is currently SyncSource or SyncTarget for any resource, respectively.
* Updated DRBD->update_global_common() to return the unified-format diff if any changes were made to global-common.conf.
* Created ScanCore->check_health() that returns the health score for a host. Created ->count_servers() that returns the number of servers on a host, how much RAM is used by those servers and, if available, the estimated migration time of the servers. Updated ->check_temperature() to set/clear/return the time that a host has been in a warning or critical temperature state.
* Finished ScanCore->post_scan_analysis_node()!!! It certainly has bugs, and much testing is needed, but the logic is all in place! Oh what a slog that was... It should be far more intelligent than M2 though, once flushed out and tested.
* Created Server->active_migrations() that returns '1' if any servers are in a migration on an Anvil! system. Updated ->migrate_virsh() to record how long a migration took in the "server::migration_duration" variable, which is averaged by ScanCore->count_servers() to estimate migration times.
* Updated scan-drbd to check/update the global-common.conf file's config at the end of a scan.
* Updated ScanCore itself to not scan when in maintenance mode. Also updated it to call 'anvil-safe-start' when ScanCore starts, so long as it is within ten minutes of the host booting.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed a bug where, in rare cases, $anvil->hostname() would call 'hostnamectl' and get a dbus error during shutdown, which would then cause the hostname to be changed to the error in the database.
* Fixed a bug in Cluster->boot_server() where it would never verify that a server has started successfully.
* Updated Database->get_ip_addresses() to store the IPs we manage in 'ip_addresses::<ip_address_address>::X'.
* Updated ocf:alteeve:server to work from command line calls, though more testing is still needed.
* Started work on 'anvil-rename-server', but haven't gotten far with it yet.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created Cluster->parse_quorum() to check if a node is quorate as 'have-quorum' in the pacemaker CIB doesn't appear to be super accurate during startup.
* Fixed a bug in striker-manage-install-target where if a node didn't have any registered IPs, it would break before generating the repo data.
* Fixed a bug in anvil-join-anvil where if the database had to be reconnected, the job data was lost.
Signed-off-by: Digimer <digimer@alteeve.ca>
* The get_mmeory endpoint was completed.
* The get_replicated_storage endpoint was completed, though it requires testing and likely has issues.
To prepare for the get_status endpoint work, I needed to update ScanCore and modules to track the host_status. This commit contains the work needed for this.
* Updated ScanCore->post_scan_analysis_striker() to use configured fence devices (except PDUs) to check if a target host is off or on, in there is no host_ipmi interface. In all cases, if a machine can be confirmed on or off, the host_status is now updated.
* To support the above fence based power checks, updated scan-cluster to store the on-disk CIB in the new scan_cluster -> scan_cluster_cib colume.
* Updated ScanCore->parse_cib() to map stonith primitive IDs to fence agents. Updated ->parse_crm_mon() to not call if the executable doesn't exist to avoid unhelpful error messages in the logs when called from a Striker.
* Update DRBD->gather_data() to get the size data from /sys/block/drbd<minor>/size' x '/sys/block/drbd<minor>/queue/logical_block_size so it works when a device is Secondary (and can't be promoted).
* Updated Database->get_hosts_info() to record the short host name as well as the stored host name. Created ->update_host_status() as a wrapper to ->insert_or_update_hosts() that only updates the host status.
* Updated anvil-join-anvil to disabled ksm and ksmtuned daemons.
* Updated scancore and anvil-daemon to set the host_status to 'online' on startup.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Finished scan-filesystems!
* Realized that filesystem UUIDs are not always actual UUIDs, and so created an additonal column in filesystems -> filesystem_internal_uuid and created a normal filesystem_uuid table.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed a bug in Database->_find_behind_databases() _find_behind_databases() where the logic to figure out which column was the host_uuid reference was too liberal, causing the wrong column to be selected in some cases. Also added a check to not look for host_uuid columns on specific tables where a match would be made, but the column is allowed to be null (like server_host_uuid that indicates the host of the server).
* Started work on the scan-filesystems scan agent, which is needed to record the data that the file system UI endpoints will need.
* Removed the 'not null' constraint from 'servers' -> 'server_host_uuid'.
* Fixed a bug in 'anvil-provision-server' where the driver ISO being 'none' caused the provision script to use the driver ISO switch.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated Remote->call() to return ('!!error!!', '!!error!!', 9999) when an error hits. Made Remote->test_access() explicitely check for '1' to be returned in order to confirm access, fixing a bug where bad target value caused false positives. Updated ->_check_known_hosts_for_target() to no longer explicitely check for 'ssh-rsa' so that machine keys using different cyphers are detected as being in known_hosts properly.
* Updated striker-auto-initialize-all to initialize nodes and DR hosts networks before trying to form them into an Anvil!. Fixed several other bugs as well. More testing is needed, but it works now.
* Updated striker-initialize-host to check for the alteeve repo and, it not found, check for accress to alteeve.com. If access, it will install our repo now.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Finished initial Striker setup in tools/striker-auto-initialize-all. Started working on peering.
* Cleaned up the handling of converting UIDs to user names in Remote->add_target_to_known_hosts() and ->_call_ssh_keyscan().
* Did a bunch of white-space/alignment cleanup.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated Get->os_type() to use 'cat' instead of Storage->read_file() because 'rsync' may not be available when it is called during striker-initialize-host calls.
* Updated Database methods to skip 'oui' and 'state' during resync.
* Updatedb striker-initialize-host to detect when it's initializing a CentOS Stream Node / DR Host and enable the HA repo.
* Created the tools/striker-auto-initialize-all tool, which is very much incomplete, that will allow for the rapid creation of a full Anvil! from freshly installed machines autonomously.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Reworked scan-ipimitool so that on nodes and dr hosts, it only scans itself. On strikers, it scans all hosts found in active Anvil! systems with a host_ipmi entry. `
* For all agents, reduced log verbosity to not push too much noise into anvil.log while scancore is running in the background.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated Cluster->add_server() to now set failure timeouts to actual numbers instead of INFINITY after discovering that INFINITY doesn't work in those cases.
* Updated Databsae->get_hosts to now check if other entries have the same host name, and if so, to set their host_key to 'DELETED'. This should make it easier to handle when a hardware machine is replaced by new hardware but uses the same host_name.
* Updated Email->check_queue() to start and enable postfix.service if it's found to not be running.
* Updated Get->available_resources() to return '!!no_data!!' when a given host hasn't got any data in scan_lvm_vgs. Now use this in anvil-provision-server to exit if a node or dr host hasn't run scancore yet.
* Fixed a bug in scan-lvm where the pvs_uuid wasn't being loaded properly, preventing lost PVs, VGs and LVs from being flagged as deleted.
* Started work on anvil-migate-server, though it's far from complete.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed a bug in Cluster->shutdown_server() where the wrong variable was being evaluated when checking the server state.
* Created DRBD->delete_resource() that deletes a resource's backing device and configuration. Note that this wipes the DRBD MD and and FS signatures before removing the LV. Updated DRBD->gather_data() to record the backing devices for volumes.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed a bug in Cluster->parse_cib() when a server that is off wasn't setting 'status'.
* Renamed 'server::location::<server>::host' to '...::host_name' in several places.
* Got more work done on anvil-delete-server, up to the point where it calls the new Cluster->delete_server() method.
* Updated fence_pacemaker to call 'drbdadm adjust all' to dampen an issue where in-memory fence configs seem to change, preventing reconnection of the peer after it reboots from the fence. More testing needed on this issue.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created a new tools/striker-parse-os-list tool that parses 'osinfo-query os' and prints out entries for words.xml for any new OSes.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Finished DRBD->get_next_resource() that returns the next available minor and the next free TCP port (with two free ports available after it).
* Created Storage->get_storage_group_details() that pulls together the LVM, storage group members and storage groups into one block of data.
* Made more progress on tools/anvil-provision-server. It now gets up to the point of creating LVs, creating DRBD resource files, loading them, creating metadata and up'ing the resource. It doesn't yet (successfully) force a new resource to primary.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created the new method Cluster->get_primary_host_uuid() that returns the 'host_uuid' of the primary node in the given cluster. This is useful for external programs to figure out which node is primary. Example is provisioning a new server being assigned to the active node. Also created ->is_primary() that is a similar test to see if the active node is the primary node or not.
* Updated Cluster->parse_cib() and ->parse_crm_mon() to work on remote hosts.
* Updated Database->get_hosts() to store the short host names.
* Created Get->host_from_ip_address() that translates an IP address to a host_uuid and host_name, if it's an IP assigned currently to a known host.
* Created Network->find_target_ip() that simplifies finding which IP address to use when the caller wants to connect to a target host.
* Reworked the anvil-join-anvil to parse fence_arguments in a way that handles passwords with spaces in them.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Cleaned up upload.pl now that it isn't responsible for loading the file details into the database. It only sets a job for the local Striker to process the file and move it into /mnt/shared/files, copy it to peer dashboards, then load jobs for Anvil! members to sync the new file.
* Created Database->get_files() and ->get_file_locations() to load the respective data.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Got tools/striker-sync-shared to pick up 'upload::move_incoming' jobs, move the uploaded file to /mnt/shared/files/, copies it to peer dashboards, adds it to the 'files' table and adds it to 'file_locations'.
* Reworked the 'file_locations' table to now map files to Anvil! systems, not hosts. It simply tracks if a given file should be on Anvil! members or not. Later, striker-sync-shared on the Anvil! members will pull the file down.
* Updated Storage->get_file_stats() to record the file's mimetype.
* Fixed up a few issues in cgi-bin/upload.pl.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed a bug in Database->check_agent_data() where the list of tables wasn't passed in, and thus the table list wasn't then passed on to Database->_find_behind_databases().
* Started work on a new method called Storage->parse_lsblk().
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated Database->insert_or_update_temperature() to accespt a 'delete' parameter (which, surprise, deletes a record).
* Updated (but not yet tested) the RPM .spec to require that 'core' and the other three packages are required to be the same version.
* Updated scan-ipmitool and scan-storcli scan agents to now delete temperature data belonging to lost sensors.
* Fixed tools/striker-manage-install-target to add multiple missing packages.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated Alert->register() to take message variables using the 'variables' parameter.
* Added a 'cache' parameter to Database->insert_or_update_health() and ->insert_or_update_temperature(). When set, the SQL UPDATE/INSERT calls and pushed into the array reference set in 'cache'. This is to allow performance improvements when processing a large amount of sensor/device data.
* Updated Log->variables() to take a 'prefix' parameter that, when set, will prefix the string to each variable line.
* Updated scan-ipmitool to use Database->insert_or_update_health() and ->insert_or_update_temperature().
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created (but not finished) scan-apc-pdu
* Added support to tracking maintenance-mode for nodes in Cluster->parse_cib
* Created Remote->read_snmp_oid().
* Created Server->get_definition.
* Updated Server->get_status() to write-out server XML files on-demand.
* Finished scan-cluster.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Renamed scan-storage to scan-lvm as we only really care about LVM data in this agent. A dedicated scan-drbd will be created later. Got the agent to parse the pvs/vgs/lvs data.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Added support in anvil.conf to disable scan agents with 'scancore::<agent_name>::disable', and added handling this to agents. Also allowed for '--force' to override this setting.
* Updated ScanCore->agent_startup() to allow for empty scan agent table lists.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Created 'Cluster->which_node' that returns 'node1' or 'node2' to indicate which node a host is.
* Continued working on scan_cluster; decided to make it not host-dependent.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Updated Alert->register to take a hash reference for message variables to simplify when a caller plans to log and register an alert at the same time.
* Updated Convert->bytes_to_human_readable() to name the 'size' variable used internally for 'bytes' to actually be 'bytes' for better consistency.
* Created multiple new Database methods;
** ->check_condition_age() is meant to be used by scan agents to see how long a given condition has been in play (ie: how long ago power was lost to a UPS or a sensor became unreadable).
** ->insert_or_update_health() handles recording data to the new 'health' table, used for determining ideal hosts for servers between nodes.
** ->insert_or_update_power() handles recording data to the new 'power' table, used for determining how power events are handled.
** ->insert_or_update_temperature() handles recording temperature data to the new 'temperature' table, used to determine how thermal events are handled.
* Got a lot more done on the scan-hardware scan agent. Only part left now is post-scan health processing.
Signed-off-by: Digimer <digimer@alteeve.ca>
* More work done on Email->send_email() to, well, actually send email (which it isn't doing yet, but it's close).
* Updated Words->key() to include the bad key name when no entry for the requested key exists in the words.xml file.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed a bug in Tools->_set_defaults where the order the tables were sync'ed it caused primary/foreign keys would trigger DB errors when resync'ing in some cases.
* Created Database->log_connections to make it easier to log which databases are actively in use and other data about the connections.
* Fixed bugs in striker-manage-peers that (partly because of the above bugs) failed to connect to new peers properly.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Got email server configuration under way. A mail server can now be configured via Email->_configure_for_server(), but more work is needed on when to switch between configs.
* Fixed some logging of passwords that wasn't being checked to see if secure logging was enabled or not.
* Fixed a bug in Striker where the back arrow in email config sub-sections weren't going back to the main email menu.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Added the anvil.conf option 'sys::privacy::strong' that controls if the Anvil! ever "calls home". Initially, this controls DRBD's usage flag.
* Updated DRBD->get_devices() to track resources by their 'by-res' names as well and by the normal '/dev/drbdX' devices.
* To mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1868467, updated Get->bridges() to parse the normal (non-JSON) data if we get invalid JSON output.
* Updated anvil-join-anvil to not disable, and in fact enable, libvirtd on boot. With DRBD 9, the original fear of a user accidentally booting a VM that's running on the peer no longer is an issue. By enabling it and leaving it on, Striker dashboard users won't lose their virtual machine manager link unless the node powers off. Also enabled actually updating the job progress, completing this tool!
Signed-off-by: Digimer <digimer@alteeve.ca>
* Cleaned up a lot of logging.
* Updated Cluster->parse_cib() to track if a stonith device has 'delay' set.
* Got a lot more work done on anvil-join-anvil's stonith processing, but it still isn't complete. Updated it to change shell user passwords as well.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Fixed Database->get_ip_addresses() to clear stale IP addresses.
* Finished (for now, more testing needed) System->configure_ipmi! Also created System->test_ipmi() that handles trying lanplus and various password lengths, updating hosts -> host_ipmi on successful check.
Signed-off-by: Digimer <digimer@alteeve.ca>
* Got anvil-join-anvil to the point where is initializes and starts the cluster.
* Deleted the old ssh key handling logic in anvil-daemon.
Signed-off-by: Digimer <digimer@alteeve.ca>