Commit Graph

2315 Commits

Author SHA1 Message Date
Digimer
ac6531ddf2 * Fixed a bug where Words->load_agent_strings() wouldn't process strings without new-lines in them (caused in the last fix).
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-03 18:53:56 -05:00
digimer-bot
eef1e4fb20
Merge pull request #29 from ClusterLabs/dependencies
* Added 'tar' as a dependency because somehow I went three years with…
2021-02-03 14:49:05 -05:00
Digimer
1998ad946c * Added 'tar' as a dependency because somehow I went three years without this...
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-03 14:34:08 -05:00
Digimer
68d007c1bb
Merge pull request #28 from ClusterLabs/anvil-provision-server-fixes
Anvil provision server fixes
2021-02-03 13:03:31 -05:00
Digimer
ff3681c913 * Added support for manually setting the server's UUID in anvil-provision-server. Also, if a server name existed before but was deleted, the old UUID is re-used to provide better continuity. The user can override this behaviour with the new --uuid switch.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-03 12:58:18 -05:00
Digimer
4b9ec56106 * Updated DRBD->delete_resource() to return a success if asked to delete a non-existent resource (as can happen when partial anvil-delete-server runs are re-run).
* Reworked DRBD->get_next_resource() to pull from the database, and to no longer do that increments-of-three nonsense. Avoidable complexity. Also added a call to Cluster->get_anvil_uuid() if the 'anvil_uuid' parameter wasn't passed.
* Updated Database->get_host_from_uuid() and ->get_hosts() to now take 'include_deleted' parameter and default to not returning deleted hosts. This fixed issues where anvil-{delete,provision}-server calls could assign jobs to now-deleted hosts with reused host names.
* Updated anvil-delete-server to print log entries to STDOUT. Also updated it to not wait of shutdown of a server in pacemaker to complete, and instead to destroy it after calling pacemaker's resource stop. Updated to also check to see if the server being deleted is already out of pacemaker and, if so, skip that step and directly try to destroy the server, if it's running.
* Updated anvil-provision-server to force 'peer_mode' runs to pull their TCP Port and DRBD minor numbers from the job. This fixes a bug where the same resource on two machines could use different TCP ports.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-02 23:09:47 -05:00
Digimer
cb955f5370 Merge branch 'anvil-provision-server-fixes' of github.com:ClusterLabs/anvil into anvil-provision-server-fixes
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-01 13:11:32 -05:00
Digimer
3708575485 * Added a check to anvil-delete-server to remove the XML definition file.
* Added checks to anvil-provision-server to see if an existing server name is flagged as DELETED, instead of outright rejecting a given server name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-01 12:56:46 -05:00
Digimer
42dc688d04
Merge pull request #27 from ClusterLabs/string_bugs
* Fixed a bug in Words->parse_banged_string() Where the flattened str…
2021-02-01 12:37:33 -05:00
Digimer
3f04c9031b * Fixed a bug in Words->parse_banged_string() Where the flattened string wasn't being used for the variable substitutions.
* This resolves issue #23.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-01 12:27:05 -05:00
Digimer
86228e9d1d * Added a check to anvil-delete-server to remove the XML definition file.
* Added checks to anvil-provision-server to see if an existing server name is flagged as DELETED, instead of outright rejecting a given server name.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-02-01 12:06:40 -05:00
digimer-bot
030f4a7efc
Merge pull request #26 from ClusterLabs/string_bugs
String bugs
2021-01-31 05:27:31 -05:00
digimer-bot
5a720276d4
Merge pull request #25 from ClusterLabs/anvil-provision-server-fixes
* Typo fixed in striker-manage-install-target insertion variable.
2021-01-31 05:23:29 -05:00
Digimer
9ae6a6b00f
Merge pull request #24 from ClusterLabs/storage-groups-testing
* Finished fixing automatic building of Storage Groups on systems whe…
2021-01-31 05:19:22 -05:00
Digimer
1081645893 * Added parameters to DRBD->get_next_resource to allow for a resource to be searched and either error out if a resource is found, or return the first DRBD minor and tcp port if found.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 02:32:12 -05:00
Digimer
f4bf1fd54a * Removed some XML insertions into strings as the break inserting into strings.
Note: These changes below shouldn't have been in this branch... *sigh*
* Fixed an issue with tools/anvil-provision-server where a VM would be created but didn't boot. When this happens, an explicit boot is sent via virsh. Also bumped up the time it waits for a new server to start up.
* Added an explicit call to scan-drbd after a new resource is created to ensure that if any calls come after looking for the next free DRBD minor or port, they don't use the ones just used.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 01:39:52 -05:00
Digimer
3d7ce84c38 * Fixed a bug in Get->host_from_ip_address() where hosts that are no longer used are returned, meaning 2+ results could be returned after a node was replaced, meaning no host name was returned.
* Fixed a bug in anvil-provision-server where forcing initialization of a new DRBD resource when running on node 2 would fail because the node ID in the drbdsetup command was hard-coded to be run from node 1.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-31 00:00:23 -05:00
Digimer
218934bec8 * Fixed a bug with the path to anvil-provision-server.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-30 19:06:22 -05:00
Digimer
e25a424eb4 * Typo fixed in striker-manage-install-target insertion variable.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-30 18:49:15 -05:00
Digimer
864d67b0a7 * Finished fixing automatic building of Storage Groups on systems where VGs are deleted.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-30 18:25:24 -05:00
Digimer
4619c810d8
Merge pull request #20 from ClusterLabs/makefile
[build] first pass at adding a build system to integrate with CI
2021-01-30 14:33:50 -05:00
Fabio M. Di Nitto
8f9892650b [build] first pass at adding a build system to integrate with CI
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net>
2021-01-30 20:16:30 +01:00
Digimer
413a4f73c2 * Updated Tools->_anvil_version() and Get->anvil_version() to now pick up a SchemaVersion from anvil.sql. This will change only when the schema changes and is used when Database->connect() is checking compatibility with other anvil database hosts. This will make it only break connection when there is a reason to do so. The anvil_version still remains as an informational version that will help when supporting users later.
* Updated Cluster->add_server() to now set failure timeouts to actual numbers instead of INFINITY after discovering that INFINITY doesn't work in those cases.
* Updated Databsae->get_hosts to now check if other entries have the same host name, and if so, to set their host_key to 'DELETED'. This should make it easier to handle when a hardware machine is replaced by new hardware but uses the same host_name.
* Updated Email->check_queue() to start and enable postfix.service if it's found to not be running.
* Updated Get->available_resources() to return '!!no_data!!' when a given host hasn't got any data in scan_lvm_vgs. Now use this in anvil-provision-server to exit if a node or dr host hasn't run scancore yet.
* Fixed a bug in scan-lvm where the pvs_uuid wasn't being loaded properly, preventing lost PVs, VGs and LVs from being flagged as deleted.
* Started work on anvil-migate-server, though it's far from complete.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-30 14:03:13 -05:00
Digimer
89dec8e1f9 * Finished anvil-delete-server! (More testing needed though)
* Fixed a bug in Cluster->shutdown_server() where the wrong variable was being evaluated when checking the server state.
* Created DRBD->delete_resource() that deletes a resource's backing device and configuration. Note that this wipes the DRBD MD and and FS signatures before removing the LV. Updated DRBD->gather_data() to record the backing devices for volumes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-26 01:45:17 -05:00
Digimer
549dbad635 * Created Cluster->delete_server(), which deletes a server resource from pacemaker (stopping it first, if needed).
* Fixed a bug in Cluster->parse_cib() when a server that is off wasn't setting 'status'.
* Renamed 'server::location::<server>::host' to '...::host_name' in several places.
* Got more work done on anvil-delete-server, up to the point where it calls the new Cluster->delete_server() method.
* Updated fence_pacemaker to call 'drbdadm adjust all' to dampen an issue where in-memory fence configs seem to change, preventing reconnection of the peer after it reboots from the fence. More testing needed on this issue.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-25 01:00:55 -05:00
Digimer
d9d347ce63 * Updated .spec for the new source location.
* Created a log disable flag to avoid deep recursion when logging at level 3.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-22 00:37:30 -05:00
Digimer
05b1fccdb3 * Created Cluster->add_server() which, well, adds a server to a pacemaker cluster, including sorting out location constraints to favour the node the server is running on, if it's running.
* Removed the exit-if-no-DB check in ocf:alteeve:server so that (hopefully, needs testing), running servers won't be impacted if the nodes lost contact with both/all strikers.
* Updated scan-server to make an explicit check for missing XML definition files on startup and write them if needed.
* Very beginning work on anvil-delete-server has been started.
* Updated anvil-provision-server to wait when it's running in peer mode until the new XML definition is in the DB and then write it out to disk before exiting. Also updated it to add the new server to pacemaker before exiting.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-18 00:38:06 -05:00
Digimer
e0ceb5c65f * Provisioning servers is done!! This commit handles the virt-install call and adding the server to the database.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-17 02:54:41 -05:00
Digimer
8127c70237 * Support for OS type selection was missing in tools/anvil-provision-server, this commit adds support for this, as well as some infrastructure to support it. This includes a new 'sys::servers::os_short_list' variable that contains a CSV of main OSes to show in a "short" list (the full list is massive). This variable can be set by the user in anvil.conf. Also added job progress calls that were missing through the storage config.
* Created a new tools/striker-parse-os-list tool that parses 'osinfo-query os' and prints out entries for words.xml for any new OSes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-16 03:30:51 -05:00
Digimer
e82025ba61 * Got the DRBD configuation and start-up completed. Unlike M2, with M3 a server can be provisioned while the peer is disconnected or failed. Also cleaned up the 'run_jobs' function by breaking it up into a set of smaller function calls.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-16 00:47:14 -05:00
Digimer
056f8edf48 * Moved the scan-drbd function 'gather_drbd_date' and moved it into DRBD->gather_date().
* Finished DRBD->get_next_resource() that returns the next available minor and the next free TCP port (with two free ports available after it).
* Created Storage->get_storage_group_details() that pulls together the LVM, storage group members and storage groups into one block of data.
* Made more progress on tools/anvil-provision-server. It now gets up to the point of creating LVs, creating DRBD resource files, loading them, creating metadata and up'ing the resource. It doesn't yet (successfully) force a new resource to primary.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-15 03:10:16 -05:00
Digimer
162f4913b1 * Started work on DRBD->get_next_resource(), that will eventually return the next free DRBD minor and TCP port numbers.
* Fixed a bug in scan-drbd that was still looking for the scan_drbd_resource_uuid from the resource config file. Also added a check to see if 'scan-drbd::resource_status' directory exists before trying to read the files in it.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-13 01:31:44 -05:00
Digimer
a7f0676a0f * Got the 'anvil-provision-server' script to the point where it actually saves the new server job.
* Created the new method Cluster->get_primary_host_uuid() that returns the 'host_uuid' of the primary node in the given cluster. This is useful for external programs to figure out which node is primary. Example is provisioning a new server being assigned to the active node. Also created ->is_primary() that is a similar test to see if the active node is the primary node or not.
* Updated Cluster->parse_cib() and ->parse_crm_mon() to work on remote hosts.
* Updated Database->get_hosts() to store the short host names.
* Created Get->host_from_ip_address() that translates an IP address to a host_uuid and host_name, if it's an IP assigned currently to a known host.
* Created Network->find_target_ip() that simplifies finding which IP address to use when the caller wants to connect to a target host.
* Reworked the anvil-join-anvil to parse fence_arguments in a way that handles passwords with spaces in them.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-12 21:29:01 -05:00
Digimer
68ea6da1d3 * Finished the web interface components of the Anvil! File Manager! Files can be purged, sync'ed or removed from specific Anvil! systems, renamed and their file types changed (and setting/removing the executable bits) as needed.
* Fixed a bug in Database->insert_or_update_jobs() where the 'job_host_uuid' being set to 'all' only translated to a job for the running host.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-06 01:31:09 -05:00
Digimer
6da2b3b17b * Got more work done on file management. A file name is now clickable and that loads a menu to rename, change the file type, purge (delete from everywhere) and select which Anvil! systems the file belongs on. Got the code done to purge a file, but it's not tested yet.
Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-04 02:05:05 -05:00
Digimer
ea84ba68eb * Fixed a bug in tools/anvil-update-states that was causing deleted interfaces to update the network_interfaces every pass, growing the DB excessively.
* Cleaned up the file manager;
** Got the jquery file uploader JS to be sane and altered it to be more useful.
** Got the list of existing files to be displayed (links clickable but not working yet).

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-03 01:14:04 -05:00
Digimer
7d3c4371c7 * Renamed tools/striker-sync-shared to tools/anvil-sync-shared, as it's now designed to run on all machines. Got it to the point where it can be run on Anvil! members to pull down freshly uploaded files. It does so, when two or more strikers are available with the target file, load balancing such that one node downloads from one striker while another node downloads from the other striker. If there is three nodes, and if there is a DR host, the DR host will download from the third striker. If there are 1 or 2 strikers, the DR host will wait to download after both nodes have finished downloading.
* Cleaned up upload.pl now that it isn't responsible for loading the file details into the database. It only sets a job for the local Striker to process the file and move it into /mnt/shared/files, copy it to peer dashboards, then load jobs for Anvil! members to sync the new file.
* Created Database->get_files() and ->get_file_locations() to load the respective data.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-02 02:23:33 -05:00
Digimer
1a36f37065 * Got file uploads working!
* Got tools/striker-sync-shared to pick up 'upload::move_incoming' jobs, move the uploaded file to /mnt/shared/files/, copies it to peer dashboards, adds it to the 'files' table and adds it to 'file_locations'.
* Reworked the 'file_locations' table to now map files to Anvil! systems, not hosts. It simply tracks if a given file should be on Anvil! members or not. Later, striker-sync-shared on the Anvil! members will pull the file down.
* Updated Storage->get_file_stats() to record the file's mimetype.
* Fixed up a few issues in cgi-bin/upload.pl.

Signed-off-by: Digimer <digimer@alteeve.ca>
2021-01-01 04:09:17 -05:00
Digimer
7002c6f7cf * Fixed a bug in Get->available_resources() where re-calling it caused the available space on storage groups kept dropping. Also added recordings in the hash for the space reserved for the host and allocated to existing servers.
* Got tools/anvil-provision-server to the point where it askes up to the storage pool / size.
* Created the shell of 'tools/striker-sync-shared' that will sync /mnt/shared/files.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-30 23:02:33 -05:00
Digimer
f30cce3c5a * Created the new tools/anvil-provision-server tool which will handle provisioning new servers, as well as having an interactive menu system to provision servers from the command line.
* Created Cluster->assemble_storage_groups() and moved the logic to auto-assemble groups out of Get->available_resources().
* Created Cluster->get_anvil_name() that will return an Anvil! name for a given anvil_uuid, or the name of the Anvil! if the host is a member of an Anvil!.
* Updated Cluster->get_anvil_uuid() to return the 'anvil_uuid' if passed a specific 'anvil_name'.
* Updated Jobs->clear() to use 'switches::job-uuid' when a job_uuid is not passed but the value exists in 'switches::job-uuid'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-30 04:35:39 -05:00
Digimer
ddffc9d782 * Finished refining Get->available_resources(). It's now complete and can be used to show a user what is available and to validate new server creation commands (up next).
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-29 20:55:06 -05:00
Digimer
1ca695ff1b * Renamed Database->storage_group_data() to ->get_storage_group_data().
* Got Get->available_resources() finished, so now storage group free space is properly recorded as well.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-28 23:50:03 -05:00
Digimer
8f823d3b86 * Switched out the static list of core table to use the array generated by Database->get_tables_from_schema().
* Fixed bugs around creating and filtering storage groups.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-28 21:59:32 -05:00
Digimer
4f33eeef2e * This commit introduces a new concept called "Storage Groups". Given that LVs back DRBD resources in M3, there needed to be a way to determine which VGs would be used when creating the backing DRBD resources, and how large those LVs could be (based on the minimum free space of the VGs in a group). A new, as yet incomplete Get->available_resources() method will handle determining what resources are available to grow exist
ing or create new servers.
* Created Database->insert_or_update_storage_groups() and ->insert_or_update_storage_group_members() to manage the new associated tables. Together, these tables create storage groups and track the VGs that are members of the group.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-28 01:22:21 -05:00
Digimer
1d03a386d3 * Created Database->get_bridges() that, surprise, loads data from the 'bridges' table.
* Started work on Get->available_resources() that will take an 'anvil_uuid' and figure out what resources are still available for use by new servers or that can be added to existing servers.
* Fixed a bug in ScanCore->agent_startup() where tables weren't being generated properly from the agent's SQL file.
* Made Storage->change_mode() return silently if it's called without a mode being passed. This happens frequently and is harmless so it's not worth filling the logs with errors.
* Renamed the 'start_time' key to 'at_start' when recording files' MD5 sums in Storage->record_md5sums and ->check_md5sums.
* When we moved the directory scan logic out of the 'scancore' daemon and into 'Storage->scan_directory', the logic to record scan agent names in 'scancore::agent::<file>' was removed. This broke a few things and, so, it was restored when it was found that a file starts with 'scan-' and the directory matches the scancore agent directory.
* Moved the 'scancore' daemon's 'load_agent_strings' to 'Words'
* Updated Words->parse_banged_string() to look for variables in the format 'value=X:units=Y' and translate it properly.
* Fixed a bug in scan-ipmitool where discovered sensor INSERT SQL queries were queued, but not committed.
* Fixed a bug in scan-storcli where a while loop was broken, preventing execution.
* Fixed a bug in the 'scancore' daemon where it wouldn't exit if sums changed. Fixed a bug where alerts weren't being sent between loops. Fixed a bug where command-line log level wasn't surviving inside the main loop.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-23 01:10:23 -05:00
Digimer
96bc1f0b78 * Created Convert->fence_ipmilan_to_ipmitool() that takes a 'fence_ipmilan' call and converts it into a direct 'ipmitool' call.
* Created Database->get_power() that loads data from the special 'power' table.
* Fixed a bug in calls to Network->ping() where some weren't formatted properly for receiving two string variables.
* Updated Database->get_anvils() to record the machine types when recording host information.
* Updated Database->get_hosts_info() to also load the 'host_ipmi' column.
* Updated Database->get_upses() to store the link to the 'power' -> 'power_uuid', when available.
* Created ScanCore->call_scan_agents() that does the work of actually calling scan agents, moving the logic out from the scancore daemon.
* Created ScanCore->check_power() that takes a host and the anvil it is in and returns if it's on batteries or not. If it is, the time on batteries and estimate hold-up time is returned. If not, the highest charge percentage is returned.
* Created ScanCore->post_scan_analysis() that is a wrapper for calling the new ->post_scan_analysis_dr(), ->post_scan_analysis_node() and ->post_scan_analysis_striker(). Of which, _dr and _node are still empty, but _striker is complete.
** ->post_scan_analysis_striker() is complete. It now boots a node after a power loss if the UPSes powering it are OK (at least one has mains power, and the main-powered UPS(es) have reached the minimum charge percentage). If it's thermal, IPMI is called and so long as at least one thermal sensor is found and it/they are all OK, it is booted. For now, M2's thermal reboot delay logic hasn't been replicated, as it added a lot of complexity and didn't prove practically useful.
* Created System->collect_ipmi_data() and moved 'scan_ipmitool's ipmitool call and parse into that method. This was done to allow ScanCore->post_scan_analysis_striker() to also call IPMI on a remote machine during thermal down events without reimplementing the logic.
* Updated scan-ipmitool to only record temperature data for data collected locally. Also renamed 'machine' variables and hash keys to 'host_name' to clarify what is being stored.
* Updated scancore to clear the 'system::stop_reason' variable.
* Added missing packages to striker-manage-install-target.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-21 16:00:35 -05:00
Digimer
0b2407e78b * Added a really simple DRBD monitoring tool to the repo, will likely remove later.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-06 22:56:53 -05:00
Digimer
d8fc660a8b * Finished scan-drbd. Removed the code to track resources by UUID via resource config file. The UUID tracking it fairly easy messed up by manual edits/rsyncs, and the code to handle updating the UUID in the resource config files was complicated enough that it negated any benefit of avoiding resources being marked as removed / newly created on name change. Left the DRBD->resource_uuid() method though, in case we decide to revisit this.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-06 18:11:26 -05:00
Digimer
20b75310b7 * Created DRBD->resource_uuid() which reads (or adds) the 'scan_drbd_uuid' to the resource configuration file, which will allow us to track resources when they're renamed.
* Got scan-drbd's read_last_scan() and process_drbd() functions done (but untested so far).

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-01 23:59:54 -05:00
Digimer
7bf4c25f9b * Cleaned up the parsing of collected data in preperation for insertion/updating of the DB.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-12-01 04:25:15 -05:00