Commit Graph

60 Commits

Author SHA1 Message Date
Digimer
33101f969a * Fixed several bugs related to tracking server boots, migrations and shut downs in the anvil database. The 'ocf:alteeve:server' now has (mostly?) safe integration with the Anvil! database. This was mostly done by updating Servers->boot_virsh(), ->shutdown_virsh() and ->migrate_server().
* Updates servers -> server_host_uuid to drop the 'NOT NULL' constraint.
* Created the new Get->server_uuid_from_name() that does what it says on the tin. Fixed a bug in ->host_uuid_from_name() where the host name was being returned instead of the UUID.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-07 02:40:31 -04:00
Digimer
be88be6d30 * Did a bunch of testing / bugfixes for scan-server.
* Updated the servers table to remove the 'not null' constraints on the server_start_after_server_uuid, server_pre_migration_file_uuid and server_post_migration_file_uuid columns.
* Updated ScanCore->agent_startup() to connect to the database(s) when there isn't a table list.
* Updated Server->migrate_virsh() to test for DB access before making DB calls (to allow ocf:alteeve:server to function even if all ScanCore DBs are offline).
* Updated ocf:alteeve:server to connect to the databases (though work without it), and changed '$FILE_NAME' to be 'ocf:alteeve:server' (to make logging more legible)
* Created the skeleton for 'scan-storage'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-05 23:47:56 -04:00
Digimer
46f1a05789 * Got the code in scan-server to the point where it _should_ now gracefully and automatically detect changes to a server's definition originatin from the database (via Striker), directly editing the on-disk definition file, or editing via libvirt tools (like virt-manager). Still needs to be tested though.
* Updated Server->migrate_virsh() to set 'servers' -> 'server_state' to 'migrating' and clear it again once the migation completes. Also added support for cold (frozen) versus live migrations.
* Updated Cluster->parse_cib() to check if a server with the server_state set to 'migrating' isn't actually migrating anymore and, if not, to clear that state. This is needed as scan-server will blindly ignore/skip any migrating server, and if a migration call is interrupted, the state could get stuck.
* Updated the 'servers' database table (and associated Database methods) to add columns for;
** server_ram_in_use      - tracking RAM used by a running server
** server_configured_ram  - RAM allocated to a running server (used with the above to alert a user and track _currently_ available RAM)
** server_updated_by_user - To be set by Striker tools to indicate when the user made a change that needs to push out to nodes / running server.
** server_boot_time       - Tracks the unixtime when the server booted (to track uptime even if the server migrates across nodes).
* Created Get->anvil_name_from_uuid() to easily convert an Anvil! UUID into a name. Also created ->host_uuid_from_name() to translate a host name into a host UUID.
* Created Server->get_runtime() that translates a server name into a process ID and then uses that to determine how long (in seconds) it has been running. This is used when a server transitions from 'shut off' to 'running' to determine exactly when the server booted (current time - runtime).
* Renamed all 'Server->parse_definition' calls that used 'from_memory' to 'from_virsh' to clarify the data source.
* Made scan-hardware smarter about RAM change alerts.
* Updated scancore to load agent strings on startup so that processing pending alerts works properly.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-10-02 02:13:34 -04:00
Digimer
4dfe0cb5a0 * Created Cluster->boot_server, ->shutdown_server and ->migrate_server methods that handle booting, migrating and shutting down servers. Also created the private method ->_set_server_constraint which is used by migrate and boot to set resource constraints to control where a server boots or migrates to.
* Did more work on parsing server data out of the CIB. There is still an issue with determining which node currently hosts a resource, however.
* Renamed Server->boot to ->boot_virsh, ->shutdown to ->shutdown_virsh and ->migrate to ->migrate_virsh to clarify that these methods work on the raw virsh calls, outside of pacemaker (indeed, they are what the pacemaker RA uses to do what pacemaker asks).
* Got more work done on the scan-cluster SA.
* Created the empty files for the pending scan-server SA.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-24 02:09:18 -04:00
Digimer
0f7267eae1 * Moved the '_host_name', '_short_host_name', and '_domain_name' private methods in Tools.pm over to Get.pm (removing the leading '_' in the method names).
* Created 'Cluster->which_node' that returns 'node1' or 'node2' to indicate which node a host is.
* Continued working on scan_cluster; decided to make it not host-dependent.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-09-20 00:27:36 -04:00
Digimer
b2c7fd95fb * Renamed the ScanCore unit file to scancore.
* Added support to parsing location contraints to Cluster->parse_cib

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-24 12:54:31 -04:00
Digimer
14bf323627 * Fixed an issue with ocf:alteeve:server where, after a migration, the target host would invoke the RA as if it was trying to migrate, instead of verifying the server (resource) was OK post migration.
* Fixed a bug in Server->get_status() where the call to Storage->rsync's returned output checked for '!!errer!!' instead of '!!error!!'.
* Fixed a bug in Storage->rsync where, when no port was passed in, it would try to specify an empty port and fail.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-20 06:32:19 -04:00
Digimer
1498e1b53c * Got server migration working using ocf:alteeve:server in a test environment!
* Converted most 'eval { }' calls to localize $@ and test the output of the eval, instead of checking to see if $@ was set.
* Converted all 'local' hash references to instead use the short host name of the local machine as a new standard.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-19 18:54:09 -04:00
Madison Kelly
30f2b3fa8e * Switched all hash 'local' keys to be the host's short user name. Untested, likely bugs to be fixed in the next commit.
Signed-off-by: Madison Kelly <mkelly@alteeve.ca>
2020-08-18 19:34:08 -04:00
Digimer
47203490a9 * Working on getting live migration to work with ocf:anvil:striker using the environment variables that pacemaker sets. Incomplete, but getting close.
* Added support to Cluster->parce_cib() to track if maintenance mode is set.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-17 11:55:49 -04:00
Digimer
cc1e0e2f77 * Updated ocf:alteeve:server to properly report a server's status when a monitor action is called.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-14 23:16:34 -04:00
Digimer
6eeb4e48c7 * Added 'timeout' and 'wfc-timeout' to drbd'd global-common.conf in DRBD->update_global_common().
* Fixed a minor bug in ocf:alteeve:server found in testing the RA.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-14 00:24:02 -04:00
Digimer
e35800c413 * Fixed up (though more testing/work needed) to ocf:alteeve:server to get it working with DRBD resources referenced using '/dev/drbd/by-res/...'.
* Added the anvil.conf option 'sys::privacy::strong' that controls if the Anvil! ever "calls home". Initially, this controls DRBD's usage flag.
* Updated DRBD->get_devices() to track resources by their 'by-res' names as well and by the normal '/dev/drbdX' devices.
* To mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1868467, updated Get->bridges() to parse the normal (non-JSON) data if we get invalid JSON output.
* Updated anvil-join-anvil to not disable, and in fact enable, libvirtd on boot. With DRBD 9, the original fear of a user accidentally booting a VM that's running on the peer no longer is an issue. By enabling it and leaving it on, Striker dashboard users won't lose their virtual machine manager link unless the node powers off. Also enabled actually updating the job progress, completing this tool!

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-08-13 00:12:20 -04:00
Digimer
01974d7efe * Finished (though testing is needed) the updated ocf:alteeve:server resource agent. It now handles starting and stopping libvirtd and drbd daemons on-demand.
Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-16 00:33:00 -04:00
Digimer
dcd1fd1492 * Created Cluster->check_node_status() that checks the status of a node (in pacemaker).
* Created Cluster->get_peers() that figures out who the peer node (and DR host, if applicable) are.
* Updated Cluster->parse_cib() to dig out more information.
* Created Cluster->start_cluster() to start pacemaker (via pcsd) locally or on all (both) nodes.
* Started working on ocf:alteeve:server to start/stop the libvirtd/drbd daemons as needed, instead of having pacemaker do it.
* Got more work done on anvil-join-anvil. Node 2 now waits for the cluster to start, and node 1 will do setup as needed, then wait for the cluster to start as well.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-07-15 18:40:37 -04:00
Digimer
726a4374d1 * Renamed the database table 'host_keys' to 'ssh_keys' to better represent what it stores.
* Updated 'variables' -> 'variable_source_uuid' to type 'uuid' and removed the 'not null' constraint.
* Updated Database->insert_or_update_variables() to check/update 'variables_source_table' and 'variables_source_uuid'.
* Created the 'trusts' database table which will, when done, tell anvil-daemon which users@machines to trust (setup passwordkess SSH).
* Created (but not finished) System->manage_authorized_keys() and moved the logic over to it from anvil-daemon.
* Changed the host types "dashboard" to "striker".
* Moved the following methods from 'System' to 'Get';
** System->get_host_type to Get->host_type
** System->get_bridges to Get->bridges
** System->get_free_memory to Get->free_memory
** System->get_os_type to Get->os_type
** System->get_uptime to Get->uptime
* Updated striker to include the host_uuid for the 'node1', 'node2' and (if chosen) 'dr1' when running a job manifest.

Signed-off-by: Digimer <digimer@alteeve.ca>
2020-06-10 18:26:50 -04:00
Digimer
934c9b1286 * Updated logging to now log anything with 'priority' set to a new 'anvil.alert.log' file (while still also logging as normal to anvil.log). This should make it easier to watch for alert messages.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-10-22 21:17:30 -04:00
Digimer
7cdd2f60e9 * Created Network->download() to handle downloading a file on the local system. Created ->bridge_info() to parse 'bridge' output. Created ->load_ips() to load IP address information from the database (as opposed to ->get_ips() which queries a system).
* Created Database->insert_or_update_bridge_interfaces() to handle updating the new bridge_interfaces database table that records which network interfaces are connected to which bridges.
* Added 'bridge_mac' and 'bridge_mtu' to the 'bridges' table.
* Started Server->map_network which will, eventually, try to map MAC addresses to IPs and record server -> vnetX -> bridge data. Made getting a server status target-dependent.
* Worked on anvil-update-states to parse and record bridge data.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-10-12 00:32:32 -04:00
Digimer
3a86bed694 * Fixed tools/striker-initialize-host so that it set the hostname on the target, not locally.
* Updated System->host_name to work locally and on remote targets.
* Renamed all 'hostname' instances to 'host_name' to standardize on a spelling throughout the program.
* Removed use of and dependency on 'hostname'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-10-02 15:39:21 -04:00
Digimer
bc341809ca * Finished (for now) ocf:alteeve:server! It can boot, migrate and stop a server cleanly. It still checks to see if DRBD needs to be started and does so when needed, but it won't stop it anymore.
* Fixed a couple typos in tools/anvil-check-memory that prevented it from running.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-09-04 18:58:43 -04:00
Digimer
8a2c86d088 * Renamed striker-configure-host (back) to anvil-configure-host, and started updating it to work on any machine type.
* Created tools/anvil-check-memory to report how much RAM is used by a given program.
* Added documentation for some previously undocumented methods.
* Updated Database->archive_database() to take the 'tables' parameter.
* Updated Storage->scan_directory() to record a directory's mode and type, even when recursive isn't used.
* Finished System->check_memory().
* Updated ocf:alteeve:server to now NOT stop a DRBD resource unless 'stop_drbd_resources'.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-09-03 14:07:39 -04:00
Digimer
c0dd34334e * Fixed another bug in making ocf:alteeve:server work in pacemaker.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-21 01:14:10 -04:00
Digimer
ed2e83a1a4 * Fixed a few more bugs in 'ocf:alteeve:anvil', but it's still failing when invoked by pacemaker.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-20 23:46:59 -04:00
Digimer
f5caec52dc * Made DRBD->allow_two_primaries() smarter about finding the 'target_node_id' when it wasn't passed.
* Fixed a couple bugs, and now ocf:alteeve:server properly can pull and push servers between nodes.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-20 00:59:41 -04:00
Digimer
113a44ecc6 * Got 'migrate_to' working in ocf:alteeve:server. 'migrate_from' still needs work.
* Created DRBD->allow_two_primaries() and ->reload_defaults() that enables (and resets/disables) dual-primary operation (allow-two-primaries=yes), used to enable live migration.
* Created Remote->test_access() that simply verifies that a remote target can be accessed (as a given user).
* Created Server->migrate() that actually migrates a server. It can push already, and pull will be added next.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-16 01:41:47 -04:00
Digimer
7db542b9b0 * Fixed a bug where definition files that used '<source file='X'/>' instead of '<source dev='X'/>' for the backing block device for disks.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-14 01:48:36 -04:00
Digimer
873ed3e2b0 * Fixed some typo bugs.
* Added stop_drbd_resources() to ocf:alteeve:server.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-12 23:15:13 -04:00
Digimer
dff74102db * Create (but not yet tested) Server->shutdown() to, well, shutdown servers.
* Modified Server->boot() to now only work locally. Also updated it to optionally take the XML definition path for the server to boot.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-10 01:35:08 -04:00
Digimer
d224be9344 * Created DRBD->manage_resource() that allows for up/down/primary/secondary'ing a resource on a local or remote system.
* Created Server->boot() that starts a server and verifies that it did actually start.
* Got ocf:anvil:server smarter about starting DRBD resources, properly handing resources where auto-promote isn't enabled. The 'start' process is now complete (baring bugs).

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-08 22:58:21 -04:00
Digimer
b1ddf945e2 * Got ocf:alteeve:server working again to boot servers. It's now smarter, knowing when the server is running locally already (success), running on the other node (hard error) and running on DR (fatal error).
* Updated DRBD->get_devices() to store data all under 'drbd::config::<host>::x'.
* Created Server->find() that takes a target and collects the servers running on it.
* Updated System->check_storage() to redirect all calls STDERR to /dev/null to supress errors about failing to open /dev/drbdX when LVM's filter isn't setup.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-08 01:10:38 -04:00
Digimer
324ef351fe * Updated DRBD->get_devices() to properly identify the peer node, when run on an actual node in the cluster (not DR or Striker).
* Created System->active_lv() that, surprise, activates an inactive logical volume. Also created ->check_storage() that parses out the LVM data.
* Fixed a bug in tools/fence_pacemaker that was preventing it from compiling and running.
* Updated ocf:alteeve:server to validate the target server's storage.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-06 23:31:35 -04:00
Digimer
16f79ca244 * Created System->get_bridges() that gets a list of bridges (and connected interfaces, and data). Also created ->get_free_memory() that returns the amount of available RAM.
* Got ocf:alteeve:server validating bridges.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-03 01:38:12 -04:00
Digimer
4a93682447 * Started rebuilding ocf:alteeve:server using the new module methods.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-08-02 00:06:06 -04:00
Digimer
7a7e3db0c1 * Created DRBD->get_devices() that finds and maps the /dev/drbdX devices to their resources and backing LVs.
* Got Server->get_status() and ->_parse_definition() pulling out all but the device bus data from the XML files. Under devices, started parsing, with 'channel' devices now being parsed.
* Fixed a bug in several Storage->X calls where the test to see if a 'target' was the local host or not wasn't smart enough.
* Purged 'ocs:alteeve:server' to prepare for the rewrite where a lot of the logic is being moved into the Anvil::Tools modules as their functionality will be needed elsewhere anyway.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-30 02:10:04 -04:00
Digimer
312c949648 * Actually added the new Anvil::Tools::DRBD module, as well as a new ::Server module that will handle anything related to the virtual servers.
* Started a re-write of the ocf:alteeve:server resource agent. It has bugs and is missing logic, and trying to clean it up given most of the bugs come from trying to clean up the original agent doesn't make sense. Moving much of the logic into module methods as the functions will be needed elsewhere anyway.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-25 01:12:25 -04:00
Digimer
0f873c45b5 * Created the new Anvil::Tools::DRBD moduke to hold all DRBD related stuff. Started working of ->get_status, still very much a work in progress.
* Started working on ocf:alteeve:server to make it smarter (and more patient) when bringing a DRBD resource up so we don't get false failures when we hit a race.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-23 02:37:14 -04:00
Digimer
f294d48777 * Updated notes with working pcs config example (enable, disable and migrate all work!)
* Minor fix to ocf/alteeve/server to not print anything except the XML meta information when requested

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-20 01:36:02 -04:00
Digimer
a5761df894 * Got migration (push and pull) working.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-19 01:05:30 -04:00
Digimer
0dc2e5d2e9 * Fixed up an issue with down'ing resources under a server after it's shut down.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-18 03:18:38 -04:00
Digimer
3cf1ad2ff8 * Fairly heavy rework of ocf:alteeve:server to update how it handles storage during server start. It now properly handles the new "one resource, multiple volumes" and "start resource, not daemon" approach.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-18 02:53:05 -04:00
Digimer
7e4a170382 * Fixed a bug where Tools.pm->_anvil_version() and Get->host_uuid() were storing values in the wrong $anvil hash.
* Fixed a bug where Get->host_uuid() wasn't reading from the host.uuid file.
* Updated Remote->call() to record a target's fingerprint when needed.
* The ocf:alteeve:server resource agent now properly stopps a server and the corresponding DRBD resource.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-17 02:41:05 -04:00
Digimer
b56fbf923c * Finished the initial convertion of ocf:alteeve:server to use Anvil::Tools.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-16 00:14:13 -04:00
Digimer
c0220c9635 * Continues work on migrating ocf:alteeve:server to use Anvil::Tools.
Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-14 01:01:50 -04:00
Digimer
9c0f6b8f79 * Added automatic 'echo return_code:$?' to System->call and Remote->call which is parsed out and returned automatically on all calls.
* Started porting ocf:alteeve:server to use the Anvil::Tools module and updating it for RHEL 8.

Signed-off-by: Digimer <digimer@alteeve.ca>
2019-07-13 04:16:03 -04:00
Digimer
eafd4fd3f7 * Fixed a couple bugs to get System->change_shell_user_password() working.
* Made logging between journald and a traditional file configurable via 'sys::log_file'. Also made the file handle unbuffered when logging to a file.
* Fixed a bug with loading the anvil.conf config file in a few locations.
* Created System->stty_echo() to handle enabling/disabling shell echo, and added restoring the echo to Tools->catch_sig.

Signed-off-by: Digimer <digimer@alteeve.ca>
2018-04-26 12:41:03 -04:00
Digimer
c21b326f1a * Changed all methods to take a 'debug' argument for setting log level on calls.
* Fixed a bug with resync, but others remain as resync is incomplete (at least for network_interfaces).
* Currently, tools/anvil-update-states is broken while working on the above issue.
* Reworked the jobs table and removed the units/anvil-jobs.service unit. Jobs will be invoked and backgrounded in all calls.
* Started adding missing hidden form fields.
* Updated the 'server' OCF resource agent version and metadata.

Signed-off-by: Digimer <digimer@alteeve.ca>
2018-03-07 03:11:55 -05:00
Digimer
92a1e29082 * The new resource agent works!
** Fixed a bug so that when the agent is invoked on the target node after a migration, it just does a quick check to see if the server is running and exists 0 if so.

Signed-off-by: Digimer <digimer@alteeve.ca>
2018-02-21 23:24:53 -05:00
Digimer
3b0659c5bf * Looks like the RA is done, though more testing is needed.
Signed-off-by: Digimer <digimer@alteeve.ca>
2018-02-21 13:54:49 -05:00
Digimer
f52d8196f6 * Migration is now sort of working. There is still an issue to sort out with enabling drbd dual-primary, but server can move is some cases now.
* Changed fence_pacemaker to exit with '1' on generic error as per LINBIT's comments.

Signed-off-by: Digimer <digimer@alteeve.ca>
2018-02-21 02:06:00 -05:00
Digimer
4e5dc9f1c2 * Started work on migration handling.
* Fixed a bug where a stop operation on a server already in shutdown would exit immediately instead of waiting for the server to actually shut off.

Signed-off-by: Digimer <digimer@alteeve.ca>
2018-02-20 02:14:59 -05:00