-
Notifications
You must be signed in to change notification settings - Fork 3
The Big Refactor
Currently we stack managed resources into two categories: A System Under Test (SUT), and its Processes. This stems from a historical context, where RabbitMQ's testing required a cluster of nodes, which was them abstracted into what have now. The distinction is a name alone, for the most part.
Instead of maintaining this design, we will abstract out the managed resource into a concept of its own. A resource will have an identity, scope and life-cycle (events, to which handlers can still be attached) and will be managed and accessed globally per test controller.
We will introduce the concept of connectivity between resources, giving us the ability to declare relationships of various kinds, including dependencies and scope for re-use.
In managing resources (and their inter-relationships) explicitly, we will move control flow from its current location (in the common-test hooks) into its own space. We also allow for life-cycle and policy management on relationships - viz, these become resources in their own right - such that we can establish short lived actions that affect one or more dependent resources, such as running an admin tool/script as part of starting/stopping/managing a exiting, running resource.
This presents numerous attractive possibilities for improvement:
- we avoid blocking the test_server and dead-locked CI runs
- we facilitate a cleaner separation of resource management from policy application
- we increase potential (re-)use across test frameworks
As part of this delivery, we will completely remove the systest_watchdog, as it is a know source of dead-locks in production tests at RabbitMQ.
The current SUT/Process configuration allows basic re-use of resources, but its behaviour is horrifically opaque and hard to even document, let alone understand. The only good thing about the current configuration is our ability to inject data into an object dynamically (from the surrounding environment, a registered configuration table, or settings file). Even this is far from idea, as configured data is seldom visible when its actually useful (e.g., in logging/debugging output, we tend to see the parameter keys, rather than the instantiated/parameterised values).
We will introduce the concept of a resource instance, which is simply the resource type fully populated and configured as it would be at the point of use (in a test case). Resource types and instances may reference one another inter-changeably, such that a type or instance can refer to a dependent (type or instance) cleanly. This will cater for situations where, for example, we wish to partially specify a resource that represents a set of dependents (e.g., a cluster of Erlang nodes) and realise this (type) many times, with differing sets of dependent resources (either types, or instances).
We will provide a simplified configuration API, design to meet the configuration management needs of the resource concept. This refactoring task is at least as large/difficult as providing the resource management capability itself.
We will make configuration parsing a plug-able API, allowing for alternative configuration schemas (such as json, xml, etc).
Currently the execution of hooks (i.e., behaviours) is baked into the SUT, proc and cli modules. We do have sytest_hooks which attempts to segregate this to some extent, but that is not a complete solution. The execution of hooks ought to reside completely outside of the resource implementations themselves. Resource implementations should be provide only two things: (1) a defined life-cycle and (2) an implementation of zero or more of these defined life-cycle states.
Take for example, the SUT resource - this is essentially little more than a container for one or more resource dependencies. The SUT implementation ought to provide little more than a wiring between the life-cycle events of its children and a realisation (type or instance). A default wiring (based on what we currently have) would look like:
all-links::established/started => user::on_start
::on_join => all-links::on_join
For a typical RabbitMQ configuration, this looks something like:
nodes::started => make_cluster
on_join => nodes::connect_to_node (establish AMQP connectivity)
on_stop => nodes::disconnect_from_node (disconnect AMQP connectivity)
Currently this is wired into a module, but that wiring ought to be configuration driven for all resources, giving us the ability to define "SUT" using configuration data alone.
The CLI module is another case in point. This implementation should handle the execution (and life-cycle management) of an external program and nothing else. That RabbitMQ cli processes want to run an addition/separate program to terminate the resource (viz rabbitmqctl stop
is used as a shutdown hook) is incidental - this should be handled via a short-lived relationship with an additional resource!
Policy, ranging from error handling to hooks to timeouts, should be something that we can define independently of a resource and apply at any scope we like, attached to whatever life-cycle events (i.e., hooks) are compatible with that policy's interface.
This task is characterised by the spaghetti inter-dependency between systest_proc and systest_cli. When regarding our fledgeling systest_resource implementation, it is clear that systest_proc is doing a lot of work that generalises well to all kinds of resources. On the other hand, systest_cli looks very much like a specialised kind of resource (for running external programs), but we don't really make any effort to differentiate between start/stop callbacks and the whole runtime/life-cycle management of the program itself. Classic example: RabbitMQ node's as a proc/cli resource type.
- start: proc => cli is used to run
rabbitmq-server start
- runtime: interaction takes place mainly using rpc calls (though cli wouldn't be amiss as such)
- stop: proc => cli is used to run
rabbitmqctl stop
but this uses a simple open_port + recv loop!
Why doesn't the stop
behaviour do the same fully fledged CLI life-cycle management as the start | running
states do? There's really no good reason, and many of the problems we see in the RabbitMQ multi-node test suites are doubtless due to this inconsistency.
The CLI then, should become a resource type all of its own, still conforming to the API but not attempting to 'shell out' in order to run shutdown hooks.