Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added support for creating shared LVM setups #388

Merged
merged 2 commits into from
Dec 12, 2023

Conversation

japokorn
Copy link
Collaborator

@japokorn japokorn commented Oct 2, 2023

Enhancement:
Support for creating shared VGs

Reason:
Requested by GFS2

Result:

tests/test-verify-pool.yml Outdated Show resolved Hide resolved
@japokorn japokorn force-pushed the main-shared_vg_support branch from 468e363 to 592f745 Compare October 3, 2023 07:26
@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (c4147d2) 13.67% compared to head (7d8b953) 13.65%.
Report is 15 commits behind head on main.

Files Patch % Lines
library/blivet.py 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #388      +/-   ##
==========================================
- Coverage   13.67%   13.65%   -0.03%     
==========================================
  Files           8        8              
  Lines        1733     1736       +3     
  Branches       79       79              
==========================================
  Hits          237      237              
- Misses       1496     1499       +3     
Flag Coverage Δ
sanity 16.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@richm
Copy link
Contributor

richm commented Oct 11, 2023

ping - any updates?

README.md Show resolved Hide resolved
@japokorn japokorn force-pushed the main-shared_vg_support branch from 592f745 to a6ff2fe Compare October 12, 2023 15:07
@japokorn
Copy link
Collaborator Author

New blivet version has been released today. I have incorporated all suggestions into the code and uncommented the test.

@japokorn japokorn force-pushed the main-shared_vg_support branch 3 times, most recently from 9e201d3 to 8dc721a Compare October 12, 2023 15:25
@japokorn japokorn marked this pull request as ready for review October 12, 2023 15:55
@japokorn japokorn force-pushed the main-shared_vg_support branch from 8dc721a to 988c73f Compare October 12, 2023 18:54
@richm
Copy link
Contributor

richm commented Nov 9, 2023

With the suggested fixes, I can run the test up until here on centos-9:

TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
task path: /home/rmeggins/linux-system-roles/storage/tests/roles/linux-system-roles.storage/tasks/main-blivet.yml:73
Thursday 09 November 2023  12:58:52 -0700 (0:00:00.017)       0:03:16.483 ***** 
fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => {
    "actions": [],
    "changed": false,
    "crypts": [],
    "leaves": [],
    "mounts": [],
    "packages": [],
    "pools": [],
    "volumes": []
}
MSG:

failed to set up pool 'vg1': __init__() got an unexpected keyword argument 'shared'
    def _create(self):
        if not self._device:
            members = self._manage_encryption(self._create_members())
            try:
                pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members, shared=self._pool['shared'])
            except Exception as e:
                raise BlivetAnsibleError("failed to set up pool '%s': %s" % (self._pool['name'], str(e)))

what version of blivet has the support for shared? Is it in centos9 yet?

@japokorn japokorn force-pushed the main-shared_vg_support branch 3 times, most recently from d6181b4 to 7003e18 Compare November 15, 2023 12:57
@japokorn
Copy link
Collaborator Author

what version of blivet has the support for shared? Is it in centos9 yet?

I have added the switch that skips the test if needed based on blivet version as per vtrefny #388 (comment)

meta: end_host
when: inventory_hostname == "localhost"

- name: Gather package facts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Gather package facts
- name: Run the role to install blivet
include_role:
name: linux-system-roles.storage
vars:
storage_pools: []
storage_volumes: []
- name: Gather package facts

Copy link
Contributor

@richm richm Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, the task Set blivet package name fails, as blivet is not installed.
Are you able to run this test locally on your laptop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the initial storage role (which installs blivet) run before the check.
While running the test I also noticed that in HA cluster role test_setup.yml there is this task:

  - name: Set node name to 'localhost' for single-node clusters
    set_fact:
      inventory_hostname: localhost  # noqa: var-naming
    when: ansible_play_hosts_all | length == 1

I am not sure what its purpose is in the role but it messed up the test when I tried to run it on single remote node. I replaced it with task that changes inventory name from 'localhost' to '127.0.0.1' and it seems to do the trick.

@japokorn japokorn force-pushed the main-shared_vg_support branch from 7003e18 to 0ed33f2 Compare November 28, 2023 18:12
- feature requested by GFS2
- adds support for creating shared VGs
- shared LVM setup needs lvmlockd service with dlm lock manager to be running
- to test this change ha_cluster system role is used to set up degenerated cluster on localhost
- the test will be skipped if run locally due to an issue with underlying services
- requires blivet version with shared LVM setup support (storaged-project/blivet#1123)
@japokorn japokorn force-pushed the main-shared_vg_support branch from 0ed33f2 to 79b1520 Compare November 28, 2023 18:21
@richm
Copy link
Contributor

richm commented Nov 29, 2023

ok - but - is there some platform that has the correct version of blivet? Alternately - if you have some copr blivet build that you are using, can you attach the log output from running the test with the right version of blivet?

@japokorn
Copy link
Collaborator Author

ok - but - is there some platform that has the correct version of blivet? Alternately - if you have some copr blivet build that you are using, can you attach the log output from running the test with the right version of blivet?

I am running the test (not skipped) on Fedora 38 with the latest blivet package (python3-blivet-3.8.2-99.20231127115915812391.3.9.devel.64.gfc7f3fc5.fc38.noarch)

@richm
Copy link
Contributor

richm commented Nov 29, 2023

[citest]

@@ -1527,7 +1527,7 @@ def _create(self):
if not self._device:
members = self._manage_encryption(self._create_members())
try:
pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members)
pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members, shared=self._pool['shared'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some sort of logic here to avoid using the shared parameter if not supported

                if self._blivet.new_vg supports 'shared' parameter:
                    pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members, shared=self._pool['shared'])
                else:
                     pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members)

There's probably some way to use introspection to see if the new_vg method supports shared

or some other way to dynamically construct the new_vg arguments e.g. new_vg_args = {} the pass like new_vg(**new_vg_args)

This is what is causing some of the test failures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the condition so the shared parameter is not used when its value is default (false). This should fix the tests.

Older versions of blivet do not support 'shared' parameter. This resulted in
failures in tests unrelated to shared VGs. This change fixes that
behavior as well as fixes minor condition error in a test.
@japokorn
Copy link
Collaborator Author

[citest]

@richm
Copy link
Contributor

richm commented Dec 6, 2023

Looks like fedora 39 has the right version of blivet. When I try your latest like this:
tox -e qemu-ansible-core-2.15 -- --image-name fedora-39 --log-level debug -- tests/tests_lvm_pool_shared.yml
I get this error:

TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] ***
...
fatal: [/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2]: FAILED! => {
    "changed": true,
    "cmd": [
        "pcs",
        "cluster",
        "setup",
        "--corosync_conf",
        "/tmp/ansible.cjhl1_x4_ha_cluster_corosync_conf",
        "--overwrite",
        "--no-cluster-uuid",
        "--",
        "rhel9-1node",
        "/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2"
    ],
    "delta": "0:00:01.327931",
    "end": "2023-12-06 18:25:28.852939",
    "rc": 1,
    "start": "2023-12-06 18:25:27.525008"
}

STDERR:

Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts'
No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2', using '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2'
Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2', use --force to override
Error: Errors have occurred, therefore pcs is unable to continue

The problem is that runqemu uses the file name of the qcow2 file as the hostname.

@richm
Copy link
Contributor

richm commented Dec 6, 2023

If I add this to the test:

    - name: Set up test environment for the ha_cluster role
      include_role:
        name: fedora.linux_system_roles.ha_cluster
        tasks_from: test_setup.yml

    - name: Create cluster
...

Then I get much farther, until here:

    - name: >-
        Create a disk device; specify disks as non-list mounted on
        {{ mount_location }}

...

TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
...
fatal: [/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2]: FAILED! => {
...
MSG:

Failed to commit changes to disk: Process reported exit code 3:   Using a shared lock type requires lvmlockd (lvm.conf use_lvmlockd.)
  Run `vgcreate --help' for more information.

I guess somewhere in the blivet module or blivet library it manages lvm.conf?

I think we need to change https://github.com/linux-system-roles/ha_cluster/blob/main/tasks/test_setup.yml#L9 to make it more generally applicable.

- name: Set node name to 'localhost' for single-node clusters
  set_fact:
    inventory_hostname: localhost  # noqa: var-naming
  when: ansible_play_hosts_all | length == 1

@tomjelinek @spetrosi I think the intention of this code is - "If inventory_hostname is not resolvable (i.e. is a qcow2 path as used by tox -e qemu, or is some sort of hostaliases like sut as used by baseos ci), then use localhost as it will always be resolvable". The problem is the test "is hostname resolvable" is not easy to do, and even with getent hosts $name, you don't know if the user provided $name as some sort of alias that actually resolved to a real hostname that is incorrect. In Jan's case, he is using an external managed host (not a local qcow2 image file) which has a real, resolvable hostname and IP address that he wants to use. I think we need to introduce a flag like ha_cluster_test_use_given_hostname:

- name: Set node name to 'localhost' for single-node clusters
  set_fact:
    inventory_hostname: localhost  # noqa: var-naming
  when:
    - ansible_play_hosts_all | length == 1
    - not ha_cluster_test_use_given_hostname | d(false)

Then

  • all tox -e qemu tests, baseos ci, and downstream automated tests will work
  • Jan can provide -e ha_cluster_test_use_given_hostname=true or otherwise provide this parameter in his inventory when running his tests e.g.
tox -e qemu-ansible-core-2.15 -- --image-name fedora-39 --log-level debug -e ha_cluster_test_use_given_hostname=true -- tests/tests_lvm_pool_shared.yml

wdyt?

@tomjelinek
Copy link
Member

@richm You got the intention absolutely right.

Adding the proposed flag works for me. It would be nice if it can be tested (@japokorn ?) before merging it in the ha_cluster role. And a comment explaining the flag is meant for other roles and thus must be kept in place even though it's not used anywhere in ha_cluster role would be helpful. Feel free to open a PR after testing or let me know to do it myself.

@richm
Copy link
Contributor

richm commented Dec 7, 2023

@tomjelinek there's also an issue with lvmlockd - man lvmlockd

USAGE
   Initial set up
       Setting up LVM to use lvmlockd and a shared VG for the first time includes some one time set up steps:

   1. choose a lock manager
       dlm
       If  dlm  (or  corosync)  are already being used by other cluster software, then select dlm.  dlm uses corosync which requires addi‐
       tional configuration beyond the scope of this document.  See corosync and dlm documentation for instructions on configuration,  set
       up and usage.

how to choose the lock manager? What additional configuration is required by corosync and dlm? Seems like this is something we need to add to the ha_cluster role.

   2. configure hosts to use lvmlockd
       On all hosts running lvmlockd, configure lvm.conf:
       use_lvmlockd = 1

@japokorn where/how is this done? seems like something the storage role/blivet should do?

   3. start lvmlockd
       Start the lvmlockd daemon.
       Use systemctl, a cluster resource agent, or run directly, e.g.
       systemctl start lvmlockd

this seems like something the ha_cluster role should do after it installs lvm2-lockd and dlm.

4. start lock manager
...
       dlm
       Start the dlm and corosync daemons.
       Use systemctl, a cluster resource agent, or run directly, e.g.
       systemctl start corosync dlm

This also seems like something the ha_cluster role should do.

   5. create VG on shared devices
       vgcreate --shared <vgname> <devices>

the storage role does this

   6. start VG on all hosts
       vgchange --lock-start

       Shared VGs must be started before they are used.  Starting the VG performs lock manager initialization that is necessary  to  begin
       using locks (i.e.  creating and joining a lockspace).  Starting the VG may take some time, and until the start completes the VG may
       not be modified or activated.

@japokorn this seems like something the storage role should do?

   7. create and activate LVs
       Standard lvcreate and lvchange commands are used to create and activate LVs in a shared VG.

This also seems like something the storage role should do

   Normal start up and shut down
       After initial set up, start up and shut down include the following steps.  They can be performed directly or may be automated using
       systemd or a cluster resource manager/agents.

       • start lvmlockd
       • start lock manager
       • vgchange --lock-start
       • activate LVs in shared VGs

@tomjelinek this says ". . . may be automated using systemd or a cluster resource manager/agents." - is this something that the ha_cluster role can configure the cluster resource manager/agents to do?

@tomjelinek
Copy link
Member

how to choose the lock manager?

Well, the documentation says that dlm should be used if corosync is in use. HA cluster uses corosync.

What additional configuration is required by corosync and dlm? Seems like this is something we need to add to the ha_cluster role.

I'm not aware of any configuration options in corosync related to dlm. And I'm not aware of any required dlm configuration, just run with the defaults.

"... may be automated using systemd or a cluster resource manager/agents." - is this something that the ha_cluster role can configure the cluster resource manager/agents to do?

It means: create cluster resources. So you just need to instruct the ha_cluster role to create the appropriate resources, ocf:pacemaker:controld and ocf:heartbeat:lvmlockd.

@richm
Copy link
Contributor

richm commented Dec 11, 2023

@tomjelinek afaict the test is setting the appropriate parameters/resources - https://github.com/linux-system-roles/storage/pull/388/files#diff-2892843b9952fe8a2e8f5867b7f5092369acfd8ae20990b1689a366c01b1584cR68-R82

Then maybe the reason it is working in Jan's testing is because he has a "real" hostname and a real IP address, but in the baseos ci and local qemu testing, the inventory_hostname is fake?

@tomjelinek
Copy link
Member

@richm Yes, the variables look good. I have verified that the cluster is able to start dlm and lvmlockd resources with no issues with such settings, if it uses a real node name. If the cluster is set up with the 'localhost' node, dlm times out on start. I'm not sure why that happens. I already tried debugging this back in October but I was unable to get any useful info from dlm debug logs.

@richm richm merged commit eec6543 into linux-system-roles:main Dec 12, 2023
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants