Skip to content

Commit

Permalink
Merge pull request rook#13314 from microyahoo/partition
Browse files Browse the repository at this point in the history
osd: support create osd with metadata partition
  • Loading branch information
satoru-takeuchi authored Feb 8, 2024
2 parents ebcb0a4 + ee05e82 commit f1aa4f3
Show file tree
Hide file tree
Showing 5 changed files with 475 additions and 54 deletions.
45 changes: 45 additions & 0 deletions .github/workflows/canary-integration-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -391,6 +391,51 @@ jobs:
with:
name: canary

osd-with-metadata-partition-device:
runs-on: ubuntu-20.04
if: "!contains(github.event.pull_request.labels.*.name, 'skip-ci')"
steps:
- name: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: consider debugging
uses: ./.github/workflows/tmate_debug
with:
use-tmate: ${{ secrets.USE_TMATE }}

- name: setup cluster resources
uses: ./.github/workflows/canary-test-config

- name: validate-yaml
run: tests/scripts/github-action-helper.sh validate_yaml

- name: use local disk as OSD metadata partition
run: |
export BLOCK="/dev/$(tests/scripts/github-action-helper.sh find_extra_block_dev)"
tests/scripts/github-action-helper.sh use_local_disk
tests/scripts/create-bluestore-partitions.sh --disk "$BLOCK" --bluestore-type block.db --osd-count 1
- name: deploy cluster
run: |
tests/scripts/github-action-helper.sh deploy_cluster osd_with_metadata_partition_device
- name: wait for prepare pod
run: tests/scripts/github-action-helper.sh wait_for_prepare_pod 1

- name: wait for ceph to be ready
run: tests/scripts/github-action-helper.sh wait_for_ceph_to_be_ready osd 1

- name: check-ownerreferences
run: tests/scripts/github-action-helper.sh check_ownerreferences

- name: collect common logs
if: always()
uses: ./.github/workflows/collect-logs
with:
name: canary

osd-with-metadata-device:
runs-on: ubuntu-20.04
if: "!contains(github.event.pull_request.labels.*.name, 'skip-ci')"
Expand Down
6 changes: 5 additions & 1 deletion Documentation/CRDs/Cluster/ceph-cluster-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -477,7 +477,7 @@ See the table in [OSD Configuration Settings](#osd-configuration-settings) to kn

The following storage selection settings are specific to Ceph and do not apply to other backends. All variables are key-value pairs represented as strings.

* `metadataDevice`: Name of a device or lvm to use for the metadata of OSDs on each node. Performance can be improved by using a low latency device (such as SSD or NVMe) as the metadata device, while other spinning platter (HDD) devices on a node are used to store data. Provisioning will fail if the user specifies a `metadataDevice` but that device is not used as a metadata device by Ceph. Notably, `ceph-volume` will not use a device of the same device class (HDD, SSD, NVMe) as OSD devices for metadata, resulting in this failure.
* `metadataDevice`: Name of a device, [partition](#limitations-of-metadata-device) or lvm to use for the metadata of OSDs on each node. Performance can be improved by using a low latency device (such as SSD or NVMe) as the metadata device, while other spinning platter (HDD) devices on a node are used to store data. Provisioning will fail if the user specifies a `metadataDevice` but that device is not used as a metadata device by Ceph. Notably, `ceph-volume` will not use a device of the same device class (HDD, SSD, NVMe) as OSD devices for metadata, resulting in this failure.
* `databaseSizeMB`: The size in MB of a bluestore database. Include quotes around the size.
* `walSizeMB`: The size in MB of a bluestore write ahead log (WAL). Include quotes around the size.
* `deviceClass`: The [CRUSH device class](https://ceph.io/community/new-luminous-crush-device-classes/) to use for this selection of storage devices. (By default, if a device's class has not already been set, OSDs will automatically set a device's class to either `hdd`, `ssd`, or `nvme` based on the hardware properties exposed by the Linux kernel.) These storage classes can then be used to select the devices backing a storage pool by specifying them as the value of [the pool spec's `deviceClass` field](../Block-Storage/ceph-block-pool-crd.md#spec).
Expand All @@ -497,6 +497,10 @@ Allowed configurations are:
| crypt | | |
| mpath | | |

#### Limitations of metadata device
- If `metadataDevice` is specified in the global OSD configuration or in the node level OSD configuration, the metadata device will be shared between all OSDs on the same node. In other words, OSDs will be initialized by `lvm batch`. In this case, we can't use partition device.
- If `metadataDevice` is specified in the device local configuration, we can use partition as metadata device. In other words, OSDs are initialized by `lvm prepare`.

### Annotations and Labels

Annotations and Labels can be specified so that the Rook components will have those annotations / labels added to them.
Expand Down
144 changes: 94 additions & 50 deletions pkg/daemon/ceph/osd/volume.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@ const (
dbDeviceFlag = "--db-devices"
cephVolumeCmd = "ceph-volume"
cephVolumeMinDBSize = 1024 // 1GB

blockDBFlag = "--block.db"
blockDBSizeFlag = "--block.db-size"
dataFlag = "--data"
)

// These are not constants because they are used by the tests
Expand Down Expand Up @@ -665,6 +669,12 @@ func (a *OsdAgent) initializeDevicesLVMMode(context *clusterd.Context, devices *
}
metadataDevices[md]["devices"] = deviceArg
}
if metadataDevice.Type == sys.PartType {
if a.metadataDevice != "" && device.Config.MetadataDevice == "" {
return errors.Errorf("Partition device %s can not be specified as metadataDevice in the global OSD configuration or in the node level OSD configuration", md)
}
metadataDevices[md]["part"] = "true" // ceph-volume lvm batch only supports disk and lvm
}
deviceDBSizeMB := getDatabaseSize(a.storeConfig.DatabaseSizeMB, device.Config.DatabaseSizeMB)
if a.storeConfig.IsValidStoreType() && deviceDBSizeMB > 0 {
if deviceDBSizeMB < cephVolumeMinDBSize {
Expand Down Expand Up @@ -721,76 +731,110 @@ func (a *OsdAgent) initializeDevicesLVMMode(context *clusterd.Context, devices *

for md, conf := range metadataDevices {

// Do not change device names if udev persistent names are passed
mdPath := md
if !strings.HasPrefix(mdPath, "/dev") {
mdPath = path.Join("/dev", md)
}

var hasPart bool
mdArgs := batchArgs
osdsPerDevice := 1
if _, ok := conf["osdsperdevice"]; ok {
if part, ok := conf["part"]; ok && part == "true" {
hasPart = true
}
if hasPart {
// ceph-volume lvm prepare --data {vg/lv} --block.wal {partition} --block.db {/path/to/device}
baseArgs := []string{"-oL", cephVolumeCmd, "--log-path", logPath, "lvm", "prepare", storeFlag}
if a.storeConfig.EncryptedDevice {
baseArgs = append(baseArgs, encryptedFlag)
}
mdArgs = baseArgs
devices := strings.Split(conf["devices"], " ")
if len(devices) > 1 {
logger.Warningf("partition metadataDevice %s can only be used by one data device", md)
}
if _, ok := conf["osdsperdevice"]; ok {
logger.Warningf("`ceph-volume osd prepare` doesn't support multiple OSDs per device")
}
mdArgs = append(mdArgs, []string{
osdsPerDeviceFlag,
conf["osdsperdevice"],
dataFlag,
devices[0],
blockDBFlag,
mdPath,
}...)
v, _ := strconv.Atoi(conf["osdsperdevice"])
if v > 1 {
osdsPerDevice = v
if _, ok := conf["databasesizemb"]; ok {
mdArgs = append(mdArgs, []string{
blockDBSizeFlag,
conf["databasesizemb"],
}...)
}
} else {
if _, ok := conf["osdsperdevice"]; ok {
mdArgs = append(mdArgs, []string{
osdsPerDeviceFlag,
conf["osdsperdevice"],
}...)
v, _ := strconv.Atoi(conf["osdsperdevice"])
if v > 1 {
osdsPerDevice = v
}
}
if _, ok := conf["databasesizemb"]; ok {
mdArgs = append(mdArgs, []string{
databaseSizeFlag,
conf["databasesizemb"],
}...)
}
mdArgs = append(mdArgs, strings.Split(conf["devices"], " ")...)
mdArgs = append(mdArgs, []string{
dbDeviceFlag,
mdPath,
}...)
}

if _, ok := conf["deviceclass"]; ok {
mdArgs = append(mdArgs, []string{
crushDeviceClassFlag,
conf["deviceclass"],
}...)
}
if _, ok := conf["databasesizemb"]; ok {
mdArgs = append(mdArgs, []string{
databaseSizeFlag,
conf["databasesizemb"],
}...)
}
mdArgs = append(mdArgs, strings.Split(conf["devices"], " ")...)

// Do not change device names if udev persistent names are passed
mdPath := md
if !strings.HasPrefix(mdPath, "/dev") {
mdPath = path.Join("/dev", md)
}

mdArgs = append(mdArgs, []string{
dbDeviceFlag,
mdPath,
}...)

// Reporting
reportArgs := append(mdArgs, []string{
"--report",
}...)
if !hasPart {
// Reporting
reportArgs := append(mdArgs, []string{
"--report",
}...)

if err := context.Executor.ExecuteCommand(baseCommand, reportArgs...); err != nil {
return errors.Wrap(err, "failed ceph-volume report") // fail return here as validation provided by ceph-volume
}
if err := context.Executor.ExecuteCommand(baseCommand, reportArgs...); err != nil {
return errors.Wrap(err, "failed ceph-volume report") // fail return here as validation provided by ceph-volume
}

reportArgs = append(reportArgs, []string{
"--format",
"json",
}...)
reportArgs = append(reportArgs, []string{
"--format",
"json",
}...)

cvOut, err := context.Executor.ExecuteCommandWithOutput(baseCommand, reportArgs...)
if err != nil {
return errors.Wrapf(err, "failed ceph-volume json report: %s", cvOut) // fail return here as validation provided by ceph-volume
}
cvOut, err := context.Executor.ExecuteCommandWithOutput(baseCommand, reportArgs...)
if err != nil {
return errors.Wrapf(err, "failed ceph-volume json report: %s", cvOut) // fail return here as validation provided by ceph-volume
}

logger.Debugf("ceph-volume reports: %+v", cvOut)
logger.Debugf("ceph-volume reports: %+v", cvOut)

var cvReports []cephVolReportV2
if err = json.Unmarshal([]byte(cvOut), &cvReports); err != nil {
return errors.Wrap(err, "failed to unmarshal ceph-volume report json")
}
var cvReports []cephVolReportV2
if err = json.Unmarshal([]byte(cvOut), &cvReports); err != nil {
return errors.Wrap(err, "failed to unmarshal ceph-volume report json")
}

if len(strings.Split(conf["devices"], " "))*osdsPerDevice != len(cvReports) {
return errors.Errorf("failed to create enough required devices, required: %s, actual: %v", cvOut, cvReports)
}
if len(strings.Split(conf["devices"], " "))*osdsPerDevice != len(cvReports) {
return errors.Errorf("failed to create enough required devices, required: %s, actual: %v", cvOut, cvReports)
}

for _, report := range cvReports {
if report.BlockDB != mdPath && !strings.HasSuffix(mdPath, report.BlockDB) {
return errors.Errorf("wrong db device for %s, required: %s, actual: %s", report.Data, mdPath, report.BlockDB)
for _, report := range cvReports {
if report.BlockDB != mdPath && !strings.HasSuffix(mdPath, report.BlockDB) {
return errors.Errorf("wrong db device for %s, required: %s, actual: %s", report.Data, mdPath, report.BlockDB)
}
}
}

Expand Down
Loading

0 comments on commit f1aa4f3

Please sign in to comment.