Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a note to conf about the dangers of modifying dir at runtime #887

Merged
merged 11 commits into from
Dec 8, 2024

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Aug 11, 2024

We've had security issues in the past with it, which is why
we marked it as PROTECTED. But, modifying during runtime
is also a dangerous action. For example, when child processes
are running, persistent temp files and log files may have
unexpected effects.

A scenario for modifying dir at runtime is to migrate a disk
failure, such as using disk-based replication to migrate a node,
writing nodes.conf to save the cluster configuration.

We decided to leave it as is and add a note in the conf
about the dangers of modifying dir at runtime.

If a forked child process is running, modifying dir is not allowed.

I can think of two problems at present:
1. If dir is modified during aof rewrite, we will leave the tmp files
on the disk. Because the rename operation of the main process in
backgroundRewriteDoneHandler will fail. Of course, we can stop aof
rewrite when modifying dir, and restart aof rewrite after the modiry.

2. Both the parent and child processes may call serverLog to write
logs, and modifying dir will cause the child process to write logs
in the old dir and the parent process to write logs in the new dir.
Because they fopen the file every time they call serverLog. (PS. we
may need to change it to open only once in the future).

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin requested a review from madolson August 11, 2024 10:24
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be okay making this config immutable for Valkey 8.0. We've had security issues in the past with it, which is why we marked it as PROTECTED. It seems fine to throw these temporary errors here as well.

src/config.c Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.84%. Comparing base (6df376d) to head (d90000d).
Report is 3 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #887      +/-   ##
============================================
+ Coverage     70.82%   70.84%   +0.01%     
============================================
  Files           118      118              
  Lines         63561    63561              
============================================
+ Hits          45020    45029       +9     
+ Misses        18541    18532       -9     

see 13 files with indirect coverage changes

@enjoy-binbin
Copy link
Member Author

I'd also be okay making this config immutable for Valkey 8.0. We've had security issues in the past with it, which is why we marked it as PROTECTED. It seems fine to throw these temporary errors here as well.

I actually seem to prefer making it immutable now, WDYT? @valkey-io/core-team

@hwware
Copy link
Member

hwware commented Aug 28, 2024

I'd also be okay making this config immutable for Valkey 8.0. We've had security issues in the past with it, which is why we marked it as PROTECTED. It seems fine to throw these temporary errors here as well.

I actually seem to prefer making it immutable now, WDYT? @valkey-io/core-team

I am ok to change this config from MODIFIABLE_CONFIG | PROTECTED_CONFIG | DENY_LOADING_CONFIG to IMMUTABLE_CONFIG | DENY_LOADING_CONFIG

@madolson
Copy link
Member

I'm okay also making it immutable.

@PingXie
Copy link
Member

PingXie commented Aug 29, 2024

+1 on making dir immutable. I can't think of a reason why dir needs to be mutable. Have you guys seen otherwise?

@enjoy-binbin enjoy-binbin changed the title Not allow modify dir while a child process is running Making configuration item dir immutable Aug 29, 2024
@enjoy-binbin enjoy-binbin added release-notes This issue should get a line item in the release notes breaking-change Indicates a possible backwards incompatible change labels Aug 29, 2024
@enjoy-binbin enjoy-binbin requested a review from a team August 29, 2024 06:39
Signed-off-by: Binbin <[email protected]>
@zuiderkwast
Copy link
Contributor

Immutable is OK, as long as it works to set --dir multiple times at startup, in config file and on command line. I know the Fedora (or Debian?) maintainer guy was doing something like that.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

Immutable is OK, as long as it works to set --dir multiple times at startup, in config file and on command line. I know the Fedora (or Debian?) maintainer guy was doing something like that.

ok, i tested it manually and it's ok (behavior is the same as unstable) and i also added some tests hope we can cover it. Maybe someone else needs to test it manually to make sure i am not breaking anythings...

@soloestoy
Copy link
Member

I have encountered scenarios where the dir needed to be modified, such as when the machine has multiple disks mounted in different directories, and the disk where the current directory is located becomes read-only due to a machine failure.

In such cases, users want to start a replica on another machine for migration. However, because the disk is read-only, a full synchronization cannot be performed (as RDB requires disk writes, which was not supported in earlier versions through diskless sync). In this situation, we recovered by modifying the dir to a usable disk.

This may not be a common case, but I am not sure if other users have other scenarios that require modifying the dir.

tests/assets/default.conf Outdated Show resolved Hide resolved
@enjoy-binbin
Copy link
Member Author

Yes, this is indeed one of the uses of dir modification, i suddenly forgot about it. Speaking of read-only disk error, it will cause cluster-config-file saving to fail and cause the node to exit. Like the cluster is doing some cluster stuff, and the node update nodes.conf and fail and exit due to disk error. We once wanted to save nodes.conf somewhere else by changing dir, but in the end we gave up for some reasons (when system disk fails, it is not reliable to hang on the disk) and we are using a module to hack it.

And btw, i would like to add a configuration item to control the killing itself thing, do you think it is necessary? I remember @PingXie seemed to have rejected it (but i think it is a configuration item that can be controlled by the administrator). Its suicide is passive and difficult to control.

void clusterSaveConfigOrDie(int do_fsync) {
    if (clusterSaveConfig(do_fsync) == C_ERR) {
        serverLog(LL_WARNING, "Fatal: can't update cluster config file.");
        exit(1);
    }
}

@zuiderkwast
Copy link
Contributor

In this situation, we recovered by modifying the dir to a usable disk.

Interesting, so maybe it's better to keep it mutable.

Can we prevent changing it when we have a child process running?

@PingXie
Copy link
Member

PingXie commented Aug 29, 2024

I have encountered scenarios where the dir needed to be modified, such as when the machine has multiple disks mounted in different directories, and the disk where the current directory is located becomes read-only due to a machine failure.

In such cases, users want to start a replica on another machine for migration. However, because the disk is read-only, a full synchronization cannot be performed (as RDB requires disk writes, which was not supported in earlier versions through diskless sync). In this situation, we recovered by modifying the dir to a usable disk.

Interesting scenario, indeed. However, as @soloestoy mentioned, it's not a typical use case, and the concerns raised by @enjoy-binbin, such as temporary file leaks and split logs, are not a concern here IMO. I agree that keeping this config mutable makes sense but I see little value or need in hardening this code path. Instead, we should document it in the dir config documentation for admin awareness. Blocking the dir mutation due to active child processes would complicate matters in urgent situations, as child processes can run for extended periods. In such cases, admins would have to identify and terminate the right child processes, which could lead to errors when trying to unblock the disk-based full sync quickly.

@PingXie
Copy link
Member

PingXie commented Aug 29, 2024

Yes, this is indeed one of the uses of dir modification, i suddenly forgot about it. Speaking of read-only disk error, it will cause cluster-config-file saving to fail and cause the node to exit. Like the cluster is doing some cluster stuff, and the node update nodes.conf and fail and exit due to disk error. We once wanted to save nodes.conf somewhere else by changing dir, but in the end we gave up for some reasons (when system disk fails, it is not reliable to hang on the disk) and we are using a module to hack it.

Right, nodes.conf is a single point of failure (SPOF) in the current design/implementation. The reliance of an in-memory server on disks, which typically have a lower mean time to failure (MTTF) compared to RAM, is unfortunate. However, this is a separate concern from the discussion of dir's mutability, so maybe open a new issue for discussion?

@PingXie
Copy link
Member

PingXie commented Aug 29, 2024

And btw, i would like to add a configuration item to control the killing itself thing, do you think it is necessary? I remember @PingXie seemed to have rejected it (but i think it is a configuration item that can be controlled by the administrator).

Sorry @enjoy-binbin I completely lost the context on this one. Do you have a thread?

@enjoy-binbin
Copy link
Member Author

ok, the link is #635, i take it wrong.

@enjoy-binbin
Copy link
Member Author

ok, now there seem to be three options now:

  1. Keep it mutable, add valkey.conf to explain the harm of modifying it
  2. Keep it mutable, block it when there are child processes (the original appearance of this PR)
  3. Keep it immutable (the current appearance of this PR)

vote?

@madolson
Copy link
Member

madolson commented Sep 1, 2024

In such cases, users want to start a replica on another machine for migration. However, because the disk is read-only, a full synchronization cannot be performed (as RDB requires disk writes, which was not supported in earlier versions through diskless sync). In this situation, we recovered by modifying the dir to a usable disk.

We have diskless replication now though, do we still need to worry about this case?

@PingXie
Copy link
Member

PingXie commented Sep 1, 2024

We have diskless replication now though, do we still need to worry about this case?

yeah. although we can't say disk-based replication will never be used, diskless has been the default (since 7.0?). so this is even less of a concern IMO. I would vote for option 1.

@hwware
Copy link
Member

hwware commented Sep 12, 2024

I look carefully which scenarios this parameter will be used again, there are 3 cases:

  1. rdb filename.
  2. aof filename.
  3. cluster config file.
  4. logging

@soloestoy describe a typical use case, and maybe more cases are coming. And we have no idea how the clients behavior is, thus I think we should keep it mutable, but in valkey.conf we can highlight it is dangerous to change it during runtime.

Thus, I vote 1.

@zuiderkwast
Copy link
Contributor

In Zhao's example, when the disk is full, it is now possible to replicate the data to another node. If disk-based sync is enabled, they can just change to diskless and replication will work.

But if the disk is full and they just want to use another disk for AOF, RDB and nodes.conf, then mutable is more useful, maybe. Nodes.conf is not written by a fork, so this one is no problem.

If it is risky to change the dir when a fork is running, then I think we can prevent it with option 2. But CONFIG SET failing at random time is also annoying. :)

What is the worst thing that can happen if dir is modified during a fork? We will leave temp files on disk, described in the first commit message of this PR:

  1. If dir is modified during aof rewrite, we will leave the tmp files
    on the disk. Because the rename operation of the main process in
    backgroundRewriteDoneHandler will fail.

Leaving temp files is maybe no disaster but it is a little annoying, if it is huge files.

@enjoy-binbin also mentioned logging: the child process will continue logging in the old dir. That seems like no problem to me. Will anything happen to RDB files? No crashes?

If it is only AOF files, then maybe we can do a similar trick as in #886? Allow dir to be changed, but store the dir that was used when the fork started.

@enjoy-binbin
Copy link
Member Author

@enjoy-binbin also mentioned logging: the child process will continue logging in the old dir. That seems like no problem to me. Will anything happen to RDB files? No crashes?

If it is only AOF files, then maybe we can do a similar trick as in #886? Allow dir to be changed, but store the dir that was used when the fork started.

I remember that the RDB files are ok, there is no crash and the RDB rename thing is ok (it is different than the AOF one). And i do remember it is only affect AOF files, yean i take try to store the dir and see if it can work. The logging one is not a real problem too i think, just a point that may be affected.

@enjoy-binbin enjoy-binbin changed the title Making configuration item dir immutable Add a note to conf about the dangers of modifying dir at runtime Dec 6, 2024
@enjoy-binbin enjoy-binbin removed release-notes This issue should get a line item in the release notes major-decision-pending Major decision pending by TSC team breaking-change Indicates a possible backwards incompatible change labels Dec 6, 2024
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now this PR is completely changed and only adds some comments. It is a very small and safe change.

@enjoy-binbin enjoy-binbin merged commit 176fafc into valkey-io:unstable Dec 8, 2024
47 checks passed
@enjoy-binbin enjoy-binbin deleted the config_set_dir branch December 8, 2024 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants