-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to disable AutoAddPolicy for SSH backend and use known_hosts and ssh config file #1683
Comments
I am no expert at SSH, so would be happy to see reasonable changes like this in a PR. Also, be aware that sshfs might be more fully featured. |
I have opened a PR. Thanks for the pointer to Python sshfs based on asyncssh. I didn't see this mentioned in the API reference's list of known implementations. It is confusing having so many alternatives. Should the built-in one be deprecated then (if and only if sshfs turns out to be much more feature-complete)? |
This is definitely an omission!
I don't directly maintain sshfs; I would be happy to have it supersede the paramiko implementation if they think there would be no lost functionality. |
Any opinions about this @shcheklein @efiop ? Is it feature complete? What was the motivation for that SSHFS implementation? Is it better than the one shipped with fsspec? Is it still being worked on? I was also wondering about the maintainers with access to the repository, but that seems to have been sorted out in fsspec/sshfs#45. As for the issues I have:
|
I'm fine with that. I don't remember all the reasons behind our decision to implement it, but from what I can recollect it was because of paramiko having some edge case that were hard to fix (e.g. configs) + performance. Some background: iterative/dvc#6064 and a bunch of other related discussions. @isidentical and @efiop should remember the details better.
it should be, yes. And it should be more or less straightforward to add other components if needed.
Yes, not full time. Driven by bugs / issues in DVC. But yes, we are trying to keep an eye on it. I can't guarantee any response time, etc though.
can we do git blame to see if it was there from the very beginning? partially, probably was driven by DVC needs (and overall the fact that it's a python lib). There is not interactive mode (like in the SSH CLI client) when you use it. wdyt? |
Thank you for the feedback. I obviously can confirm the config issues and it is sad to see that the indirectly linked issue is open for 8 years by now even though pull requests do exist for this feature. This issue from 6 years ago complaining about the amount of open issues and, worse, the hundreds of open pull requests mirrors the opinion I formed over the past few days trying to use paramiko. I can also confirm the performance issues and I should have done some simple benchmarks with fsspec vs, sshfs before trying to integrate paramiko into my project. I was simply put off by the async usage of asyncssh, and paramiko is 8x more popular on Github, and has a really nice SFTP API, which can be almost directly used to implement the FUSE callbacks. That's why I tried with paramiko first. After some initial slight trouble with the weird non-standard conforming file object, I noticed the poor performance and I don't see an easy way to work around this. Here is my workaround for the file object bug because it might also affect usage of fsspec.sftp:class SFTPFile(io.RawIOBase):
# For a reference implementation based on RawIOBase, see "class SocketIO(io.RawIOBase)" in:
# https://github.com/python/cpython/blob/00ffdf27367fb9aef247644a96f1a9ffb5be1efe/Lib/socket.py#L683
# or others implementations inside cpython:
# https://github.com/python/cpython/blob/00ffdf27367fb9aef247644a96f1a9ffb5be1efe/Lib/_compression.py#L33
# For the internals of RawIOBase and others, see:
# https://github.com/python/cpython/tree/main/Lib/_pyio.py#L619
# The reasons for this wrapper are:
# - paramiko.sftp_file.SFTPFile is incompatible to io.IOBase: https://github.com/paramiko/paramiko/issues/2452
# - I need the file object to inherit from io.RawIOBase or else the isinstance(fileobj, io.IOBase) check in
# compression.detectCompression will fail!
# - Add option to clean up resource MountSource when this file is closed.
def __init__(
self, fileObject: IO[bytes], mountSource: Optional[MountSource] = None, closeMountSource: bool = False
) -> None:
io.RawIOBase.__init__(self)
self.fileObject = fileObject
self.mountSource = mountSource
self.closeMountSource = closeMountSource
self.seekable = self.fileObject.seekable
self.readable = self.fileObject.readable
self.writable = self.fileObject.writable
self.readinto = self.fileObject.readinto
self.read = self.fileObject.read
self.tell = self.fileObject.tell
def __enter__(self):
return self
def __exit__(self, exception_type, exception_value, exception_traceback):
self.close()
@overrides(io.RawIOBase)
def close(self) -> None:
if self.closeMountSource and self.mountSource:
self.mountSource.close()
@overrides(io.RawIOBase)
def fileno(self) -> int:
# This is a virtual Python level file object and therefore does not have a valid OS file descriptor!
raise io.UnsupportedOperation()
@overrides(io.RawIOBase)
def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
# https://github.com/paramiko/paramiko/issues/2452
self.fileObject.seek(offset, whence)
return self.tell()
@overrides(io.RawIOBase)
def readall(self) -> bytes:
# It is necessary to implement this, or else the io.RawIOBase.readall implementation would use
# io.DEFAULT_BUFFER_SIZE (8 KiB). Notably, this would ignore the block size configured in BufferedReader,
# when calling read(-1) on it because it thinks that raw.readall is a fast-path, but in this case is ~100x
# slower than 4 MiB reads equal to the Lustre-advertised block size.
# https://github.com/python/cpython/issues/85624
chunks = []
while True:
result = self.read()
if not result:
break
chunks.append(result)
return b"".join(chunks) Here is a very simple performance comparison that might be already be better than nothing for fsspec/sshfs#2 : import time
import fsspec.implementations.sftp
fs = fsspec.implementations.sftp.SFTPFileSystem("127.0.0.1")
for i in range(5):
t0=time.time(); size=len(fs.open('silesia.tar.gz').read()); t1=time.time()
print(f"Read {size} in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s")
# Read 68238807 in 16.93 s -> 4.03 MB/s
# Read 68238807 in 16.74 s -> 4.08 MB/s
# Read 68238807 in 16.74 s -> 4.08 MB/s
# Read 68238807 in 16.75 s -> 4.07 MB/s
# Read 68238807 in 16.70 s -> 4.09 MB/s
import sshfs
fs = sshfs.SSHFileSystem("127.0.0.1")
for i in range(5):
t0=time.time(); size=len(fs.open('silesia.tar.gz').read()); t1=time.time()
print(f"Read {size} in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s")
# Read 68238807 in 2.06 s -> 33.18 MB/s
# Read 68238807 in 2.07 s -> 32.99 MB/s
# Read 68238807 in 2.04 s -> 33.43 MB/s
# Read 68238807 in 2.01 s -> 33.93 MB/s
# Read 68238807 in 2.04 s -> 33.48 MB/s So yeah, paramiko is ~8x slower than asyncssh for this most simple of simple read-only usages. This seems to be a problem with the
That sounds good enough. Thank you for your timely response to my impertinent questions.
Github can do git blame and it shows this refactor commit from 3 years ago, which is basically at project birth. So yeah, seems to have been there from the beginning.
The non-interactivity is indeed a problem. For my use case, the known hosts are not something that changes every day, so I would add the host as a known host by logging in once via (interactive) ssh before using any non-interactive tool. I can see that there are other use cases and that disabling this verification is simply the easiest way to make it work for all use cases, but I still think that it should at least be configurable. And if there is no interactivity, it can only succeed or fail (with a helpful error message suggesting to add the host as a known host). |
I'm not sure what I ought to do from our end here. @mxmlnkn , you would recommend raising DeprecationWarning when instantiating fsspec's paramiko backend, with a view to removing it in the future? |
Hm, I guess there are also arguments against deprecating it because paramiko and asyncssh are two very different backends. I guess I'm fine with nothing happening at all. I'll probably simply switch to fsspec/sshfs and set a fixed known_hosts and do the URI parsing and config loading myself. |
OK. Can I ask:
Is there any downside from this? The sync API should hide the implementation detail from you unless you need async. |
No, you are completely right that the sync API via sshfs perfectly hides the async API of asyncssh. But, I was trying to use and benchmark asyncssh directly to have a fairer comparison to paramiko, which I was already using directly, and I simply have not worked with async at all yet, so I failed to get asyncssh to work. I am not aware about any intrinsic downsides of async, e.g., in terms of performance. It's just that it is hard to use and has a "viral effect". Then again, how is asyncssh hiding the async API in the first place? I saw some solutions that had to spin up a separate thread. That could indeed have performance or thread-safety implications... |
|
AutoAddPolicy
makes the user vulnerable to man-in-the-middle attacks. I think, it should not be added by default and/or there should be an option to disable/enable it. If the user-wide~/.ssh/known_hosts
file is correctly loaded, then theAutoAddPolicy
should likely not be necessary in the first place. For SSH, this policy change needs to be enabled withssh -o StrictHostKeychecking=no ...
or via the SSH config file.Loading the SSH config file is yet another feature request of mine. For example, currently, things like
ssh cluster
works because I have configured an alias for that in my~/.ssh/config
, butfsspec.open("ssh://cluster")
does not work because paramiko does not automatically load the usual OpenSSH files even though it has helper APIs to load them manually.The text was updated successfully, but these errors were encountered: