Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the hex string of the Insecure.MD5.hash digest without impacting the performance #298

Closed
2horse9sun opened this issue Nov 18, 2024 · 2 comments

Comments

@2horse9sun
Copy link

2horse9sun commented Nov 18, 2024

New API Proposal: Get the hex string of the Insecure.MD5.hash digest without impacting the performance

Motivation:

I'm wondering how to get the hex string of the Insecure.MD5.hash digest while trying to compute the md5 of files. However, I was only able to get the "description" of the md5 data as below:

let data = try testFilePath.read()
let hexString = Insecure.MD5.hash(data: data).description

// Outputs:
MD5 digest: dbe2a50e3babf7c9ac1c1ac25551f5a2

What I really needed is the "dbe2a50e3babf7c9ac1c1ac25551f5a2" part instead of with "MD5 digest: " included.

After some googling, one way to do that is to manually use map to construct the hex string as below:

let digest = Insecure.MD5.hash(data: data)
let hexString = digest.compactMap { String(format: "%02x", $0) }.joined()

However, after adopting this approach, the performance of our project was highly impacted compared with the old "CryptoSwift" lib (cuz md5 is heavily involved in our workflow). After some investigation, I was doubting that the above compactMap may actually be the root cause. Then I did a performance test as below:

let testFilePath = Path("/Users/bytedance/testFile")
let data = try testFilePath.read()
let times = 1000000

print(data.md5().toHexString())
print(Insecure.MD5.hash(data: data).description)

// Test CryptoSwift md5
let cryptoSwiftStartTime = Date.now
for _ in 0..<times {
    let _ = data.md5().toHexString()
}
let cryptoSwiftEndTime = Date.now
let cryptoSwiftInterval = cryptoSwiftEndTime.timeIntervalSince(cryptoSwiftStartTime)
print("CryptoSwift: \(cryptoSwiftInterval)")

// Test swift-crypto md5
let swiftCryptoStartTime = Date.now
for _ in 0..<times {
    let digest = Insecure.MD5.hash(data: data)
    let _ = digest.compactMap { String(format: "%02x", $0) }.joined()
}
let swiftCryptoEndTime = Date.now
let swiftCryptoInterval = swiftCryptoEndTime.timeIntervalSince(swiftCryptoStartTime)
print("swift-crypto: \(swiftCryptoInterval)")

It turned out that:

  1. For small files (<1KB), swift-crypto is more than 3x slower than CryptoSwift.
  2. For large files (> 1MB), swift-crypto is more than 3x faster than CryptoSwift.

However, our workflow mainly involves thousands of small files which causes the performance issue. To further make sure that the "map" operation makes it slower, I removed the "map" and did anther test as below:

let testFilePath = Path("/Users/bytedance/testFile")
let data = try testFilePath.read()
let times = 1000000

print(data.md5().toHexString())
print(Insecure.MD5.hash(data: data).description)

// Test CryptoSwift md5
let cryptoSwiftStartTime = Date.now
for _ in 0..<times {
    let _ = data.md5()
}
let cryptoSwiftEndTime = Date.now
let cryptoSwiftInterval = cryptoSwiftEndTime.timeIntervalSince(cryptoSwiftStartTime)
print("CryptoSwift: \(cryptoSwiftInterval)")

// Test swift-crypto md5
let swiftCryptoStartTime = Date.now
for _ in 0..<times {
    let digest = Insecure.MD5.hash(data: data)
}
let swiftCryptoEndTime = Date.now
let swiftCryptoInterval = swiftCryptoEndTime.timeIntervalSince(swiftCryptoStartTime)
print("swift-crypto: \(swiftCryptoInterval)")

Now, swift-crypto is faster than CryptoSwift in both cases.

It can be concluded that the transformation from digest to hex string actually has non-trivial impact on the performance.

Then I did a global search within swift-crypto to figure out how it handles hex strings. I found that there's an internal computed property hexString extended to DataProtocol which is used to generate hex strings. Since it's internal, I have to duplicate the PrettyBytes.swift file in out project. Another test shows that the hexString extension has no impact on the performance.

Therefore, the final solution is to duplicate the PrettyBytes.swift file in our project to get access to the hexString extension to solve the performance issue.

All in all, my questions are:

  1. Am I missing some existing APIs which transform the digest data into hex string without impacting the performance?
  2. If not, is it possible to make the hexString extension be a public API to support such use case?
  3. Is there any other recommended way to solve the above issue?

Thanks so much!

Importance:

This may not be a serious problem in small projects. However, for huge projects like what I'm currently working on, we have really strict performance requirements. The performance impact is unacceptable after migrating from "CryptoSwift" to "swift-crypto". Hopefully, this issue can be solved soon :)

@Lukasa
Copy link
Contributor

Lukasa commented Nov 18, 2024

Thanks for filing this!

In Crypto itself, we're not likely to offer the hex string extension itself as it's not part of the CryptoKit API surface. However, if you're willing to use _CryptoExtras, we could add the API as an extension in that module.

If that's not acceptable then I would recommend copying out the PrettyBytes.swift file as you have, or just the relevant pieces. In fact, even that is a diagnostic tool without any performance work done on it, so you can further speed it up:

let charA = UInt8(ascii: "a")
let char0 = UInt8(ascii: "0")

private func itoh(_ value: UInt8) -> UInt8 {
    assert(value < 16)
    return (value > 9) ? (charA &+ value &- 10) : (char0 &+ value)
}

extension DataProtocol {
    var hexString: String {
        let hexLen = self.count * 2
        return String(unsafeUninitializedCapacity: hexLen) { hexChars in
            var offset = 0
        
            self.regions.forEach { (_) in
                for i in self {
                    hexChars[Int(offset &* 2)] = itoh((i >> 4) & 0xF)
                    hexChars[Int(offset &* 2 &+ 1)] = itoh(i & 0xF)
                    offset &+= 1
                }
            }
            
            return offset &* 2
        }
    }
}

@2horse9sun
Copy link
Author

Thanks! Your suggestions are really insightful. I'll close the issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants