Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream encrypt using functor based chunk_storage #396

Merged
merged 41 commits into from
Dec 17, 2024

Conversation

dirvine
Copy link
Member

@dirvine dirvine commented Nov 28, 2024

Implement Streaming Encryption with Memory-Efficient Processing

Overview

This PR introduces a new streaming encryption implementation that processes file chunks efficiently without requiring all data to be loaded into memory at once. The implementation maintains the security properties of the original algorithm while improving memory usage for large files.

Key Changes

New Streaming Implementation
Added encrypt_stream function that processes chunks in a streaming fashion
Maintains correct encryption order for all chunks:
Chunk 0: Uses hashes from last two chunks
Chunk 1: Uses hash 0 and last hash
Chunks 2+: Uses previous two hashes
Leverages Rayon for parallel processing of chunks 2 and onwards
Uses get_pad_key_and_iv with sorted source hashes to ensure consistent encryption
Memory Efficiency Improvements
Processes chunks in order without requiring full file loading
Builds hashes incrementally while maintaining encryption requirements
Optimizes memory usage by processing chunks in parallel where possible

Code Organization

Simplified encryption logic by using a pre-sorted vector of source hashes
Removed redundant chunk count parameter
Improved error handling with proper Result types
Maintained compatibility with existing encryption algorithms and data structures

Testing

All tests are passing, including:

Small file encryption
Minimum size requirements
Large file handling
Cross-platform compatibility
Error handling scenarios
Concurrent access patterns
Performance Impact
Improved memory efficiency for large files
Maintained parallel processing capabilities where possible
No significant impact on encryption/decryption speed
Security Considerations
Maintains the same encryption algorithm and security properties
Preserves the original chunk dependency model
No changes to core cryptographic operations

Future Work

Consider additional optimizations for very large files
Potential for streaming decryption implementation
Further performance benchmarking with different file sizes

Breaking Changes
None. This implementation maintains backward compatibility with existing APIs and data formats.

When decrypting from storage, we now recursively retrieve and decrypt
parent data maps if a child data map is provided. This allows handling
of shrunk/nested data maps during decryption.

The decrypt_from_storage function will:
- Check if provided data map is a child
- Recursively get and decrypt parent data maps until root is found
- Use root data map for final decryption of content

This maintains backwards compatibility while adding support for
shrunk data maps in the storage decryption flow.
- Rename `decrypt` to `decrypt_sorted_set` to better reflect its purpose
  of handling pre-sorted encrypted chunks
- Change visibility from `pub` to `pub(crate)` to restrict usage to
  within the crate only, as this is an internal implementation detail

This change improves code clarity and enforces proper encapsulation by
preventing external crates from directly accessing this implementation
function.
Remove unused seek_info related code to simplify the public API. This code
was not being used by any consumers and added unnecessary complexity to
the codebase.

Changes:
- Remove seek_info related structs and functions
- Simplify API by focusing on core encryption/decryption functionality
- Maintain backwards compatibility for existing consumers

This change reduces maintenance burden and makes the API surface more
focused and easier to understand.
Move non-public utility functions from lib.rs to a new utils.rs module to improve code organization and readability. This change:

- Creates a new utils.rs module for internal helper functions
- Moves all non-public functions from lib.rs to utils.rs
- Adds pub(crate) visibility to utility functions
- Updates imports and function calls in lib.rs
- Maintains all existing functionality while improving code structure

The refactor makes lib.rs more focused on the public API while keeping implementation details in a dedicated utils module.
- **Moved stream methods to a separate `src/stream.rs` file:**
  - Extracted stream-related structs and implementations from `src/lib.rs` to `src/stream.rs` for better readability and maintainability.
  - Updated module imports and adjusted code to reflect the new structure.

- **Fixed test failures in `StreamSelfDecryptor`:**
  - Added the `finalize_decryption` method to move the decrypted partial output to the final output path once all chunks are processed.
  - Modified the `next_encrypted` method to call `finalize_decryption` when decryption is complete.
  - Ensured that the decrypted content is available at the expected output path, resolving the "No such file or directory" errors in tests.

- **Improved error handling in `src/stream.rs`:**
  - Updated error construction in `StreamSelfDecryptor` to use `std::io::Error` with appropriate `ErrorKind`, matching the expected type for `Error::Io`.
    - Used `IoError::new(ErrorKind::Other, message)` to create detailed IO errors.
  - Provided more descriptive error messages for file operations, aiding in debugging.

- **Resolved linter warnings and performed code cleanup:**
  - Handled unused `Result` from `insert` operations by assigning them to `_`, satisfying `#[deny(unused_results)]`.
  - Removed unused imports and organized existing imports for clarity.
  - Ensured that all paths and directories are correctly created before file operations, adding checks and error messages where necessary.
  - Replaced `?` operators with `unwrap()` in test functions to make failures more explicit during testing.
  - Added debug print statements in tests to trace execution flow and identify the exact point of failure when a test does not pass.

- **Updated test implementations in `src/tests.rs` and `src/stream/tests.rs`:**
  - Modified tests to verify directory existence before proceeding.
  - Ensured that parent directories are created before writing files.
  - Improved assertions and error messages to provide clearer feedback during test runs.
  - Kept `TempDir` instances alive for the duration of tests to prevent premature deletion of temporary directories.

- **General code enhancements:**
  - Maintained consistent use of file paths and error handling across the codebase.
  - Ensured that temporary files and directories are properly managed to prevent resource leaks or unexpected errors.
  - Improved code readability and maintainability by organizing code logically and adding helpful comments where appropriate.

Files affected:
- `src/stream.rs` (new file created for stream methods)
- `src/stream/tests.rs`
- `src/tests.rs`
- `src/lib.rs` (adjusted imports and removed moved code)
- `Cargo.toml` (updated dependencies if necessary)
Not currently used and requires a complete rewrite
This commit includes several significant improvements to chunk handling and testing:

Core Changes:
- Modified encrypt_from_file to store all chunks, including parent chunks
- Fixed decrypt_full_set to use efficient HashMap lookup for chunks
- Improved shrink_data_map to properly track and return all generated chunks
- Updated decrypt to handle parent chunks correctly using HashMap lookup

Test Improvements:
- Added comprehensive encryption/decryption test covering all combinations:
  * encrypt() -> decrypt()
  * encrypt() -> decrypt_from_storage()
  * encrypt_from_file() -> decrypt()
  * encrypt_from_file() -> decrypt_from_storage()
- Added test_encrypt_from_file_stores_all_chunks to verify chunk storage
- Fixed platform_specific_sizes test to handle parent chunks correctly
- Added detailed debug output in tests for better diagnostics
- Improved error handling tests to match actual behavior

The key insight was that we needed to:
1. Keep track of ALL chunks during encryption and shrinking
2. Use efficient HashMap lookups for chunk retrieval
3. Handle parent chunks correctly in all operations
4. Test all combinations of encryption/decryption methods

These changes ensure consistent behavior across all encryption/decryption paths
and improve the reliability of chunk handling throughout the codebase.
This commit includes significant improvements to the Python bindings and testing:

Core Changes:
- Fix type conversion issues between Python and Rust
- Implement proper PyBytes handling in py_decrypt
- Convert chunk hashes to hex strings for Python callbacks
- Add proper error handling for Python callbacks
- Fix memory management issues with PyEncryptedChunk

Python Interface Improvements:
- Return hex strings instead of raw bytes for chunk names
- Implement proper content() method for PyEncryptedChunk
- Add from_bytes classmethod for PyEncryptedChunk
- Fix type annotations for Python callback functions

Test Improvements:
- Add comprehensive encryption/decryption test suite
- Test all combinations of encryption and decryption methods
- Add tests for large file handling
- Add tests for data map shrinking
- Verify cross-backend compatibility
- Add detailed debug output in tests

The changes ensure consistent behavior between Rust and Python APIs while
maintaining proper memory safety and error handling. The comprehensive test
suite now verifies all major use cases and edge conditions.
…rove Testing

This commit introduces significant improvements to the Python bindings, adds streaming decryption functionality, and enhances test coverage across the codebase.

BREAKING CHANGES:
- Removed 'py_' prefix from Python binding function names for cleaner API
- Changed Python module structure to use `_self_encryption` as internal module name

Features:
- Added streaming decryption support with parallel chunk retrieval
- Implemented comprehensive Python bindings for all core functionality
- Added serialization/deserialization functions to public API
- Enhanced DataMap functionality with better child handling

Python Binding Improvements:
- Restructured Python package to follow best practices
- Added proper type hints and docstrings
- Implemented streaming decryption support in Python
- Fixed naming conflicts between Rust and Python modules
- Added comprehensive Python test suite

Testing Enhancements:
- Added comprehensive encryption/decryption test covering all combinations:
  * In-memory encryption -> All decryption methods
  * File-based encryption -> All decryption methods
  * Streaming decryption with parallel chunk retrieval
- Added cross-platform compatibility tests
- Enhanced error handling tests
- Added tests for DataMap child functionality
- Improved test coverage for edge cases

Documentation:
- Updated README with comprehensive Python usage examples
- Added detailed API documentation for Python bindings
- Improved function documentation and type annotations
- Added examples for streaming operations
- Updated implementation details section

Build System:
- Updated pyproject.toml for better Python package management
- Fixed maturin configuration for proper module naming
- Improved Python package structure
- Added proper dependencies in setup.py

The changes provide a more robust, well-tested, and user-friendly interface for both Rust and Python users, with particular emphasis on handling large files efficiently through streaming operations.
BREAKING_CHANGE :
This commit adds chunk verification functionality and improves the Python bindings:

Features:
- Add verify_chunk function to validate chunk content against expected hash
- Add comprehensive tests for chunk verification
- Clean up Python bindings naming and structure

Python Binding Changes:
- Remove 'py_' prefix from all Python-exposed functions for cleaner API
- Fix module structure to properly expose functionality
- Add chunk verification to Python API
- Update Python tests to cover all functionality including verification
- Fix module naming and import issues

Testing:
- Add test_verify_chunk to Python test suite
- Add Rust test for verify_chunk functionality
- Enhance comprehensive test coverage
- Add cross-platform compatibility tests

Documentation:
- Update README with chunk verification examples in both Rust and Python
- Clean up Python API documentation
- Add type hints and docstrings to Python bindings

Build System:
- Fix pyproject.toml configuration
- Update module naming in setup.py
- Fix Python package structure
- Add proper dependencies

The changes provide a more robust way to verify chunk integrity and improve the overall usability of the Python bindings.
This commit updates the benchmarks to use the public API instead of internal functions:

- Replace `decrypt_full_set` with public `decrypt` function in benchmarks
- Update imports in benches/lib.rs to use only public API
- Maintain same benchmarking functionality while using proper public interface

The change ensures that benchmarks follow the same patterns as regular usage
of the library by utilizing the public API instead of internal implementation
details. This makes the benchmarks more representative of real-world usage
and maintains proper API boundaries.
This commit enhances the Python bindings documentation and usability:

Core Changes:
- Add detailed module-level documentation to self_encryption/__init__.py
- Add comprehensive docstrings to all Python-exposed classes and functions
- Document all parameters, return types, and exceptions
- Add examples in docstrings for common use cases

Documentation Improvements:
- Add overview of library features and capabilities
- Document all public APIs with type hints and descriptions
- Add usage examples for basic and advanced features
- Include detailed explanations of key concepts
- Add cross-references between related functionality

Python Bindings:
- Add docstrings to PyDataMap, PyEncryptedChunk, and PyXorName classes
- Document all class methods and attributes
- Add parameter and return type documentation
- Include exception information in docstrings
- Add examples for common operations

The changes ensure that Python users can get comprehensive help using
the built-in help() function, making the library more accessible and
easier to use. Documentation follows Python conventions and provides
clear, practical examples for all functionality.
Add a comprehensive CLI that exposes all core functionality of the self-encryption
library, making it easily accessible from the terminal. The CLI is fully
integrated with the Python package and will be included in all releases.

New Features:
- Add `self-encryption` CLI command with subcommands:
  - encrypt-file: Encrypt files and store chunks
  - decrypt-file: Decrypt files using data maps and chunks
  - verify: Verify integrity of encrypted chunks
  - shrink: Optimize data maps by consolidating chunks
- Add comprehensive --help documentation for all commands
- Support both JSON and human-readable output formats
- Include streaming support for large file operations

Technical Changes:
- Add click>=8.0.0 as a dependency
- Configure entry points in both pyproject.toml and setup.py
- Integrate CLI module with package's public interface
- Add error handling with colored output

The CLI maintains full compatibility with existing Python bindings while
making the library more accessible to command-line users.
Improve documentation and clarity of the Python bindings and CLI interface,
with particular focus on streaming operations and API completeness.

Documentation Improvements:
- Add comprehensive docstrings for all Python functions
- Enhance streaming_decrypt_from_storage documentation with parallel processing details
- Add detailed examples using ThreadPoolExecutor for chunk retrieval
- Include minimum size requirements in encryption documentation
- Add CLI usage examples and command descriptions

Technical Details:
- Clarify get_chunks callback interface for streaming operations
- Add key features section for streaming functionality
- Improve error descriptions and type hints
- Add memory efficiency notes for large file operations

Version bump to 0.32.3 to reflect documentation improvements.
To add actual streaming next
- Replace get_pki with get_pad_key_and_iv to match original encryption algorithm
- Create sorted vector of source hashes upfront for consistent encryption
- Remove unused chunk_count parameter from encrypt_stream function
- Simplify code by removing manual hash lookups for each chunk type

All tests passing. This change maintains the same encryption algorithm while
processing chunks in a streaming fashion, improving memory efficiency for
large files.
@dirvine dirvine requested a review from a team as a code owner November 28, 2024 19:06
This commit introduces several improvements to the Python bindings and documentation:

- Add streaming_encrypt_from_file function for memory-efficient file encryption
- Consolidate Python documentation into a single, well-organized section
- Update examples to demonstrate both regular and streaming encryption methods
- Add installation instructions and improve API documentation structure
- Remove duplicate and outdated Python binding sections

The streaming encryption feature allows processing large files without loading
them entirely into memory, while maintaining the same security guarantees as
the regular encryption method.
@dirvine dirvine force-pushed the stream_encrypt branch 3 times, most recently from 67e1fa6 to 1714681 Compare November 29, 2024 01:02
Comment on lines 8 to 9
// Load the data map from file or another source
let data_map = load_data_map("path/to/data_map")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this example run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grumpy I have added clap and made this use cmd line args now

src/encrypt.rs Outdated
Comment on lines 109 to 117
/// Encrypt chunks in a streaming fashion, processing them in the correct order to satisfy the
/// encryption requirements. Each chunk is encrypted using the hashes of two other chunks:
/// - For chunk 0: Uses hashes of the last two chunks
/// - For chunk 1: Uses hash of chunk 0 and the last chunk
/// - For chunks 2+: Uses hashes of the previous two chunks
pub(crate) fn encrypt_stream(
chunks: Vec<RawChunk>,
) -> Result<(DataMap, Vec<EncryptedChunk>)> {
// Create a sorted vector of all hashes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory impact of this one for HUGE files is HUGE x 2 as chunks stays in memory all run time and encrypted_chunks as well.
I don't understand how this is streaming?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the code is clean and I think this could be used as a regular all in mem algo but stream should maybe be stripped out of the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, this was in another commit. I have added it in the code now and we actually do save chunks to the store as we process them and that should fix this one, but great catch

Comment on lines -156 to -159
pub struct StreamSelfEncryptor {
// File path for the encryption target.
file_path: PathBuf,
// List of `(start_position, end_position)` for each chunk for the target file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that we removed this buggy piece of code!

Comment on lines +5 to +7
/// Helper function to XOR a data with a pad (pad will be rotated to fill the length)
pub(crate) fn xor(data: &Bytes, &Pad(pad): &Pad) -> Bytes {
let vec: Vec<_> = data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all these in a different utils.rs file is a great improvement!!

Comment on lines -67 to -76
// (NB: this hard-coded ref needs update if algorithm changes)
let ref_data_map = vec![
ChunkInfo {
src_hash: XorName([
219, 177, 84, 234, 189, 172, 82, 64, 169, 100, 5, 56, 3, 43, 142, 126, 51, 235,
194, 243, 30, 130, 132, 197, 137, 36, 170, 62, 46, 44, 176, 201,
]),
dst_hash: XorName([
248, 155, 46, 153, 173, 52, 226, 212, 133, 172, 107, 200, 72, 150, 41, 50, 116, 77,
85, 92, 67, 168, 25, 56, 93, 61, 209, 194, 65, 172, 227, 130,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the algo change? I feel like this kind of test was very useful as they are very strict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as we now shrink data_maps and they all have length==3 this test became invalid. I think we should re-introduce such a test again when all this work is in place as it's very handy. In tis case I Could not differentiate the chunks from the file and data_map though.

What we did though was create a backend matrix that compares all encrypt and decrypt functions in a matrix to check we have not broken any existing algorithms. It's not as robust as a test like this one though and I agree we should introduce another hard coded one like this soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue is we now shrink the data maps and don't return any > Len == 3 so the test became useless at this time. We should look to create another such test for backwards compatibility again when all these changes are in place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was maybe added by accident?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be gone now.

Copy link
Member

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of great code quality improvements!
Not sure I see were the streaming is in the encrypt_stream fn, maybe we should just rename it to regular encrypt? And if it is better than what we had, maybe replace the original one?
Great to see some python support!

The streaming encryption implementation was holding all encrypted chunks in
memory, defeating the purpose of streaming. This change:

- Modified encrypt_stream to only return DataMap instead of (DataMap, Vec<EncryptedChunk>)
- Updated streaming_encrypt_from_file to encrypt and store chunks immediately
- Removed redundant chunk storage in encrypt_stream
- Maintains same encryption logic and chunk ordering

This reduces memory usage by processing and storing chunks as they are
encrypted rather than accumulating them in memory.
Added a test to verify that the encryption algorithm produces consistent
results when encrypting the same input data multiple times. The test:

- Creates identical input files with deterministic content
- Encrypts them using separate storage instances
- Verifies that:
  - Data maps have identical structure
  - Generated chunks have identical content
  - Chunk hashes match between runs

This helps ensure that changes to the streaming implementation don't
affect the underlying encryption algorithm's deterministic behavior.
Enhanced the parallel streaming decryptor example with proper CLI handling:

- Added required command line arguments for:
  - Data map file path (-d, --data-map)
  - Chunks directory path (-c, --chunks-dir)
  - Output file path (-o, --output)
- Added comprehensive path validation:
  - Checks data map file exists and is readable
  - Verifies chunks directory exists and is a directory
  - Validates output directory exists and is writable
- Added clap dependency with derive features
- Improved error messages and user feedback

The example now provides a more robust CLI interface with proper
validation before attempting decryption operations.
b-zee
b-zee previously approved these changes Dec 10, 2024
Rename the encryption module to aes to better reflect its implementation
using AES-128-CBC cipher. Update all module references across the codebase:
- Update lib.rs to use 'mod aes' instead of 'mod encryption'
- Update imports in utils.rs, encrypt.rs, and decrypt.rs
- Keep the same functionality while improving code clarity

This change makes the codebase more maintainable by using more specific
and accurate module naming.
Refactors streaming_encrypt_from_file to use a single pass through the file while
maintaining correct encryption. Key changes:

- Collects all source hashes while reading chunks
- Processes chunks 2+ immediately after getting their hashes
- Stores first two chunks for processing after all hashes are available
- Uses shrink_data_map to handle child map creation
- Returns shrunk data map for consistency with encrypt()

This maintains the streaming efficiency while ensuring correct encryption by:
1. Only reading each chunk once from disk
2. Keeping minimal data in memory
3. Preserving the encryption requirements (all hashes needed)
4. Properly handling the data map hierarchy

The function now matches the behavior of encrypt() while being more memory
efficient for large files.
@dirvine dirvine merged commit 3bd5314 into maidsafe:master Dec 17, 2024
6 of 10 checks passed
@dirvine dirvine deleted the stream_encrypt branch December 17, 2024 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants