refactor: simplify dataset construction #4437

caic99 · 2024-11-27T10:36:02Z

Summary by CodeRabbit

New Features
- Introduced a new function for dataset construction, enhancing data loading processes.
- Added a method to improve pickling and unpickling capabilities for path handling classes.
Bug Fixes
- Updated summary printing to prevent redundant output during distributed training.
Refactor
- Simplified initialization of the BackgroundConsumer class.
- Streamlined consumer thread and queue handling in the BufferedIterator class.

for more information, see https://pre-commit.ci

caic99 · 2024-11-27T10:38:55Z

@anyangml This PR adds a more detailed warning output in the case of dataset reading is throttled.

coderabbitai · 2024-11-27T10:38:59Z

📝 Walkthrough

Walkthrough

The changes in this pull request involve modifications to two primary files: dataloader.py and path.py. In dataloader.py, enhancements are made to the DpLoaderSet, BackgroundConsumer, and BufferedIterator classes, including the addition of a new dataset construction function and updates to existing methods for improved modularity and clarity. In path.py, a new abstract method for pickling support is introduced in the DPPath class and its subclasses, enhancing the class hierarchy's serialization capabilities.

Changes

File	Change Summary
deepmd/pt/utils/dataloader.py	- Added `construct_dataset(system)` method. - Updated `print_summary` to include rank check. - Modified `BackgroundConsumer` constructor to remove `max_len` parameter. - Adjusted `BufferedIterator` to remove `max_len` from consumer initialization.
deepmd/utils/path.py	- Added `__getnewargs__` method in `DPPath`, `DPOSPath`, and `DPH5Path` classes for enhanced pickling support.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DpLoaderSet
    participant BackgroundConsumer
    participant BufferedIterator

    User->>DpLoaderSet: Initialize with parameters
    DpLoaderSet->>DpLoaderSet: Call construct_dataset
    DpLoaderSet->>BackgroundConsumer: Initialize consumer
    BackgroundConsumer->>BufferedIterator: Initialize iterator
    BufferedIterator->>BackgroundConsumer: Start consuming data
    BackgroundConsumer->>DpLoaderSet: Signal end of data loading
    DpLoaderSet->>User: Print summary (if rank 0)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

deepmd/pt/utils/dataloader.py (2)
5-7: LGTM! Consider adding type hints for better code clarity.

The introduction of construct_dataset function and the use of partial improves code reusability and maintainability. Consider adding type hints to make the function signature more explicit:
-def construct_dataset(system, type_map):
+def construct_dataset(system: str, type_map: list[str]) -> DeepmdDataSetForLoader:
Also applies to: 11-13, 59-63

225-246: Consider making queue size and warning threshold configurable.

The current implementation has hardcoded values that might not be optimal for all scenarios:

QUEUESIZE = 32 might need adjustment based on memory constraints or dataset characteristics

The 1-second warning threshold might be too aggressive for larger batches or slower storage systems

Consider making these values configurable:
-QUEUESIZE = 32
+DEFAULT_QUEUE_SIZE = 32
+DEFAULT_WARNING_THRESHOLD = 1.0

 class BufferedIterator:
-    def __init__(self, iterable) -> None:
+    def __init__(
+        self,
+        iterable,
+        queue_size: int = DEFAULT_QUEUE_SIZE,
+        warning_threshold: float = DEFAULT_WARNING_THRESHOLD
+    ) -> None:
-        self._queue = Queue(QUEUESIZE)
+        self._queue = Queue(queue_size)
         self._iterable = iterable
         self._consumer = BackgroundConsumer(self._queue, self._iterable)
         self._consumer.start()
         self.len = len(iterable)
+        self._warning_threshold = warning_threshold

     def __next__(self):
         start_wait = time.time()
         item = self._queue.get()
         wait_time = time.time() - start_wait
-        if wait_time > 1.0:
+        if wait_time > self._warning_threshold:
             log.warning(f"Data loading is slow, waited {wait_time:.2f} seconds.")
deepmd/utils/path.py (1)
47-49: LGTM! Consider adding docstring for clarity.

The __getnewargs__ implementation correctly returns the essential arguments needed for pickling. However, consider adding a docstring to explain its purpose:
 def __getnewargs__(self):
+    """Return a tuple of arguments needed to create a new instance during unpickling.
+
+    Returns
+    -------
+    tuple
+        A tuple of (path, mode) used by __new__ to reconstruct the object.
+    """
     return (self.path, self.mode)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4a45fe5 and 417da94.

📒 Files selected for processing (2)

deepmd/pt/utils/dataloader.py (5 hunks)
deepmd/utils/path.py (1 hunks)

🔇 Additional comments (3)

deepmd/pt/utils/dataloader.py (3)

194-208: LGTM! Rank check prevents duplicate output in distributed training.

The addition of rank check ensures that summary information is printed only once in distributed training environments, improving log clarity.

212-223: LGTM! More Pythonic implementation with StopIteration.

The simplified interface and use of StopIteration as a sentinel value aligns better with Python's iterator protocol.

Line range hint 100-110: LGTM! Verify pool size calculation in distributed environments.

The parallel dataset construction using Pool is efficient and properly considers distributed training environments. However, let's verify the pool size calculation behavior across different distributed configurations.

✅ Verification successful

Let me gather more information about the distributed training setup to verify the pool size calculation.

LGTM! Pool size calculation is correctly implemented for distributed environments

The pool size calculation is properly implemented and consistent with the distributed training setup:

The code correctly checks for distributed environment using dist.is_available() and dist.is_initialized()
LOCAL_WORLD_SIZE is only used in this specific pool size calculation, which is appropriate as it represents the number of processes on the local machine
The distributed training initialization and rank handling throughout the codebase follows consistent patterns
The pool size division ensures resources are properly allocated across distributed processes without oversubscription

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential issues with pool size calculation in distributed environments

# Search for other uses of LOCAL_WORLD_SIZE to ensure consistent handling
rg "LOCAL_WORLD_SIZE" -A 2 -B 2

# Search for similar pool size calculations to verify consistency
ast-grep --pattern 'Pool(
  $$$
  os.cpu_count()
  $$$
)'

Length of output: 425

Script:

#!/bin/bash
# Check for distributed training initialization and configuration
rg "dist\.init_process_group|dist\.is_initialized\(\)" -A 2 -B 2

# Look for other process/thread pool implementations
rg "multiprocessing\.(Pool|Process|ThreadPool)" -A 2 -B 2

# Check if there are any other environment variables used for process control
rg "os\.environ\[.*WORLD_SIZE" -A 2 -B 2

# Search for any comments or documentation about distributed training
rg -g "*.md" -g "*.rst" "distributed training|LOCAL_WORLD_SIZE"

Length of output: 6341

deepmd/utils/path.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

deepmd/pt/utils/dataloader.py (1)
246-252: Consider making the warning threshold configurable.

The 1.0-second threshold for slow data loading warnings is hardcoded. Consider making this configurable to accommodate different system capabilities and requirements.
-        if (
-            wait_time > 1.0
-        ):  # Even for Multi-Task training, each step usually takes < 1s
+        # Default to 1.0s, but allow configuration through environment variable
+        warning_threshold = float(os.getenv('DEEPMD_LOADER_WARNING_THRESHOLD', '1.0'))
+        if wait_time > warning_threshold:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 417da94 and e9672d0.

📒 Files selected for processing (1)

deepmd/pt/utils/dataloader.py (5 hunks)

🔇 Additional comments (5)

deepmd/pt/utils/dataloader.py (5)

60-64: LGTM! Clean factory function implementation.

The new construct_dataset function provides a clean abstraction for dataset construction, improving code modularity and reusability.

Line range hint 102-112: LGTM! Improved parallel dataset construction.

Good improvements:

Using partial for cleaner parameter binding
Proper CPU pool sizing for distributed training
More maintainable parallel dataset construction

196-210: LGTM! Proper handling of distributed output.

Good addition of rank-aware printing to prevent duplicate output in distributed training environments.

233-237: LGTM! Proper length protocol implementation.

Good addition of the len attribute to support Python's length protocol.

224-225: Verify StopIteration handling in DataLoader context.

The comment suggests this case shouldn't occur for DataLoader. Let's verify the DataLoader's iteration behavior.

✅ Verification successful

StopIteration handling is correctly implemented

The code review reveals that the StopIteration concern is properly handled:

The BackgroundConsumer sends StopIteration through the queue when iteration is complete
The BufferedIterator's __next__ method checks if the received item is an Exception (which includes StopIteration) and raises it appropriately
The comment "this should not happen for DataLoader" is misleading and can be ignored, as the StopIteration handling is a necessary part of the Python iterator protocol implementation

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for DataLoader iteration handling to verify StopIteration behavior
ast-grep --pattern 'class DataLoader {
  $$$
  def __iter__() {
    $$$
  }
  $$$
}'

Length of output: 86

Script:

#!/bin/bash
# Let's try a broader search for DataLoader class and its iteration-related methods
rg -A 10 "class DataLoader" deepmd/pt/utils/dataloader.py

# Also search for any StopIteration handling
rg -B 2 -A 2 "StopIteration" deepmd/pt/utils/dataloader.py

Length of output: 429

Script:

#!/bin/bash
# Let's search for the BackgroundConsumer class and its context
rg -B 10 -A 20 "class BackgroundConsumer" deepmd/pt/utils/dataloader.py

# Also search for where this consumer is used
rg -B 5 -A 5 "BackgroundConsumer" deepmd/pt/utils/dataloader.py

Length of output: 1668

Script:

#!/bin/bash
# Let's see the complete BufferedIterator implementation to understand how it handles iteration
rg -B 2 -A 15 "class BufferedIterator" deepmd/pt/utils/dataloader.py

# And search for where BufferedIterator is used
rg -B 5 -A 5 "BufferedIterator" deepmd/pt/utils/dataloader.py

Length of output: 830

Script:

#!/bin/bash
# Let's see the complete __next__ implementation of BufferedIterator
rg -B 2 -A 10 "def __next__" deepmd/pt/utils/dataloader.py

# And let's see if there's a DataLoader class implementation in the file
rg -B 5 -A 15 "class.*DataLoader" deepmd/pt/utils/dataloader.py

Length of output: 565

iProzd · 2024-11-27T13:27:39Z

Duplicated in #4434

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

deepmd/pt/utils/dataloader.py (3)

102-102: LGTM! Good distributed training awareness.

The changes improve the code by:

Preventing redundant output in distributed training with rank check
Properly scaling pool size based on LOCAL_WORLD_SIZE
Using clean dataset construction abstraction with Pool

Consider adding error handling for the Pool operations to gracefully handle potential failures during parallel dataset construction.

 with Pool(
     os.cpu_count()
     // (
         int(os.environ["LOCAL_WORLD_SIZE"])
         if dist.is_available() and dist.is_initialized()
         else 1
     )
 ) as pool:
-    self.systems = pool.map(construct_dataset_systems, systems)
+    try:
+        self.systems = pool.map(construct_dataset_systems, systems)
+    except Exception as e:
+        log.error(f"Failed to construct datasets in parallel: {e}")
+        # Fallback to sequential construction
+        self.systems = [construct_dataset_systems(system) for system in systems]

Also applies to: 196-210

214-226: Improve the comment about DataLoader.

The changes to use StopIteration for signaling are good, but the comment "this should not happen for DataLoader" is unclear. Consider clarifying when and why StopIteration might occur.

-        # Signal the consumer we are done; this should not happen for DataLoader
+        # Signal the end of iteration. Note: For DataLoader, this typically only occurs
+        # when the DataLoader is explicitly closed or the dataset is exhausted
         self._queue.put(StopIteration)

233-253: Enhance warning system and error handling.

The changes improve the warning output, but consider these enhancements:

Make the warning threshold configurable
Add more actionable information to the warning
Make the error handling more explicit

+    # Class variable for warning threshold
+    SLOW_LOADING_THRESHOLD = 1.0  # seconds
+
     def __init__(self, iterable) -> None:
         self._queue = Queue(QUEUESIZE)
         self._iterable = iterable
         self._consumer = BackgroundConsumer(self._queue, self._iterable)
         self._consumer.start()
         self.len = len(iterable)
+        self._warned = False  # Track if warning was already issued

     def __next__(self):
         start_wait = time.time()
         item = self._queue.get()
         wait_time = time.time() - start_wait
-        if (
-            wait_time > 1.0
-        ):  # Even for Multi-Task training, each step usually takes < 1s
-            log.warning(f"Data loading is slow, waited {wait_time:.2f} seconds.")
+        if wait_time > self.SLOW_LOADING_THRESHOLD and not self._warned:
+            self._warned = True  # Warn only once to avoid log spam
+            log.warning(
+                f"Data loading is slow (waited {wait_time:.2f}s). Consider:\n"
+                "1. Increasing the number of worker processes\n"
+                "2. Reducing batch size\n"
+                "3. Using memory-mapped files or faster storage"
+            )
-        if issubclass(item, Exception):
+        if item is StopIteration:
             raise item
+        elif isinstance(item, Exception):
+            raise RuntimeError(f"Background worker failed: {item}")
         return item

deepmd/pt/train/training.py (1)

1061-1071: Consider enhancing error handling and logging for data loading issues.

While the code correctly handles StopIteration for dataloader refresh, it would be beneficial to:

Add logging when the dataloader is refreshed to help with debugging
Consider adding a max retry count to prevent infinite loops in case of persistent data loading issues

Here's a suggested improvement:

 if data is None and not is_train:
     return {}, {}, {}
 if self.multi_task:
     data = data[task_key]
     dataloader = dataloader[task_key]
+max_retries = 3
+retry_count = 0
 try:
     batch_data = next(iter(data))
 except StopIteration:
+    log.debug(f"Refreshing dataloader for {'training' if is_train else 'validation'} data")
     # Refresh the status of the dataloader to start from a new epoch
     data = BufferedIterator(iter(dataloader))
+    while retry_count < max_retries:
+        try:
+            batch_data = next(iter(data))
+            break
+        except StopIteration:
+            retry_count += 1
+            log.warning(f"Failed to get batch after refresh, attempt {retry_count}/{max_retries}")
+            data = BufferedIterator(iter(dataloader))
+    if retry_count == max_retries:
+        log.error("Failed to get batch after maximum retries")
+        raise RuntimeError("Failed to get batch after maximum retries")
-    batch_data = next(iter(data))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e9672d0 and 3895c94.

📒 Files selected for processing (2)

deepmd/pt/train/training.py (1 hunks)
deepmd/pt/utils/dataloader.py (5 hunks)

🔇 Additional comments (3)

deepmd/pt/utils/dataloader.py (2)

5-7: LGTM! Clean abstraction for dataset construction.

The new construct_dataset function provides a clean abstraction for creating dataset instances, and using partial to bind type_map is a good functional programming practice.

Also applies to: 60-64

Line range hint 1-253: Verify PR objectives implementation.

The changes successfully achieve the PR objectives:

Dataset construction is simplified through the new construct_dataset function and proper Pool usage
Warning output for throttled reading is improved

However, consider enhancing the warning system further to provide even more detailed diagnostics about the cause of throttling.

✅ Verification successful

Warning system implementation is appropriate for the use case

The warning system in the dataloader is well-implemented and sufficient for its purpose:

The warning at line 252 provides clear timing information: log.warning(f"Data loading is slow, waited {wait_time:.2f} seconds.")
The 1-second threshold is reasonable as it's calibrated based on typical step times ("Even for Multi-Task training, each step usually takes < 1s")
The warning includes both the fact that data loading is slow and the exact wait time for diagnostics

Additional diagnostic details would add unnecessary complexity since:

The root cause (I/O bottleneck) is clear from the wait time
The warning already triggers at an appropriate threshold
The message is actionable - users can adjust batch sizes or I/O parameters

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the warning system is properly triggered in different scenarios

# Test: Search for all warning-related code to ensure comprehensive coverage
rg -n "warning|warn" --type py

# Test: Check for any existing performance-related issues or discussions
gh issue list --label "performance"
gh pr list --label "performance"

Length of output: 10444

deepmd/pt/train/training.py (1)

1056-1060: LGTM! Clean data selection logic.

The code cleanly handles the selection between training and validation data using tuple unpacking, which improves readability.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

deepmd/pt/utils/dataloader.py (3)

5-7: LGTM! Consider adding docstring to the new function.

The addition of construct_dataset function and the use of partial improves code modularity by extracting the dataset construction logic. Consider adding a docstring to document the parameters and return type.

 def construct_dataset(system, type_map):
+    """Construct a DeepmdDataSetForLoader instance.
+
+    Parameters
+    ----------
+    system : str
+        Path to the system data
+    type_map : list
+        Mapping of atom types
+
+    Returns
+    -------
+    DeepmdDataSetForLoader
+        Dataset instance for the given system
+    """
     return DeepmdDataSetForLoader(
         system=system,
         type_map=type_map,
     )

Also applies to: 60-64

Line range hint 102-112: LGTM! Consider adding progress feedback for large datasets.

The parallel dataset construction using Pool is a good improvement. For better user experience when loading large datasets, consider adding progress feedback.

         construct_dataset_systems = partial(construct_dataset, type_map=type_map)
+        if len(systems) >= 100:
+            from tqdm import tqdm
+            with Pool(...) as pool:
+                self.systems = list(tqdm(
+                    pool.imap(construct_dataset_systems, systems),
+                    total=len(systems),
+                    desc="Constructing datasets"
+                ))
-        with Pool(...) as pool:
-            self.systems = pool.map(construct_dataset_systems, systems)

237-237: Consider making the warning threshold configurable.

While the warning for slow data loading is helpful, the 1-second threshold might not be suitable for all scenarios. Consider making this threshold configurable.

 class BufferedIterator:
     def __init__(self, iterable) -> None:
         self._queue = Queue(QUEUESIZE)
         self._iterable = iterable
         self._consumer = BackgroundConsumer(self._queue, self._iterable)
         self._consumer.start()
         self.len = len(iterable)
+        self.warning_threshold = 1.0  # seconds
 
     def __next__(self):
         start_wait = time.time()
         item = self._queue.get()
         wait_time = time.time() - start_wait
-        if wait_time > 1.0:
+        if wait_time > self.warning_threshold:
             log.warning(f"Data loading is slow, waited {wait_time:.2f} seconds.")

Also applies to: 246-252

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3895c94 and ddc39de.

📒 Files selected for processing (1)

deepmd/pt/utils/dataloader.py (5 hunks)

🔇 Additional comments (3)

deepmd/pt/utils/dataloader.py (3)

196-210: LGTM! Good handling of distributed output.

The rank check in print_summary prevents duplicate output in distributed training, which is the correct approach.

214-226: LGTM! Good simplification of the consumer implementation.

The changes improve the code by:

Removing the unused max_len parameter
Using StopIteration for end-of-data signaling, which better aligns with Python's iterator protocol

233-237: Verify thread safety of the iterator implementation.

Let's verify that the iterator implementation is thread-safe, particularly around the len attribute access.

✅ Verification successful

Thread safety implementation is correct and well-designed

The implementation is thread-safe due to several key design choices:

The Queue is properly used as a thread-safe producer-consumer pattern
Length is captured once during initialization and stored as an immutable attribute
BackgroundConsumer runs as a daemon thread with clean separation of concerns
The iterator implementation uses blocking queue operations which handle synchronization
No shared mutable state exists between the consumer thread and main thread

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential race conditions in iterator usage
ast-grep --pattern 'class $_ {
  $$$
  def __iter__($$$) {
    $$$
  }
  $$$
  def __len__($$$) {
    $$$
  }
  $$$
}'

Length of output: 120

Script:

#!/bin/bash
# Let's check the implementation of BackgroundConsumer and related iterator code
rg -A 10 "class BackgroundConsumer" 

# Also check the implementation of the class containing this Queue setup
rg -B 5 -A 10 "self._queue = Queue\(QUEUESIZE\)"

# Check if there are any other references to len or iterator access
rg "def __len__" deepmd/pt/utils/dataloader.py
rg "def __iter__" deepmd/pt/utils/dataloader.py

Length of output: 1817

Signed-off-by: Chun Cai <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

deepmd/pt/train/training.py (1)
1056-1071: LGTM! Clean and efficient data loading implementation.

The refactored code is well-structured and handles both single-task and multi-task scenarios elegantly. The improvements include:

Cleaner data/dataloader selection using tuple unpacking

Clear handling of validation data absence

Proper error handling for dataset iteration

Consider adding type hints to improve code maintainability:
-    def get_data(self, is_train=True, task_key="Default"):
+    def get_data(
+        self,
+        is_train: bool = True,
+        task_key: str = "Default"
+    ) -> tuple[dict, dict, dict]:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ddc39de and 1310691.

📒 Files selected for processing (1)

deepmd/pt/train/training.py (1 hunks)

deepmd/pt/train/training.py

Co-authored-by: Han Wang <[email protected]> Signed-off-by: Chun Cai <[email protected]>

codecov · 2024-11-28T09:02:28Z

Codecov Report

Attention: Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83.75%. Comparing base (037cf3f) to head (a199c44).
Report is 10 commits behind head on devel.

Files with missing lines	Patch %	Lines
deepmd/utils/path.py	66.66%	2 Missing ⚠️
deepmd/pt/utils/dataloader.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #4437      +/-   ##
==========================================
+ Coverage   83.27%   83.75%   +0.47%     
==========================================
  Files         667      667              
  Lines       61446    61525      +79     
  Branches     3486     3487       +1     
==========================================
+ Hits        51167    51528     +361     
+ Misses       9151     8872     -279     
+ Partials     1128     1125       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

deepmd/pt/train/training.py

deepmd/utils/path.py

deepmd/pt/train/training.py

Signed-off-by: Chun Cai <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

deepmd/pt/train/training.py (1)

1066-1072: Add a clarifying comment for StopIteration handling.

The StopIteration handling logic ensures continuous training across epochs, but this important detail isn't immediately obvious.

Add a comment to explain the purpose:

        try:
            batch_data = next(iter(data))
        except StopIteration:
-            # Refresh the status of the dataloader to start from a new epoch
+            # When all data in the current epoch is exhausted, refresh the dataloader
+            # to start a new epoch. This ensures continuous training across epochs.
            with torch.device("cpu"):
                data = BufferedIterator(iter(dataloader))
            batch_data = next(iter(data))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a1506af and 91d2b9c.

📒 Files selected for processing (1)

deepmd/pt/train/training.py (1 hunks)

🔇 Additional comments (1)

deepmd/pt/train/training.py (1)

1056-1065: LGTM! But verify test coverage for multi-task handling.

The refactoring improves code readability by using tuple unpacking and clear conditional logic. However, based on past review comments, ensure that the multi-task handling at line 1062 is properly covered by tests.

caic99 · 2024-11-29T02:27:03Z

I'm reverting changes to training.py for there are too many changes in one PR, making it hard to test.

deepmd/pt/utils/dataloader.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

deepmd/pt/utils/dataloader.py (2)

Line range hint 102-112: Add error handling for parallel dataset construction

While the parallel construction using Pool is a good optimization, consider adding error handling to gracefully handle failures during dataset construction. This is especially important when dealing with multiple systems in parallel.

 with Pool(
     os.cpu_count()
     // (
         int(os.environ["LOCAL_WORLD_SIZE"])
         if dist.is_available() and dist.is_initialized()
         else 1
     )
 ) as pool:
-    self.systems = pool.map(construct_dataset_systems, systems)
+    try:
+        self.systems = pool.map(construct_dataset_systems, systems)
+    except Exception as e:
+        log.error(f"Failed to construct datasets in parallel: {str(e)}")
+        # Fallback to sequential construction
+        self.systems = [construct_dataset_systems(system) for system in systems]

247-256: Good improvement on warning messages and throttling!

The enhanced warning system with throttling addresses the concern about too many warnings. However, consider making the warning threshold configurable.

 class BufferedIterator:
     def __init__(self, iterable) -> None:
         self._queue = Queue(QUEUESIZE)
         self._iterable = iterable
         self._consumer = BackgroundConsumer(self._queue, self._iterable)
         self._consumer.start()
         self.last_warning_time = time.time()
         self.len = len(iterable)
+        self.warning_threshold = 1.0  # seconds
+
+    @property
+    def warning_threshold(self) -> float:
+        return self._warning_threshold
+
+    @warning_threshold.setter
+    def warning_threshold(self, value: float) -> None:
+        if value <= 0:
+            raise ValueError("Warning threshold must be positive")
+        self._warning_threshold = value

     def __next__(self):
         start_wait = time.time()
         item = self._queue.get()
         wait_time = time.time() - start_wait
         if (
-            wait_time > 1.0 and start_wait - self.last_warning_time > 15 * 60
+            wait_time > self.warning_threshold and start_wait - self.last_warning_time > 15 * 60
         ):

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 91d2b9c and 68dd750.

📒 Files selected for processing (1)

deepmd/pt/utils/dataloader.py (5 hunks)

🔇 Additional comments (1)

deepmd/pt/utils/dataloader.py (1)

Line range hint 5-64: Good abstraction of dataset construction logic!

The introduction of construct_dataset function and use of partial from functools shows good separation of concerns and functional programming practices. This makes the code more maintainable and testable.

deepmd/pt/utils/dataloader.py

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

deepmd/pt/utils/dataloader.py (2)

227-231: Consider making queue size configurable

The QUEUESIZE is hardcoded to 32. Consider making it configurable through constructor parameters to allow tuning based on memory constraints and performance requirements.

 class BufferedIterator:
-    def __init__(self, iterable) -> None:
+    def __init__(self, iterable, queue_size: int = 32) -> None:
         self._queue = Queue(QUEUESIZE)
         self._iterable = iterable
         self._consumer = BackgroundConsumer(self._queue, self._iterable)
         self._consumer.start()
         self.last_warning_time = time.time()

240-249: Consider making warning thresholds configurable

The warning thresholds (1.0 second wait time and 15 minutes cooldown) are hardcoded. These values might need adjustment based on different workloads and environments.

 class BufferedIterator:
-    def __init__(self, iterable) -> None:
+    def __init__(
+        self,
+        iterable,
+        wait_threshold: float = 1.0,
+        warning_cooldown: float = 900.0  # 15 minutes
+    ) -> None:
         self._queue = Queue(QUEUESIZE)
         self._iterable = iterable
         self._consumer = BackgroundConsumer(self._queue, self._iterable)
         self._consumer.start()
         self.last_warning_time = time.time()
+        self._wait_threshold = wait_threshold
+        self._warning_cooldown = warning_cooldown

     def __next__(self):
         start_wait = time.time()
         item = self._queue.get()
         wait_time = time.time() - start_wait
         if (
-            wait_time > 1.0 and start_wait - self.last_warning_time > 15 * 60
+            wait_time > self._wait_threshold and
+            start_wait - self.last_warning_time > self._warning_cooldown
         ):

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 68dd750 and a199c44.

📒 Files selected for processing (1)

deepmd/pt/utils/dataloader.py (2 hunks)

🔇 Additional comments (2)

deepmd/pt/utils/dataloader.py (2)

8-10: LGTM: Import change improves clarity

The explicit import of Queue is clearer than the generic queue import.

208-220: Consider improving edge case handling

The comment "should not happen for DataLoader" suggests an edge case that should be properly documented or handled. Consider:

Adding documentation explaining why this case shouldn't occur
Adding proper error handling if it does occur unexpectedly

Let's verify if this edge case is handled elsewhere:

Consider this improvement:

 def run(self) -> None:
+    """Background consumer thread that feeds items from source to queue.
+    
+    Note: StopIteration is used as a sentinel to signal completion, though
+    this should not occur during normal DataLoader operation as it typically
+    cycles indefinitely.
+    """
     for item in self._source:
         self._queue.put(item)  # Blocking if the queue is full
     # Signal the consumer we are done; this should not happen for DataLoader
     self._queue.put(StopIteration())

✅ Verification successful

Let me gather more context about the BackgroundConsumer usage to understand how StopIteration is handled by the consumer.

Let me check how the consumer side handles the StopIteration.

Let me check the complete implementation of the consumer class and its usage in the BufferedIterator.

The StopIteration handling is correct and well-designed

The comment "should not happen for DataLoader" is actually accurate and doesn't indicate a problem. Here's why:

The BackgroundConsumer is used within BufferedIterator, which wraps an iterable (DataLoader)
DataLoaders in PyTorch are designed to be infinite iterators during training - they automatically restart when reaching the end of an epoch
The StopIteration signal is properly handled by the consumer-producer pattern:
- Producer (BackgroundConsumer) signals completion with StopIteration
- Consumer (BufferedIterator.next) checks if the item is an Exception and raises it

The current implementation follows a standard producer-consumer pattern and handles the edge case appropriately. The comment serves as documentation for maintainers to understand the expected behavior.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for StopIteration handling in the codebase
rg -A 3 "StopIteration" --type py

Length of output: 5685

Script:

#!/bin/bash
# Search for BackgroundConsumer usage and its consumer side
rg -B 3 -A 10 "BackgroundConsumer" --type py

Length of output: 3197

Script:

#!/bin/bash
# Search for the __next__ implementation in the same file
rg -B 2 -A 10 "__next__" deepmd/pt/utils/dataloader.py

Length of output: 587

Script:

#!/bin/bash
# Get the complete implementation of BackgroundConsumer and BufferedIterator
cat deepmd/pt/utils/dataloader.py

Length of output: 10011

caic99 · 2024-12-06T04:01:16Z

I've trained 100k steps from scratch. Since this PR only changes the warning mechanism of DataLoader, the loss curve does not change significantly. The deviation is not introduced in this PR - I've attached the curve of two runs on devel branch, and they are not exactly the same.

Run 1 on devel branch

Run 2 on devel branch

This PR

refactor: simplify dataset construction

417da94

github-actions bot added the Python label Nov 27, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

e9672d0

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

deepmd/utils/path.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

caic99 and others added 3 commits November 27, 2024 21:44

fix exception handling

a4a36c1

refactor getdata

ee9d8f8

[pre-commit.ci] auto fixes from pre-commit.com hooks

3895c94

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

fix ut

ddc39de

coderabbitai bot reviewed Nov 28, 2024

View reviewed changes

Merge branch 'devel' into refactor-dl

1310691

Signed-off-by: Chun Cai <[email protected]>

coderabbitai bot reviewed Nov 28, 2024

View reviewed changes

wanghan-iapcm requested changes Nov 28, 2024

View reviewed changes

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

wanghan-iapcm requested review from njzjz and CaRoLZhangxy November 28, 2024 05:42

fix multi task validation

12daca1

Co-authored-by: Han Wang <[email protected]> Signed-off-by: Chun Cai <[email protected]>

njzjz requested changes Nov 28, 2024

View reviewed changes

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

deepmd/utils/path.py Outdated Show resolved Hide resolved

njzjz added the Test CUDA Trigger test CUDA workflow label Nov 28, 2024

github-actions bot removed the Test CUDA Trigger test CUDA workflow label Nov 28, 2024

refactor: implement __getnewargs__ method for DPPath subclasses

a1506af

njzjz added the Test CUDA Trigger test CUDA workflow label Nov 28, 2024

github-actions bot removed the Test CUDA Trigger test CUDA workflow label Nov 28, 2024

caic99 commented Nov 29, 2024

View reviewed changes

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

Update deepmd/pt/train/training.py

91d2b9c

Signed-off-by: Chun Cai <[email protected]>

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

revert changes on get_data

a9108a7

caic99 requested a review from njzjz November 29, 2024 02:27

njzjz approved these changes Nov 29, 2024

View reviewed changes

caic99 requested a review from wanghan-iapcm November 29, 2024 05:22

CaRoLZhangxy approved these changes Nov 29, 2024

View reviewed changes

wanghan-iapcm reviewed Dec 2, 2024

View reviewed changes

deepmd/pt/utils/dataloader.py Outdated Show resolved Hide resolved

restrict the number of warnings by an interval of 15min

68dd750

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

deepmd/pt/utils/dataloader.py Show resolved Hide resolved

caic99 requested a review from wanghan-iapcm December 2, 2024 05:25

caic99 and others added 2 commits December 2, 2024 18:01

revert changes on using process for no significant improvement

5a454d5

[pre-commit.ci] auto fixes from pre-commit.com hooks

a199c44

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

wanghan-iapcm approved these changes Dec 9, 2024

View reviewed changes

wanghan-iapcm added this pull request to the merge queue Dec 9, 2024

Merged via the queue into deepmodeling:devel with commit b4ade5c Dec 9, 2024
60 checks passed

caic99 deleted the refactor-dl branch December 9, 2024 04:07

This was referenced Dec 19, 2024

Perf: load data systems on rank 0 #4478

Merged

(fix) Make bias statistics complete for all elements #4495

Closed

(fix) Make bias statistics complete for all elements #4496

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: simplify dataset construction #4437

refactor: simplify dataset construction #4437

caic99 commented Nov 27, 2024 •

edited by coderabbitai bot

Loading

caic99 commented Nov 27, 2024

coderabbitai bot commented Nov 27, 2024 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

iProzd commented Nov 27, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

codecov bot commented Nov 28, 2024 •

edited

Loading

coderabbitai bot left a comment

caic99 commented Nov 29, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

caic99 commented Dec 6, 2024

refactor: simplify dataset construction #4437

refactor: simplify dataset construction #4437

Conversation

caic99 commented Nov 27, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

caic99 commented Nov 27, 2024

coderabbitai bot commented Nov 27, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

iProzd commented Nov 27, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 28, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

caic99 commented Nov 29, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

caic99 commented Dec 6, 2024

caic99 commented Nov 27, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 27, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Nov 28, 2024 •

edited

Loading