Extractor

A high-performance Rust library for filtering and processing large CSV files, with support for both streaming and indexed processing modes.

Features

🚀 High-performance parallel processing of CSV files
📑 Memory-mapped file handling for efficient I/O
🔍 Advanced filtering system with multiple condition types
📖 Optional indexed access mode for rapid filtering
💻 Multi-threaded processing support
🎯 Zero-copy parsing where possible
📊 Progress tracking and statistics
🛡️ Comprehensive error handling

Performance

Processes millions of rows per second on modern hardware
Memory-efficient streaming mode for large files
Optional indexing for repeated queries
Parallel processing with configurable thread count

Installation

Add this to your Cargo.toml:

[dependencies]
extractor = "0.1.0"

Quick Start

use extractor::{BioFilter, Config, FilterCondition};
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a filter with default configuration
    let mut filter = BioFilter::builder("input.csv", "output.csv")
        .with_config(Config::default())
        .build()?;

    // Add filters
    filter.add_filter(Box::new(ColumnFilter::new(
        "gene_expression",
        FilterCondition::Numeric(NumericCondition::GreaterThan(0.5))
    )?));

    // Process the file
    let stats = filter.process()?;
    println!("Processed {} rows, matched {}", stats.rows_processed, stats.rows_matched);

    Ok(())
}

Advanced Usage

Indexed Mode

For repeated queries on the same file, use indexed mode for better performance:

// Build an index
let index = FileIndex::builder("input.csv", "gene_id")
    .add_secondary_index("chromosome")
    .build()?;
index.save("data.index")?;

// Use the index
let mut filter = BioFilter::builder("input.csv", "output.csv")
    .with_index("data.index")
    .build()?;

Custom Filters

Implement the Filter trait for custom filtering logic:

struct CustomFilter {
    column: String,
}

impl Filter for CustomFilter {
    fn apply(&self, row: &[u8], headers: &HashMap<String, usize>) -> Result<bool> {
        // Custom filtering logic here
    }

    fn column_name(&self) -> &str {
        &self.column
    }

    fn description(&self) -> String {
        format!("Custom filter on {}", self.column)
    }
}

Available Filter Conditions

Exact match (Equals)
Substring (Contains)
Regular expression (Regex)
Numeric comparisons (GreaterThan, LessThan, Equal)
Range checks (Between)
Multiple values (OneOf)
Null checks (Empty, NotEmpty)

Configuration Options

let config = Config {
    delimiter: b',',
    has_headers: true,
    chunk_size: 1024 * 1024,  // 1MB chunks
    parallel: true,
    use_index: false,
    num_threads: Some(4),
    progress: ProgressConfig::default(),
};

Performance Tips

Use indexed mode for repeated queries on the same file
Adjust chunk size based on your system's memory
Enable parallel processing for multi-core systems
Use memory mapping for large files
Consider pre-filtering columns when building indices

Error Handling

The library provides detailed error types for different failure scenarios:

match result {
    Err(ExtractorError::Io { source, path }) => // Handle I/O errors
    Err(ExtractorError::Csv(e)) => // Handle CSV parsing errors
    Err(ExtractorError::Index { kind, path }) => // Handle index-related errors
    // etc.
}

Contributing

Contributions are welcome! Please see our Contributing Guide for details.

License

This project is licensed under the GNU Affero General Public License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
benches		benches
examples		examples
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extractor

Features

Performance

Installation

Quick Start

Advanced Usage

Indexed Mode

Custom Filters

Available Filter Conditions

Configuration Options

Performance Tips

Error Handling

Contributing

License

About

Releases

Packages

Languages

License

HeartBioPortal/Extractor

Folders and files

Latest commit

History

Repository files navigation

Extractor

Features

Performance

Installation

Quick Start

Advanced Usage

Indexed Mode

Custom Filters

Available Filter Conditions

Configuration Options

Performance Tips

Error Handling

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages