A high-performance Rust library for filtering and processing large CSV files, with support for both streaming and indexed processing modes.
- 🚀 High-performance parallel processing of CSV files
- 📑 Memory-mapped file handling for efficient I/O
- 🔍 Advanced filtering system with multiple condition types
- 📖 Optional indexed access mode for rapid filtering
- 💻 Multi-threaded processing support
- 🎯 Zero-copy parsing where possible
- 📊 Progress tracking and statistics
- 🛡️ Comprehensive error handling
- Processes millions of rows per second on modern hardware
- Memory-efficient streaming mode for large files
- Optional indexing for repeated queries
- Parallel processing with configurable thread count
Add this to your Cargo.toml
:
[dependencies]
extractor = "0.1.0"
use extractor::{BioFilter, Config, FilterCondition};
use std::path::PathBuf;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a filter with default configuration
let mut filter = BioFilter::builder("input.csv", "output.csv")
.with_config(Config::default())
.build()?;
// Add filters
filter.add_filter(Box::new(ColumnFilter::new(
"gene_expression",
FilterCondition::Numeric(NumericCondition::GreaterThan(0.5))
)?));
// Process the file
let stats = filter.process()?;
println!("Processed {} rows, matched {}", stats.rows_processed, stats.rows_matched);
Ok(())
}
For repeated queries on the same file, use indexed mode for better performance:
// Build an index
let index = FileIndex::builder("input.csv", "gene_id")
.add_secondary_index("chromosome")
.build()?;
index.save("data.index")?;
// Use the index
let mut filter = BioFilter::builder("input.csv", "output.csv")
.with_index("data.index")
.build()?;
Implement the Filter
trait for custom filtering logic:
struct CustomFilter {
column: String,
}
impl Filter for CustomFilter {
fn apply(&self, row: &[u8], headers: &HashMap<String, usize>) -> Result<bool> {
// Custom filtering logic here
}
fn column_name(&self) -> &str {
&self.column
}
fn description(&self) -> String {
format!("Custom filter on {}", self.column)
}
}
- Exact match (
Equals
) - Substring (
Contains
) - Regular expression (
Regex
) - Numeric comparisons (
GreaterThan
,LessThan
,Equal
) - Range checks (
Between
) - Multiple values (
OneOf
) - Null checks (
Empty
,NotEmpty
)
let config = Config {
delimiter: b',',
has_headers: true,
chunk_size: 1024 * 1024, // 1MB chunks
parallel: true,
use_index: false,
num_threads: Some(4),
progress: ProgressConfig::default(),
};
- Use indexed mode for repeated queries on the same file
- Adjust chunk size based on your system's memory
- Enable parallel processing for multi-core systems
- Use memory mapping for large files
- Consider pre-filtering columns when building indices
The library provides detailed error types for different failure scenarios:
match result {
Err(ExtractorError::Io { source, path }) => // Handle I/O errors
Err(ExtractorError::Csv(e)) => // Handle CSV parsing errors
Err(ExtractorError::Index { kind, path }) => // Handle index-related errors
// etc.
}
Contributions are welcome! Please see our Contributing Guide for details.
This project is licensed under the GNU Affero General Public License - see the LICENSE file for details.