AI Data Security is an open-source solution that leverages the power of Artificial Intelligence (AI) to automatically classify and encrypt your critical data, ensuring maximum protection. By replacing outdated methods like regex-based detection, it reduces errors, improves accuracy, and identifies both known and previously unknown sensitive information within your organization.
Designed as the last Line of Defense Against Adversaries, AI Data Security guarantees that even if a breach occurs, your sensitive data remains completely inaccessible. Its intelligent algorithms eliminate manual intervention, streamline data protection processes, and ensure compliance with data protection laws.
With AI Data Security, you can confidently safeguard your data, future-proof your organization against evolving cyber threats.
Whether you're a researcher, student, or professional managing extensive documentation, this tool streamlines the process of categorizing and searching through your documents with ease.
⚠️ IMPORTANT DISCLAIMERThis project is currently a Proof of Concept and is under active development:
- Features are incomplete and actively being developed
- Bugs and breaking changes are expected
- Project structure and APIs may change significantly
- Documentation may be outdated or incomplete
- Not recommended for production use at this time
- Security features are still being implemented
We welcome all feedback and contributions, but please use at your own risk!
- Document Loading: Supports loading various document types from a specified input folder.
- Predefined Topic Assignment: Assigns documents to user-defined topics using advanced NLP techniques.
- Vector Embeddings: Generates high-quality embeddings for each document to enable efficient similarity searches.
- Database Integration: Utilizes Qdrant to store document embeddings and metadata for scalable and fast retrieval.
- User-Friendly UI: Built with Streamlit, offering an intuitive interface for processing documents, viewing statistics, and searching.
- Full Document View & Download: Allows users to preview document content and download full documents directly from the UI.
- Logging: Comprehensive logging to track application events and errors for easy debugging.
- Parallel Processing: Improving performance with concurrent task execution.
- Client Agent: Installing agents on client machines for direct communication with the core system.
- Data Encryption: Protecting sensitive data with robust encryption.
- OCR Capability: Extracting text from images efficiently.
- API Development: Enabling seamless integration with external systems.
- Advanced Admin Panel: Centralized control and system monitoring.
- File Logging: Tracking file activity.
- Ransomware Protection: Securing data against Ransomwares.
- 2FA Security: Adding an extra layer of authentication for file access.
- GPU Utilization: Accelerating processes with GPU power.
- Multi-Language Support: Making the system accessible globally.
- Post-Quantum Cryptography: Future-proofing data security with algorithms resistant to quantum computing threats.
- System Clustering: Boosting system reliability and optimizing processes by clustering resources, ensuring high availability and load balancing.
- Python 3.8+
- Docker (for running Qdrant)
- Git
-
Clone the Repository:
Clone the project repository to your local machine using Git. -
Create a Virtual Environment:
It's recommended to use a virtual environment to manage dependencies. -
Install Dependencies:
Install the required Python packages listed inrequirements.txt
. -
Setup Qdrant:
Run Qdrant using Docker to set up the vector database.
Configure Qdrant and embedding model settings using environment variables in a .env
file.
Define predefined topics and folder paths in the config.yaml
file.
Ensure that the input and output folders exist or can be created by the application, and verify that Qdrant is running and accessible with the specified credentials.
-
Run the Application:
Launch the Streamlit app to access the user interface. -
Input Configuration:
- Predefined Topics: Enter each topic on a separate line.
- Input Folder Path: Specify the directory containing your documents.
- Output Folder Path: Specify where categorized documents will be stored.
- Start processing to categorize your documents.
-
View Database Statistics:
Access statistics about the total number of documents and their distribution across topics. -
Search Documents:
Select a topic and enter a query to search for similar documents. View previews and download full documents as needed.
AI-Data-Security/
│
├── app.py
├── main.py
├── config.py
├── config.yaml
├── requirements.txt
├── README.md
├── .env
├── logs/
│ └── document_loader.log
├── input_documents/
│ └── ... (Your input documents)
├── output_documents/
│ └── ... (Categorized documents)
├── DatabaseHandler/
│ └── database_handler.py
├── DocumentLoader/
│ └── document_loader.py
└── TopicModeler/
└── topic_modeler.py
- app.py: Streamlit application handling the user interface.
- main.py: Core processing functions for document handling and categorization.
- config.py: Configuration settings for paths and Qdrant credentials.
- config.yaml: YAML configuration file for predefined topics and folder paths.
- .env: Environment variables for Qdrant configuration and embedding model.
- requirements.txt: Python dependencies required for the project.
- DatabaseHandler/database_handler.py: Handles interactions with the Qdrant vector database.
- DocumentLoader/document_loader.py: Loads documents from the input folder.
- TopicModeler/topic_modeler.py: Assigns topics to documents and generates embeddings.
- logs/: Directory containing log files for tracking events and errors.
- input_documents/: Directory where you place documents to be processed.
- output_documents/: Directory where categorized documents are stored.
Happy Document Categorizing! 🚀
If you encounter any issues or have suggestions for improvements, feel free to open an issue or submit a pull request. Your feedback is invaluable!
Clone the project repository to your local machine using Git:
git clone https://github.com/Mohammad-Mirasadollahi/AI-Data-Security.git
Navigate to the project directory and create a virtual environment to manage dependencies:
cd AI-Data-Security
python -m venv venv
Activate the virtual environment:
- Windows:
venv\Scripts\activate
- macOS/Linux:
source venv/bin/activate
Install the required Python packages:
pip install -r requirements.txt
Run Qdrant using Docker to set up the vector database:
docker run -p 6333:6333 qdrant/qdrant
-
Environment Variables:
Create a.env
file with your Qdrant configuration. -
YAML Configuration:
Define your predefined topics and folder paths inconfig.yaml
.
Launch the Streamlit app:
streamlit run app.py
Access the application via the URL provided in the terminal, typically http://localhost:8501
.
Feel free to customize and expand upon this README as your project evolves. Good luck with your open-source journey!