A (soon to be) powerful and modular web scraper that converts web content into well-structured Markdown files.
- 🌐 Scrapes any accessible website
- 📝 Converts HTML to clean Markdown format
- 🔄 Handles various HTML elements:
- Headers (h1-h6)
- Paragraphs
- Links
- Images
- Lists
- 📋 Preserves document structure
- 🪵 Comprehensive logging
- ✅ Robust error handling
git clone https://github.com/ursisterbtw/markdown_lab.git
cd markdown_lab
pip install -r requirements.txt
python main.py <url> -o <output_file>
Example:
python main.py https://www.example.com -o output.md
from main import MarkdownScraper
scraper = MarkdownScraper()
html_content = scraper.scrape_website("https://example.com")
markdown_content = scraper.convert_to_markdown(html_content)
scraper.save_markdown(markdown_content, "output.md")
The project includes comprehensive unit tests. To run them:
pytest
- requests: Web scraping
- beautifulsoup4: HTML parsing
- pytest: Testing framework
- argparse: CLI argument parsing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- BeautifulSoup4 for excellent HTML parsing capabilities
- Requests library for simplified HTTP handling
- Python community for continuous inspiration 🐍
- Add support for more HTML elements
- Implement custom markdown templates
- Add concurrent scraping for multiple URLs
- Include CSS selector support
- Add configuration file support
🐍🦀 ursister