Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper uc #294

Open
wants to merge 226 commits into
base: scraper-uc
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
226 commits
Select commit Hold shift + click to select a range
768aa06
feat(crawler): Enhance stealth and flexibility, improve error handling
unclecode Oct 17, 2024
dbb587d
Update gitignore
unclecode Oct 17, 2024
dd17ed0
Rename some flags name, introducing magic flag.
unclecode Oct 18, 2024
aab6ea0
Update requirements and switch to 0.3.8
unclecode Oct 18, 2024
b8147b6
chore: Bump version to 0.3.71 and improve error handling
unclecode Oct 18, 2024
b309bc3
Fix the model nam ein quick start example
unclecode Oct 18, 2024
4e2852d
[v0.3.71] Enhance chunking strategies and improve overall performance
unclecode Oct 19, 2024
e7cd8a1
Update Changelog
unclecode Oct 19, 2024
6ec4cb3
Enhance Markdown generation and external content control
unclecode Oct 20, 2024
1dd36f9
Refactor content scrapping strategy and improve error handling
unclecode Oct 20, 2024
04d16e6
Fix Base64 image parsing in WebScrappingStrategy (issue 182)
unclecode Oct 20, 2024
a5f627b
feat: customize crawl base directory
IdrisHanafi Oct 21, 2024
60ba131
[v0.3.72] Enhance content extraction and proxy support
unclecode Oct 22, 2024
32f57c4
Merge pull request #194 from IdrisHanafi/feat/customize-crawl-base-di…
unclecode Oct 24, 2024
bcfe83f
feat: enhance crawler with overlay removal and improved screenshot ca…
unclecode Oct 24, 2024
38474bd
Update version
unclecode Oct 24, 2024
4239654
Update Documentation
unclecode Oct 27, 2024
ff9149b
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Oct 27, 2024
ac9d83c
Update gitignore
unclecode Oct 27, 2024
d61615e
Merge branch '0.3.72'
unclecode Oct 27, 2024
c2a71a5
Update Docs folder, prepare branch for new version 0.3.73
unclecode Oct 27, 2024
d913e20
Update Readme
unclecode Oct 28, 2024
b2800fe
Add badges to README
unclecode Oct 28, 2024
d9e0b7a
Fix README badge
unclecode Oct 28, 2024
3529c2e
Update new tutorial documents and added to the docs folder.
unclecode Oct 29, 2024
e9f7d5e
Merge branch '0.3.73'
unclecode Oct 29, 2024
df9ee44
build: make requirements more flexible
mjvankampen Oct 30, 2024
605a827
fix dev requirements and lock playwright due to failing tests
mjvankampen Oct 30, 2024
9307c19
Update documents, upload new version of quickstart.
unclecode Oct 30, 2024
982d203
Merge branch '0.3.73'
unclecode Oct 30, 2024
47464ce
Update README
unclecode Oct 30, 2024
cb6f532
Update README
unclecode Oct 30, 2024
e97e8df
Update README: Fix typo in project name
unclecode Oct 30, 2024
19c3f3e
Refactor tutorial markdown files: Update numbering and formatting
unclecode Oct 30, 2024
0a09d78
chore(docs): fix documentation links + markdown lint
timoa Oct 31, 2024
6c7235d
Add mission.md file
unclecode Oct 31, 2024
d8eef02
Add link to mission statement in README
unclecode Oct 31, 2024
492ada0
Add mission diagram to MISSION.md
unclecode Oct 31, 2024
62a86db
Refactor mission section in README and add mission diagram
unclecode Oct 31, 2024
07f508b
Merge pull request #218 from timoa/main
unclecode Nov 3, 2024
de6b43f
Merge pull request #215 from mjvankampen/build/flexible-requirements
unclecode Nov 3, 2024
54d5a3a
Improved database management and error handling, updated README instr…
unclecode Nov 4, 2024
e28c49a
Refactor .gitignore.dev file: Add ignore patterns for various files a…
unclecode Nov 4, 2024
42f1c67
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0…
unclecode Nov 4, 2024
33d0e9e
Update dev gitignore
unclecode Nov 4, 2024
7b0cca4
Update gitignore
unclecode Nov 4, 2024
fbdf870
Update CHANGELOG
unclecode Nov 4, 2024
be8f4fc
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0…
unclecode Nov 4, 2024
e6c914d
Refactor version management and remove deprecated gitignore.dev file
unclecode Nov 4, 2024
c4c6227
Creating the API server component
unclecode Nov 4, 2024
0bba0e0
Preventing NoneType has no attribute get Errors
bizrockman Nov 4, 2024
a28046c
Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to epi…
bizrockman Nov 4, 2024
870296f
Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_…
bizrockman Nov 4, 2024
3a3c88a
Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Ext…
bizrockman Nov 4, 2024
796dbaf
Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_…
bizrockman Nov 4, 2024
67a23c3
feat(core): Release v0.3.73 with Browser Takeover and Docker Support
unclecode Nov 5, 2024
3cf19a1
chore(version): bump version to 0.3.73
unclecode Nov 5, 2024
43a2b26
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 5, 2024
1c20b81
docs(README): update Docker usage instructions and add deployment opt…
unclecode Nov 5, 2024
2a54f3c
refactor(core): remove main_v0.py file and associated functionality
unclecode Nov 5, 2024
1e7db0d
docs(README): update release notes for version 0.3.73 with new featur…
unclecode Nov 5, 2024
b512636
feat(api): add CORS support and static file serving, update root redi…
unclecode Nov 5, 2024
c5aa1be
Merge pull request #229 from bizrockman/main
unclecode Nov 6, 2024
9f5eef1
Refactored the `CustomHTML2Text` class in `content_scrapping_strategy…
unclecode Nov 6, 2024
2879344
Update README.md
devatnull Nov 6, 2024
f757423
Update API server request object. text_docker file and Readme
unclecode Nov 7, 2024
16f9186
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 7, 2024
b120965
Fixed issues with the Manage Browser, including its inability to conn…
unclecode Nov 7, 2024
bcdd809
Remove some old files.
unclecode Nov 8, 2024
f9a297e
Add Docker example script for testing Crawl4AI functionality
unclecode Nov 8, 2024
a098483
Update Roadmap
unclecode Nov 9, 2024
b6d6631
Enhance Async Crawler with Playwright support
unclecode Nov 12, 2024
8c22396
Merge pull request #234 from devatnull/patch-1
unclecode Nov 12, 2024
00026b5
feat(config): Adding a configurable way of setting the cache director…
Nov 12, 2024
bf91adf
fix: Resolve unexpected BrowserContext closure during crawl in Docker
unclecode Nov 13, 2024
61b93eb
Update change log
unclecode Nov 13, 2024
38044d4
Merge pull request #255 from maheshpec/feature/configure-cache-directory
unclecode Nov 13, 2024
c38ac29
perf(crawler): major performance improvements & raw HTML support
unclecode Nov 13, 2024
17913f5
feat(crawler): support local files and raw HTML input in AsyncWebCrawler
unclecode Nov 13, 2024
3d00fee
- In this commit, the library is updated to process file downloads. U…
unclecode Nov 14, 2024
7f1ae5a
Update changelog
unclecode Nov 14, 2024
1f269f9
test(content_filter): add comprehensive tests for BM25ContentFilter f…
unclecode Nov 15, 2024
ae7ebc0
chore: update .gitignore and enhance changelog with major feature add…
unclecode Nov 15, 2024
60670b2
Merge pull request #7 from aravindkarnam/main
aravindkarnam Nov 15, 2024
d0014c6
New async database manager and migration support
unclecode Nov 16, 2024
5098442
refactor: migrate versioning to __version__.py and remove deprecated …
unclecode Nov 16, 2024
90df692
feat(crawl_sync): add synchronous crawl endpoint and corresponding test
unclecode Nov 16, 2024
e62c807
feat(deploy): add Railway deployment configuration and setup instruct…
unclecode Nov 16, 2024
f77f06a
feat(deploy): add deployment configuration and templates for crawl4ai
unclecode Nov 16, 2024
fca1319
feat(docker): add MkDocs installation and build step for documentation
unclecode Nov 16, 2024
6f2fe59
feat(deploy): update instance size to professional-xs and add memory …
unclecode Nov 16, 2024
6b569cc
feat(deploy): update branch to 0.3.74 and change instance size to bas…
unclecode Nov 16, 2024
67edc2d
feat(deploy): update instance size to professional-xs and add memory …
unclecode Nov 16, 2024
5d0b132
feat(deploy): change instance size to professional-xs and update memo…
unclecode Nov 16, 2024
79feab8
refactor(deploy): remove memory utilization alert configuration from …
unclecode Nov 16, 2024
1961adb
refactor(docker): remove shared memory size configuration to streamli…
unclecode Nov 16, 2024
6360d05
feat(api): add API token authentication and update Dockerfile descrip…
unclecode Nov 16, 2024
9139ef3
feat(docker): update Dockerfile for improved installation process and…
unclecode Nov 16, 2024
4b45b28
feat(docs): enhance deployment documentation with one-click setup, AP…
unclecode Nov 16, 2024
3a66aa8
feat(cache): introduce CacheMode and CacheContext for enhanced cachin…
unclecode Nov 17, 2024
3a524a3
fix(docs): remove unnecessary blank line in README for improved reada…
unclecode Nov 17, 2024
2a82455
feat(crawl): implement direct crawl functionality and introduce Cache…
unclecode Nov 17, 2024
f9fe6f8
feat(database): implement version management and migration checks dur…
unclecode Nov 17, 2024
a59c107
Update changelog for 0.3.74
unclecode Nov 17, 2024
df63a40
feat(docs): update examples and documentation to replace bypass_cache…
unclecode Nov 17, 2024
152ac35
feat(docs): update README for version 0.3.74 with new features and im…
unclecode Nov 17, 2024
852729f
feat(docker): add Docker Compose configurations for local and hub dep…
unclecode Nov 18, 2024
b6af94c
Merge remote-tracking branch 'origin/main' into 0.3.74
unclecode Nov 18, 2024
73658c7
chore: update .gitignore to include manage-collab.sh
unclecode Nov 19, 2024
593c7ad
test: trying to push to main
Nov 19, 2024
3aae30e
test1: trying to push to main
Nov 19, 2024
2f19d38
Update .gitignore to include .gitboss/ and todo_executor.md
unclecode Nov 19, 2024
788c67c
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 19, 2024
fbcff85
Remove test files
unclecode Nov 19, 2024
a6dad3f
test: trying to push to 0.3.74
Nov 19, 2024
f2cb7d5
Delete test3.txt
unclecode Nov 19, 2024
b654c49
Update .gitignore to exclude additional scripts and files
unclecode Nov 19, 2024
2bdec1f
chore: add manage-collab.sh to .gitignore
unclecode Nov 19, 2024
7047422
Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0…
unclecode Nov 19, 2024
d418a04
Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269)
darwing1210 Nov 20, 2024
3439f78
fix: crawler strategy exception handling and fixes (#271)
NanmiCoder Nov 20, 2024
dbb751c
In this commit, we introduce the new concept of MakrdownGenerationStr…
unclecode Nov 21, 2024
006bee4
feat: enhance image processing capabilities
unclecode Nov 22, 2024
571dda6
Update Redme
unclecode Nov 22, 2024
24ad2fe
feat: enhance Markdown generation to include fit_html attribute
unclecode Nov 22, 2024
e02935d
chore: update README to reflect new features and improvements in vers…
unclecode Nov 22, 2024
8dea3f4
chore: update README to include new features and improvements for ver…
unclecode Nov 22, 2024
a5decaa
Merge branch '0.3.74'
unclecode Nov 22, 2024
d7a112f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 22, 2024
0d0cef3
feat: add enhanced markdown generation example with citations and fil…
unclecode Nov 22, 2024
c179703
Fixed a few bugs, import errors and changed to asyncio wait_for inste…
aravindkarnam Nov 23, 2024
f8e85b1
Fixed a bug in _process_links, handled condition for when url_scorer …
aravindkarnam Nov 23, 2024
3d52b55
Merge pull request #8 from aravindkarnam/main
aravindkarnam Nov 23, 2024
2226ef5
fix: Exempting the start_url from can_process_url
aravindkarnam Nov 23, 2024
d729aa7
refactor: Add group ID to for images extracted from srcset.
unclecode Nov 23, 2024
829a1f7
feat: update version to 0.3.741 and enhance content filtering with he…
unclecode Nov 23, 2024
edad7b6
chore: remove Railway deployment configuration and related documentation
unclecode Nov 24, 2024
d7c5b90
feat: add support for arm64 platform in Docker commands and update IN…
unclecode Nov 24, 2024
de43505
feat: update version to 0.3.742
unclecode Nov 24, 2024
b09a86c
chore: remove deprecated Docker Compose configurations for crawl4ai s…
unclecode Nov 24, 2024
195c0cc
chore: remove deprecated Docker Compose configurations for crawl4ai s…
unclecode Nov 24, 2024
b13fd71
chore: 1. Expose process_external_links as a param
aravindkarnam Nov 26, 2024
ee3001b
fix: moved depth as a param to can_process_url and applying filter ch…
aravindkarnam Nov 26, 2024
a98d51a
Remove the can_process_url check from _process_links since it's alrea…
aravindkarnam Nov 26, 2024
a888c91
Fix "Future attached to a different loop" error by ensuring tasks are…
aravindkarnam Nov 26, 2024
155c756
<Future pending> issue fix was incorrect. Reverting
aravindkarnam Nov 26, 2024
9530ded
fixed the final scraper_quickstart.py example
aravindkarnam Nov 26, 2024
ff731e4
fixed the final scraper_quickstart.py example
aravindkarnam Nov 26, 2024
2f5e059
updated definition of can_process_url to include dept as an argument,…
aravindkarnam Nov 26, 2024
c6a0221
docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixi…
unclecode Nov 27, 2024
b5d4db0
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 27, 2024
73661f7
docs: enhance development installation instructions (#286)
nelzomal Nov 27, 2024
f998e9e
Fix: handled the cases where markdown_with_citations, references_mark…
HamzaFarhan Nov 27, 2024
24723b2
Enhance features and documentation
unclecode Nov 28, 2024
a1c7dc1
Merge branch 'next' of https://github.com/unclecode/crawl4ai into next
unclecode Nov 28, 2024
3ff0b0b
feat: update changelog for version 0.3.743 with new features, improve…
unclecode Nov 28, 2024
76bea6c
Merge branch 'main' into 0.3.743
unclecode Nov 28, 2024
c2d4784
fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit…
unclecode Nov 28, 2024
e4acd18
docs: update README for version 0.3.743 with new features, enhancemen…
unclecode Nov 28, 2024
ce7d494
docs: update README for version 0.3.743 with new features, enhancemen…
unclecode Nov 28, 2024
d556dad
docs: update README to keep details open for extraction capabilities,…
unclecode Nov 28, 2024
3abb573
docs: update README for version 0.3.743 with improved formatting and …
unclecode Nov 28, 2024
d583aa4
refactor: update cache handling in quickstart_async example to use Ca…
unclecode Nov 28, 2024
a69f7a9
fix: correct typo in function documentation for clarity and accuracy
unclecode Nov 28, 2024
ddfb670
docs: update README to reflect new branding and improve section headi…
unclecode Nov 28, 2024
3fda66b
docs: refine README content for clarity and conciseness, improving de…
unclecode Nov 28, 2024
efe93a5
docs: enhance README with development TODOs and refine mission statem…
unclecode Nov 28, 2024
0cbd594
Merge branch 'next' - Update README, and quickstart examples
unclecode Nov 28, 2024
0bccf23
docs: update quickstart_async.py to enable example function calls for…
unclecode Nov 28, 2024
a036b7f
feat: implement create_box_message utility for formatted error messag…
unclecode Nov 28, 2024
a9b6b65
chore: update version to 0.3.744 and add publish.sh to .gitignore
unclecode Nov 28, 2024
b14e83f
docs: fix link formatting for recent updates section in README
unclecode Nov 28, 2024
776efa7
docs: fix link formatting for recent updates section in README
unclecode Nov 28, 2024
48d43c1
docs: fix link formatting for recent updates section in README
unclecode Nov 28, 2024
9221c08
docs: fix link formatting for recent updates section in README
unclecode Nov 28, 2024
cf35cbe
CRAWL4_AI_BASE_DIRECTORY should be Path object instead of string (#298)
paulokuong Nov 28, 2024
1d83c49
Enhance setup process and update contributors list
unclecode Nov 28, 2024
652d396
chore: update version to 0.3.745
unclecode Nov 28, 2024
7d81c17
fix: improve handling of CRAWL4_AI_BASE_DIRECTORY environment variabl…
unclecode Nov 28, 2024
98c64f9
Merge branch 'next'
unclecode Nov 28, 2024
aa3e2d0
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 28, 2024
c848577
docs: update README to reflect latest version v0.3.745
unclecode Nov 28, 2024
c0e87ab
fix: update package versions in requirements.txt for compatibility
unclecode Nov 28, 2024
b0419ed
Update README.md (#300)
unclecode Nov 28, 2024
449dd7c
Migrating from the classic setup.py to a using PyProject approach.
unclecode Nov 29, 2024
12e73d4
refactor: remove legacy build hooks and setup files, migrate to setup…
unclecode Nov 29, 2024
d202f35
Enhance installation and migration processes
unclecode Nov 29, 2024
93bf3e8
Refactor Dockerfile and clean up main.py
unclecode Nov 29, 2024
f9c98a3
Enhance Docker support and improve installation process
unclecode Nov 29, 2024
1def53b
docs: update Raspberry Pi section to indicate upcoming support
unclecode Nov 29, 2024
569bdb6
Merge branch 'next'
unclecode Nov 29, 2024
1ed7c15
:adhesive_bandage: Page-evaluate navigation destroyed error (#304)
dvschuyl Nov 29, 2024
0780db5
fix: handle errors during image dimension updates in AsyncPlaywrightC…
unclecode Nov 29, 2024
8c76a8c
docs: add contributor entry for dvschuyl regarding AsyncPlaywrightCra…
unclecode Nov 29, 2024
3e83893
Enhance User-Agent Handling
unclecode Nov 30, 2024
80d58ad
bump version to 0.3.747
unclecode Nov 30, 2024
293f299
Add PruningContentFilter with unit tests and update documentation
unclecode Dec 1, 2024
95a4f74
fix: pass logger to WebScrapingStrategy and update score computation …
unclecode Dec 2, 2024
e9639ad
refactor: improve error handling in DataProcessor and optimize data p…
unclecode Dec 3, 2024
b02544b
docs: update README and blog for version 0.4.0 release, highlighting …
unclecode Dec 3, 2024
486db3a
Updated to version 0.4.0 with new features
unclecode Dec 4, 2024
56f82f3
Merge branch 'next'
unclecode Dec 4, 2024
a45b8b1
Merge issues with 0.4.0 is over
unclecode Dec 4, 2024
8c611dc
Refactored web scraping components
unclecode Dec 5, 2024
c51e901
feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light…
unclecode Dec 8, 2024
740214e
Merge branch 'next'
unclecode Dec 8, 2024
e3488da
fixing Readmen tap (#313)
olavohenrique03 Dec 9, 2024
ba3e808
fix: The extract method logs output only when self.verbose is set to …
lu4nx Dec 9, 2024
2d31915
Commit Message:
unclecode Dec 9, 2024
ded554d
Fixed typo (#324)
moamamun Dec 9, 2024
e130fd8
Implement new async crawler features and stability updates
unclecode Dec 10, 2024
5431fa2
Add PDF & screenshot functionality, new tutorial
unclecode Dec 10, 2024
7591648
Update async_webcrawler.py (#337)
lvzhengri Dec 10, 2024
5188b7a
Add full-page screenshot and PDF export features
unclecode Dec 10, 2024
0982c63
Enhance AsyncWebCrawler and related configurations
unclecode Dec 12, 2024
de1766d
Bump version to 0.4.2
unclecode Dec 12, 2024
3d69715
chore: Update .gitignore to include new files and directories
unclecode Dec 12, 2024
20d6f5f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Dec 12, 2024
4a72c5e
Add release notes and documentation for version 0.4.2: Configurable C…
unclecode Dec 12, 2024
399af80
Merge branch 'next'
unclecode Dec 12, 2024
7af1d32
Update README for version 0.4.2: Reflect new features and enhancements
unclecode Dec 12, 2024
7524aa7
Feature: Add Markdown generation to CrawlerRunConfig
unclecode Dec 13, 2024
e9e5b56
Fix js_snipprt issue 0.4.21
unclecode Dec 15, 2024
ed7bc19
Bump version to 0.4.22
unclecode Dec 15, 2024
7c0fa26
Merge pull request #9 from aravindkarnam/main
aravindkarnam Dec 17, 2024
7a5f83b
fix: Added browser config and crawler run config from 0.4.22
aravindkarnam Dec 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .do/app.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
alerts:
- rule: DEPLOYMENT_FAILED
- rule: DOMAIN_FAILED
name: crawl4ai
region: nyc
services:
- dockerfile_path: Dockerfile
github:
branch: 0.3.74
deploy_on_push: true
repo: unclecode/crawl4ai
health_check:
http_path: /health
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
name: web
routes:
- path: /
22 changes: 22 additions & 0 deletions .do/deploy.template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
spec:
name: crawl4ai
services:
- name: crawl4ai
git:
branch: 0.3.74
repo_clone_url: https://github.com/unclecode/crawl4ai.git
dockerfile_path: Dockerfile
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
health_check:
http_path: /health
envs:
- key: INSTALL_TYPE
value: "basic"
- key: PYTHON_VERSION
value: "3.10"
- key: ENABLE_GPU
value: "false"
routes:
- path: /
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -199,13 +199,22 @@ test_env/
**/.DS_Store

todo.md
todo_executor.md
git_changes.py
git_changes.md
pypi_build.sh
git_issues.py
git_issues.md

.next/
.tests/
.issues/
.docs/
.issues/
.issues/
.gitboss/
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
publish.sh
combine.sh
combined_output.txt
Loading