Python implementation (#13)

* Python: Initial commit, Added extra files to specs folder for easy testing. * fixed import statements * fixed json schema regex * optimized for python * no longer abstract, changed comments * changed return value * test.py now only runs spec tests. * removed specs for rebasing * Python: Rebased and removed old specs. * fixed import statements * fixed json schema regex * optimized for python * no longer abstract, changed comments * changed return value * test.py now only runs spec tests. * removed specs for rebasing * purehtml python version 0.1.0 * Removed unused files, updated readme.md and driver script. * style: format readme * style: format code * feat: renamed extract_from_str to extract, validate_string to is_valid_config_yaml * docs: package name set * feat: added .gitignore * style: format code * feat: renamed config factory methods, removed unnecessary helper functions, moved test helper to root * fix: parent * fix: parent * refactor: test şeması ve konfigürasyon okuma işlemlerini doğrulama işleminden ayır * feat: göreceli yolları kaydet * test çıktılarını sadeleştir * refactor : first selection no longer happens within the PureHTML class. * refactor : renamed PureHTMLInitializer Class to BeautifulSoupBackend. * style: All files reformatted and file names changed according to snake case naming convention. * fix : removed duplicate ConfigFactory.py file * test: html test case fixed to match cheerio's result * test: test spec for the purehtml's basics added * fix : changed return value from float to int. * feat : changed parser to html5lib. * fix: html5lib added to the dependencies * build: setup python package --------- Co-authored-by: onurc <[email protected]>
purescraps · Dec 9, 2024 · 746a1ed · 746a1ed
1 parent 90bfa93
commit 746a1ed
Show file tree

Hide file tree

Showing 42 changed files with 1,946 additions and 2 deletions.
diff --git a/.github/workflows/publish-python-package.yml b/.github/workflows/publish-python-package.yml
@@ -0,0 +1,22 @@
+on:
+  push:
+    branches: ["master"]
+    paths: ["python/**"]
+
+jobs:
+  pypi-publish:
+    # there should be a tag
+    if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/python-v')
+
+    name: Upload release to PyPI
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/purehtml-purescraps
+    permissions:
+      id-token: write # IMPORTANT: this permission is mandatory for trusted publishing
+    steps:
+      # retrieve your distributions here
+
+      - name: Publish package distributions to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,2 @@
 /.vscode
-
+/.idea
diff --git a/python/.gitignore b/python/.gitignore
@@ -0,0 +1,163 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
diff --git a/python/LICENSE b/python/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Onur C. Güven
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/python/Makefile b/python/Makefile
@@ -0,0 +1,16 @@
+.PHONY: build
+
+build: dist
+
+.PHONY: clean
+
+clean:
+	rm -rf dist
+
+dist: purehtml/**/*.py
+	python3 -m build
+
+.PHONY: publish
+
+publish:
+	python3 -m twine upload dist/*
diff --git a/python/README.md b/python/README.md
@@ -0,0 +1,59 @@
+# PureHTML
+
+PureHTML is a parsing specification for extracting JSON from HTML.
+
+- [Documentation](https://purescraps.github.io/purehtml/)
+
+**Installation**:
+
+To install **PureHTML**, you can use `pip` with following command:
+
+```bash
+  pip install purehtml
+```
+
+## Usage
+
+```python
+from purehtml import ConfigFactory, extract
+
+yaml_string = """
+selector: span
+type: array
+items: {}
+"""
+html_string = """
+<div>
+  <span>a</span>
+  <span>b</span>
+  <span>c</span>
+</div>
+"""
+expected_output = "['a', 'b', 'c']"
+
+config = ConfigFactory.from_yaml(yaml_string)
+
+print(f"Extracted output : {extract(config, html_string, 'http://example.com')}")
+print(f"Expected  output : {expected_output}")
+
+```
+
+## Development
+
+For testing, we use test specifications defined in `<purehtml root>/specs`.
+You may run `python3 run_tests.py` to check if the python implementation
+is working correctly or has some incompatibilities.
+
+#### Building
+
+1. Install `build` pip with this command: `python3 -m pip install --upgrade build`
+2. Execute `make clean build` to create a clean build output. The output will be placed in `dist/`
+
+#### Publishing to Python Package Index
+
+1. Install the `twine` package: `python3 -m pip install --upgrade twine`
+2. Publish to pypi: `python3 -m twine upload dist/*` or with `make publish`
+
+## License
+
+[MIT](https://choosealicense.com/licenses/mit/)
diff --git a/python/purehtml/__init__.py b/python/purehtml/__init__.py
@@ -0,0 +1,7 @@
+# __init__.py
+
+from purehtml.config_factory import ConfigFactory
+from purehtml.purehtml import extract
+from purehtml.purehtml_validator import is_valid_config_yaml
+
+__all__ = ["ConfigFactory", "extract", "is_valid_config_yaml"]
diff --git a/python/purehtml/backend/__init__.py b/python/purehtml/backend/__init__.py
diff --git a/python/purehtml/backend/backend.py b/python/purehtml/backend/backend.py
@@ -0,0 +1,89 @@
+from bs4 import BeautifulSoup
+
+
+class BeautifulSoupBackend:
+    """
+    A backend implementation using BeautifulSoup.
+    """
+
+    @staticmethod
+    def load(html):
+        """
+        Load an HTML string or buffer into a PureHTMLBeautifulSoupDocument.
+        """
+        return PureHTMLDocument(BeautifulSoup(html, "html5lib"))
+
+
+class PureHTMLDocument:
+    """
+    A document implementation using BeautifulSoup.
+    """
+
+    def __init__(self, soup):
+        self._soup = soup
+
+    def select(self, selector):
+        """
+        Select elements based on a CSS selector and wrap them in PureHTMLBeautifulSoupNode objects.
+        """
+        elements = self._soup.select(selector)
+        return [PureHTMLNode(self._soup, el) for el in elements]
+
+    def root(self):
+        """
+        Return the root element of the document.
+        """
+        return PureHTMLNode(self._soup, self._soup)
+
+
+class PureHTMLNode:
+    """
+    A node implementation using BeautifulSoup.
+    """
+
+    def __init__(self, soup, element):
+        self._soup = soup
+        self._element = element
+
+    def attr(self, name=None):
+        """
+        Get an attribute or all attributes of the node.
+        """
+        if name:
+            return self._element.get(name, None)
+        return self._element.attrs
+
+    def find(self, selector):
+        """
+        Find child elements based on a CSS selector and wrap them in PureHTMLBeautifulSoupNode objects.
+        """
+        # If `self._element` is the root, use it globally
+        if self._element == self._soup:
+            elements = self._soup.select(selector)
+        else:
+            # Otherwise, restrict to children
+            elements = self._element.select(selector)
+        return [PureHTMLNode(self._soup, el) for el in elements]
+
+    def html(self):
+        """
+        Get the HTML content of the node.
+        """
+        return self._element.decode_contents()
+
+    def is_selector(self, selector):
+        """
+        Check if the node matches the selector.
+        """
+        matched_elements = self._soup.select(
+            selector
+        )  # Store the matched elements in a variable
+        return (
+                self._element in matched_elements
+        )  # Check if current element is in the matched list
+
+    def text(self):
+        """
+        Get the text content of the node.
+        """
+        return self._element.get_text()