GitHub - nsteinberg-r7/PyDomainExtractor: Highly optimized domain name extraction library written in C++

Highly optimized domain name extraction library written in C++

About The Project

PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.

Built With

Performance

Extract From Domain

Test was measured on a file containing 10 million random domains from various TLDs (Sep. 24th 2020)

Library	Function	Time
PyDomainExtractor	pydomainextractor.extract	2.30s
publicsuffix2	publicsuffix2.get_sld	25.77s
tldextract	__call__	34.22s
tld	tld.parse_tld	36.64s

Extract From URL

Test was measured on a file containing 1 million random urls (Sep. 24th 2020)

Library	Function	Time
PyDomainExtractor	pydomainextractor.extract	2.76s
publicsuffix2	publicsuffix2.get_sld	14.33s
tldextract	__call__	44.34s
tld	tld.parse_tld	79.13s

Prerequisites

In order to compile this package you should have GCC, libidn2, and Python development package installed.

Fedora

sudo dnf install python3-devel libidn2-devel gcc-c++

Ubuntu 18.04

sudo apt install python3-dev libidn2-dev g++-9

Installation

pip3 install PyDomainExtractor

Usage

Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
    'tld\n'
    'custom.tld\n'
)

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': 'google',
>>>     'domain': 'com',
>>>     'suffix': ''
>>> }

domain_extractor.extract('google.custom.tld')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'custom.tld'
>>> }

URL Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('http://google.com/')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

Validation

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.is_valid_domain('google.com')
>>> True

domain_extractor.is_valid_domain('domain.اتصالات')
>>> True

domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True

domain_extractor.is_valid_domain('domain-.com')
>>> False

domain_extractor.is_valid_domain('-sub.domain.com')
>>> False

domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False

TLDs List

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.get_tld_list()
>>> [
>>>     'bostik',
>>>     'backyards.banzaicloud.io',
>>>     'biz.bb',
>>>     ...
>>> ]

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - [email protected]

Project Link: https://github.com/Intsights/PyDomainExtractor

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
images		images
pydomainextractor		pydomainextractor
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highly optimized domain name extraction library written in C++

Table of Contents

About The Project

Built With

Performance

Extract From Domain

Extract From URL

Prerequisites

Installation

Usage

Extraction

URL Extraction

Validation

TLDs List

License

Contact

About

Releases

Packages

Languages

License

nsteinberg-r7/PyDomainExtractor

Folders and files

Latest commit

History

Repository files navigation

Highly optimized domain name extraction library written in C++

Table of Contents

About The Project

Built With

Performance

Extract From Domain

Extract From URL

Prerequisites

Installation

Usage

Extraction

URL Extraction

Validation

TLDs List

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages