PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.
Test was measured on a file containing 10 million random domains from various TLDs (Sep. 24th 2020)
Library | Function | Time |
---|---|---|
PyDomainExtractor | pydomainextractor.extract | 2.30s |
publicsuffix2 | publicsuffix2.get_sld | 25.77s |
tldextract | __call__ | 34.22s |
tld | tld.parse_tld | 36.64s |
Test was measured on a file containing 1 million random urls (Sep. 24th 2020)
Library | Function | Time |
---|---|---|
PyDomainExtractor | pydomainextractor.extract | 2.76s |
publicsuffix2 | publicsuffix2.get_sld | 14.33s |
tldextract | __call__ | 44.34s |
tld | tld.parse_tld | 79.13s |
In order to compile this package you should have GCC, libidn2, and Python development package installed.
- Fedora
sudo dnf install python3-devel libidn2-devel gcc-c++
- Ubuntu 18.04
sudo apt install python3-dev libidn2-dev g++-9
pip3 install PyDomainExtractor
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
'tld\n'
'custom.tld\n'
)
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': 'google',
>>> 'domain': 'com',
>>> 'suffix': ''
>>> }
domain_extractor.extract('google.custom.tld')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'custom.tld'
>>> }
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract('http://google.com/')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.is_valid_domain('google.com')
>>> True
domain_extractor.is_valid_domain('domain.اتصالات')
>>> True
domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True
domain_extractor.is_valid_domain('domain-.com')
>>> False
domain_extractor.is_valid_domain('-sub.domain.com')
>>> False
domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.get_tld_list()
>>> [
>>> 'bostik',
>>> 'backyards.banzaicloud.io',
>>> 'biz.bb',
>>> ...
>>> ]
Distributed under the MIT License. See LICENSE
for more information.
Gal Ben David - [email protected]
Project Link: https://github.com/Intsights/PyDomainExtractor