Skip to content

Highly optimized domain name extraction library written in C++

License

Notifications You must be signed in to change notification settings

nsteinberg-r7/PyDomainExtractor

 
 

Repository files navigation

Logo

Highly optimized domain name extraction library written in C++

license Python Build PyPi

Table of Contents

About The Project

PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.

Built With

Performance

Extract From Domain

Test was measured on a file containing 10 million random domains from various TLDs (Sep. 24th 2020)

Library Function Time
PyDomainExtractor pydomainextractor.extract 2.30s
publicsuffix2 publicsuffix2.get_sld 25.77s
tldextract __call__ 34.22s
tld tld.parse_tld 36.64s

Extract From URL

Test was measured on a file containing 1 million random urls (Sep. 24th 2020)

Library Function Time
PyDomainExtractor pydomainextractor.extract 2.76s
publicsuffix2 publicsuffix2.get_sld 14.33s
tldextract __call__ 44.34s
tld tld.parse_tld 79.13s

Prerequisites

In order to compile this package you should have GCC, libidn2, and Python development package installed.

  • Fedora
sudo dnf install python3-devel libidn2-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev libidn2-dev g++-9

Installation

pip3 install PyDomainExtractor

Usage

Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
    'tld\n'
    'custom.tld\n'
)

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': 'google',
>>>     'domain': 'com',
>>>     'suffix': ''
>>> }

domain_extractor.extract('google.custom.tld')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'custom.tld'
>>> }

URL Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('http://google.com/')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

Validation

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.is_valid_domain('google.com')
>>> True

domain_extractor.is_valid_domain('domain.اتصالات')
>>> True

domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True

domain_extractor.is_valid_domain('domain-.com')
>>> False

domain_extractor.is_valid_domain('-sub.domain.com')
>>> False

domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False

TLDs List

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.get_tld_list()
>>> [
>>>     'bostik',
>>>     'backyards.banzaicloud.io',
>>>     'biz.bb',
>>>     ...
>>> ]

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - [email protected]

Project Link: https://github.com/Intsights/PyDomainExtractor

About

Highly optimized domain name extraction library written in C++

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 91.3%
  • Python 8.7%