Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List spring cleaning #91

Open
pappasadrian opened this issue Feb 23, 2021 · 4 comments
Open

List spring cleaning #91

pappasadrian opened this issue Feb 23, 2021 · 4 comments

Comments

@pappasadrian
Copy link
Contributor

Hello,

I was taking a brief look through the filter list. Considering that this is a project ongoing for more than 10 years, it might be worth performing some sort of spring cleaning, to identify and remove:

  1. domains no longer in operation
  2. site-specific rules blocking items that no longer exist on the site
  3. check for duplicates/redundant stuff

I can't think of a good methodology for this, other than manually checking.

Any thoughts and ideas? Is this even necessary?
I was thinking that reducing the rules might help with performance of adblockers? IDK.

@kargig
Copy link
Owner

kargig commented Feb 25, 2021

good idea! even if it's not strictly necessary, it's always good to clean up things every now and then

  1. domains not in operation should be easy to spot. just parse the domain name out of a rule and see if it resolves. if it does, keep it. if it doesn't remove it.
  2. + 3. far more difficult, especially the element hiding rules

@kargig
Copy link
Owner

kargig commented Feb 27, 2021

for #92, I got (most of) the domains out of the rules file with:

cat void-gr-filters.txt| grep -Po '.*?(//|\|\||@@\|\||@@|\~)\K.*?(?=/|#)' | sort | uniq > domain-list.txt
cat void-gr-filters.txt| grep -Po '^[0-9a-zA-Z].*?(?=/|#)' | sort | uniq >> domain-list.txt
sort domain-list.txt | uniq > domain-list-final.txt

and then ran this:

#!/bin/bash

DOMAIN_LIST="domain-list-final.txt"
#DOMAIN_LIST="testme.txt"
RESOLVER="1.1.1.1"
BAD_DOMAINS="bad_domains"
SUB_NO_RECORD="no_record"
WWW_EXISTS="www_exists"

rm -f "${BAD_DOMAINS}" "${SUB_NO_RECORD}" "${WWW_EXISTS}"

while read -r line; do
  # cleanups
  myline=$(echo "${line}" | awk -F':' '{ print $1 }')
  line=$(echo "${myline}" | grep -Ev '/|\|' | grep -Ev '^[0-9]')
  if [ "x${line}" = "x" ]; then
    continue
  fi
  echo "Working on: ${line}"
  # Check if the subdomain exists
  if [ "$(dig "${line}" @${RESOLVER} +short)" = "" ]; then
  # Check if the subdomain with www prepended exists
    if [ "$(dig "www.${line}" @${RESOLVER} +short)" = "" ]; then
      domain=$(echo "${line}" | awk -F. '{ print $(NF-1) "." $NF }')
      # if the domain doesn't have NS records, the domain does not exist any more
      if [ "$(dig NS "${domain}" @${RESOLVER} +short)" = "" ]; then
        echo "${domain}" | tee -a "${BAD_DOMAINS}"
      # if the entry is a subdomain we already know it doesn't have A record
      elif [ "$(echo "${line}" | grep -o '\.' | wc -l)" -gt "1" ]; then
          echo "${line}" | tee -a "${SUB_NO_RECORD}"
      fi
    else
      echo "${line}" | tee -a "${WWW_EXISTS}"
    fi
  fi
done < "${DOMAIN_LIST}"

double checked all "bad_domains" manually

@aldi
Copy link
Collaborator

aldi commented Feb 28, 2021

@kargig Good stuff!

@krystian3w
Copy link

krystian3w commented Jun 17, 2022

redundant stuff

Cosmetic filter have network start characters: 044bc9f (made in 2016)

||www.dwrean.net##a[href="http://www.kratisinow.gr"]
||www.dwrean.net##iframe[src="http://kratisinow.digitup.eu/widget/widget-artists"]

AdGuard disabled use in 2018: AdguardTeam/FiltersRegistry@a452d4d#diff-6472c0fcd53f81660278097de5b81a5a1cd70c38b8a5068d02039207a61d5726R93-R95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants