Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wabac fuzzy rules - update + process #335

Merged
merged 1 commit into from
Jun 27, 2024
Merged

Wabac fuzzy rules - update + process #335

merged 1 commit into from
Jun 27, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Jun 25, 2024

Fix #216

Changes:

  • no new rules to update, since we do not support Twitter, Washington Post, WixStatic, Facebook
  • moved rules definitions to yaml (instead of json) so that we can add inline comment ... sic ... and we do not need to escape the string values
  • added information about where these rules comes from, a bit on how to update them, when it was done last time
  • properly install dependencies needed to generate rules

@benoit74 benoit74 self-assigned this Jun 25, 2024
@benoit74 benoit74 force-pushed the wabac_fuzzy_rules branch 4 times, most recently from 47747b0 to 7cf858d Compare June 25, 2024 12:44
@benoit74 benoit74 marked this pull request as ready for review June 25, 2024 12:47
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

moved rules definitions to yaml (instead of json)
so that we can add inline comment ... sic ...

I've seen and used fake keys for this:

{
    "fuzzyRules": [
        {
            "//": "Youtuble rule related comment",
            "pattern": ".*googlevideo.com/(videoplayback(?=\\?)).*[?&](id=[^&]+).*",
            "replace": "youtube.fuzzy.replayweb.page/\\1?\\2"
        }
    ]
}

and we do not need to escape the string values

It's not required. It's a convention.

That's just for the sake of the argument. YAML is fine.
Aren't we planning on using a single, shared, file with browsertrix though? JS world is obviously more JSON-oriented than YAML.

@benoit74
Copy link
Collaborator Author

Good points !

Aren't we planning on using a single, shared, file with browsertrix though? JS world is obviously more JSON-oriented than YAML.

Yes we are, but probably not in the coming weeks. Not sure which standard we might use then

Regarding escaping in JSON string, I agree we are not forced to do it, but it's required according to https://www.ietf.org/rfc/rfc4627.txt paragraph 2.5 (it is just a memo, so somehow we could ignore it, but still ...):

The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".

Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as "\".

I will merge as-is, we can always rollback, it's not like it is a big change ^^

@benoit74 benoit74 merged commit a852cb3 into main Jun 27, 2024
5 checks passed
@benoit74 benoit74 deleted the wabac_fuzzy_rules branch June 27, 2024 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new upstream wabac.js fuzzy rules and keep track of sync commit
2 participants