Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CloudflareProtectedEmailAddress #6

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ondenman
Copy link
Contributor

@ondenman ondenman commented May 25, 2017

This is a step towards fixing: everypolitician/everypolitician-data#34867

Member email addresses have been obfuscated and the scraper is currently capturing a string of javascript instead of the email address:

:email=>"[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */"

For example, when the obfuscated address is b3d2dfd7dcc5d6c1d2f3d7dac3c6c7d2d7dcc09dd4dcc59dc3ca, calling

CloudflareProtectedEmailAddress.new('b3d2dfd7dcc5d6c1d2f3d7dac3c6c7d2d7dcc09dd4dcc59dc3ca').human_readable

will return:
[email protected]

The javascript comes from the Paraguay member pages. The key part is:

a.substr(n,2)^r

r is the first character of the input string. The remainder is the obfuscated email. Two digit hexadecimals represent each character.

The next step will be to use this class within MemberPage.

This method of obfuscation is not particular to this site. For example, it is being used on the New Zealand site everypolitician/everypolitician-data#36492. The intention is to develop the class in this scraper before extracting it to a gem.

@ondenman ondenman force-pushed the add-cloudflare-protected-email-address branch from 26fc736 to d6d9c80 Compare May 25, 2017 12:57
@ondenman ondenman changed the title Add CloudflareProtectedEmailAddress class Add CloudflareProtectedEmailAddress May 25, 2017
@ondenman ondenman requested a review from tmtmtmtm May 25, 2017 14:03
Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the plan is to get this working here, and then extract to a gem? Would be good to note that if so.

As per previous discussion, you also need to explain where the algorithm comes from.

require 'cgi'

class CloudflareProtectedEmailAddress
def initialize(obfuscated:)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your thinking behind making this require a named parameter, rather than just a positional one? If there are likely to be other arguments that might make sense, but here it seems like it just complicates the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was leaning towards verbosity but I see why it complicates the interface.

end

def key
@key ||= obfuscated[0..1].hex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really expensive enough to need caching? (Especially compared to unobfuscated)

@key ||= obfuscated[0..1].hex
end

def unobfuscated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value returned from this still seems fairly obfuscated to me…

end

it 'will make an obfuscated email address human readable' do
email.must_equal '[email protected]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have another test here as well. I could make this test suite pass with a MUCH simpler implementation :D

@tmtmtmtm tmtmtmtm assigned ondenman and unassigned tmtmtmtm May 25, 2017
@ondenman ondenman force-pushed the add-cloudflare-protected-email-address branch from ae51cff to 2d0f184 Compare May 25, 2017 16:51
@ondenman
Copy link
Contributor Author

#human_readable returns the original string if the unobfuscated one does not look like at email (it tests to see if contains an @). I'm not sure if it's desirable behaviour but I think it may be more useful than returning something unintelligible if, for instance, the class was initialised with a plain email address.


def key
obfuscated[0..1].hex
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key and obfuscated_characters here are both taking responsibility for the same thing, but doing it in slightly different ways. This would be cleaner and more maintainable if those both used an underlying method that took care of the conversion e.g.

  def hex_pairs
    obfuscated.scan(/.{2}/).map(&:hex)
  end

  def obfuscated_characters
    hex_pairs.drop(1)
  end

  def key
    hex_pairs.first
  end

(not sure if hex_pairs is the right name here)

def escaped_email_address
obfuscated_characters.reduce('') do |email, encoded_character|
email + (encoded_character ^ key).to_s(16).prepend('%')
end
Copy link
Contributor

@tmtmtmtm tmtmtmtm May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit over-complicated.

Perhaps a simple map+join would be clearer?

   obfuscated_characters.map { |char| (char ^ key).to_s(16).prepend('%') }.join


def human_readable
return CGI.unescape(escaped_email_address) if CGI.unescape(escaped_email_address).include?('@')
obfuscated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has become much too complicated and confusing now, I fear. If you really need to do this check, I'd suggest at least factoring out that CGI.unescape(escaped_email_address) to a named method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this also seems back-to-front. A guard clause should be used to break out in unusual cases, not be the main path, with a fall through to the odd ones.

Again, I'm not convinced you need this at all, but if you do, then it would be better as:

  return obfuscated unless unescaped.include? '@'
  unescaped

@tmtmtmtm tmtmtmtm removed their assignment May 25, 2017
@ondenman ondenman assigned tmtmtmtm and unassigned ondenman Jun 6, 2017
@ondenman ondenman requested a review from tmtmtmtm June 6, 2017 08:53
Oliver Denman added 2 commits June 8, 2017 18:02
The Paraguay members page contains obfuscated email addresses.

When initialized with an encrypted email address from the Paraguay members list page,  CloudFlareProtectedEmail#unobfuscated_email will return the human readable address.
@tmtmtmtm tmtmtmtm force-pushed the add-cloudflare-protected-email-address branch from 1488b48 to 9199f3c Compare June 8, 2017 17:03
attr_accessor :obfuscated

def components
obfuscated.scan(/.{2}/).map(&:hex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious here why you're using /.{2}/ rather than just ..


attr_accessor :obfuscated

def components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

components is a bit too overloaded a term for me to be entirely comfortable with it here — I'm not sure it really conveys what's actually going on, and could be misleading. Is there a better name you can think of?

obfuscated.scan(/.{2}/).map(&:hex)
end

def obfuscated_characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite a misleading name too, as it returns integers, not characters

components.first
end

def escaped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to follow if it actually mentioned percent encoding (or hex encoding) somewhere. Ideally that would be obvious from the name of the method, though if that's not sensible, then a comment might be called for.

end

def escaped
obfuscated_characters.map { |char| (char ^ key).to_s(16).prepend('%') }.join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've worked out what's actually going on here, I'm not entirely sure that .to_s(16) gets you the correct format. This approach will convert 15 to '%f' when I think it's strictly meant to be '%0F'.

sprintf("#%02X" % i) should get you the correct value, and means you wouldn't need the prepend either.

@tmtmtmtm tmtmtmtm assigned ondenman and unassigned tmtmtmtm Jun 14, 2017
@ondenman ondenman force-pushed the add-cloudflare-protected-email-address branch from c7ef388 to ed9f546 Compare June 16, 2017 10:01
@ondenman ondenman assigned tmtmtmtm and unassigned ondenman Jun 16, 2017
@ondenman ondenman requested review from tmtmtmtm and removed request for tmtmtmtm June 16, 2017 11:24
@ondenman ondenman assigned ondenman and unassigned tmtmtmtm Jun 16, 2017
@ondenman ondenman force-pushed the add-cloudflare-protected-email-address branch from ed9f546 to 54f0000 Compare June 16, 2017 15:38
@ondenman ondenman requested a review from tmtmtmtm June 19, 2017 08:46
@ondenman ondenman assigned tmtmtmtm and unassigned ondenman Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants