Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For MrJob, make unpacking archives optional #93

Open
anjackson opened this issue Mar 23, 2022 · 9 comments
Open

For MrJob, make unpacking archives optional #93

anjackson opened this issue Mar 23, 2022 · 9 comments
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented Mar 23, 2022

I'm developing a fork of MrJob that makes unpacking archives optional, here: https://github.com/ukwa/mrjob/tree/make-unpacking-archives-optional

It should be possible to install this into a venv using:

pip install -U git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional

If that works, then update the MrJob config as per the updated docs:

runners:
    hadoop:
        unpack_archives: false

Running the job with this configuration should skip the unpacking-archives step and leave the files as they were.

EDIT: If this works, I'll try to contribute the change back upstream.

@anjackson
Copy link
Contributor Author

Can you try this out @GilHoggarth and see if it works?

@GilHoggarth
Copy link
Contributor

GilHoggarth commented Mar 23, 2022

The installation line fails to pull in the patch, which I guess is due to the latest policy changes around github access.

FYI, I'm running:

(venv) [hdfsadmin@nlsh3httpfs generate_checksums]$ python3 -m pip --proxy http://explorer.bl.uk:3127/ install -U git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional
Collecting git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional
  Cloning https://github.com/ukwa/mrjob.git (to revision make-unpacking-archives-optional) to /tmp/pip-req-build-jjqmh0b4
fatal: unable to access 'https://github.com/ukwa/mrjob.git/': Failed connect to github.com:443; Connection timed out
Command "git clone -q https://github.com/ukwa/mrjob.git /tmp/pip-req-build-jjqmh0b4" failed with error code 128 in None

This installation works with mrjob:

(venv) [hdfsadmin@nlsh3httpfs generate_checksums]$ python3 -m pip --proxy http://explorer.bl.uk:3127/ install mrjob
Requirement already satisfied: mrjob in ./venv/lib/python3.7/site-packages (0.7.4)
Requirement already satisfied: PyYAML>=3.10 in ./venv/lib/python3.7/site-packages (from mrjob) (6.0)

@GilHoggarth
Copy link
Contributor

GilHoggarth commented Mar 23, 2022

Hacking the changes directly into bin.py, I now get:

Traceback (most recent call last):
  File "generate_checksums.py", line 20, in <module>
    MRGenerateChecksum.run()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 616, in run
    cls().execute()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 687, in execute
    self.run_job()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 636, in run_job
    runner.run()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/runner.py", line 503, in run
    self._run()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/hadoop.py", line 326, in _run
    self._create_setup_wrapper_scripts()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 446, in _create_setup_wrapper_scripts
    manifest=True)
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 495, in _write_setup_script
    setup, manifest=manifest, wrap_python=wrap_python)
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 595, in _setup_wrapper_script_content
    lines.extend(self._manifest_download_content())
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 693, in _manifest_download_content
    if self._opts['unpack_archives']:
KeyError: 'unpack_archives'

Quite understandably, you'll expect I made a mess of adding your code!

@GilHoggarth
Copy link
Contributor

GilHoggarth commented Mar 23, 2022

If I change line 693 to if 'unpack_archives' in self._opts and self._opts['unpack_archives'] != False: the mrjob now runs, but eventually fails as an mr job:

map 0% reduce 0%
  Task Id : attempt_1645461135252_47322_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
	at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

  Task Id : attempt_1645461135252_47322_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
	at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

  Task Id : attempt_1645461135252_47322_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
	at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

   map 100% reduce 0%
  Job job_1645461135252_47322 failed with state FAILED due to: Task failed task_1645461135252_47322_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

@anjackson
Copy link
Contributor Author

Installation seems to work if the https_proxy environment variable is set appropriately.

There were a few issues with the implementation, but it seems to work okay now. As per ukwa/mrjob@e3901a2

@anjackson
Copy link
Contributor Author

I've opened a PR (Yelp/mrjob#2215) but we can just install our branch for now.

@GilHoggarth
Copy link
Contributor

Installed via pip and seen to be working.

@GilHoggarth
Copy link
Contributor

For the purpose of the our hadoop data migration, this patch works successfully. However, you might wish to keep this ticket open whilst the patch is waiting to be included upstream. Consequently, I'm unassigning myself from this ticket

@GilHoggarth GilHoggarth removed their assignment Mar 25, 2022
@anjackson
Copy link
Contributor Author

Hm, attempted to request review in https://groups.google.com/g/mrjob but my post isn't turning up. Unless I messed up posting there somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants