Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't hardcode container interface in Docker vmms #76

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

cg2v
Copy link
Member

@cg2v cg2v commented Jul 9, 2015

Instead of putting all the file copying and su-ing in the docker run
command constructed by the vmm, include an interface script in the
container and use it as the container entrypoint. The only thing
passed from the vmm to the container are the resource limit parameters.

The script interface might be changed to pass in the volume name and/or the name of the output file so the script doesn't even to know that.

In addition, rewrite the dockerfile so it (1) includes fewer RUN lines, and
(2) does not need to fetch Tango from github in order to build autodriver;
Instead, the autodriver source is copied to the container using the ADD
instruction

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.50% when pulling 89ea64b on cg2v:docker_interface into 5ec90a2 on autolab:master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.11% when pulling 6ed7029 on cg2v:docker_interface into 5ec90a2 on autolab:master.

@mihirpandya
Copy link
Member

I agree this is a better approach than putting all the constants in the vmms itself. So let's iterate on this idea.

The only problem I noticed so far is that on jobs that should timeout, runJob exits with EXIT_OSERROR instead of EXIT_TIMEOUT. This is a problem because an OS error represents a VM that should no longer be used. So that triggers a catastrophic retry, meaning that the VM is destroyed, recreated, added back to the pool and the job is retried. The reason we are getting this error is because the autodriver is not able to kill the job and perform its clean up. I've pasted a sample output file.

Also, why did you move from Python to Perl? Seems like you had some Python 2 vs 3 compatibility issue? I think most of the developers are more familiar with Python and would like to keep things in Python as much as possible unless there is a very good reason to switch.

Autograder [Sun Aug  2 07:09:12 2015]: Received job test_job-1:2
Autograder [Sun Aug  2 07:10:35 2015]: Internal error: Unable to complete job after 2 tries. Pleae resubmit
Autograder [Sun Aug  2 07:10:35 2015]: Job status: waitVM=None copyIn=0 runJob=3 copyOut=0
Autograder [Sun Aug  2 07:10:35 2015]: Here is the output from the autograder:
---
Autodriver: Job timed out after 60 seconds
bash hello.sh
Hello from 7e59f0478238:/home/autograde/autolab
Autodriver: Gave up killing user processes at line 383

@cg2v
Copy link
Member Author

cg2v commented Aug 2, 2015

The python->perl change was a result of what was in the sample container (it only has python 3), assuming that the containers used in production would be similarly minimal, and trying to pick something that would work on both ubuntu and rhel based containers. As the sample Dockerfile is not actually representative, I would certainly be inclined to go back to the python script, as it's easier to read.

@cg2v
Copy link
Member Author

cg2v commented Aug 2, 2015

The timeout mismatch happened because there were zombie processes after autodriver did its first kill, and the wrapper, which is pid 1, didn't wait for them. Apparently, when a shell is pid 1, it does end up calling wait() and preventing such zombies from accumulating. I added a background thread to the wrapper which calls wait until autodriver exits (or wait fails).
i also decided to use exec directly, so the parent process gets direct access to the exit status.

@cg2v
Copy link
Member Author

cg2v commented Aug 3, 2015

the distDocker change has not been tested

@landscape-bot
Copy link

Code Health
Repository health increased by 0.51% when pulling bdb6a5e on cg2v:docker_interface into 5651be9 on autolab:master.

@landscape-bot
Copy link

Code Health
Repository health increased by 2% when pulling bdb6a5e on cg2v:docker_interface into a69fa16 on autolab:master.

Instead of putting all the file copying and su-ing in the docker run
command constructed by the vmm, include an interface script in the
container and use it as the container entrypoint. The only thing
passed from the vmm to the container are the resource limit parameters.

In addition, rewrite the dockerfile so it (1) includes fewer RUN lines, and
(2) does not need to fetch Tango from github in order to build autodriver;
Instead, the autodriver source is copied to the container using the ADD
instruction
Make the wrapper script compatible with python 2.x (x>=6), not
just 3.0
install python2 so interface script can use #!/usr/bin/python
Install all of C0's files, and use a wrapper shell script for cc0 so
the compiler is always invoked with an absolute path
- spawn a thread that calls wait(). We are pid 1 in the container
  and need to collect all zombies, otherwise autodriver might get
  confused
- Don't double fork. it's not easy to pass the full exit status
  of autodriver through a double fork, so use os.execvp instead
  of subprocess.call. This means we have to do some of the file
  descriptor munging directly.
- Don't bother creating a temporary output file. Directly pipe
  autodriver's output to the real output file
master now has 2 dockerfiles, one targeting 122 with C0, and one not
Apply updates (fewer RUN lines, do not fetch autodriver from github) to
both of them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants