Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

titanv CI broken #261

Closed
ghost opened this issue Feb 3, 2020 · 23 comments
Closed

titanv CI broken #261

ghost opened this issue Feb 3, 2020 · 23 comments
Milestone

Comments

@ghost
Copy link

ghost commented Feb 3, 2020

Stable build build #360

./headercvt stub.c -- -I /opt/clpy/llvm-7.1.0/lib/clang/7.1.0/include -I/usr/local/cuda/include

Issue #260 build #361

./headercvt stub.c -- -I /opt/clpy/llvm-7.1.0/lib/clang/7.1.0/include

The function build.get_cuda_path() may miss the CUDA path (because of deleted symlink?).

@vorj vorj changed the title _titanv_ CI broken titanv CI broken Feb 3, 2020
@vorj
Copy link

vorj commented Feb 3, 2020

The function build.get_cuda_path() may miss the CUDA path (because of deleted symlink?).

cuda_path = build.get_cuda_path()

@vorj
Copy link

vorj commented Feb 3, 2020

It seems that @nsakabe-fixstars 's opinion is right.
cuda-10.2 had been installed in titanv , so the symlink may have gone on rebooting at last weekend (In last week, cuda-9.2 was available on titanv).

@LWisteria
Copy link
Member

/usr/local/cuda-10.2/include/CL/cl.h?

@ghost
Copy link
Author

ghost commented Feb 3, 2020

The path problem has been resolved.

However, there is another issue. The CUDA 10.2 compiler does not accept our carray.clh.
#260 (review)

See Jenkins' log.

fp16.clh:35:10: error: loading directly from pointer to type 'const __attribute__((address_space(16776963))) half' is not allowed
E     return *(const half*)&ret;
carray.clh:559:13: error: declaring function return value of type 'half' is not allowed; did you forget * ?
E   static half clpy_nextafter_fp16(half x1, half x2){

@vorj
Copy link

vorj commented Feb 3, 2020

The path problem has been resolved.

✌️

For Fixstars developers : I edited a build script of jenkins configure to fix this issue, so when we will meet same problem in the future, rewrite ${CUDA_PATH} in the script.

However, there is another issue. The CUDA 10.2 compiler does not accept our carray.clh.

Ugh, so this issue is related to #224, then we need to fix that at first... This is neglect by NVIDIA, isn't it?

@LWisteria
Copy link
Member

The primary problem of this is updating CUDA version on the CI machine (titanv). It's not caused by ClPy itself.
Therefore, the right solution is to fix the CUDA version for CI.
You may just remove titanv from CI until solved.

One of the ideal solutions could be to modify half/fp16 behavior on ClPy, not just to disable it.
I will make the issue about it.

@ybsh
Copy link
Collaborator

ybsh commented Feb 6, 2020

Installing docker on titanv to run ClPy with cuda-9.2.

@ybsh
Copy link
Collaborator

ybsh commented Feb 6, 2020

Installed a Docker engine (19.03) and nvidia-docker on titanv.
Tested nvidia-smi with the following command and it worked.

# docker run --gpus all nvidia/cuda:9.2-base nvidia-smi

@ybsh
Copy link
Collaborator

ybsh commented Feb 14, 2020

I was careless not to notice this, but with the above command nvidia-smi doesn't seem working on cuda9.2.
It says CUDA Version: 10.2. I haven't found out why.

# docker run --gpus all nvidia/cuda:9.2-base nvidia-smi
Fri Feb 14 07:23:04 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:65:00.0 Off |                  N/A |
| 59%   80C    P2   177W / 250W |   7183MiB / 12064MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

It turned out the desired version had been installed.
The reason nvidia-smi showed v10.2 was that it showed the version returned by its driver API, which is a separate set of interfaces from runtime APIs.

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

Trying to build clpy with this Dockerfile:

FROM nvidia/cuda:9.2-devel
RUN apt-get update && apt-get install -y \
    clang-6.0 \
    libclang-6.0-dev \
    cmake \
    git \
    python3 \
    python3-pip \
    wget \
    vim
RUN pip3 install \
    cython \
    numpy \
    chainer==3.3.0 \
    pytest
WORKDIR /env
RUN wget https://github.com/CNugteren/CLBlast/archive/1.4.1.tar.gz
RUN tar -zxvf 1.4.1.tar.gz  \
    && rm *.gz \
    && cd CLBlast-1.4.1 \
    && mkdir -p build \
    && cd build \
    && cmake -DCMAKE_BUILD_TYPE=Release .. \
    && make -j8
ENV CLBLAST="/env/CLBlast-1.4.1"
ENV C_INCLUDE_PATH="${CLBLAST}/include:${C_INCLUDE_PATH}"
ENV CPLUS_INCLUDE_PATH="${CLBLAST}/include:${CPLUS_INCLUDE_PATH}"
ENV LIBRARY_PATH="${CLBLAST}/build:${LIBRARY_PATH}"
ENV LD_LIBRARY_PATH="${CLBLAST}/build:${LD_LIBRARY_PATH}"
WORKDIR /app
COPY ./app /app
COPY ./train_mnist.py /app
WORKDIR /app/clpy
RUN sh -c 'python3 setup.py develop 2>&1 | tee build.log'

To run this Dockerfile,

  • Create a directory foo and place this Dockerfile inside
  • cd to foo
  • mkdir -p app and place ClPy directory inside it
  • Build the image for example with # docker build -t clpy_test .
  • Run by ````# docker run --gpus all -it -d --name test clpy_test /bin/bash ```
  • # docker attach clpy_test

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

With clpy/ and train_mnist.py from Chainer in the context.

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

$ python3 -m clpy train_mnist.py -g 0
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 174, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.5/runpy.py", line 133, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/usr/lib/python3.5/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/app/clpy/clpy/__init__.py", line 17, in <module>
    from clpy import core  # NOQA
  File "/app/clpy/clpy/core/__init__.py", line 1, in <module>
    from clpy.core import core  # NOQA
  File "clpy/backend/function.pxd", line 4, in init clpy.core.core
  File "/app/clpy/clpy/backend/__init__.py", line 3, in <module>
    from clpy.backend import compiler  # NOQA
  File "clpy/backend/function.pxd", line 4, in init clpy.backend.compiler
  File "clpy/backend/device.pxd", line 4, in init clpy.backend.function
  File "clpy/backend/device.pyx", line 1, in init clpy.backend.device
  File "clpy/backend/opencl/env.pyx", line 82, in init clpy.backend.opencl.env
  File "clpy/backend/opencl/api.pyx", line 17, in clpy.backend.opencl.api.GetPlatformIDs
  File "clpy/backend/opencl/exceptions.pyx", line 24, in clpy.backend.opencl.exceptions.check_status
clpy.backend.opencl.exceptions.OpenCLRuntimeError: UNKNOWN ERROR: -1001

Seems the build had failed...

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

Here is the build log of ClPy: build.log

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

I observed at least the build of CLBlast was successful.

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

At the head of build.log:

readlink: missing operand
Try 'readlink --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.
nm: '/../lib/libclangTooling.a': No such file
make: 'ultima' is up to date.
make: Nothing to be done for 'build'.
make: Nothing to be done for 'deploy'.
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
building ultima started
building headercvt
launching headercvt (converting cl.h)...
Options: {'linetrace': False, 'profile': False, 'annotate': False}
Include directories: ['/usr/local/cuda/include']
Library directories: ['/usr/local/cuda/lib64']
building without Cython

"building without Cython" does not look normal. Does someone have an instant thought of what mistake I might have made?

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

nm: '/../lib/libclangTooling.a': No such file

# which clang

It does not know clang's PATH.

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

FROM nvidia/cuda:9.2-devel

RUN apt-get update && apt-get install -y \
    clang-6.0 \
    libclang-6.0-dev \
    cmake \
    git \
    python3 \
    python3-pip \
    wget \
    vim

RUN pip3 install \
    cython \
    numpy \
    chainer==3.3.0 \
    pytest

WORKDIR /env
RUN wget https://github.com/CNugteren/CLBlast/archive/1.4.1.tar.gz
RUN tar -zxvf 1.4.1.tar.gz  \
    && rm *.gz \
    && cd CLBlast-1.4.1 \
    && mkdir -p build \
    && cd build \
    && cmake -DCMAKE_BUILD_TYPE=Release .. \
    && make -j8

ENV CLBLAST="/env/CLBlast-1.4.1"
ENV C_INCLUDE_PATH="${CLBLAST}/include:${C_INCLUDE_PATH}"
ENV CPLUS_INCLUDE_PATH="${CLBLAST}/include:${CPLUS_INCLUDE_PATH}"
ENV LIBRARY_PATH="${CLBLAST}/build:${LIBRARY_PATH}"
ENV LD_LIBRARY_PATH="${CLBLAST}/build:${LD_LIBRARY_PATH}"

WORKDIR /app
COPY ./app /app
COPY ./train_mnist.py /app

ENV CLANG="/usr/lib/llvm-6.0"
ENV PATH="${CLANG}/bin:${PATH}"
ENV CPLUS_INCLUDE_PATH="${CLANG}/include:${CPLUS_INCLUDE_PATH}"
ENV LIBRARY_PATH="${CLANG}lib:${LIBRARY_PATH}"
ENV LD_LIBRARY_PATH="${CLANG}/lib:${LD_LIBRARY_PATH}"

WORKDIR /app/clpy
RUN sh -c 'python3 setup.py develop 2>&1 | tee build.log' # It might be easier to do this in an interactive shell

As you can see I added clang's PATH and now the environment knows where clang is (confirmed with which), but I get the same runtime error when running train_mnist.py.
This again is the log when I built ClPy:
build.log

@LWisteria
Copy link
Member

@ybsh I don't understand why you're making Dockerfile. Using an interactive shell seems easier.

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

@LWisteria I use Dockerfile for automating all the operations before compiling ClPy, and do the compilation by an interactive shell.

@LWisteria LWisteria added this to the v2.1.0rc2 milestone Feb 22, 2020
@vorj
Copy link

vorj commented Mar 2, 2020

@ybsh @LWisteria The original problem had been solved by #269 .
Will you continue this issue to create good Dockerfile? (even so I think that it's better to make a new issue for that instead of continuing this issue, though)

@ybsh
Copy link
Collaborator

ybsh commented Mar 2, 2020

@vorj Thank you very much for the fix (which obviated the need to create a separate CUDA 9.2 environment).
I don't think I will, because there is no longer any urgent reason to do that and I'm working on the bottleneck elimination issue #153 .

@vorj
Copy link

vorj commented Mar 2, 2020

OK, so let's close this.

@vorj vorj closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants