Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python example file's data location does not meet Lambda's expectation #7

Open
habemusne opened this issue Apr 9, 2019 · 0 comments

Comments

@habemusne
Copy link

I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.

Now when I run ./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py under the driver folder, Spark's log shows java.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist.

I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under /tmp/lambda, I looked at what actually was under /tmp/lambda. There was a spark folder. So my work-around was to create a temporary /tmp/lambda/spark/data/mllib/ under my EC2, move my data file there, and then point to that file in spark.read. Specifically I changed line 42 to

    import os
    data_folder = '/home/ec2-user/driver/data/mllib'
    lambda_folder = '/tmp/lambda/spark/data/mllib'
    filename = 'sample_kmeans_data.txt'
    os.system('mkdir -p ' + lambda_folder)
    os.system('cp {}/{} {}/{}'.format(data_folder, filename, lambda_folder, filename))
    dataset = spark.read.format("libsvm").load('{}/{}'.format(lambda_folder, filename))

And then it worked fine.

I suppose that part or many Python files has this problem, so it can be a barrier for python users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant