Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark2.3.2+hadoop3.1.1环境下,根据手册配置后无法使用 #3

Open
zmzeng opened this issue Jan 3, 2021 · 2 comments
Open

Comments

@zmzeng
Copy link

zmzeng commented Jan 3, 2021

问题描述:
按照该仓库的手册配置插件后,在代码中访问obs路径仍然失败。报Class org.apache.hadoop.fs.obs.OBSFileSystem not found错误。

环境信息如下:
spark 2.3.2 无部署,直接通过spark-submit调用;
hadoop 3.1.1 单机伪分布式部署;
spark可以正常访问hadoop,详见最后的执行日志。

hadoop@ecs-c04d:~$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_275
Branch
Compiled by user jshao on 2018-09-16T12:15:32Z
Revision
Url
Type --help for more information.
hadoop@ecs-c04d:~$ hdfs version
Hadoop 3.1.1
Source code repository https://github.com/apache/hadoop -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c
Compiled by leftnoteasy on 2018-08-02T04:26Z
Compiled with protoc 2.5.0
From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.1.jar

下载了hadoop-huaweicloud-3.1.1-hw-40.jar和esdk-obs-java-3.20.6.1.jar并放置到了spark和hadoop的依赖目录:
/usr/local/spark/jars/
/usr/local/hadoop/share/hadoop/common/lib/
/usr/local/hadoop/share/hadoop/tools/lib/
/usr/local/hadoop/share/hadoop/hdfs/lib

core-site.xml内容(obs桶在北京四region)

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>fs.obs.impl</name>
        <value>org.apache.hadoop.fs.obs.OBSFileSystem</value>
    </property>
    <property>
        <name>fs.obs.access.key</name>
        <value>此处已更改为我的ak</value>
    </property>
    <property>
        <name>fs.obs.secret.key</name>
        <value>此处已更改为我的sk</value>
    </property>
    <property>
        <name>fs.obs.endpoint</name>
        <value>obs.cn-north-4.myhuaweicloud.com</value>
    </property>
    <property>
        <name>fs.obs.buffer.dir</name>
        <value>/home/hadoop/obs-buffer</value>
    </property>
</configuration>

运行代码:

# obs_test.py
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("obs test") \
    .getOrCreate()

df = spark.read.csv("obs://dev-modelarts/kaggle-CTR/data/data/train.csv", header=True, inferSchema=True)
df.printSchema()
df.show()

运行命令:
spark-submit --jars hadoop-huaweicloud-3.1.1-hw-40.jar,esdk-obs-java-3.20.6.1.jar obs_test.py

报错日志:

hadoop@ecs-c04d:~$ spark-submit --jars hadoop-huaweicloud-3.1.1-hw-40.jar,esdk-obs-java-3.20.6.1.jar obs_test.py
21/01/03 09:04:55 WARN Utils: Your hostname, ecs-c04d resolves to a loopback address: 127.0.1.1; using 192.168.0.230 instead (on interface eth0)
21/01/03 09:04:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/01/03 09:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/01/03 09:04:56 INFO SparkContext: Running Spark version 2.3.2
21/01/03 09:04:56 INFO SparkContext: Submitted application: obs test
21/01/03 09:04:56 INFO SecurityManager: Changing view acls to: hadoop
21/01/03 09:04:56 INFO SecurityManager: Changing modify acls to: hadoop
21/01/03 09:04:56 INFO SecurityManager: Changing view acls groups to:
21/01/03 09:04:56 INFO SecurityManager: Changing modify acls groups to:
21/01/03 09:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/01/03 09:04:56 INFO Utils: Successfully started service 'sparkDriver' on port 36681.
21/01/03 09:04:56 INFO SparkEnv: Registering MapOutputTracker
21/01/03 09:04:56 INFO SparkEnv: Registering BlockManagerMaster
21/01/03 09:04:56 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/01/03 09:04:56 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/01/03 09:04:56 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e294dec0-2f9d-4f3f-9e7e-3875c2b20d58
21/01/03 09:04:56 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
21/01/03 09:04:56 INFO SparkEnv: Registering OutputCommitCoordinator
21/01/03 09:04:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/01/03 09:04:57 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.230:4040
21/01/03 09:04:57 INFO SparkContext: Added JAR file:///home/hadoop/hadoop-huaweicloud-3.1.1-hw-40.jar at spark://192.168.0.230:36681/jars/hadoop-huaweicloud-3.1.1-hw-40.jar with timestamp 1609635897154
21/01/03 09:04:57 INFO SparkContext: Added JAR file:///home/hadoop/esdk-obs-java-3.20.6.1.jar at spark://192.168.0.230:36681/jars/esdk-obs-java-3.20.6.1.jar with timestamp 1609635897155
21/01/03 09:04:57 INFO SparkContext: Added file file:/home/hadoop/obs_test.py at file:/home/hadoop/obs_test.py with timestamp 1609635897166
21/01/03 09:04:57 INFO Utils: Copying /home/hadoop/obs_test.py to /tmp/spark-a51e2865-0465-4a1b-a6c5-1da954078da6/userFiles-6c5a6a09-bdbe-45bb-8629-a2041d237232/obs_test.py
21/01/03 09:04:57 INFO Executor: Starting executor ID driver on host localhost
21/01/03 09:04:57 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39711.
21/01/03 09:04:57 INFO NettyBlockTransferService: Server created on 192.168.0.230:39711
21/01/03 09:04:57 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/01/03 09:04:57 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.230:39711 with 366.3 MB RAM, BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO EventLoggingListener: Logging events to hdfs://localhost:9000/spark-logs/local-1609635897202
21/01/03 09:04:57 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/hadoop/spark-warehouse').
21/01/03 09:04:57 INFO SharedState: Warehouse path is 'file:/home/hadoop/spark-warehouse'.
21/01/03 09:04:58 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/01/03 09:04:58 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "/home/hadoop/obs_test.py", line 11, in <module>
    df = spark.read.csv("obs://dev-modelarts/kaggle-CTR/data/data/train.csv", header=True, inferSchema=True)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 441, in csv
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o55.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.obs.OBSFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2596)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3320)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
	at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:709)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:390)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:390)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:344)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:389)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.obs.OBSFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2500)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2594)
	... 30 more

21/01/03 09:04:58 INFO SparkContext: Invoking stop() from shutdown hook
21/01/03 09:04:58 INFO SparkUI: Stopped Spark web UI at http://192.168.0.230:4040
21/01/03 09:04:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/01/03 09:04:58 INFO MemoryStore: MemoryStore cleared
21/01/03 09:04:58 INFO BlockManager: BlockManager stopped
21/01/03 09:04:58 INFO BlockManagerMaster: BlockManagerMaster stopped
21/01/03 09:04:58 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/01/03 09:04:58 INFO SparkContext: Successfully stopped SparkContext
21/01/03 09:04:58 INFO ShutdownHookManager: Shutdown hook called
21/01/03 09:04:58 INFO ShutdownHookManager: Deleting directory /tmp/spark-a51e2865-0465-4a1b-a6c5-1da954078da6/pyspark-c89583cc-2dd6-419c-ae4a-c7119739455c
21/01/03 09:04:58 INFO ShutdownHookManager: Deleting directory /tmp/spark-c0e74089-56ef-4ee7-8944-446c3bd77482
21/01/03 09:04:58 INFO ShutdownHookManager: Deleting directory /tmp/spark-a51e2865-0465-4a1b-a6c5-1da954078da6
@iRitiLopes
Copy link

Same issue

@chzzzyyyjjj
Copy link

你好,请问解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants