Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

ClassNotFoundException: org.apache.spark.shuffle.rdma.RdmaShuffleManager #36

Open
UntaggedRui opened this issue Sep 30, 2019 · 4 comments

Comments

@UntaggedRui
Copy link

the spark-defaults.conf of both workers and master are

spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.driver.extraClassPath /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
spark.executor.extraClassPath /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar

and I have placed libdisni.so in /usr/lib.
When I run TeraGen build from spark-terasort,I can run it with --master spark://master:7077 --deploy-mode client,which whole command is

spark-submit  --master spark://master:7077 --deploy-mode client --class com.github.ehiggs.spark.terasort.TeraGen  /home/rui/software/spark-2.4.0-bin-hadoop2.7/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar  1g hdfs://master:9000/data/terasort_in1g

but failed with ClassNotFoundException while using --master spark://master:7077 --deploy-mode cluster,which whole command is

spark-submit  --master spark://master:7077 --deploy-mode cluster --class com.github.ehiggs.spark.terasort.TeraGen  /home/rui/software/spark-2.4.0-bin-hadoop2.7/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar  1g hdfs://master:9000/data/terasort_in1g

the error info is

Launch Command: "/home/rui/software/jdk1.8.0_212/bin/java" "-cp" "/home/rui/software/spark-2.4.0-bin-hadoop2.7/conf/:/home/rui/software/spark-2.4.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.extraClassPath=/home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar" "-Dspark.driver.supervise=false" "-Dspark.submit.deployMode=cluster" "-Dspark.master=spark://master:7077" "-Dspark.driver.extraClassPath=/home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar" "-Dspark.jars=file:/home/rui/software/spark-2.4.0-bin-hadoop2.7/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "-Dspark.rpc.askTimeout=10s" "-Dspark.app.name=com.github.ehiggs.spark.terasort.TeraGen" "-Dspark.shuffle.manager=org.apache.spark.shuffle.rdma.RdmaShuffleManager" "org.apache.spark.deploy.worker.DriverWrapper" "spark://[email protected]:43489" "/home/rui/software/spark-2.4.0-bin-hadoop2.7/work/driver-20190930101138-0006/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar" "com.github.ehiggs.spark.terasort.TeraGen" "5g" "hdfs://master:9000/data/terasort_in5g2"
========================================

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
	at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.rdma.RdmaShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
	at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:259)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:323)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
	at com.github.ehiggs.spark.terasort.TeraGen$.main(TeraGen.scala:48)
	at com.github.ehiggs.spark.terasort.TeraGen.main(TeraGen.scala)
	... 6 more

How should I fix it?
What's more, in client deployment , when I generate 50GB data using teragen and use raw spark with no spark-defaults.conf, the transfer speed between master and slave is about 270MB/s. However,when I change my spark-defaults.conf and replace spark.shuffle.manager to org.apache.spark.shuffle.rdma.RdmaShuffleManager, the speed is also 270MB/s.
Is this because I used hdfs storage but it has nothing to do with spark shuffle?
Can you recommed a workload for me to significantly improve completion time when using spark rdma?
Thanks a lot !

@petro-rudenko
Copy link
Member

Hi, thanks for using SparkRDMA.

  1. Make sure /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar is accessible from both master and executor. You can run something like: ./spark/sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar

  2. Can you please describe your cluster (how many nodes, how much CPU do you have, what the NIC you are using) ?

Thanks,
Peter

@UntaggedRui
Copy link
Author

Hi, thanks for using SparkRDMA.

  1. Make sure /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar is accessible from both master and executor. You can run something like: ./spark/sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
  2. Can you please describe your cluster (how many nodes, how much CPU do you have, what the NIC you are using) ?

Thanks,
Peter

Thanks for replying to my question.

  1. Yes, I sure.The result is
[rui@rdma-server-204 spark-2.4.0-bin-hadoop2.7]$ sbin/slaves.sh ls -al /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
master: -rwxr-xr-x 1 rui rui 478528 Nov 29  2018 /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
slave1: -rwxr-xr-x 1 rui rui 478528 Nov 29  2018 /home/rui/data/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
  1. There are two servers in my cluster. A server is both a master and a worker, and other one is a worker in the cluster. Master has 2*24 cores, worker has 2*12 cores. The detail information is
    cluster info
    My Master node nic is
[rui@rdma-server-204 spark-2.4.0-bin-hadoop2.7]$ lspci | grep Mell
09:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
09:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

My worker node nic is

[rui@supervisor-1 ~]$ lspci | grep Mell
05:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
05:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
06:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
06:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

@petro-rudenko
Copy link
Member

So, --deploy-mode cluster is for yarn deployment mode only. You are using standalone mode. So this parameter is not needed. Also, 2 nodes could be quite small to get the full benefit of RDMA. You could try first to run whether disni RdmaVsTCPBenchmark or UCX bandwith benchmark

BTW we're in progress of releasing Spark over UCX. It'll have all the functionality of SparkRDMA + better performance + support other transports (cuda, shared memory) and protocols. Keep in touch: https://github.com/openucx/sparkucx/

@UntaggedRui
Copy link
Author

So, --deploy-mode cluster is for yarn deployment mode only. You are using standalone mode. So this parameter is not needed. Also, 2 nodes could be quite small to get the full benefit of RDMA. You could try first to run whether disni RdmaVsTCPBenchmark or UCX bandwith benchmark

BTW we're in progress of releasing Spark over UCX. It'll have all the functionality of SparkRDMA + better performance + support other transports (cuda, shared memory) and protocols. Keep in touch: https://github.com/openucx/sparkucx/

Oh,thank you very much.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants