Apache Spark is an open-source framework and a general-purpose cluster computing system. Spark provides high-level APIs in Java, Scala, Python and R that supports general execution graphs. It comes with built-in modules used for streaming, SQL, machine learning and graph processing. It is capable of analyzing a large amount of data and distribute it across the cluster and process the data in parallel.
In this tutorial, we will explain how to install Apache Spark cluster computing stack on Ubuntu 20.04.
Prerequisites
- A server running Ubuntu 20.04 server.
- A root password is configured the server.
Getting Started
First, you will need to update your system packages to the latest version. You can update all of them with the following command:
apt-get update -y
Once all the packages are updated, you can proceed to the next step.
Install Java
Apache Spark is a Java-based application. So Java must be installed in your system. You can install it with the following command:
apt-get install default-jdk -y
Once the Java is installed, verify the installed version of Java with the following command:
java --version
You should see the following output:
openjdk 11.0.8 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
Install Scala
Apache Spark is developed using the Scala. So you will need to install Scala in your system. You can install it with the following command:
apt-get install scala -y
After installing Scala. You can verify the Scala version using the following command:
scala -version
You should see the following output:
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Now, connect to the Scala interface with the following command:
scala
You should get the following output:
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.8).
Type in expressions for evaluation. Or try :help.
Now, test the Scala with the following command:
scala> println("Hitesh Jethva")
You should get the following output:
Hitesh Jethva
Install Apache Spark
First, you will need to download the latest version of Apache Spark from its official website. At the time of writing this tutorial, the latest version of Apache Spark is 2.4.6. You can download it to the /opt directory with the following command:
cd /opt
wget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
Once downloaded, extract the downloaded file with the following command:
tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz
Next, rename the extracted directory to spark as shown below:
mv spark-2.4.6-bin-hadoop2.7 spark
Next, you will need to configure Spark environment so you can easily run Spark commands. You can configure it by editing .bashrc file:
nano ~/.bashrc
Add the following lines at the end of the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save and close the file then activate the environment with the following command:
source ~/.bashrc
Start Spark Master Server
At this point, Apache Spark is installed and configure. Now, start the Spark master server using the following command:
start-master.sh
You should see the following output:
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu2004.out
By default, Spark is listening on port 8080. You can check it using the following command:
ss -tpln | grep 8080
You should see the following output:
LISTEN 0 1 *:8080 *:* users:(("java",pid=4930,fd=249))
Now, open your web browser and access the Spark web interface using the URL http://your-server-ip:8080. You should see the following screen:
Start Spark Worker Process
As you can see, Spark master service is running on spark://your-server-ip:7077. So you can use this address to start the Spark worker process using the following command:
start-slave.sh spark://your-server-ip:7077
You should see the following output:
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu2004.out
Now, go to the Spark dashboard and refresh the screen. You should see the Spark worker process in the following screen:
Working with Spark Shell
You can also connect the Spark server using the command-line. You can connect it using the spark-shell command as shown below:
spark-shell
Once connected, you should see the following output:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.6.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/29 14:35:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ubuntu2004:4040
Spark context available as 'sc' (master = local[*], app id = local-1598711719335).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.6
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.8)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
If you want to use Python in Spark. You can use pyspark command-line utility.
First, install the Python version 2 with the following command:
apt-get install python -y
Once installed, you can connect the Spark with the following command:
pyspark
Once connected, you should get the following output:
Python 2.7.18rc1 (default, Apr 7 2020, 12:05:55)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.6.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/29 14:36:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.6
/_/
Using Python version 2.7.18rc1 (default, Apr 7 2020 12:05:55)
SparkSession available as 'spark'.
>>>
If you want to stop Master and Slave server. You can do it with the following command:
stop-slave.sh
stop-master.sh
Conclusion
Congratulations! you have successfully installed Apache Spark on Ubuntu 20.04 server. Now you should able to perform basic tests before you start configuring a Spark cluster. Feel free to ask me if you have any questions.
Đăng ký liền tay Nhận Ngay Bài Mới
Subscribe ngay
Cám ơn bạn đã đăng ký !
Lỗi đăng ký !
Add Comment