Lab 1: EC2 files
(DSCI 551, Spring 2023)
Due: 11:59pm, Friday, January 27, 2023
Points: 10
1. Add the following lines to the end of your ~/.bashrc file on EC2 and submit a screenshot showing these lines added to your file.
export JAVA_HOME=/usr/lib/jvm/default-java
export PATH=$PATH:~/spark-3.3.1-bin-hadoop3/bin
export PATH=$PATH:~/hadoop-3.3.4/bin:~/hadoop-3.3.4/sbin
2. Explain what each line added to your ~/.bashrc file in task 1 does.
3. Submit screenshot(s) showing examples of executing the following Linux commands on your EC2 instance.
ls, cd, mv, cp, rm, rmdir, mkdir, cat
4. Explain what each of the commands in task 3 does.
Submission: 1. Screenshots for question 1 and question 3
2. A pdf includes your answers for question 2 and question 4
3. Zip the files mentioned above into a .zip file (not .rar!). Name the zipped file with your name. e.g. “John_Doe_Lab1.zip”
—>
• Install EC2 instance
o Select Ubuntu 20.04
o
o Create a key pair if you have not had one, download the *.pem key (make sure you
remember where you put it) Note the screenshot shows dsci2024 but you can use any
other name you want (e.g., I am using dsci2023).
o 10-20GB is sufficient (minimum is 8GB)
o Click Launch instance!
o Select your instance, and go to Connect. Find tab for SSH client:
o Copy the Example command
o Open a terminal with access to the ssh client program
o If you have Windows OS, install msys2 (see note at the end)
o If you have Mac, just open a terminal windows which already has access to ssh
o “cd” to the place where you have downloaded the *.pem key.
o Execute: chmod 400
o (see the AWS screenshot)
o Paste the command you copied. The command looks like this:
o ssh -i
▪ again, see the screenshot for example
o say yes to the question.
o You should now be connected to EC2.
o Note: when you restart the instance, its ip address changes. You need to recopy the ssh
connection string from EC2 web site.
o Text editor on EC2:
▪ nano
▪ vi
• Install MySQL:
o sudo apt install mysql-server
o sudo mysql
o In MySQL prompt (mysql>):
▪ alter user ‘root’@’localhost’ identified with mysql_native_password by ‘Dsci551’;
▪ exit
o mysql -u root -p
▪ on password prompt, type: Dsci-551 and hit enter
o (note) MySQL server consumes a lot of main memory
▪ Step the server first, please you run other program, e.g., hdfs, spark, …
▪ Stop the server by executing:
• sudo service mysql stop
▪ You may start the server by executing:
• sudo service mysql start
• Install Java SDK:
o sudo apt install default-jdk
o (there might be a configuration menu popping up, just hit the tab key to select OK, and
hit enter).
o (add this line to the end of your ~/.bashrc file on EC2)
▪ export JAVA_HOME=/usr/lib/jvm/default-java
o log out and log in to EC2 again
• Install Spark:
o (note: please install Spark, you need to install Java SDK first)
o wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
o tar xvf spark-3.3.1-bin-hadoop3.tgz
o (add this line to ~/.bashrc)
▪ export PATH=$PATH:~/spark-3.3.1-bin-hadoop3/bin
o pyspark
• Install Hadoop:
o wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
o tar xvf hadoop-3.3.4.tar.gz
o Follow the instructions in: https://hadoop.apache.org/docs/stable/hadoop-projectdist/hadoop-common/SingleCluster.html on “pseudo distributed operation”. In
particular,
▪ Follow the configuration steps
▪ Follow the “set up passphraseless ssh” steps
▪ Edit the file: :~/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
• add the following line (you can edit line 54):
• export JAVA_HOME=/usr/lib/jvm/default-java
▪ follow the execution steps to format namenode, start dfs, etc.
o add this line to ~/.bashrc file:
▪ export PATH=$PATH:~/hadoop-3.3.4/bin:~/hadoop-3.3.4/sbin
o note:
▪ if namenode does not start, try to reformat the namenode
▪ if datanode does not come up, try:
• rm -rf /tmp/hadoop-ubuntu/dfs/data
o (note) this will remove the directory where hdfs stores its data
node content.
• Restart the dfs
• Install MongoDB:
o Follow the instructions in https://www.mongodb.com/docs/manual/tutorial/installmongodb-on-ubuntu/
o Make sure using the steps for Ubuntu 20.04
• Windows OS: If you are using Windows, you can either use Powershell or download Cygwin.
o If you are using Powershell
▪ No need to download additional software
▪ Make sure your .pem file is in the folder where you are executing ssh.
o If you want to use Cygwin
▪ Please go to Cygwin.com
▪ Download and execute setup-x86_64.exe
▪ Make sure you select openssh package when installing
▪ Your Cygwin default installation directory will be “c:\cygwin64”
• Note: your home directory will be in msy2 will be like:
o c:\cygwin64\home\
• copy your *.pem file downloaded from AWS to this directory