Friday, May 15, 2015

GNU screen cheat sheet

Screen allows disconnect and connect to a running shell from multiple locations. If the process is already running you can steal it to screen using reptyr: reptyr PID. Screen cheat sheet from here: http://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/
yum install screen
screen
screen -r #reattach
“Ctrl-a” then “?”. You should now have the screen help page.
“Ctrl-a” “c” new window
“Ctrl-a” “n” for the next window or “Ctrl-a” “p” for the previous
detach from the window using “Ctrl-a” “d”.
“Ctrl-a” “H”, creates a running log of the session.
“Ctrl-a” “M” to look for activity. 
“Ctrl-a” “x”.  This will require a password to access the session again.
“Ctrl-a” “k”.  You should get a message if you want to kill the screen

Thursday, May 7, 2015

Cluster configuration and Apache Spark installation, configuration and start

Stand-alone cluster configuration notes

Skip this if you have already configured Hadoop cluster.
Create users on all nodes:
useradd hduser
passwd hduser 
groupadd hadoop
usermod -a -G hadoop hduser
Login as the new user:
sudo su - hduser 
Spark nodes interact with ssh. Password-less ssh should be enabled on all nodes:
ssh-keygen -t rsa -P ''
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Check that ssh works locally without password:
ssh localhost 
Copy public key from the master node to worker node:
ssh-copy-id -i ~/.ssh/id_rsa.pub user@worker_node 

Spark compilation notes

Spark needs Java (at least 1.7), Scala 2.10 (does not support 2.11) and Maven. Hadoop is optional.
Install Java and Scala using package manager yum or apt-get. Or download rpm files from corresponding sites and run install:
sudo rpm -i java.rpm
sudo rpm -i scala.rpm
Download and unzip maven to /usr/local/maven. Configure proxy for Maven if needed in maven/conf/settings.xml. Another config file might be in ~/.m2/settings.xml.
Add to your ~/.bashrc
export M2_HOME=/usr/local/maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
If Java is < 1.8:
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
If you need proxy:
export http_proxy=http://my-web-proxy.net:8088
export https_proxy=http://my-web-proxy.net:8088
Sometimes Maven collects these options, so be careful if the following does not compile because of maven trying do download something.
Clone from git, compile and change owner to hduser (the user who has password-less ssh between nodes enabled):
sudo git clone https://github.com/apache/spark.git /usr/local/spark
cd /usr/local/spark
mvn -Dhadoop.version=1.2.1 -Pnetlib-lgpl -DskipTests clean package
sudo chown -R hduser:hadoop /usr/local/spark

Spark installation notes

Assume that Spark compilation was done on the master node. One need to copy Spark to all other nodes in the cluster to /usr/local/spark and change its owner to hduser (as above).
Also, add to hduser ~/.bashrc on all nodes:
export SPARK_HOME=/usr/local/spark
export _JAVA_OPTIONS=-Djava.io.tmpdir=[Folder with a lot of space]
The latter option is needed for Java temporary folder when Spark writes data on shuffle. By default it is /tmp and it is usually small.
Also if there is Hadoop installation it is useful to force Spark read its configuration instead of using the default one (e.g. for replication factor etc.):
export PATH=$PATH:$HADOOP_HOME/conf

Spark configuration notes

Some theory:

  • Spark runs one Master and several Workers. It is not recommended to have both Master and Worker on the same node. It worth having only one Worker on one node that owns all RAM and CPU cores unless it has many CPUs or the task is better solved ON many Workers. 
  • When one submit a task, Spark creates Driver on Master node and Executors on Worker nodes. 

It would be nice that one has to configure only Master node and all options will be transferred to Workers. However, it is not the case. Though, in there is some minimal configuration when you don't need to touch each Workers's config. It is one Worker per node.
spark/conf/spark-defaults:
spark.master    spark://mymaster.com:7077
spark.driver.memory     16g
spark.driver.cores      4
spark.executor.memory   16g #no more than available, otherwise will fail
spark.local.dir /home/hduser/tmp #Shuffle directory, should be fast and big disk
spark.sql.shuffle.partitions    2000 #number of reducers for SQL, default is 200
List all the Worker nodes in spark/conf/slaves:
my-node2.net
my-node3.net 

Spark start

Start all nodes:
$SPARK_HOME/sbin/start-all.sh
You should be able to see the Web-interface on my-node1.net:8088
Start Spark shell:
$SPARK_HOME/bin/spark-shell.sh --master spark://my-node1.net:7077
Stop all nodes:
$SPARK_HOME/sbin/stop-all.sh


Monday, May 4, 2015

Hadoop free space and file sizes

It is useful to understand what would be the size of data and free space if you want to write something to HDFS. Default block size in HDFS is 64MB, so one file will take at least 64MB. Also, default replication ratio is 3x. The size will be:
3 * Sum(i)(size[i] / 64 + 1)
Check the block size and replication ratio:
$HADOOP_HOME/bin/hadoop fsck / 
Check the free space (plain free space, not taking into account replication or block size):
$HADOOP_HOME/bin/hadoop dfsadmin -report
How big is the folder (it is actually replication ratio times bigger):
$HADOOP_HOME/bin/hadoop dfs -dus [/some/folder]

Git rebase (put your history on top of upsteam history)

Git rebase is usually needed when you want to push commits to the branch, but it is ahead of you and you don't want to add your merge messages to the branch history. Sort of bon ton. Rebase will replay your history on top of the branch. Workflow for git rebase:
git rebase [upstream/master]
If there are conflicts, resolve them by hand, or:
git checkout --ours (or --theirs) [filename]
git add [filename]
Continue:
git rebase --skip #if theirs (or --continue #if ours)