Quickstart

This section will help you to quickly getting started with the platform. For more details have a look at the rest of this guide, and also check the Tutorials that we have prepared and the Want to know more section.

Warning

Before connecting we always recommend that you first start the VPN. If not you will not have access to some services.

If for some reason you are not using the VPN, then one alternative could be to launch a remote desktop from the visualization platform and then connect from there.

By far, the most common way to connect is by establishing an SSH session:

ssh username@hadoop3.cesga.es

Once connected, you will notice that there are two main filesytems:

HOME: The standard filesystem when you log in
HDFS: The distributed Hadoop filesystem

To migrate your HDFS data from the old platform to the new one, you can use a command similar to the following:

hadoop distcp -i -pat -update hdfs://10.121.13.19:8020/user/uscfajlc/wcresult hdfs://nameservice1/user/uscfajlc/wcresult

Note

It is recommended to launch the distcp command inside a screen session so it will continue later.

See the Migrating Data section for more details about how to migrate your data from the previous platform.

You can then start using the tools you are interested in like Spark or Hive.

Note

The default version of Spark is 2.4.0. If you plan to use code coming from Spark 1.6 take that into account.

There is also a nice web user interface that you can use to get started with the platform. You can find more information in the BD|CESGA WebUI and HUE: A nice graphical interface to Hadoop sections.