Skip to content

2. Installing and configuring DataStage

Stephen Eisenhauer edited this page Feb 13, 2014 · 1 revision

DataStage has been tested to work with the Ubuntu Linux 11.10 Oneiric Ocelot and 12.04 Precise Pangolin operating systems, and the Virtual Machines work with VMWare Fusion 4.x and VMware Player

Note: Ubuntu 12.10 is missing key modules required by DataStage. It may be possible to get it to work, but it's a huge hassle to track down all the missing pieces.

DataStage is designed to be installed at the command line, but don't let this put you off! These step-by-step instructions should be sufficient even if you have no Linux or command-line experience.

You can run DataStage as a virtual machine (i.e. a mini-Linux system, sitting inside another computer which could be running any operating system), or on a specially dedicated Linux server. We recommend setting up a dedicated server for continued use, but a VM is a good way to quickly test-drive the system. Your IT department should be able to help you if you get stuck, but if you install VMware Fusion or VMware Player, you can probably follow the instructions by yourself to get it running.

Useful installation info

(note this information as you go along, for easy reference)

  • IP address
  • "Root" (super-administrator) username and password
  • System username and password (if different -- by default the username and password are the same as root)
  • DNS entry (web address) you wish to link to DataStage -- you will need to register a domain to do this. Talk to your University IT support, or use a commercial web hosting company (Google will give you loads of options).
  • Usernames and passwords (note: users can change their passwords later, but the administrator setting things up for the first time will need to know the initial configuration)

Debian installation of DataStage (at the command line)

Step one: install Linux Ubuntu

Set up your host system (VM or ordinary Linux installation) to run "Ubuntu 11.10":http://ubuntuguide.org/wiki/Ubuntu_Oneiric or "Ubuntu 12.04":http://releases.ubuntu.com/precise. We strongly recommend using the "server" edition, because it is quicker to install, and has fewer extra components (meaning fewer security vulnerabilities). You can use a desktop (GUI) installation though. You will have to install DataStage via the "terminal" interface.

If you are creating a VM inside another system, set aside as much memory as you think you will need to store your files (we suggest minimum 3GB – Ubuntu installation in VMware Player doesn’t work with less than 2GB). Store the virtual disk as a single file. If given the choice, do download and install Linux tools as part of installation process.

If you use OSX Lion and Virtual Box, we have some additional information on the issue tracker, "here":http://dataflow-jira.bodleian.ox.ac.uk/jira/browse/DATAFLOW-268](http://dataflow-jira.bodleian.ox.ac.uk/jira/browse/DATAFLOW-2688.

Step two: prepare to install DataStage*

When installing your operating system, if you choose the default installation settings, you can proceed as below. Expert users can customise their installation (you will need aptitude, chkconfig, apache, pwgen, samba, python-pylibacl at minimum).

If you have never worked at the command line before, here's how to read these instructions: Any line beginning with a dollar sign ("$ ") is a command. Just type that text, verbatim, in the bottom line of whatever is visible on your screen, and press "ENTER". Most of the time, the prompt you see on your screen will be "$ " or "# ". Information in curly brackets {} is instructions or commentary -- do not enter this on your screen.

Log in, using your system credentials (see above).

$ sudo su {telling the system you want to be a "super user" and you don't want it to prompt you for your password again}
$ ifconfig {to get your **"IP address":http://en.wikipedia.org/wiki/IP_address**.  Look for "inet addr" for the ethernet port.  It will be a string of numbers in a format like "162.65.3.12".   Note this address.}
$ apt-get install ssh # to allow your computer to communicate with other computers via the SSH protocol
$ Y {continue}

Once your system is up,and IP address recorded, you can use a tool like PuTTY to administer your server from another computer. E.g. you could start your server at work, and finish installing the software from home. To do this, you should connect to your server via SSH, through port 22. Or just carry on with these instructions, directly at the command line of your server. Note: if you do suspend or turn off your server partway through the installation process, SSH may stop working. to fix this:

$ start ssh

And check the firewall isn't blocking port 22:

$ iptables --list {look at the line with “Chain ufw-user-limit-accept…ACCEPT …” (and it will tell you if anything is blocked.}

To fix this:

$ ufw enable {to turn on the uncomplicated firewall interface} $ allow 22 {to enable port 22 -- you can use this command with different numbers, to turn on whichever ports you require} $ service iptables start {to start up your firewall with the new settings}

Step three: Install DataStage

$ sudo su {if you're not already in super-user mode}
# nano /etc/apt/sources.list

You are now editing the apt repository to include the DataStage download. Scroll to the bottom. At the very bottom of the screen, type the following, paying attention to include the spaces:

deb http://apt-repo.bodleian.ox.ac.uk/datastage/ stable main

You could choose to edit "stable" to "unstable", i.e. deb http://apt-repo.bodleian.ox.ac.uk/datastage/ unstable main.

Now press ctrl + x to exit, press Y when asked to save, and then press Enter to confirm the default filename to save to.

# wget http://apt-repo.bodleian.ox.ac.uk/datastage/dataflow.gpg
# apt-key add dataflow.gpg
# aptitude update {this might take awhile – there is a lot to update}
# aptitude install dataflow-datastage

Press Y to continue.

This will take some time: a lot of stuff will scroll by. Eventually, you will see your command prompt (probably something like: root@ubuntu:~#) again. Now it's installed - DataStage is now a command available from anywhere in the system.

Step four: Start DataStage

# datastage-config {to get to the configuration menu}
  1. Run c to get into the configuration submenu.
  2. Run a to start apache2. Now read the readout that the system gives you – see if you can follow the instructions in purple text. For example:
  3. Run da to start datastage.
  4. Run d (defaultsite) to reconfigure and restart apache. (This removes any existing apache configurations you have.)
  5. (Optional) Run c (confsamba) to configure samba if you want to make any changes to smb.conf.
  6. (Optional) Run fs to configure the filesystem.
  7. Run q to quit the apache submenu.
  8. Run q again to quit the config submenu.

Your DataStage website should be live! Go check it. Either enter the web address that you already registered, or enter the IP address directly into your web browser (e.g. http://192.35.41.12). You can't do much with your web interface yet -- but if you can't see it at all, something has gone wrong. You can try to trouble-shoot.

Check that ports 22 and 80 are open:

# ufw allow ssh {this unblocks ssh through the firewall in the server - port 22}
# ufw allow http {this unblocks http through the firewall in the server - port 80}
# ufw allow samba {this unblocks samba through the firewall in the serer - ports 137, 138, 139 and I think a couple of others}

Make sure you've started DataStage and apache ( # datastage-config ), and started "defaultsite" within the config submenu. If you still have trouble, and are running this with a VM, it's probably easiest to start again from scratch: destroy this VM and create a new one, using default Ubuntu settings.

Note: I have found that using VPN software can sometimes (but not always…) mess up my connection to the server (even when my computer is only talking to its own virtual machine).

Step five: Adding users

# datastage-config {You can follow the instructions in the submenu, thus:}
  1. Run u to get into the users submenu.
  2. Run a to add a new user. You will be prompted to provide some user information. For example, let's say the user's name is "Test Leader". We could use testlead as the username (no spaces or special characters are allowed here), Test_Leader as the name (no spaces allowed here), and any email address.
  3. Use l to designate this user as a "leader" when asked.
  4. Use y to when asked if the information you entered is correct.

Let's add another user:

  1. Run a to add a new user. Let's say the user's name is "Test Member". We could use testmbr as the username, TestMember as the name, and any email address.
  2. Use m to designate this user as a "member" when asked.
  3. Use y to when asked if the information you entered is correct.
  4. Use q to quit this submenu.

Note: If you create a user with the "Collaborator" role, that user will be given no personal filesystem space and will only be able to access "collab" areas (no access to "shared" areas).

Other Notes:

  1. Do not use your “root” server credentials for any of the group members (i.e. don’t have the same username/password for the group leader and the server administrator - this will keep your data and your server much safer).
  2. If you make a mistake setting up a user, keep going: you can’t back out of the process of adding a user; you’ve got to get to the end. You could choose “Quit” or “no” (no = go back and change the entry), in place of the “Yes this is correct” command at the end of the process. Note: you cannot change a user's access permissions once you have created them. You can only delete the account or change the password.

To delete a user, follow the instructions on the “user” interface to remove a user. You will see that you are given the option to “purge” the user (remove the username and the associated files), or to “remove” only the username, which will keep the files but the files will be left in limbo – you will need to reassign them to somebody else.

Note: if you try to delete a user while that person is logged on via mapped drive or the web interface, it will break the web interface. Make sure that the relevant user has genuinely logged off (using the “log off” button on the web interface) – simply closing the browser may not be enough – and they have disconnected DataStage as mapped drive.

I have also found that if I make changes to the server, although the web interface continues to “seem” to behave normally, the changes aren’t always reflected in the web interface (e.g. if I add a user, the new user isn’t able to log into the site). If this happens, just close and re-open your browser and start again.

Step six: Set passwords for Samba/http access

We also need to set passwords to allow the same users to access the server in different ways (e.g. for access via the web interface and the mapped drive).

If you are still in the datastage-config tool, run q as needed until you are back to the root command prompt (looks something like root@ubuntu:~#).

# smbpasswd <username> # use the username you picked when creating the user

You will be asked to provide a password for samba (this password will be used when logging into the Windows network shares), and then again to confirm it. You will not see your characters being typed on the screen for security purposes.

# passwd <username> # use the username you picked when creating the user

You will now be asked to provide a password for this user's unix account, and then again to confirm it. As before, you will not see your characters being typed on the screen for security purposes.

You need to repeat this process for every user you created.

Step seven: Setting up a filesystem

DataStage is designed to make it easy to submit your files to a permanent archive or repository, once you are through with them. So you must set up a default external repository for submissions, even if you do not plan to use one.

Go to your DataStage web interface (either the direct IP address of the server, or wherever you have set up your web URL).

Log in (button in top-right corner, using credentials with Group Leader privileges)

Click on “admin” in the top-left corner (only a logged-in group leader can get in. If you are logged in as any other sort of user, you will not be allowed to log in). If you get stuck, go back to your DataStage homepage, click the “log out” button in the top-right corner, and log back in as a group leader.

Repositories -> green plus button to add repo.

  1. Enter a repository (use dummy info if necessary).
  2. Click “home” (tiny link at top left of screen) or just visit /admin/
  3. Use “choose default repository” to tag this repository as “default”
  4. Click save. You will now see “default repository.”

Other information

To turn off or restart your server:

$ sudo halt # if you’re using a virtual machine, power off the VM after the halt is complete

or

$ sudo reboot # to restart your server - use this if you're administering your server remotely -- if you just turn it off, you can't use your remote computer to turn it back on again!

You will need to re-start DataStage once your machine has restarted:

  1. {log in using your admin credentials}
  2. $ sudo datastage-config
  3. Run d to start DataStage
  4. Run q to quit the config tool.

Go to your web interface to check it has worked -- the DataStage homepage should appear pretty much instantly if everything has worked correctly.

To delete and remove DataStage installation (e.g. to reinstall it from scratch) note: I'm not sure this is removing everything... but it seems to work OK. Updates welcome

$ sudo apt-get --purge remove dataflow-datastage {to remove the main installation}
$ rm *.gpg* {to remove the underlying .gpg installation file stored in your system, so when you try to install the new version, you pull the latest version from the repository}

This should leave your (samba and UNIX) usernames and passwords from the last time around, and I'm pretty sure it leaves the filetree intact -- but I haven't thoroughly tested this. Let me know how you get on! [email protected]

To change the look of your website This is mostly for experts -- but a novice user can easily change the "DataStage" label on the front page, e.g. to have it read the name of your research group instead.

A quick-and-dirty method to re-name your front page:

$ nano /usr/share/pyshared/datastage/web/core/templates/base.html 

Look for a line like this:

<title>{% block title %}{% endblock %} | {{ settings.SITE_NAME }}</title>

just delete/overwrite this, so it reads:

<title>*YourWebsiteNameNoSpaces*</title> 

Look lower, for the line:

<a href=”{% url core:index %>{{ settings.SITE_NAME }}</a> 

And replace it with

<a href=”{% url core:index %>*YourWebsiteNameNoSpaces*</a>

For pros: The web interface has been written in a "Django":https://www.djangoproject.com/ framework.

To change the webpage display, the filepath is: /usr/share/pyshared/datastage/web/ File: /usr/share/pyshared/datastage/web/core/templates/base.html

The name in the website header (currently set to “DATASTAGE”) isn’t hardcoded, it’s picked up from python. To change any of these settings, the developers say to adjust this file, but I (a non-developer) can't figure out what to do:

nano /usr/share/pyshared/datastage/web/settings.py

Configuration file

/usr/share/dataflow-datastage/conf/datastage.conf – configuration file with key value pairs, picked up by website templates (when Python needs access)

/etc/apache2/sites-available/datastage.conf

/etc/apache2/sites-enabled/datastage.conf – any changes must be reflected here (when Apache needs access)

Good luck!

Updates and corrections to this wiki are welcome. [email protected]