In general, you will download databases from the Internet. Download URLs and detailed configuration information for many popular databases can be found on the Matrix Science web site at http://
www.matrixscience.com/help/seq_db_setup.html
To get you started, and as a service to users who do not have a fast connection to the Internet, a selection of database files is supplied on DVD. Please note that these files will become increasingly out-of-date. If possible, you should download updates from the Internet at regular intervals. The databases on the DVD are:
1. contaminants (contaminants collection from MPI Martinsried) 2. cRAP (contaminants collection from GPM)
3. IPI_arabidopsis 4. IPI_bovine
In each case, the files have been renamed to include a version or date stamp, and placed into the recommended directory structure. The proce-dure to enable each database is as follows:
1. Choose a suitable location for the database files 2. Unpack the files from the DVD archive
3. In some cases, unpack the additional files required to create taxonomy indexes).
4. If necessary, modify the existing Mascot configuration to identify the location of the database files.
5. Enable the new database.
Note: If you are upgrading Mascot, and use this procedure to update one or more of your databases, first stop the Mascot service (Windows) or kill ms-monitor.exe (Unix), and delete the old database files. Once all the new files are in place and you have updated your Mascot configuration, you will need to re-start the Mascot service / ms-monitor.exe
1. Choose a suitable location for the database files
The “default” location for databases is under the Mascot sequence direc-tory, as illustrated in the directory structure, above. However, database files can be located on any local drive. If you decide to put the files in a different location, you will need to make a configuration change in step 4.
2. Unpack the files from the DVD archive
Unix: If your Mascot server is on a Unix platform, you can unpack the files using gzip and tar. If the Databases DVD is mounted as /mnt/
dvdrom and you wish to unpack the MSDB files to /usr/local/mascot/
sequence
cd /usr/local/mascot/sequence
gzip –dc /mnt/dvdrom/ IPI_human.tar.gz | tar xvf –
Chapter 5. Sequence Database Setup 57
Windows (GUI): Many people will prefer to use a graphical utility like WinZip to unpack these archives. You will need to use WinZip 9.0 or later to unpack NCBInr and UniRef100 because earlier versions cannot cope with files larger than 4 GB.
Before trying to unpack a database, launch WinZip and choose Configu-ration from the Options menu. On the miscellaneous tab, clear the checkbox for ‘TAR file smart CR/LF conversion’. On the Folders tab, check the location of the ‘Temp folder’.
By default, this will be your temp directory, and is where WinZip will attempt to unpack the database files. Some of these are very large, so you may need to change this setting to a drive with sufficient free space.
Choose OK to save your changes then use WinZip to open one of the archives, e.g. the IPI_human.tar.gz file. Usually, WinZip will have set your file associations so that you can open a file with a gz extension by double clicking it. When you open MSDB.tar.gz, WinZip will ask
Choose Yes, and the archive will be opened and displayed
Choose Extract, and then select your Mascot sequence directory. Make sure that ‘All files’ and ‘Use folder names’ are selected.
Windows (Command line): Download and install BsdTar from http://gnuwin32.sourceforge.net/packages/bsdtar.htm By default, the executable is installed into C:\Program Files\GnuWin32\bin\bsdtar.exe (32-bit) or C:\Program Files
(x86)\GnuWin32\bin\bsdtar.exe (64-bit). If the Databases DVD is drive D: and you wish to unpack the NCBInr files to
Chapter 5. Sequence Database Setup 59
C:\Inetpub\mascot\sequence under 32-bit Windows, open a command prompt (Windows Start menu; Programs; Accessories), and enter the following::
C:
cd \Inetpub\mascot\sequence
“C:\Program Files\GnuWin32\bin\bsdtar.exe” -xvf D:\NCBInr.tar.gz
3. Unpack the additional files
NCBInr and UniRef100 are comprehensive databases containing entries from many different organisms. Mascot supports taxonomy filtering, which allows you to search a subset of the database entries, such as those for mammals or for Homo Sapiens. Building a taxonomy index requires additional files, and these are supplied on the DVD in an ar-chive called taxonomy.tar.gz. Each database uses a different mix of files, but the easiest thing is to copy all the files in one go, whichever database you plan to use. Unpack the files in taxonomy.tar.gz using the same approach as in step 2, but copying the files into the Mascot taxonomy folder. Choose to overwrite any existing files unless you know that the files on the hard disk are more recent than those from the DVD.
4. Modify the Mascot configuration
From your local Mascot home page, choose Configuration Editor. From the menu, choose Database Maintenance. All of these databases are pre-configured, but are not enabled. If we are installing NCBInr, choose NCBInr from the Select drop-down list. It will look similar to this
In step 2, if you copied the files to a different location, you will need to change the Path to match. Note that the slashes must be forward
Chapter 5. Sequence Database Setup 61
slashes, even under Windows, and that there must be a wild card before the fasta file extension. Otherwise, all you need to do is change Inactive to Active. Choose Test this definition, and you should see something similar to this
If there are any error messages, these must be investigated and fixed.
Choose Return to database definitions at the bottom of the page to return to the configuration form and choose Apply to save the changes.
Follow the Database Status link to monitor the database as the files are compressed and tested. The database is not available for searching until the Status is ‘In Use’. (If you are upgrading Mascot, you’ll have to start the Mascot service or ms-monitor.exe to view Database Status and compress the new databases.)