The LIGO Laboratory LDAS Systems | ||
---|---|---|
ldasname | System | Gateway |
ldas-dev | LDAS Developement | ldas-dev.ligo.caltech.edu |
ldas-test | LDAS Test | ldas-test.ligo.caltech.edu |
ldas-cit | LDAS Archive | ldas-cit.ligo.caltech.edu |
ldas-wa | LDAS Hanford | ldas.ligo-wa.caltech.edu |
ldas-la | LDAS Livingston | ldas.ligo-la.caltech.edu |
ldas-mit | LDAS MIT | ldas.mit.edu |
A user wishing to run an LDAS system entirely within his own account does not need to create a /etc/ldasname file; the LDAS system will register itself as localhost when no /etc/ldasname file is found.
Note that the runLDAS script is where the LD_LIBRARY_PATH, PATH, DB2INSTANCE and any other required environment variables need to be set.
On Linux systems the required command lsof is located under /usr/sbin, so if this is not in the path of the user running LDAS, it will need to be added to the runLDAS script like this:
Intial startup of API's as user "ldas":
/ldas/bin/runLDAS *****
(replace ***** with the LDAS system password)
and hit Return.
Several things will happen:
/ldas_outgoing/manager.log.
The manager.log file is a good place to look if after running runLDAS nothing seems to have happened.
The file created at startup on a new LDAS system does not contain any LDAS users capable of issuing user commands. It is necessary to use the control and monitor API client program cmonClient to add users who can issue commands.
On a brand new LDAS system with no defined users,
it is necessary to add the users.queue data for the
admin LDAS user by hand.
To do this, copy the file users.queue from the LDAS
managerAPI installation directory into the managerAPI
runtime directory, and using an editor, add a new user
by copying the example NULL user in that file, replacing
the NULL user's info with the correct info for the admin
user, and placing the md5sum of the admin user's
password into the password field (the NULL user has "ng"
in the password field, an invalid md5 hash).
Given the admin user's password "foo", create the md5
hash by running:
echo foo |md5sum
And pasting the resulting hash value into the users.queue file.
The file will be updated with new users as they are added using the control and monitor API.
/ldas_outgoing/logs/LDASmanager.log.html
will be rotated out and archived, and a new log file will be opened.
::API_LIST
which is defined in the file:
/ldas_outgoing/logs/APIstatus.html,
will be written in the exported file system. These files be viewed on the net at each of the LIGO sites:
The second link is to a page which contains a historical
overview of the running LDAS system.
The information on this page is not particularly useful,
but future enhancements to this page will make it a
powerful tool for analyzing long term LDAS system health.
Log File Entries:
Each of the API's at each of the sites writes useful
information into a log file. Log file entries are scored
using red, yellow, and green colored ball images:
LDAS system log files for each site are here:
The verbosity of the logging is controlled by the value of the ::DEBUG global variable. ::DEBUG may be set to 0 for minimal logging, 1 for moderate logging, or 2 for heavy logging and stderr messaging.
There is a convenient utility
hotgrep
which can be used to monitor the log files continuously.
Example call to hotgrep for monitoring the manager's log
file (type all on one line as shown):
rxvt -fn 9x15 -sl 8192 -bg ivory -fg brown -title "`cat /etc/ldasname`"_mgr_log -e hotgrep logs/LDASmanager.log.html '.+' &
Processing User Commands:
The Base Address (::BASEPORT+1) of the installation is the manager API's operator socket. User Commands can be issued to this socket from a telnet session or a script or program which is capable of making a socket connection. Connections should be made to:
All other service addresses are calculated by the API's at runtime, and are, in any event, not to be accessed directly by anyone except the manager API, or other LDAS API's on the internal network. The other ports are in any event bound to the internal network and not accessible from outside computers.
The submitting of LDAS user commands normally requires negotiation of a Challenge/Response protocol:
ldasJob {-name mlei -password md5protocol -email 127.0.0.1:52124} {dataPipeline ... }
Then the manager returns a string consisting of the word md5salt and the md5salt that the client should use when creating the hash:
md5salt [integer]
(** Right, it's not really a salt in the normal sense, it's a
session key. The name is an artefact of an earlier implementation.
Sorry for any confusion.)
The client then appends the salt value to the user's password and calculates an md5 hash of the combined password/salt string and returns this value to the manager like this:
md5digest af2058b1b115a2aec77e76e07b85d031
And the manager validates or rejects the request with an appropriate message, and then closes the channel if the persistent communication method described in the next paragraph is not being used.
Persistent socket client support (Deprecated. See below):
When the IP address and port number declared in the email address
option are identical to the IP and port of the initial client
connection, a persistent socket communication channel is
automatically configured, and if the client does not close it's
socket subsequent to recieving the job status message from the
manager, the job result message will be sent via the same channel
when the job completes.
This optional behaviour mode was added to support clients behind
restrictive firewalls.
NEW Persistent Socket Client Support:
The persistent socket protocol described above was found to be
insufficient when the client was connecting from behind a NAT
router.
A new ptotocol, where the email option is declared to be the
string persistent_socket has been added which
can be used transparently behind a NAT router, or for ANY OTHER
persistent socket communication, effectively deprecating the
"identical IP and port" syntax, which will however continue to
work as before (or not work, as before, from behind a NAT
router).
However, user commands can be submitted from within the martian network (defined in the LDASapi.rsc or LDASmanager.rsc file using the resource variable ::PRIVELEGED_IP_ADDRESSES) using a simplified cleartext key username password protocol:
::MGRKEY ldasJob ... (where the -password option uses the cleartext password)
What to do if things REALLY aren't working
If the User Commands do not seem to be completing, or are
returning error messages which you cannot understand:
(The following information is becoming outdated as the
control and monitor API becomes more full-featured. It is
no longer necessary to telnet to the manager API's emergecy
port to issue maintenance commands; the control and monitor
API has menu entries for most commonly required actions.)
telnet localhost {::BASEPORT + 2} Trying 127.0.0.1... Connected to hostname. Escape character is '^]'. ::MGRKEY [username] [password] "mgr::bootstrapAPI [ API name ]" Connection closed by foreign host.The log messages should reflect that the API was restarted or that an error occurred.
Telnet to the manager, and issue the commands shown
in bold:
telnet localhost {::BASEPORT + 2} Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. ::MGRKEY [username] [password] mgr::sHuTdOwN Connection closed by foreign host.
To restart the database server, type db2start.
If the metaserver is rebooted, the database server may need to be restarted via db2start.
Most of the mpiAPI actions are accomplished using rsh, however, because of the much better error handling using ssh, the mpiAPI uses ssh for invoking mpirun. In order for this to work correctly it is necessary for user ldas to be able to execute commands on the beowulf master node as any of the search users. This means that ssh must be configured to function non-interactively. There are several ways to do this, two of the most common are to use the ssh-agent facility or to enable rhosts authentication in ssh and use hosts.equiv. It requires a fair amount of understanding of ssh to make use of either of these facilities. See the docs for ssh.
Another thing to try is to log onto the beowulf gateway machine as the 'ldas' user, cd to the running directory of the mpiAPI (usually /ldas_outgoing/mpiAPI) and attempt to run the command:
recon -v bootschema.lam
If this fails to produce the well known "Woohoo!" message, and instead returns an error message, it will be necessary to carefully read the error message that is returned and to remedy (going one step at a time) all problems encountered. Rerun the recon command after each change is made to the configuration.
Once the recon command is working reliably for user ldas, repeat the procedure, using a search user as the guinea pig, and when all issues are resolved make sure that they are resolved for all search users.
A common problem seems to be that /ldcg/bin is not at the beginning of the path for the search users.
When recon is happy, the mpiAPI will likely begin to process jobs.
There is a new utility that can be used to detect Linux
system errors on the beowulf cluster. It is called
scancluster, and it is usually found in the
/ldas/bin directory.
The script should be examined and possibly modified before
use to set the base name of the beowulf node machines. The
default value is set to node.
The scancluster utility is run on the beowulf
gateway, and it produces a report called
scancluster.log.NNNNNNNNNN, where the file extension
is the unix timestamp.
The log file will have a seperate section for each node on
the cluster, and there will be two subsection to each node
report. The first section will contain error conditions, and
the second section will contain the last 10 lines of the
/var/log/messages file.
The scancluster log is visually inspected for system errors.
Even a fairly large cluster can be examined relatively
quickly by this method.
The API's rely on the existence and correct contents of several files:
And all files named as in these patterns:
Any of which may be overridden by files of the same
name placed in /ldas_outgoing.
Variables which MUST BE SET per site installation are shown in RED in the following sections
/ldas/bin/LDASapi.rsc
(there are default locations for most parts of the
system which are defined relative to ::LDAS,
but we redefine ::MOUNT_PT, ::PUBDIR, and
::LDASLOG for every running condition except
in-place testing of LDAS from a build directory).
/ldas/lib/mpiAPI/LDASmpi.rsc
There are five resource file defined variables that control
the distribution of executed processes among the nodes in the
beowulf cluster:
The first name in the nodelist will be assumed to be the
gateway machine for the beowulf, and will, by default, be the
machine that all wrapper masters run on.
Nodes may be named multiple times in the nodelist,
resulting in the creation of multiple virtual nodes which
allows use of multiple cpu's available on SMP nodes, or
for developmental testing using a stand alone computer with
a virtual cluster defined in this way.
Defining node reuse allows multiple processes to be run on
each virtual node defined via ::NODENAMES.
When ::MPI_MULTIPLE_NODES is defined, the number of processes
which can be started on a single virtual node will be limited
by the multiplier ::MPI_NODE_SHARE_LIMIT, which defaults to 2.
The valid defined database names are:
Currently the metadataAPI can only be connected to one database
when it is running; to switch to another database e.g. lho_test
at the site, telnet to the manager, and issue the commands shown
in bold:
New anonymous FTP directories can be created under the
::PUBDIR by the admin user through the user command
makeFtpDirectory.
contains the definitions of the Tcl variables that point to:
/ldas/lib/managerAPI/LDASmanager.rsc
(normally set to "NORMAL")
(should be set to something at least twice
the number of assistant managers permitted
so that buggy search code is allowed to decay
gracefully without tying up the system!)
(normally set to 60 seconds)
(normally set to 300 seconds)
(this is USUALLY the content of the file
/etc/ldasname)
(defaults to 100 rows)
(Now found in /ldas_outgoing/diskcacheAPI/LDASdiskcache.rsc)
(currently set to
/ldas/shared/macros)
(currently set to
/ldas)
contains the definitions of the Tcl variables that point to:
(nominally set to 10)
(This will generally be:
LD_PRELOAD=/ldas/lib/libdlmalloc.so USE_DB=$db)
contains the definitions of the Tcl variables that point to:
(lam or mpich)
(generally set to /ldcg for a compliant LDAS
installation)
(\$::LDAS/bin/wrapperAPI for most installations)
(something like -x LD_LIBRARY_PATH=\$::LDAS/lib:\$::LDAS/lib/genericAPI:/ldcg/lib)
(a list of names, i.e: [ list search01 search02... ])
(default rsc has this set to 120000)
/ldas/lib/metadataAPI/LDASdsnames.ini
(a list of names, i.e: [ list node0 node1 node2 node3... ])
This is done because it is not uncommon to have the internal
nodes on a beowulf to be isolated from external networks, and
that would cause wrapper masters running on the isolated nodes
to be unable to communicate with the datacond and eventmon API's,
with which the wrappers must exchange data over the network.
(0 or 1)
If a physical node is defined once in nodenames, and
::MPI_MULTIPLE_NODES is defined, then two jobs may both run
processes on the same physical node, but one job will not
run multiple processes on the same physical node.
To get a single job to run multiple processes on a single
physical node, declare it two times in ::NODENAMES and the
two virtual nodes may be assigned to a single job.
(Defaults to 2)
(0 or 1)
contains the definition of the Tcl variable that points to:
(currently set to lho_1)
telnet localhost {::BASEPORT + 2}
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
::MGRKEY [username] [password] set ::USE_DB lho_test
Connection closed by foreign host.
Now command the managerAPI to restart the metadataAPI
as shown in bold:
telnet localhost {::BASEPORT + 2}
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
::MGRKEY [username] [password] mgr::bootstrapAPI metadata
Connection closed by foreign host.
To find out the database connected to by the metadataAPI,
telnet localhost {::BASEPORT + emergencyPortOffset}
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
::MGRKEY "puts $cid $::dbname"
lho_1
{metadata :emergency:executed: puts $cid $::dbname}
Connection closed by foreign host.
OpenSSH is required to run LDAS.
Instructions on building and installing OpenSSH can be found on
the Installation Notes page.
The Unix account of the user who will be running LDAS must be configured
according to the following steps.
ssh-keygen -t rsaThis should only be performed on the LDAS gateway machine (the one running the managerAPI). Accept the default name and location for the key file. For added security, it is recommended to provide a non-blank passphrase. Note, you will be prompted for this passphrase whenever a new ssh agent is started, so don't forget it. More on that later.
~/.ssh/id_rsa.pub
file
to the ~/.ssh/authorized_keys2
file:cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2Make sure this information is appended to the
authorized_keys2
file on ALL machines used by
the LDAS system. If you have a beowulf cluster, this must
also be done in every search user account on the beowulf
gateway.~/.ssh
directory and
authorized_keys2
files as necessary.
~/.ssh
directory
and authorized_keys2
file:chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys2They must not have group or other write permissions.
ssh <machine_name>Perform this from the LDAS gateway machine (the one running the managerAPI) and answer 'yes' to any prompts.
In order for the managerAPI to ssh into other machines, an ssh-agent must be started to hold the key you created earlier. The agent will authenticate the ssh connections without the need to provide a password.
The first time you run the runLDAS script,
it will start an ssh-agent for you and add your key.
You will be prompted for the passphrase, if you provided one
when you created your key. If LDAS is restarted, the runLDAS
script will attempt to attach to an existing agent. If it can't
find an agent already running, it will create a new one for you,
and you will be prompted to re-enter your passphrase.
You need to set up a password for cmonClient and cgi keys to make your ldas system unique instead of using the defaults in the installation directory.
The following fields should replaced existing entries in LDAScntlmon.rsc:
There is a tk script with a gui interface to accomplish all of the steps to create custom copies of LDAScntmon.rsc and cmonClient.rsc files (see below).
in /ldas_outgoing/cntlmonAPI/LDAScntlmon.rsc, edit the following
set ::CLIENTKEY <your cgi key>
in /ldas_outgoing/cntlmonAPI/LDAScntlmon.rsc, edit the following
set ::CHALLENGE <your client key>
edit your cmonClient rsc file which can be your running directory or cmonClient.rsc file in your cmonClient tarball:
set ::CHALLENGE <your client key>
Priviledged users are users that can execute authorized functions via cmonClient e.g reboot LDAS APIs, update resources, examine core files etc. To enable a user as a priviledged user, edit global resource file /ldas_outgoing/LDASapi.rsc and add user's login to ::control_group. After that you need to restart cntlmonAPI.
Normally, user administration via cmonClient is not allowed since there is a formal process to apply for ldas accounts. But for installations that are not part of this process, they may need to add user accounts by hand. Edit /ldas_outgoing/cntlmonAPI/LDAScntlmon.rsc and set variable ::userAdminOK to 1 instead of 0. Reboot cntlmonAPI and you are ready to add user accounts to your site.
Edit the local cntlmon resource file /ldas_outgoing/cntlmonAPI/LDAScntlmon.rsc to set the following parameter to your site name:
;## desc=define http site name set ::SITE_HTTP http://<your gateway name>.ligo.caltech.educntlmonAPI starts a script periodically to remove the old jobs and logs. Override the following defaults for your installation if necessary:
;## desc=periodic cleanup of logs and jobs every 3 hrs (millisecs) set ::CLEAR_OUTPUT_PERIOD 10800000 ;## desc=time to keep old jobs and logs (secs) 1 wk set ::JOBS_PURGE_BEFORE 604800 ;## desc=time to keep old jobs and logs (secs) 2 wks set ::LOGS_PURGE_BEFORE 1209600
ldas service certificates for the gateway box of your system must be obtained via sysadm and place in a designated directory to be specified by ldas resource variables ( see General Setup ).
Define the following global resource variables for managerAPI and cntlmonAPI:
e.g example in LDASapi.rsc:
;## desc=tcl globus lib directory set ::TCLGLOBUS_DIR /ldcg/lib ;## desc=define location of globus service key file set ::X509_USER_KEY /ldas_outgoing/grid-security/ldaskey.pem ;## desc=define location of globus service cert file set ::X509_USER_CERT /ldas_outgoing/grid-security/ldascert.pem ;## desc=define location of globus service certificates dir set ::X509_CERT_DIR /ldas_outgoing/grid-security/certificates ;## desc=globus manager host in case there is more than 1 port set ::GLOBUS_MANAGER_API_HOST ldas-dev.ligo.caltech.edu ;## desc=option to use globus tcl channel set ::USE_GLOBUS_CHANNEL 1
Define port numbers for globus connections in managerAPI and cntlmonAPI different than those for tcl sockets:
e.g managerAPI port definition in LDASapi.rsc
;## desc=manager globus ports for receiving jobs via user cert set ::TCLGLOBUS_USER_PORT 10031 ;## desc=manager globus ports for receiving jobs via host cert set ::TCLGLOBUS_HOST_PORT 10030 ;## desc=globus manager host in case there is more than 1 port set ::GLOBUS_MANAGER_API_HOST ldas-dev.ligo.caltech.eduand in cntlmonAPI resource file LDAScntlmon.rsc:
;## desc=globus port for cntlmon set ::TCLGLOBUS_PORT 10032
When requesting certain functions via cmonClient e.g. LDAS Tests, cache view, database base, display mount point tree, cntlmonAPI submit these requests as ldas jobs to managerAPI, thus acting as a tclglobus client but running on an ldas system as user ldas so there is no user proxy available. The following resources need to be set up in LDAScntlmon.rsc file:
;## desc=service for globus tcl channel set ::SERVICE_NAME ldas ;## enable gsi authenication in globus channel or disable (blanks) set ::GSI_AUTH_ENABLED "-gsi_auth_enabled"Note that in order to connect successfully to the manager, the ::GSI_AUTH_ENABLED setting must match that of the manager. If the manager does not support GSI authenication, the client must connect with
set ::GSI_AUTH_ENABLED ""
cmonClient resource file in the user's home directory should be updated with the following tclglobus resources ( this is automatically done if you download an updated version of cmonClient and select OK to updating of resources).
;## desc=tcl globus lib directory set ::TCLGLOBUS_DIR /ldcg/lib ;## desc=tcl globus port in cntlmonAPI set ::TCLGLOBUS_PORT 10032 ;## enable gsi authenication in globus channel or disable (blanks) set ::GSI_AUTH_ENABLED "-gsi_auth_enabled"If cmonClient is run from an ldas machine by user ldas, it will automatically extract and configure the following resources necessary to configure as running with service certificates:
::X509_USER_KEY ::X509_USER_CERT ::X509_CERT_DIR ::GSI_AUTH_ENABLED ::SERVICE_NAME ::TCLGLOBUS_PORT (cntlmonAPI)
LDAS test scripts and test clients such as LDASJobH should have the following variables defined to connect properly to managerAPI:
::TCLGLOBUS_DIR ::X509_USER_KEY ::X509_USER_CERT ::X509_CERT_DIR ::GSI_AUTH_ENABLED ::GLOBUS_MANAGER_API_HOST ::TCLGLOBUS_HOST_PORT ( user ldas only ) ::SERVICE_NAME ( user ldas only ) ::TCLGLOBUS_USER_PORT ( running as an ldas user )For more details on use of tclglobus package, see Globus enabled ldas