# This file is part of Clusterix, Copyright (C) 2004 Alessandro Manzini, # email: a.manzini@infogroup.it # Clusterix is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # Clusterix is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with Clusterix; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA ############# Clusterix(wsd) ############# It is a cluster software for unix operating system (Linux,FreeBSD,Solaris) there are 2 type of clusterix: 1) Clusterix: this type write status information on a raw device on shared disk. 2) Clusterixwsd: this type write status information on file on local 2disk of the nodes. It is intented to be used with 2 machines if you want to put a service in high availability. With service i mean ip addresses, program (like web server,database etc), shared disk. So we can use for example to put in high availability a web server in 2 nodes or to put in high availability a database with shared disk. If you want to put more than one service in high availability so that one service stay in one node and the second in the other node you can simply make 2 installation of clusterix in 2 different directory of the 2 nodes. When the there is a failer of service on one node the cluster system move the service on the other node, that is it configure ip,publish macaddress of interface, mount disk if any in configuration file and at the end it start the program (web server, database etc). If you have to mount shared disk and you want to use very intensivily these disk, maybe it is better to use clusterixwsd because clusterix write informazion on shared disk (not exchange information throught network like clusterixwsd) and can take away some resource that are needed by the service. - the cluster software is composed by a main script (clusterixwsd.sh), one configration file (clusterixwsd.conf), one file that control the start and stop sequence of a service (script.sh), one file that start and stop the service (service.sh),2 files that check the status of the service and switch the cluster if service failer (control_process.sh control_control.sh). only the script clusterixwsd.sh and the configuration file clusterixwsd.conf are needed for the correct function of the cluster. others file and script are optional and they are used to control and administrate the service offered by the cluster. - All the variables that you have to set are in the configuration file. - The cluster script and configuration file have to be the same in both of the nodes. - Usually you have only to make change to clusterixwsd.conf, service.sh and to conffile variable on clusterixwsd.sh. The other you have to leave untouched. HOW IT FUNCTION a) cluster with status information on shared disk: Clusterix Public Interface |-----------------------------------------------| ____|____ Private Interface ____|____ | | <---------> | | | | <---------> | | --------- Backup Private Interface --------- | | Write | | Write | raw device | ---------------------------------------------------------- | | |writes host 1 | |write host 2 | |________________________________________________________| The information are placed in one rawdevice of a shared disk. Mainly every node writes a timestamp that the other goes to read.If the timestamp is unchanged by tot sec (variable timeoutdisk) the node that reads decides the other is down and takes the services if it does not still have them. The cluster takes advantage also of information taken from the net. It can use zero, one or two private interfaces of communication with the other node. In order to activate them or deactivate set variable PRNI and BNI. Before controlling the timestamp of the other node, the cluster check if the public interface is communicating on the net. If yes the timestamp is checked of the other node. If no it checks through the private interfaces if the public interface of the other node is up. If yes it stop the services and start them on the other node. If no it does not take no action. If beyond the public interface also the private one is down, it tries to communicate with the private interface of backup. If also this is down, the isolated node stop its services. The private interfaces (or public in the case in which there is no private interface) are used to check if the public interfaces on the 2 nodes are up and in order to launch vary commands from a node to the other. In this type of cluster the information on the status (timestamp etc) are written on the raw device that is visible at the same time by the 2 nodes. b) Cluster with status information on local disk: Clusterixwsd Public Interface |-----------------------------------------------| ____|____ Private Interface ____|____ | | <---------> | | | | <---------> | | --------- Backup Private Interface --------- | | | | | | Write | | |R R| | | Write Host 1 | | |e e| | | host 2 | | |a a| | | | | |d d| | | | | | | | | | | |h h| | | -------------------- ---------------------- | | |o o| | | Write | | |s s| | | Write Host 2 | | |t t| | | Host 1 | | |1 2| | | | | v v | | |_| v v |_| Quorum file host 1 Quorum file host 2 Every node writes the information and the timestamp on a file on local disk on certain blocks and they also writes a timestamp on a very precise block on file on the other node. Every node reads the timestamp on the local file. In this case is fundamental that puclic or private network function in order to determine the state of the node. It can use zero, one or two private interfaces of communication with the other node. In order to activate them or deactivate check variable PRNI and BNI. Before controlling the timestamp of the other node, the cluster check if the public interface is communicating on the net. If yes the timestamp is checked of the other node. If yes it checks through the private interfaces if the public interface of the other node is up. If yes it stop of the services and it start on the other node. If yes it does not take no action. If beyond the public interface also the private is down, it tries to communicate with the private interface of backup. If also this is down, the isolated node stop its services. The private interfaces (or public in the case in which there is no private interface) are used to check if the public interfaces on the 2 nodes are up and in order to launch vary commands from a node to the other. In this type of cluster the information on the status (timestamp etc) are exchanged via network. For both types of cluster For both types based on the settings the start of the virtual services implies start of one or more IP addreses, the start of a program and mount of one or more disk. The differences in the settings of variables with linux or freebsd are not many. and it is possible to find indications on how to set the variable in the configuration file as comments before the variable. Variable that can be different according of the os you are using are date, ps, dd, ifconfig,for definition of the alias, the mount/umount of the disks. In the case in which you active external control on the services, 2 processes are activated: the first one controls that the service is up. if not it switch the services on the other node. The other controls that the first process is up. if not it restart it. The first process restart the second in the case in which it is not present. So the 2 processes check one the other. In the case in which we have two host with 3 interfaces of net you can set the variable PRNI=on and BNI=on. In this case we have 1 public interface, 1 private and 1 private interface of backup. If we have only 2 interfaces we set the PRNI=on and BNI=off.So we will have 1 public interface and 1 private one without interface of backup. In the case in which the host have one single interface the cluster can only work with the public interface (but it is better to not use so). In this case we set variable PRNI=off and BNI=off. The command cluster status shows timestamp, status and if the node own the services for every of the 2 nodes, when the service is started and finally the status of all the interfaces in the configuration file. A detailed log can be found in the file set in the variable log. It is possible to set the timeout after which if the field timestamp is not changed,the node is considered down from the other. This variable is timeoutdisk. The variable checkfreq regulated the frequency with which the timestamp is written and the frequency with which the timestamp written by the other node is checked. The variable checkprocfreq regulated the frequency with which the script control_process check the availability of the service. The script of control of the availability of the service can be whatever script that give 0 for ok and 1 for ko. When it is executed the stop of the service, the cluster try to umount disks if any in configuration file. If umount command fails, cluster force a crash of node to prevent data corruption (if you want to disable this, you have to leave unset crashcommand variable). BEFORE TO INSTALL It is needed whichever unix os between Linux,FreeBSD and Solaris. It is also needed the following commands: bash,hostname,ps,dd,ssh,date,kill,ifconfig,ping,send_arp (only for linux not for freebsd). if you want mount and umount filesystem you needs also: mount,umount,lsof (for linux e solaris),fstat (for freebsd). For solaris you need to use ksh for the script. For ping you have to install the ping program of iputils. For Freebsd is right the default program. This because you want ping to be able to send broadcast request. The syntax of ping program for linux to establish if network is down is "ping -b -c 3 -f -w 2" for Freebsd is "ping -c 3 -f -t 2" and for Solaris is "ping -c 1" All this program are put in variables in the configuration file and can be substitue by other command that make the same things. The ssh program has to be configured so that you can launch command from root on the other node without any password on all the interfaces. If there is installed openssh you can receive it so: ssh-keygen -t rsa (for dsa is ssh-keygen -t dsa.it is the same) this creates a key in ~/.ssh/id_rsa e ~/.ssh/id_rsa.pub. the key in ~/.ssh/id_rsa.pub has to be placed in the file ~/.ssh/authorized_keys of the other node. examples of the files: ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1brtV7H9V5A3yLDYxUG71eGO0nvHmJ2g2+U7 n+Ed5cs0C8mW3Ecb5PkQqCHmdErVQFnzs8BllZSoAcmfxMSjbH7DZKmlz/z0V3CcRgIc661o TfrIFc/xk7GDxQiaNO8+VMw/BjrtWsYxPHT5vkzigPQPdLBhamFWKTYeTJAX7sE= root@be llatrix.intra.it ~/.ssh/authorized_keys dell'altro nodo: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1brtV7H9V5A3yLDYxUG71eGO0nvHmJ2g2+U7 n+Ed5cs0C8mW3Ecb5PkQqCHmdErVQFnzs8BllZSoAcmfxMSjbH7DZKmlz/z0V3CcRgIc661o TfrIFc/xk7GDxQiaNO8+VMw/BjrtWsYxPHT5vkzigPQPdLBhamFWKTYeTJAX7sE= root@be llatrix.intra.it then you repeat these things inverting the 2 nodes. then you have to go in the file /etc/ssh/sshd_config and set yes the permitrootlogin variable and restart ssh daemon. at this point you can lanch command as root on both of the 2 nodes without any password. you have also to say to the node to ignore broadcast request. This is because in this manner you can establish if network is reacheable or not. In fact the test that try to know if network is reacheable of not send a broadcast request to the network. so you dont want that the machine itself reply to this request. So you have to set net.ipv4.icmp_echo_ignore_broadcasts = 1 for linux. Freebsd by default dont reply to broadcast request. In order to permit that when the virtual ip pass from one node to the other, the mac address is published correctly to all machine, you have to put the nodes in the same virtual lan with the router on the switch. In this manner the gratuitos arp request send by send_arp program arrives to router and so to all other network. Otherwise uou can see the virtual ip only in its network. INSTALLAZIONE For cluster with status information on local disk: create a directory to put cluster software: mkdir /opt/clusterix create a directory to put log and to put file that contain status information. mkdir /opt/clusterix/log mkdir /opt/clusterix/qf create file for status information: touch /opt/clusterix/qf/quorumfile put the following files in the directory /opt/clusterix: clusterixwsd.sh Main cluster script clusterixwsd.conf Configuration file control_control0_4.sh Scirpt that control control_process0_4.sh control_process0_4.sh Scirpt that control control_control0_4.sh and that lanch the script that test the availability of the service script.sh Script that control the start and stop of the service service.sh Script that start and stop the service (it is only to test the cluster at the beginning of installation) controllo (script that have to give 0 for ok and 1 for ko) gotest.sh the service use for test (it is only to test the cluster at the beginning of installation) Note: all the files names can be changed and then changes the values of variables in the configuration files) Note: All the files except the configuration file have to be executable. Note: The send_arp program have to take the following parameters: usage: send_arp src_ip_addr src_hw_addr targ_ip_addr targ_hw_addr I include that send_arp program for linux in this archive. Note that for FreeBSD and Solaris you havent needed of it. Note: If for some reason you want to change the block number in which the cluster write information either in file or raw device, you have to see the blocknum function in clusterix(wsd).sh. Note: For Solaris you have to use another date program to return the number of secs from 1 January 1970. I include in this tar datenum.c and compiled version datenum that give the right result. Then you have to change the values of the variable according to your system: for the file clusterixwsd.sh you have only to set conffile variable to the path of the configuration file. Then you have to set all the variables in the configuration file clusterixwsd.conf according to your needs and following the comment that is before each variable. When you are installing it is better that you set the variable to use the test service that is with the cluster and only when you are sure that all function that you substitute this test service with the real service. after that you have set the variable in configuration files you have to copy all these files on the other node. than you can give the command: "/opt/clusterix/clusterixwsd.sh initialize" this initialize the file in which the status information are stored. then you start the cluster: "/opt/clusterix/clusterixwsd.sh start" now you can check the log to see if it is all right, and give the command "/opt/clusterix/clusterixwsd.sh status" to see if it is all right and check for the presence of the test service with: "ps auxw | grep gotest" and check for the precence of an ip and mounted disk if any in configuration file. then you start the cluster in the other node: "/opt/clusterix/clusterixwsd.sh remote start" and then check again with: "/opt/clusterix/clusterixwsd.sh status" at this point the cluster is up and function and if there is some hardware failer the cluster switch the service. but until now there is no check of service. So if there is a service fails or you stop the service the cluster dont know of that and doesnt make anything. So you can start the external check of the service: "/opt/clusterix/clusterixwsd.sh startcontrol" and then check the statuscontrol program with: "/opt/clusterix/clusterixwsd.sh statuscontrol" At this point we can make some test. For example you can stop the service with "/opt/clusterix/service.sh stop" (or ps auxw | grep gotest and then kill pid) and see that the cluster switch it on the other node. you can stop the cluster with "/opt/clusterix/clusterixwsd.sh stop" and see that the cluster start the service on the other node. you can reboot the machine and test that the service go on the other node etc. The same is valid for cluster with status information on shared disk with the exception that instead to specified a quorumfile variable with a file you have to set a variable quorumdevice to a raw device that is on shared disk. after that all the things function you can put your service under the control of the cluster: so set the variable start_service_comand to a script that start the service set the variable stop_service_comand to a script that stop the service set stop_service_comand_2 to a script that stop the service if the first script fails to stop it set includestringstart excludestringstart includestringstop excludestringstop according to your needs. Explanation of the variables in configuration file: version is the version of the cluster system pathcluster is the path of the main script operatingsystem Specify operating system. Possible chooses: Linux,FreeBSD, Solaris.if you use Solaris change also the default shell in ksh in all the script hostname program that have to return the host name without domain like host1 and not host1.domain.org node1 Name of the 2 nodes of the cluster without domain node2 log Log file cluster quorumfile File that contain status information. For cluster with status information in local disk quorumdevice Raw device that contain status information. For cluster with status information in shared disk blocksize block size in bytes inside the status information file or partition timeoutdisk Timeout accessing status information file after that a node is thought down by other node checkfreq Frequency of the check and write information on status information file timeoutkillremotestop Timeout to wait before to kill the process that remotely stop the cluster on the other node. This is useful to give time to disk to umount. So it is advisable that is greater than timeoutstop + timeoutstop2 + umountfailedsec * umountcountnum. if you make so you are sure that when the node mount the disk the other node have just umounted them. script Script that control the start and the stop of the service servicename Name of the service to start. it appear in the log and in the email start_service_comand Command that start the service. IMPORTANT: start_service_comand variable has to be different from includestringstart variable stop_service_comand Command that stop the service. stop_service_comand_2 Second stop command to lanch if the first dont stop the service includestringstart excludestringstart includestringstop excludestringstop List of string pattern to match or exclude for starting and stopping the service. Put the strings separated by a "|". example "pattern1|pattern2". For the include the service start only if all the pattern are matched. For the exclude, all the pattern are excluded from the matched ones and you can also use regular expression for the exclude. timeoutstop Timeout after which begin the stop of the process without waiting for normal stop timeoutstop_2 Timeout for the the second stop script after which begin the stop of the process without waiting for normal stop timeoutstart Timeout after which the service is tried to start again countstartlimit Number of tries to start the service begincheck Number of seconds after the start of the service before to begin to check the precence of the process trycountnumquorum Number of tries before to say that the other member is not polling quorum device trysecintquorum Interval between one try to another to see if other node is polling quorum device trymessagequorum Message to display in log for polling quorum device test date have to return the date as Wed Sep 21 10:51:35 CEST 2005 datenum have to return date as 1127292729 (seconds by 1 Jan 1970) mailto Destination address of mail alert control_process path to the control_process script control_control path to the control_control script controlscript Path to script that control the availability of the service. It have to return 0 for good 1 for wrong checkprocfreq Frequency of the check of the availability of the service countfailedservicenum Number of tries before to say that a service is down failedtrysec secs between one failed try to the next trymessageservice Message to display in log when trying this service fsckbeforemount put ON if you want to make an fsck of file system before to mount it umountcountnum Number of tries to umount the device umountfailedsec secs between one failer try to the next umountfaildmessage Message to display in log for umount failer crashcommand Command that force crash on the node. reboot -f for linux reboot -q for freebsd. It is needed to be sure that when you mount a file system on the node, the file system is not mounted in the other node. Leave unset if you dont want to force crash of node in the case of umount failer. numdevice Number of device to mount,umount when cluster start,stop. devicetomount1 device to mount mountpoint1 Mount point killprocess # Define the utility to use to kill the process before to umounting a file system. For Solaris,Linux: for pidopenfile in `$lsof $1 | $awk '{print $2}' |$uniq | $grep -v PID`; do if [ ! -z "$pidopenfile" ]; then $kill -9 $pidopenfile; fi; done For Freebsd: for pidopenfile in `$fstat -f $1 | $awk '{print $6}' | $grep -v INUM`; do if [ ! -z $pidopenfile ]; then $kill -9 $pidopenfile; fi; done PRNI set to on if you want to you a private network interface node1ipprivatenetwork1 ip of private network interface on node1 node2ipprivatenetwork1 ip of private network interface on node2 BNI set BNI (backup network interface) to on if you have and want to use a private backup interface node1ipprivatenetwork2 ip of backup private network interface on node1 node2ipprivatenetwork2 ip of backup private network interface on node2 node1ippublicnetwork ip of public network interface on node1 node2ippublicnetwork ip of public network interface on node2 netmaskpublicnetwork netmask of public network interface broadcastpublicnetwork broadcast of public network interface interfacepublicnetwork name of public network interface trycountnumpubnet Number of tries before to say that public interface is down trysecintpubnet Interval between one try to another to see if public interface is down trymessagepubnet Message to display in log for public interface test node1macaddress mac address of public interface of node1 node2macaddress mac address of public interface of node2 is needed for linux not for FreeBSD numvip Number of virtual ip address to active on the start of service useexternalvipfile Set to "on" if you want to read vip address from esternal file. externalvipfile File containing vip address if you use external file you have to put one ip per line vipdeflinux Set the ip,netmask,broadcast,interface and interfacenumber for virtual ip addrees. if unused let the variable ip unset. If you use it also add the other variable (netmask,broadcast, interface,interfacenumber). you have to set when you put useexternalvipfile="off". vipdeffreebsd the same for FreeBSD vipdefsolaris the same for Solaris OPERATIVE MANUAL Clusterix 4.6... Usage: /opt/clusterix1/clusterixwsd.sh {start|stop|startforeground|startall|stopall|stat us|initialize|startservice|stopservice|stopcluster|stopclusterhere|startcontrol|stopcont rol|statuscontrol|writedatenow|writeactive|version} start: start cluster in background: service and check quorum device. stop: stop cluster: service and check quorum device. status: Retrieve status information of the 2 nodes. initialize: Initialize quorum device. startforeground: start cluster in foreground: service and check quorum device. startcontrol: start processes that control the availability of service. stopcluster: stop the cluster system without stopping the service on both nod es. stopclusterhere: stop the cluster system without stopping the service on this nod e. stopcontrol: stop processes that control the availability of service. statuscontrol: status of processes that control the availability of service. startservice: start only the service not the cluster system. startserviceifnotactive: start only the service not the cluster system only if th e node is not active. stopservice: stop only the service not the check of quorum device. stopall: stop service,stop check quorum device and stop processes that control the availability of service. startall: start service,start check quorum device and start processes that control the availability of service. Use this only if the cluster is down also in the o ther node.Otherwise use /opt/clusterix1/clusterixwsd.sh start. remote {start|stop|startservice|stopservice}: start,stop,startservice,stopservice on the other node. writeactive {yes|no}: Write yes,no for the status on quorum device. writedatenow: Write actual date on quorum device. version: Program version. clusterixwsd.sh initialize - Initialize status information files clusterixwsd.sh start - Start cluster on this node clusterixwsd.sh remote start - Start cluster on other node clusterixwsd.sh stop - Stop cluster on this node clusterixwsd.sh remote stop - Stop cluster on other node clusterixwsd.sh stopcluster - Stop cluster system without stopping the service clusterixwsd.sh status - Check the status of the cluster clusterixwsd.sh startservice - Start only the service without cluster (for emergency) clusterixwsd.sh writeactive yes|no - Write yes or no in the block that contain active information on status file clusterixwsd.sh writedatenow - Write date in the block that contain timestamp information on status file