Thursday, August 14, 2014

Script to manage MaxScale

MaxScale 1.0 from SkySQL is now in Beta and there are some cool features in it, I guess some adventurous people has already put it into production. There are still some rough edges and stuff to be fixed, but it is clearly close to GA. One thing missing though are something to manage starting and stopping MaxScale in a somewhat controlled way, which is what this blog is all about.

I have developed two simple scripts that should help you manage MaxScale in a reasonable way, but before we go into the actual scripts, there are a few things I need to tell you. To begin with, if you haven't yet downloaded MaxScale 1.0 beta, you can get it from MariaDB.com, just go to Resources->MaxScale and to get to the downloads you first need to register (which is free). Here are downloads to rpms and source, but if you are currently looking for a tarball, there seems to be none, well actually there is, the first link under "Source Tarball" actually is a binary tarball. I have reported this so by the time you read this, this might have been fixed. Of course you can always get the source from github and build it yourself.

Anyway, for MaxScale to start, you need a configuration file and you have to set the MaxScale home directory. If you are on CentOS or RedHat and install the rpms (which is how I set it us), MaxScale is installed in /usr/local/skysql/maxscale, and this is also what MAXSCALE_HOME needs to be set to. MaxScale can take a configuration file argument, but if this isn't passed, then MAXSCALE_HOME/etc/MaxScale.cnf will be used. In addition, you will probably have to add MAXSCALE_HOME/lib to your LD_LIBRARY_PATH variable.

All in all, there some environment variables to set before we can start MaxScale, and this is the job of the first script, maxenv.sh. For this to work, it has to be placed in the bin directory under MAXSCALE_HOME. In this script we also set MAXSCALE_USER, and this is used by MaxScale start / stop script to be explained later, and this is the linux user that will run MaxScale. You can set this to an empty string to run maxscale as the current user, which is the normal way that MaxScale runs, but in that case you need to make sure that the user in question has write access to MAXSCALE_HOME and subdirectories.

So, here we go, here is maxenv.sh, and you can copy this into the bin directory under your MAXSCALE_HOME directory and use it like that (note that I set the user to run MaxScale to mysql, so if you don't have that user, then create that or modify maxenv.sh accordingly):
#!/bin/bash
#
export MAXSCALE_HOME=$(cd `dirname ${BASH_SOURCE[0]}`/..; pwd)
export MAXSCALE_USER=mysql
PATH=$PATH:$MAXSCALE_HOME/bin
# Add MaxScale lib directory to LD_LIBRARY_PATH, unless it is already there.
if [ `echo $LD_LIBRARY_PATH | awk -v RS=: '{print $0}' | grep -c "^$MAXSCALE_HOME/lib$"` -eq 0 ]; then
   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MAXSCALE_HOME/lib
fi
export MAXSCALE_PIDFILE=$MAXSCALE_HOME/log/maxscale.pid


Now it's time for maxctl, which is the script that starts and stops MaxScale, and this also must be placed in the MAXSCALE_HOME/bin directory, and note that this script relies on maxenv.sh above, so to use maxctl you also need maxenv.sh as above. The script is rather long, but there are probably a thing or two missing anyway, but for me this has been useful:

#!/bin/bash
#
# Script to start and stop MaxScale.
#
# Set up the environment
. $(cd `dirname $0`; pwd -P)/maxenv.sh

# Set default variables
NOWAIT=0
HELP=0
QUIET=0
MAXSCALE_PID=0

# Get pid of MaxScale if it is running.
# Check that the pidfile exists.
if [ -e $MAXSCALE_PIDFILE ]; then
   MAXSCALE_PID=`cat $MAXSCALE_PIDFILE`
# Check if the process is running.
   if [ `ps --no-heading -p $MAXSCALE_PID | wc -l` -eq 0 ]; then
      MAXSCALE_PID=0
   fi
fi


# Function to print output
printmax() {
   if [ $QUIET -eq 0 ]; then
      echo $* >&2
   fi
}

# Function to print help
helpmax() {
    echo "Usage: $0 start|stop|status|restart"
    echo "Options:"
    echo "-f - MaxScale config file"
    echo "-h - Show this help"
    echo "-n - Don't wait for operation to finish before exiting"
    echo "-q - Quiet operation"
}


# Function to start maxscale
startmax() {
# Check if MaxScale is already running.
   if [ $MAXSCALE_PID -ne 0 ]; then
      printmax "MaxScale is already running"
      exit 1
   fi

# Check that we are running as root if a user to run as is specified.
   if [ "x$MAXSCALE_USER" != "x" -a `id -u` -ne 0 ]; then
      printmax "$0 must be run as root"
      exit 1
   fi

# Check that we can find maxscale
   if [ ! -e $MAXSCALE_HOME/bin/maxscale ]; then
      printmax "Cannot find MaxScale executable ($MAXSCALE_HOME/bin/maxscale)"
      exit 1
   fi

# Check that the config file exists, if specified.
   if [ "x$MAXSCALE_CNF" != "x" -a ! -e "$MAXSCALE_CNF" ]; then
      printmax "MaxScale configuration file ($MAXSCALE_CNF) not found"
      exit 1
   fi

# Start MaxScale
   if [ "x$MAXSCALE_USER" == "x" ]; then
      $MAXSCALE_HOME/bin/maxscale -c $MAXSCALE_HOME ${MAXSCALE_CNF:+-f $MAXSCALE_CNF}
   else
      su $MAXSCALE_USER -m -c "$MAXSCALE_HOME/bin/maxscale -c $MAXSCALE_HOME ${MAXSCALE_CNF:+-f $MAXSCALE_CNF}"
   fi
}


# Function to stop maxscale
stopmax() {
   NOWAIT=1
   if [ "x$1" == "-n" ]; then
      NOWAIT=0
   fi

# Check that we are running as root if a user to run as is specified.
   if [ "x$MAXSCALE_USER" != "x" -a `id -u` -ne 0 ]; then
      printmax "$0 must be run as root"
      exit 1
   fi

# Check that the pidfile exists.
   if [ ! -e $MAXSCALE_PIDFILE ]; then
      printmax "Can't find MaxScale pidfile ($MAXSCALE_PIDFILE)"
      exit 1
   fi
   MAXSCALE_PID=`cat $MAXSCALE_PIDFILE`

# Kill MaxScale
   kill $MAXSCALE_PID

   if [ $NOWAIT -ne 0 ]; then
# Wait for maxscale to die.
      while [ `ps --no-heading -p $MAXSCALE_PID | wc -l` -ne 0 ]; do
      usleep 100000
      done
      MAXSCALE_PID=0
   fi
}


# Function to show the status of MaxScale
statusmax() {
# Check that the pidfile exists.
   if [ $MAXSCALE_PID -ne 0 ]; then
      printmax "MaxScale is running (pid: $MAXSCALE_PID user: `ps -p $MAXSCALE_PID --no-heading -o euser`)"
      exit 0
   fi
   printmax "MaxScale is not running"
   exit 1
}

# Process options.
while getopts ":f:hnq" OPT; do
   case $OPT in
      f)
         MAXSCALE_CNF=$OPTARG
         ;;
      h)
         helpmax
         exit 0
         ;;
      n)
         NOWAIT=1
         ;;
      q)
         QUIET=1
         ;;
      \?)
         echo "Invalid option: -$OPTARG"
         ;;
   esac
done

# Process arguments following options.
shift $((OPTIND - 1))
OPER=$1

# Check that an operation was passed
if [ "x$1" == "x" ]; then
   echo "$0: your must enter an operation: start|stop|restart|status" >&2
   exit 1
fi


# Handle the operations.
case $OPER in
   start)
      startmax
      ;;
   stop)
      stopmax
      ;;
   status)
      statusmax
      ;;
   restart)
      if [ $MAXSCALE_PID -ne 0 ]; then
         NOWAITSAVE=$NOWAIT
         NOWAIT=0
         stopmax
         NOWAIT=$NOWAITSAVE
      fi
      startmax
      ;;
   *)
      echo "Unknown operation: $OPER. Use start|stop|restart|status"
      exit 1
esac


To use this script, you call it with one of the 4 main operations: start, stop, restart or status. If you are to run as a specific user, you have top run it as root (or using sudo):
sudo /usr/local/skysql/maxscale/maxctl start
Also, there are some options you can pass:
  • -h - Print help
  • -f - Run with the spcified configuration file instead of MAXSCALE_HOME/etc/MaxScale.cnf
  • -n - Don't want for stop to finish before returning. Stopping MaxScale by sending a SIGTERM is how it is to be done, but it takes a short time for MaxScale to stop completely. By default maxctl will wait until MaxScale is stopped before returning, by passing this option MaxScale will return immediately though.
  • -q - Quiet operation
/Karlsson

Monday, August 4, 2014

What is HandlerSocket? And why would you use it? Part 1

HandlerSocket is included with MariaDB and acts like a simple NoSQL interface to InnoDB, XtraDB and Spider and I will describe it a bit more in this and a few upcoming blogs.

So, what is HandlerSocket? Adam Donnison wrote a great blog on how to get started with it, but if you are developing MariaDB applications using C, C++, PHP or Java what good does HandlerSocket do you?

HandlerSocket in itself is a MariaDB plugin, of a type that is not that common as is is a daemon plugin. Adam shows in his blog how to enable it and install it, so I will not cover that here. Instead I will describe what it does, and doesn't do.

A daemon plugin is a process that runs "inside" the MariaDB. A daemon plugin can implement anything really, as long as it is relevant to MariaDB. One daemon plugin is for examples the Job Queue Daemon and another then is HandlerSocket. A MariaDB Plugin has access to the MariaDB internals, so there is a lot of things that can be implemented in a daemon plugin, even if it is a reasonable simple concept.

Inside MariaDB there is an internal "handler" API which is used as an interface to the MariaDB Storage Engines, although not all engines supports this interface. MariaDB then has a means to "bypass" the SQL layer and access the Handler interface directly, and this is done by using the HANDLER commands with MariaDB. Note that we are not talking about a different API to uise the handler commands, instead they are executed using the same SQL interface as you are used it's, it's just that you use a different set of commands. One limitation of these commands is that they only allow reads, even though the internal Handler interface supports writes as well as reads.

When you use the HANDLER commands, you have to know not only the tables name that you access, but also the index you will be using when you access that table (assuming then you want to use an index, but you probably do). A simple example of using the HANDLER commands follows here:

# Open the orders table, using the alias o
HANDLER orders OPEN AS o;

# Read an order record using the primary key.
HANDLER o READ `PRIMARY` = (156);

# Close the order table.
HANDLER o CLOSE;

In the above example, I guess this is just overcomplication something basically simple, as all this does is the same as
SELECT * FROM orders WHERE id = 156;

The advantage of using the handler interface though is performance, for large datasets using the Handler interface is much faster.

All this brings up three questions:
  • First, if we are bypassing the SQL layer, by using the HANDLER Commands, would it not be faster to bypass the SQL level protocol altogether, and just use a much simple protocol?
  • Secondly, while we are at it, why don't we allow writes, as this is the biggest issues, we can always speed reads by scaling out anyway?
  • And last, how much faster is this, really?
The answer to the first two questions then is the Handler Socket plugin, which was the whole deal with this blog, as this use a separate, very simple, protocol and allows writes! For the third question, this is where I come in, I have done some simple INSERT style benchmarks using HandlerSocket, and I have some results for you in my next blog. So don't touch that dial, I'll be right back!

/Karlsson

Friday, July 11, 2014

MariaDB Replication, MaxScale and the need for a binlog server

Introduction

This is an introduction to MariaDB Replication and to why we need a binlogs server and what this is. The first part is an introduction to replication basics, and if you know this already, then you want want to skip past the first section or two.

MariaDB Replication

MySQL and MariaDB has a simple but very effective replication system built into it. The replication system is asynchronous and is based on a pull, instead of a push, system. What this means in short is that the Master keeps track of the DML operations and other things that might change the state of the master database and this is stored in what is called the binlog. The slave on the other hand is responsible for getting the relevant information from the master to keep up to speed. The binlogs consist of a number of files that the master generates, and the traditional way of dealing with slaves is to point them to the master, specifying a starting point in the binlogs consisting of a filename and a position.

When a slave is started it gets the data from the binlogs, one record at the time, from the given position in the master binlogs and in the process updates the current binlogs file and position. So the master keep track of the transactions and the slave follows behind as fast as it can. The slave has two types of threads, the IO thread that gets data from the master and to a separate relay log on the slave, and an SQL thread that applies the data from the relay log to the slave database.
This really is less complicated than it sounds, in a way, but the implementation of it on the other hand is probably more complicated than one might think. There are also some issues with this setup, some which is fixed by the recent GTID implementation in MySQL 5.6 and more significantly in MariaDB 10.

In a simple setup with just 1 master and a few slaves, this is all there is to it: Take a backup from the master and take to keep track of the binlogs position when this is done, then recover this backup to a slave and then set the slave starting position at the position when the backup was executed. Now the slave will catch up with the operations that has happened since the backup was run and eventually it will catch up with the master and then poll for any new events.

One that that is not always the case with all database systems is that the master and the slave are only different in the sense of the configuration. Except this, these servers run the same software and the same operations can be applied to them, so running DML on a slave is no different than on a master, but if there is a collision between some data that was entered manually on the slave and some data the arrives through replication, say there is a duplicate primary key, then replication will stop.

What do we use replication for

Replication is typically used for one or more of three purposes:
•    Read Scale-out - In this case the slaves are used for serving data that is read, whereas all writes go to the master. As in most web applications there is a much larger amount of reads than writes, this makes for good scalability and we have fewer writes to be handled by the single master, whereas the many reads can be server by one or more slaves.
•    Backup - Using a slave for backup is usually a pretty good idea. This allows for cold backups even, as if we shut down the slave for backing it up, it will catch up with the master once restarted after the backup is done. Having a full database setup on a slave for backup also means that recovery times of we need to do that, is fast and also allows for partial recovery if necessary.
•    High Availability - As the master and Slave are kept in sync, at least with some delay, one can sure use the slave to fail over to should the master fail. The asynchronous nature of replication does mean though that failover is a bit more complex than one might think. There might be data in the relay slave log that has not yet been applied and might cause issues when the slave is treated as a master. Also, there might be data in the binlogs on the master which means that when the master is again brought on-line, it might be out of sync with the slave: some data that never was picked up by the slave might be on the master and data that entered the system after the slave was switched to a master is not on the old master.

The nature of replicated data

In the old days, the way replication worked was by just sending any statement that modified data on the master to the slave. This way of working is still available, but over time it was realized that this was a bit difficult, in particular with some storage engines.

This led to the introduction of Row-based Replication (RBR) where the data to be replicated is transferred not as a SQL statement but as a binary representation of the data to be modified.

Replicating the SQL statement is called Statement Based Replication (SBR). An example of a statement that can cause issues when using SBR is:
DELETE FROM test.tab1 WHERE id > 10 LIMIT 5
In this case RBR will work whereas when using statement based replication we cannot determine which rows will be deleted. There are more examples of such non-deterministic SQL statements where SBR fails but RBR works.

A third replication format mode is available, MIXED, where MariaDB decided ona statement by statement base which replication format is best.

Scaling replication

Eventually many users ended up having many slaves attached to that single master. And for a while, this was not a big issue, the asynchronous nature of things means that the load on the master was limited when using replication, but with enough nodes, eventually this turned in to being an issue.

The solution then was to introduce an "intermediate master". This is a slave that is also a master to other slaves, and this is configured having log_slave_updates on, which means that data that is applied on the slave from the relay log and into the slave, are also written to the binlog.

This is a pretty good idea, but there are some issues also. To begin with, on the intermediate master, data has to be written several times, once in the relay log, once in the database (and if InnoDB is used, a transaction log is also written) and then we have to write it to the binlog.

Another issue that is in effect here is the single threaded nature of replication (this is different in MariaDB 10 and MySQL 5.7 and up), which means that a slave on a master that runs many threads, might get into a situation where the slave can't keep up with the master, even though the slave is similarly configured as the master. Also, a run running statement on the master will hold up replication for as long as that statement runs on the slave, and if we have an intermediate master, then the delay will be doubled (once on the intermediate master and once on the actual slave).

The combined effect of the duplication of the delay and the requirement to write data so many times, leads to the result that an intermediate master maybe isn't such a good idea after all.
As for the replication use-cases, intermediate masters are sometimes used as alternative masters when failing over. This might seems like a good idea, but the issue is that the binlogs on the intermediate master doesn't look the same as the binlogs on the actual master. This is fixed by using Global Transaction IDs though, but these have different issues and unless you are running MariaDB 10 or MySQL 5.6, this isn't really an option (and even with MySQL 5.6, there are big issues with this).

What we need then is something else. Something that is a real intermediate master. Something that looks like a slave to the master and as a master to the slave, but doesn't have to write data three times first and that doesn't have to apply all the replication data itself so it doesn't introduce delays into the replication chain.

The slave that attaches to this server should see the same replication files as it would see it it connected to the real master.

MaxScale and the Plugin architecture

So let's introduce MaxScale then, and the plugin architecture. MaxScale has been described before, but one that that might not be fully clear is the role of the plugins. MaxScale relies much more on the plugins that most other architectures, fact is, without the plugins, MaxScale can't do anything, everything is a plugin!

The MaxScale core is a multi-threaded epoll based kernel with 5 different types of plugins (note that there might be more than one plugin of each type, and this is mostly the case actually:
•    Protocols - These implement communication protocols, including debugging and monitoring protocols. From this you realize that without appropriate protocol plugins, MaxScale will not be able to be accessed at all, so these modules are key. Among the current protocols are MariaDB / MySQL Client and Server protocols.
•    Authentication - This type of plugin authenticates users connecting to MaxScale. Currently MariaDB / MySQL Authentication is supported.
•    Router - This is a key type of module that determines how SQL traffic is routed an managed.
•    Filter - This is an optional type of pluging there the SQL traffic can be modfied, checked or rejected,
•    Monitor - This type of modules is there to monitor the servers that MaxScale connects to, and this data is used by the routing mode.

Before we end this discussion on MaxScale, note that there might be several configurations through one single MaxScale setup, so MaxScale can listen to one prot for one set up servers and routine setup, and on another port for a different setup.

With this we have an idea how MaxScale work, so let's see if we can tie it all up.

MaxScale as a Binlog server

As can be seen from the description of MaxScale a lot of what is needed to create a Binlog server to use as an intermediate server for slaves is there. What is needed is a router module that acts as a slave to the assigned master, downloads the binlogs from there, using the usual MariaDB / MySQL Replication protocol. This routing plugin also needs to serve the slaves with the downloaded binlogs files. In theory, and also in practice, the slaves will not know if it is connected to the real master or to MaxScale.

Using MaxScale this way as an intermediate Master, a slave that connects to the MaxScale can work from the same Binlog files and positions as when connected directly to the master, as the files are the same for all intents and purposes. There will be no extra delays for long running SQL statements as these aren't applied on MaxScale, the replication data is just copied from the master, plain and simple. As for parallel slaves, this should work better in when using MaxScale as a Binlog server, but this is yet to be tested.

So there should be many advantages to using MaxScale as a binlog server compared to using an intermediate MariaDB / MySQL server. On the other hand, this solution is not for everyone, many just doesn't drive replication that hard that the load on the master is an issue so that an intermediate Master is requited. On the other, many use an Intermediate Master also for HA, and in this case it would have be advantageous to use MaxScale instead of that Intermediate master, the latter which could still server the role as a fail-over HA server.

Now, there one issue with all of this that many of you might have spotted: That cool Binlog server plugin module for MaxScale doesn't exists. Well, I am happy to say that you are wrong, it does exist and it works. A Pilot for such a module has been developed by SkySQL together with Booking.com that had just this need for an intermediate server that wasn't just yet another MariaDB / MySQL server. For the details on the specific usecase, see the blog by Jean-François Gagné.