Monday, 31 March 2014

Live Kernel Patching Solutions

A large public cloud is quite a complex thing to manage, even with toolsets such as Openstack to deploy and manage it.  You often have many customers with compute instances on a single server so security is a high priority.  This also introduces an interesting problem, when you need to deploy a Linux Kernel security patch to the host (sometimes thousands of hosts) how do you do it with the least disruption?

One of the currently most accepted methods is to suspend the instances on a box, deploy the patch, reboot the host and resume the instances.  Whilst this works in many cases you have customers who have had their instances down for X minutes (even if it is planned and the customer is notified it can be inconvenient) and instance kernels that suddenly think "my clock just jumped, WTF just happened?".  This in-turn can cause problems for long running clients talking to servers as well as any time-sensitive applications.  Long running clients will often think they are still connected to the server when they really are not (there are ways around this with TCP Keepalive).

There are a couple of old solutions to this and a couple of new ones and as part of my work for HP's Advanced Technology Group I will be taking a deep dive into them in the coming weeks.  For now here is a quick summary of what is around:

kexec


This is probably the oldest technology on the list.  It doesn't quite fall under the "Live Kernel Patching" umbrella but is close enough that it warrants a mention.  It works by basically ejecting the current kernel and userspace and starting a new kernel.  It effectively reboots the machine without a POST and BIOS/EFI initialisation.  Today this only really shaves a few seconds off the boot time and can leave hardware in an inconsistent state.  Booting a machine using the BIOS/EFI sets the hardware up in an initialised state, with kexec the hardware could be in the middle of reading/writing data at the point the new kernel is loaded causing all sorts of issues.

Whilst this solution is very interesting, I personally would not recommend using it as during a mass deployment you are likely to see failures.  More information on kexec can be found on the Wikipedia entry.

Ksplice


Ksplice was really the first toolset to implement Live Kernel Patching.  It was created by several MIT students who subsequently spun-off a company from it which supplied patches on a subscription model.  In 2011 this company was acquired by Oracle and since then there have been no more official Open Source releases of the technology.  Github trees which are updated to work with current kernels still exist.

The toolset works by taking a kernel patch and converting it into a module which will apply the changes to functions to the kernel without requiring a reboot.  It also supports changes to the data structures with some additional developer code.  It does temporarily pause the kernel whilst the patch is being applied (a very quick process), but this is far better than rebooting and should mean that the instances do not need suspending.

kpatch


Both Red Hat and SUSE realised that a more modern Open Source solution is needed for the problem, and whilst SUSE announce their solution (kGraft) first, Red Hat's kpatch solution was the first to show actual code to the solution.

Red Hat's kpatch solution gives you a toolset which creates a binary diff of a kernel object file before and after a patch has been applied.  It then turns this into a kernel module which can be applied to any machine with the same kernel (along with kpatch's loaded module).  Like Ksplice, it does need to pause the kernel whilst patching the functions.  It also as-yet does not support changes to data structures.

It is still very early days for this solution but development is has been progressing rapidly.  I believe the intention is to create a toolset that will take a unified diff file and turn that into a kpatch module automatically.

For more information on this solution take a look at their blog post and Github repository.

kGraft


SUSE Labs announced kGraft earlier this year but only very recently produced code to show their solution.

From the documentation I've seen so far their solution appears to work in a similar way to Red Hat's but they have the unique feature which gives the ability for the patch to be applied to the kernel without pausing it.  Both the old and the replacement functions can exist at the same time, old executions will finish using the old function and new executions will use the new function.

This solution seems to have gone down the route of bundling their code on top of a Linux kernel git tree which means it took an entire night for me to download the git history.  I'm looking forward to digging through the code of this solution to see how it works.

The git tree for this can be found in the kgraft kernel repository (make sure to checkout the origin/kgraft branch after you have cloned it) and SUSE's site on the technology can be found here.

Summary


All three of the above solutions are very interesting.  Combined with a deployment technology such as Salt or Ansible it could mean the end to maintenance downtime for cloud compute instances.  As soon as I have done more research on the technologies I will be writing more details and hopefully even contributing where possible.

Saturday, 22 February 2014

Cloud Users: Don't put your eggs in one basket

In my previous blog post I briefly mentioned that a premature cloud account removal meant that the Drizzle project lost a lot of data, including backups.  I'll go into a bit more detail here so cloud users can learn from our mistakes.  I will try to avoid using cloud puns here as much as possible :)

As with everything in life clouds fail.  No cloud I have ever seen so far can claim they have had 100% uptime since the beginning of hitting GA.  However much projects like Openstack simplify things for the operators of clouds they are complex architectures and things can fail.

But there is one element of human failure I've seen several times from multiple cloud vendors which is the problem that crippled Drizzle.  That is account deletion.

Most of Drizzle's websites and testing framework were running from a cloud account using compute resources.  Backups were made and automatically uploaded to the cloud file storage on the same account for archiving.  This was our mistake (and we have definitely learnt from it).  We knew that at some point in the future the cloud accounts used would be migrated to a different cloud and the current cloud account terminated.  Unfortunately the cloud account used was terminated prematurely.  This meant that all compute instances and file storage was instantly flushed down the toilet.  All our sites and backups were instantly destroyed.

This is not the only time I have seen this happen.  There have been two other instances I know of in the last year where an accidental deletion of a cloud account has meant that all data including backups were destroyed.  Luckily in both those cases the damage was relatively minor.  I actually also lost a web server due to this problem around the same time as Drizzle was hit.

The Openstack CI team do something quite clever but relatively simple to mitigate against these problems and continue running.  They use multiple cloud vendors (last I checked it was HP Cloud and Rackspace).  When your commit is being tested in Jenkins it goes to a cloud compute instance in whatever cloud is available at the time.  So if a vendor goes down for any reason the CI can still continue.

I highly recommend a few things to any users of the cloud.  You should:

  1. Make regular offsite backups (and verify them)
  2. If uptime is important, use multiple cloud providers
  3. Use Salt, Ansible or similar technology so that you can quickly spin your cloud instances up again to your requirements at a moments notice

Patrick Galbraith, a Principal Engineer who works with me at HP's Advanced Technology Group is currently working on a way to enhance libcloud to work with HP Cloud better so that it is easy to seamlessly use multiple clouds.  We are also working on several enhancements to salt and ansible.  Both very promising technologies when it comes to cloud automation.

The way I see it, no one should be putting all their cloud eggs in one basket.

Is Drizzle dead?

Yesterday someone opened a Launchpad question asking "is Drizzle dead?".  I have answered that question on Launchpad but wanted to blog about it to give a bit of background context.

As I am sure most of the people who read this know, Drizzle is an Open Source database which was originally forked from the alpha version of MySQL 6.0.  At the time it was an extremely radical approach to Open Source development, many features were stripped out and re-written as plugins to turn it into a micro-kernel style architecture.  Every merge request was automatically throughly tested on several platforms for regressions, memory leaks and even positive/negative performance changes.

In fact Drizzle has influenced many Open Source projects today.  Openstack's Continuous Integration was born from the advanced testing we did on Drizzle.  MariaDB's Java connector was originally based on Drizzle's Java connector.  Even MySQL itself picked up a few things from it.

Development of Drizzle started off as a "What if?" R&D project at Sun Microsystem spearheaded by Brian Aker.  Once Oracle acquired Sun Microsystem a new corporate sponsor was found for Drizzle, Rackspace.

Rackspace hired all the core developers (and that is the point where I joined) and development progressed through to the first GA release of Drizzle.  Unfortunately Rackspace decided to no longer sponsor the development of Drizzle and we had to disband.  I've heard many reasons for this decision, I don't want to reflect on it, I just want to thank Rackspace for that time.

Where are we now?  Of the core team whilst I was at Rackspace:

So, back to the core question: "Is Drizzle dead?".  The core team all work long hours in our respective jobs to make some awesome Open Source products and in what little spare time we have we all work on many Open Source projects.  Unfortunately splitting our time to work on Drizzle is hard, so the pace has dramatically slowed.  But it isn't dead.  We have been part of Google Summer of Code, we still get commits in from all over the place and Drizzle is still part of the SPI.

Having said this, Drizzle no longer has a corporate sponsor.  Whilst Drizzle can live and go on without one, it is unlikely to thrive without one.

Another thing that is frequently asked is: "What happened to the docs and wiki?".  Drizzle being a cloud databases had all of its development and public documentation servers hosted in the cloud.  Unfortunately the kill switch was accidentally hit prematurely on the cloud account used.  This means we not only lost the servers but the storage space being used for backups.  This also affected other Open Source projects such as Gearman.  The old wiki is dead, we cannot recover that content.  The docs were auto-generated from the reStructuredText documentation in the source.  It was just automatically compiled and rendered for easy reading.

What I would personally like to see is the docs going to Read The Docs automatically (there is an attempt to do this, but it is currently failing to build) and the main site moved to DokuWiki similar to the new Gearman site.

As for Drizzle itself...  It was in my opinion pretty much exactly what an Open Source project should be and indeed was developing into what I think an Open Source database should be.  It just needs a little sponsorship and a core team that are paid to develop it and mentor others who wish to contribute.  Given that it was designed from the ground-up to be a multi-tenant in-cloud database (perfect for a DBaaS) I suspect that could still happen, especially now projects like Docker are emerging for it to sit on.

Thursday, 13 February 2014

Caveats with Eventlet

The Stackforge Libra project as with most Openstack based projects is written in Python.  As anyone who has used Python before probably knows, Python has something called a GIL (Global Interpreter Lock).  The GIL basically causes Python to only execute one thread at a time, context switching between the threads.  This means you can't really use threads for performance reasons in Python.

One solution to get a little more performance is to use Eventlet.  Eventlet is a library which uses what is called "Green Threads" and hacks on top of the networking libraries to give a mutli-threaded like feel to an application.  As part of this blogging series for HP's Advanced Technology Group I'll write about some of the things I found out the hard way about Eventlet so hopefully you don't hit them.

What are Green Threads?


Green Threads are basically a way of doing multi-tasking on a single real thread.  They use what is called "Cooperative Yielding" to allow each other to run rather than being explicitly scheduled.  This has the advantage of removing the need for locks in many cases and making asynchronous IO easier.  But they come with caveats which can hurt if you don't know about them.

Threading library patched


One of the first things you typically do with eventlet is "Monkey Patch" standard Python library functions so that the are compatible with cooperative yielding.  For example you want the sleep() function to yield rather than hanging all the green threads up until finished.

The threading library is one of the libraries that is monkey patched and the behaviour suddenly becomes slightly different.  When you try to spawn a thread control will not return back to the main thread until the child thread has finished execution.  So your loop that tries to spawn X threads will suddenly only spawn 1 thread and not spawn the next until that thread has finished.  It is recommended you use Eventlet's green thread calls instead (which will actually work as expected).

Application hangs


Cooperative yielding relies on the library functions being able to yield.  Which means that if you use functions that do not understand this the yielding will not happen and all your green threads (including the main thread) will hang waiting.  Any unpatched system call (such as executing some C/C++ functions) falls into this category.

A common place you can see this is with the MySQLdb library which is a wrapper for the MySQL C connector (libmysqlclient).  If you execute some complex query that will take some time, all green threads will wait.  If your MySQL connection hangs for any reason... well, you are stuck.  I recommend using one of the native Python MySQL connectors instead.

Another place I have seen this is with any library that relies on epoll.  Python-gearman is an example of this.  It seems that Eventlet only patches the select() calls, so anything that uses epoll.poll() is actually blocking with Eventlet.

In summary there are cases where Eventlet can be useful.  But be careful where you are using it or things can grind to a halt really quickly.

Tuesday, 11 February 2014

Why use double-fork to daemonize?

Something that I have been asked many times is why do Linux / Unix daemons double-fork to start a daemon?  Now that I am blogging for HP's Advanced Technology Group I can try to explain this as simply as I can.

Daemons double-fork

For those who were not aware (and hopefully are now by the title) almost every Linux / Unix service that daemonizes does so by using a double-fork technique.  This means that the application performs the following:

  1. The application (parent) forks a child process
  2. The parent process terminates
  3. The child forks a grandchild process
  4. The child process terminates
  5. The grandchild process is now the daemon

The technical reason


First I'll give the technical documented reason for this and then I'll break this down into something a bit more consumable.  POSIX.1-2008 Section 11.1.3, "The Controlling Terminal" states the following:


The controlling terminal for a session is allocated by the session leader in an implementation-defined manner. If a session leader has no controlling terminal, and opens a terminal device file that is not already associated with a session without using the O_NOCTTY option (see open()), it is implementation-defined whether the terminal becomes the controlling terminal of the session leader. If a process which is not a session leader opens a terminal file, or the O_NOCTTY option is used on open(), then that terminal shall not become the controlling terminal of the calling process.

The breakdown

When you fork and kill the parent of the fork the new child will become a child of "init" (the main process of the system, given a PID of 1).  You may find others state on the interweb that the double-fork is needed for this to happen, but that isn't true, a single fork will do this.

What we are actually doing is a kind of safety thing to make sure that the daemon is completely detached from the terminal.  The real steps behind the double-fork are as follows:
  1. The parent forks the child
  2. The parent exits
  3. The child calls setsid() to start a new session with no controlling terminals
  4. The child forks a grandchild
  5. The child exits
  6. The grandchild is now the daemon
The reason we do step 4 & 5 is it is possible for the child to regain control of the terminal, but once it has lost control the forked grandchild cannot do this.

Put simply it is a roundabout way of completely detaching itself from the terminal that started the daemon.  It isn't a strict requirement to do this at all, many modern init systems can daemonize a process that will stay in the foreground quite easily.  But it is useful for systems that can't do this and for anything that at some point in time is expected to be run without an init script.

Why VLAIS is bad

At the beginning of the month I was at FOSDEM interfacing and watching talks on behalf of HP's Advanced Technology Group.  Since my core passion is working on and debugging C code I went to several talks on Clang, Valgrind and other similar technologies.  Unfortunately there I several talks I couldn't get into due to the sheer popularity of them but I hope to catch up with video in the future.

One talk I went to was on getting the Linux kernel to compile in Clang.  It appears that there are many changes which are down to Clang being a minimum of C99 compliant and GCC supporting some non-standard language extensions.

The one language extension which stood out for me was called VLAIS which stands for Variable Length Arrays In Structs.  Now, VLAs (Variable Length Arrays) are nothing new in C they have been around a long time.  What we are talking about here are variable length arrays at any point in the struct.  For example:

void foo(int n) {
    struct {
        int x;
        char y[n];
        int z;
    } bar;
}

The char in this struct is what we are talking about here.  This kind of code is used in several places in the kernel, for the most part it is used in encryption algorithms.

How did it get added to GCC?


It came about around 2004 in what appears to be a conversion of a standard from ADA to C.  There is a mailing list post on it here.  It has since been used in the kernel and I can understand the argument that the kernel was never intended to be compiled with anything other than GCC.  But the side of me that likes openness and portability is not so keen.  I suspect the problems that currently plague the kernel for Clang also affect compilers such as Intel's ICC and other native CPU manufacturer's compilers.

So why is it bad?


Well, to start with there is the portability issue.  Any code that uses this will not compile in other compilers.  If you turn on the pedantic C99 flags in GCC it won't compile either (if you aren't doing this then you really should, it shakes out lots of bugs).  Once Linus Torvalds found out about the usage of it in the kernel he called it an abomination and asked for it to be dropped.

Next there is debugging.  I'm not even sure if debuggers understand this and if they do I can well imagine it being difficult to work with, especially if you need to track Z in the example above.

There is the possibility of alignment issues.  Some architectures work much better when the structs are byte aligned by a certain width.  This will be difficult to do with a VLAIS in the middle of the struct.

In general I just don't think it is clean code and if it were me I would be using pointers and allocated memory instead.

Of course this is all my opinion and I'm sure people have other views that I haven't thought of.  Please use the comments box to let me know what you think about VLAIS and its use in the kernel.  You can find out more in the Linux Plumbers Conference 2013 slides.

Monday, 10 February 2014

HAProxy logs byte counts incorrectly

Continuing my LBaaS look back series of blog posts for HP's Advanced Technology Group I am today looking into an issue that tripped us up with HAProxy.

Whilst we were working with HAProxy we naturally had many automated tests going through a Jenkins server.  One such test was checking that the byte count in the logs tallied with bytes received, this would be used for billing purposes.

Unfortunately we always found our byte counts a little off.  At first we found it was due to dropped log messages.  Even after this problem was solved we were still not getting an exact tally.
After some research and reading of code I found out that despite what the manual says the outgoing byte count is measured from the backend server to HAProxy, not the bytes leaving HAProxy.  This means that injected headers are not in the byte count and if HAProxy is doing HTTP compression for you the count will be way off.

My findings were backed by this post from the HAProxy developer.

On average every log entry for us was off by around 30 bytes due to injected headers and cookies.
Given the link above this appears to be something the developer is looking into but I doubt it will be a trivial fix.