Thursday, 26 June 2014

CoreOS Review

I have spent a few days now playing with CoreOS and helping other members of HP's Advanced Technology Group get it running on their setups.

Today I thought I would write about the good and the bad about CoreOS so far.  Many components are in an alpha or beta state so things may change over the coming months.  Also as a disclaimer, views in this post are my own and not necessarily those of HP.

Installation


As stated in my blog post yesterday, I have been using CoreOS on my Macbook Pro using Vagrant and VirtualBox.  This made it extremely easy to setup the CoreOS cluster on my Mac.  I made a minor mistake to start with, and that is to not configure the unique URL required for Etcd correctly.  A mistake a colleague of mine also made on his first try so it likely to be a common one to make.

I initially had VirtualBox configured to use a Mac formatted USB drive I have hooked-up.  Vagrant tried to create my CoreOS cluster there and during the setup the Ruby in Vagrant kept spinning on some disk reading routine and not completing the setup.  Debug output didn't help find the cause so I switched to using the internal SSD instead.

CoreOS


CoreOS itself appears to be derived from Chrome OS which itself is a fork of Gentoo.  It is incredibly minimal, there isn't even a package manager that comes with it.  But that is the whole point.  It is designed so that Docker Containers are run on top of it providing the application support.  Almost completely isolating the underlying OS from the applications running on top of it.  This also provides excellent isolation between say MySQL and Apache for a LAMP stack.

It is a clean, fast OS using many modern concepts such as systemd and journald.  Some of these are only in the bleeding-edge distributions at the moment so many people may not be familiar with using them.  Luckily one of my machines is running Fedora 20 so I've had a play with these technologies before.

Etcd


CoreOS provides a clustered key/value store system called 'etcd'.  The name of this confused many people I spoke to before we tried it.  We all assumed it was a clustered file store for the /etc/ path on CoreOS.  We were wrong, although that is maybe the direction it will eventually take.  It actually uses a REST based interface to communicate with it.

Etcd has been pretty much created as a new project from the ground-up by the CoreOS team.  The project is written in Go and can be found on Github.  Creating a reliable clustered key/value store is hard, really hard.  There are so many edge cases that can cause horrific problems.  I cannot understand why the CoreOS team decided to roll their own instead of using one of the many that have been well-tested.

Under the hood the nodes communicate to each other using what appears to be JSON (REST) for internal admin commands and Google Protobufs over HTTP for the Raft Consensus Algorithm library used.  Whilst I commend them for using Protobufs in a few places, HTTP and JSON are both bad ideas for what they are trying to achieve.  JSON will cause massive headaches for protocol upgrades/downgrades and HTTP really wasn't designed for this purpose.

At the moment this appears to be designed more for very small scale installations instead of hundreds to thousands of instances.  Hopefully at some point it will gain its own protocol based on Protobufs or similar and have code to work with the many edge cases of weird and wonderful machine and network configurations.

Fleet


Fleet is another service written in Go and created by the CoreOS team.  It is still a very new project aimed at being a scheduler for a CoreOS cluster.

To use fleet you basically create a systemd configuration file with an optional extra section to tell Fleet what CoreOS instance types it can run on and what it would conflict with.  Fleet communicates with Etcd and via. some handshaking figures out a CoreOS instance to run the service on.  A daemon on that instance handles the rest.  The general idea is you use this to have a systemd file to manage docker instances, there is also a kind-of hack so that it will notify/re-schedule when something has failed using a separate systemd file per service.

Whilst it is quite simple in design it has many flaws and for me was the most disappointing part of CoreOS so far.  Fleet breaks, a lot.  I must have found half a dozen bugs in it in the last few days, mainly around it getting completely confused as to which service is running in which instance.

Also the way that configurations are expressed to Fleet are totally wrong in my opinion.  Say, for example, you want ten MySQL docker containers across your CoreOS cluster.  To express this in Fleet you need to create ten separate systemd files and send them up.  Even though those files are likely identical.

This is how it should work in my opinion:  You create a YAML file which specifies what a MySQL docker container is and what an Apache/PHP container is.  In this YAML you group these and call them a LAMP stack.  Then in the YAML file you specify that your CoreOS cluster needs five LAMP stacks, and maybe two load balancers.

Not only would my method scale a lot better but you can then start to have a front-end interface which would be able to accept customers.

Conclusion


CoreOS is very ambitions project that in some ways becomes the "hypervisor"/scheduler for a private Docker cloud.  It can easily sit on top of a cloud installation or on top of bare metal.  It requires a totally different way of thinking and I really think the ideas behind it are the way forward.  Unfortunately it is a little too early to be using it in anything more than a few machines in production, and even then it is more work to manage than it should be.

Wednesday, 25 June 2014

Test drive of CoreOS from a Mac

Yes, I am still the LinuxJedi, and whilst I have Linux machines in my office for day-to-day work the main desktop I use is Mac OS.

I have recently been giving CoreOS a spin for HP's Advanced Technology Group and so for this blog post I'm going to run through how to setup a basic CoreOS cluster on a Mac.  These instructions should actually be very similar to how you would do it in Linux too.

Prerequisites


There are a few things you need to install before you can get things going:
  1. VirtualBox for the CoreOS virtual machines to sit on
  2. Vagrant to spin-up and configure the virtual machines
  3. Git to grab the vagrant files for CoreOS
  4. Homebrew to install 'fleetctl' (also generally really useful on a developer machine)
You then need to install 'fleetctl' which will be used to spin up services on the cluster:

$ brew update
$ brew install fleetctl

Configuring


Once everything is installed the git repository for the vagrant bootstrap is needed:

$ git clone git://github.com/coreos/coreos-vagrant/
$ cd coreos-vagrant/
$ cp config.rb.sample config.rb
$ cp user-data.sample user-data

The last two commands here copy the sample configuration files to configuration files we will use for Vagrant.

Edit the config.rb file and look for the line:

#$num_instances=1

Uncomment this and set the number to '3'.  When I was testing I also enabled the 'beta' channel in this file instead of the default 'alpha' channel.  This is entirely up to personal preference and both will work with these instructions at time of writing.

With your web browser go to https://discovery.etcd.io/new.  This will give you a token URL.  This URL will be used to help synchronise Etcd which is a clustered key/value store in CoreOS.  In the user-data YAML file uncomment the 'discovery' line and set the URL to the one the link above gave you.  It is important that you generate a new URL for every cluster and every time you burn down / spin up  a cluster on your machine.  Otherwise Etcd will not behave as expected.

Spinning Up


At this point everything is configured to spin-up the CoreOS cluster.  From the 'coreos-vagrant' directory run the following:

$ vagrant up

This may take a few minutes as it downloads the CoreOS base image and creates the virtual machines required.  It will also configure Etcd and setup SSH keys so that you can communicate with the virtual machines.

You can SSH into any of these virtual machines using the following and substituting '01' with the CoreOS instance you wish to connect to:

$ vagrant ssh core-01 -- -A

Using Fleet


Once you have everything up and running you can use 'fleet' to send systemd configurations to the cluster.  Fleet will schedule which CoreOS machine the configuration file will run on.  Fleet internally uses Etcd to discover all the nodes of your CoreOS cluster.

First of all we need to tell 'fleetctl' on the Mac how to talk to the CoreOS cluster.  This requires two things, the IP/port and the SSH key:

$ export FLEETCTL_TUNNEL=127.0.0.1:2222
$ ssh-add ~/.vagrant.d/insecure_private_key

For this example I'm going to spin up three memcached docker instances inside CoreOS, I'll let Fleet figure out where to put them only telling it that two memcached instances cannot run on the same CoreOS machine.  To do this create three identical files, 'memcached.1.service', 'memcached.2.service' and 'memcached.3.service' as follows:

[Unit]
Description=Memcached server
After=docker.service

[Service]
ExecStart=/usr/bin/docker run -p 11211:11211 fedora/memcached

[X-Fleet]
X-Conflicts=memcached*

This is a very simple systemd based script which will tell Docker in CoreOS to get a container of a Fedora based memcached and run it.  The 'X-Conflicts' option tells Fleet that two memcached systemd configurations cannot run on the same CoreOS instance.

To run these we need to do:

$ fleetctl start memcached.1.service
$ fleetctl start memcached.2.service
$ fleetctl start memcached.3.service

This will take a while and show as started whilst it downloads the docker container and executes it.  Progress can be seen with the following commands:

$ fleetctl list-units
$ fleetctl journal memcached.1.service

Once complete your three CoreOS machines should all be listening on port 11211 to a Docker installed memcached.

Thursday, 5 June 2014

Live Kernel Patching Video Demo

Earlier today I posted my summary of testing the live kernel patching solutions so far as part of my work for HP's Advanced Technology Group.  I have created a video demo of one of these technologies, kpatch.


This demo goes over the basics of how easy kpatch is to use.  The toolset for now will only work with Fedora 20  for now but judging by the documentation Ubuntu support is coming very soon.

Live Kernel Patching Testing

At the end of March I gave a summary of some of the work I was doing for HP's Advanced Technology Group on evaluating kernel patching solutions.  Since then I have been trying out each one whilst also working on other projects and this blog post is to show my progress so far.

After my first blog post there were a few comments as to why such a thing would be required.  I'll be the first to admit that when Ksplice was still around I felt it better to have a redundant setup which you could perform rolling reboots on.  That is still true in many cases.  But with OpenStack based public clouds a reboot means that customers sitting on top of that hardware will have their virtual machines shut down or suspended.  This is not ideal when with big iron machine you can have 20 or more customers to one server.  A good live kernel patching solution means that in such a scenario the worst case would be a kernel pause for a few seconds.

In all cases my initial testing was to create a patch which would modify the output of /proc/meminfo. This is very simple and doesn't go as far as testing what would happen patching a busy function under load but at least gets the basics out of the way.

Ksplice


As previously discussed, Ksplice was really the only toolkit to do live kernel patching in Linux for a long time.  Unfortunately since the Oracle acquisition there have been no Open Source updates to the Ksplice codebase.  In my research I found a Git repository for kSplice which had recent updates to it. I tried using the with Fedora 20 but my guess is either Fedora does something unique with the kernels or it is just too outdated for modern 3.x Linux kernels.  My attempts to compile the patches hit several problems inside the toolset which could not easily be resolved.

kGraft


After Ksplice I moved on to kGraft.  This takes on a similar approach to Ksplice but uses much newer techniques.  Of all the solutions I believe the technology in kGraft is the most advanced.  It is the only solution that is capable of patching a kernel without pausing it first.  To use this solution there is a requirement for now to use SUSE's kGraft kernel tree.  I tested this on an OpenSUSE installation so I could be close to the systems it was designed on.  Unfortunately whilst the internals are quite well documented, the toolset is not.  After several attempts I could not find a way to get this solution to work end-to-end.  I do think this solution has massive potential, especially if it gets added to the mainline kernel.  I hope in the future the documentation is improved and I can revisit this.

kpatch


Finally I tried Red Hat's kpatch.  This solution was released around the same time as kGraft and likewise is in a development stage with warnings that it probably shouldn't be used in production yet. The technology again is similar to Ksplice with more modern approaches but what really impressed me is the ease-of-use.  The kpatch source does not require compiling as part of the kernel, it can be compiled separately and is just a case of 'make && make install'.  The toolset will take a standard unified diff file and using this will compile the required parts of the kernel to create the patch module.  Whilst this sounds complex it is a single command to do this.  Applying the patch to the kernel is equally as easy.  Running a single command will do all the required steps and apply the patch.

Conclusion


If public clouds with shared hardware and going to continue to grow this should be a technology that is invested in.  I was very quickly drawn to Red Hat's solution because they have put a high priority on how the user would interact with the toolset.  I'm sure many Devops engineers will appreciate that.  What I would love to see is kGraft's technology combined with kpatch's toolset.  This will certainly be an interesting field to watch as both projects mature.

Monday, 28 April 2014

Should voicemail be trusted?

As a member of HP's Advanced Technology Group I take security very seriously. It is a primary thought in everything that engineer does at HP.

I will start this discussion with a disclaimer: Don't hack voicemail!  Not only is it a really nasty thing to do, it is illegal!

In the UK we have had a phone hacking scandal in our media for a long time. The short story is reporters for tabloid newspapers were accessing the voicemail of celebrities to find out gossip to sell papers.  Phone providers made this easy by having an easy to guess default PIN number or no PIN number at all to access voicemail remotely.

Whilst things have improved slightly in the wake of this, The Register recently proved you can still access the voicemail of others without a PIN number very simply.  By the time you read this I suspect both providers affected by this will have closed the loop hole but there is bound to be other loop holes just waiting to be exploited.  This still raises several questions in my mind about the security of voicemail.

Judging by the data from a recent Data Genetics article you could reasonably guess a PIN in three tries with around 18% chance of getting it correct.  I've not tried to lock myself out of a voicemail system before but I would hope it locks out after three attempts (if not then we should really worry).  If you have some information such as memorable years/dates about the person owning the number you could probably even increase your success rate.  So in theory a hacker wouldn't have to try too many phone numbers until he/she got in.

If your PIN number is not stored in an encrypted form it is likely that it would be vulnerable to some form of social engineering attack at the provider's end.  I also suspect that many would use their credit card PIN number as their voicemail PIN number to make it easy to remember which adds an level of insecurity with the system.

I think it is very unlikely that the voicemail is stored as in an encrypted form, much more likely that it is a bunch of MP3s on a disk array with a database table pointing to your messages (or just blob data in the DB).  This brings the security of it down inline with email (worse because a PIN number is easier to guess than passwords).  Even if the voicemail data is encrypted the provider holds the locks and the keys, rendering you powerless.

The general saying is that email should be considered public and you shouldn't send messages you wouldn't want the world reading without at least some form of encryption (such as PGP). I would say that exactly the same is true of voicemail, don't use it for messages that you don't want the world to hear.  Voicemail doesn't have anything like PGP encryption built-in.

Several of my friends and I have our voicemail greeting messages set to say that we don't listen to our voicemail ever and that messages should be left in a different form.  In this decade, where security is really under the magnifying glass, I think someone needs to start taking a serious look at a better way of doing voicemail.

Monday, 31 March 2014

Live Kernel Patching Solutions

A large public cloud is quite a complex thing to manage, even with toolsets such as Openstack to deploy and manage it.  You often have many customers with compute instances on a single server so security is a high priority.  This also introduces an interesting problem, when you need to deploy a Linux Kernel security patch to the host (sometimes thousands of hosts) how do you do it with the least disruption?

One of the currently most accepted methods is to suspend the instances on a box, deploy the patch, reboot the host and resume the instances.  Whilst this works in many cases you have customers who have had their instances down for X minutes (even if it is planned and the customer is notified it can be inconvenient) and instance kernels that suddenly think "my clock just jumped, WTF just happened?".  This in-turn can cause problems for long running clients talking to servers as well as any time-sensitive applications.  Long running clients will often think they are still connected to the server when they really are not (there are ways around this with TCP Keepalive).

There are a couple of old solutions to this and a couple of new ones and as part of my work for HP's Advanced Technology Group I will be taking a deep dive into them in the coming weeks.  For now here is a quick summary of what is around:

kexec


This is probably the oldest technology on the list.  It doesn't quite fall under the "Live Kernel Patching" umbrella but is close enough that it warrants a mention.  It works by basically ejecting the current kernel and userspace and starting a new kernel.  It effectively reboots the machine without a POST and BIOS/EFI initialisation.  Today this only really shaves a few seconds off the boot time and can leave hardware in an inconsistent state.  Booting a machine using the BIOS/EFI sets the hardware up in an initialised state, with kexec the hardware could be in the middle of reading/writing data at the point the new kernel is loaded causing all sorts of issues.

Whilst this solution is very interesting, I personally would not recommend using it as during a mass deployment you are likely to see failures.  More information on kexec can be found on the Wikipedia entry.

Ksplice


Ksplice was really the first toolset to implement Live Kernel Patching.  It was created by several MIT students who subsequently spun-off a company from it which supplied patches on a subscription model.  In 2011 this company was acquired by Oracle and since then there have been no more official Open Source releases of the technology.  Github trees which are updated to work with current kernels still exist.

The toolset works by taking a kernel patch and converting it into a module which will apply the changes to functions to the kernel without requiring a reboot.  It also supports changes to the data structures with some additional developer code.  It does temporarily pause the kernel whilst the patch is being applied (a very quick process), but this is far better than rebooting and should mean that the instances do not need suspending.

kpatch


Both Red Hat and SUSE realised that a more modern Open Source solution is needed for the problem, and whilst SUSE announce their solution (kGraft) first, Red Hat's kpatch solution was the first to show actual code to the solution.

Red Hat's kpatch solution gives you a toolset which creates a binary diff of a kernel object file before and after a patch has been applied.  It then turns this into a kernel module which can be applied to any machine with the same kernel (along with kpatch's loaded module).  Like Ksplice, it does need to pause the kernel whilst patching the functions.  It also as-yet does not support changes to data structures.

It is still very early days for this solution but development is has been progressing rapidly.  I believe the intention is to create a toolset that will take a unified diff file and turn that into a kpatch module automatically.

For more information on this solution take a look at their blog post and Github repository.

kGraft


SUSE Labs announced kGraft earlier this year but only very recently produced code to show their solution.

From the documentation I've seen so far their solution appears to work in a similar way to Red Hat's but they have the unique feature which gives the ability for the patch to be applied to the kernel without pausing it.  Both the old and the replacement functions can exist at the same time, old executions will finish using the old function and new executions will use the new function.

This solution seems to have gone down the route of bundling their code on top of a Linux kernel git tree which means it took an entire night for me to download the git history.  I'm looking forward to digging through the code of this solution to see how it works.

The git tree for this can be found in the kgraft kernel repository (make sure to checkout the origin/kgraft branch after you have cloned it) and SUSE's site on the technology can be found here.

Summary


All three of the above solutions are very interesting.  Combined with a deployment technology such as Salt or Ansible it could mean the end to maintenance downtime for cloud compute instances.  As soon as I have done more research on the technologies I will be writing more details and hopefully even contributing where possible.

Saturday, 22 February 2014

Cloud Users: Don't put your eggs in one basket

In my previous blog post I briefly mentioned that a premature cloud account removal meant that the Drizzle project lost a lot of data, including backups.  I'll go into a bit more detail here so cloud users can learn from our mistakes.  I will try to avoid using cloud puns here as much as possible :)

As with everything in life clouds fail.  No cloud I have ever seen so far can claim they have had 100% uptime since the beginning of hitting GA.  However much projects like Openstack simplify things for the operators of clouds they are complex architectures and things can fail.

But there is one element of human failure I've seen several times from multiple cloud vendors which is the problem that crippled Drizzle.  That is account deletion.

Most of Drizzle's websites and testing framework were running from a cloud account using compute resources.  Backups were made and automatically uploaded to the cloud file storage on the same account for archiving.  This was our mistake (and we have definitely learnt from it).  We knew that at some point in the future the cloud accounts used would be migrated to a different cloud and the current cloud account terminated.  Unfortunately the cloud account used was terminated prematurely.  This meant that all compute instances and file storage was instantly flushed down the toilet.  All our sites and backups were instantly destroyed.

This is not the only time I have seen this happen.  There have been two other instances I know of in the last year where an accidental deletion of a cloud account has meant that all data including backups were destroyed.  Luckily in both those cases the damage was relatively minor.  I actually also lost a web server due to this problem around the same time as Drizzle was hit.

The Openstack CI team do something quite clever but relatively simple to mitigate against these problems and continue running.  They use multiple cloud vendors (last I checked it was HP Cloud and Rackspace).  When your commit is being tested in Jenkins it goes to a cloud compute instance in whatever cloud is available at the time.  So if a vendor goes down for any reason the CI can still continue.

I highly recommend a few things to any users of the cloud.  You should:

  1. Make regular offsite backups (and verify them)
  2. If uptime is important, use multiple cloud providers
  3. Use Salt, Ansible or similar technology so that you can quickly spin your cloud instances up again to your requirements at a moments notice

Patrick Galbraith, a Principal Engineer who works with me at HP's Advanced Technology Group is currently working on a way to enhance libcloud to work with HP Cloud better so that it is easy to seamlessly use multiple clouds.  We are also working on several enhancements to salt and ansible.  Both very promising technologies when it comes to cloud automation.

The way I see it, no one should be putting all their cloud eggs in one basket.