Friday, 28 January 2011

MySQL to Drizzle character set considerations

Drizzle supports one character set (unless you include binary) which is UTF8.  It is the character set used by most of the web and supporting many different character sets can lead to complications.  That is not to say that there are not advantages of supporting many different types of character sets like MySQL does, but more care is needed when using them.

As an example of this, a new Drizzle user came online today saying that drizzledump's MySQL to Drizzle conversion was turning 'è' to 'è'.  When drizzledump connects to a MySQL server it sets the connection to UTF8 so that the dump output is compatible with Drizzle.  After a bit of discussion it was discovered that the user's table was latin1 and connection was latin1 (PHP does this by default) but they were storing and retrieving UTF8 data.  Essentially their data was getting mangled but it happened to work.  The problem came when telling MySQL to export this data as UTF8, it was effectively doing a double UTF8 conversion of the data.

With this in mind we have added a new option to drizzledump so that it stops setting the character set for the connection in these situations, '--my-data-is-mangled'.

8 comments:

  1. [...] This post was mentioned on Twitter by drizzlenews, Andrew Hutchings. Andrew Hutchings said: MySQL to Drizzle character set considerations http://j.mp/gx4O81 [...]

    ReplyDelete
  2. That sounds like a killer feature to me. Unmangling data in MySQL is quite a challenge.

    ReplyDelete
  3. Just out of curiosity what happens when you try to dump data where the MySQL table or column charset differs from the base charset? MySQL permits this and it leads to some rather ugly problems with replication at least.

    ReplyDelete
  4. When a table or column differs from the connection character set, that is fine. MySQL should convert the data. The problem comes when the character set for the data differs from the character set for the connection. In the case I outlined in my blog post the default mysqldump should also give a bad dump (I think, judging by the manual, I haven't tested it).

    It has been a while since I have worked on MySQL and can't remember about character set issues in replication unfortunately.

    ReplyDelete
  5. A few years ago I wrote a blog post on character set settings: http://www.alberton.info/dbms_charset_settings_explained.html

    ReplyDelete
  6. [...] has added –my-data-is-mangled for handling certain MySQL charset migration [...]

    ReplyDelete