Understanding MySQL Binary Collations
Today I am going to explore MySQL’s Binary Collations. About a week ago, I wrote a post about an issue I was having with trailing whitespace not showing up as a different value in a GROUP BY query as I expected.
As a brief recap, this is because the MySQL character set I was using was PADSPACE, which means that all trailing whitespace is ignored in comparisons. The solution was to use the BINARY operator.
This issue led me to do more investigation into MySQL character sets and collations. I wanted to see if there were a way to make trailing spaces be significant by changing the column’s character definition, instead of using the BINARY operator. Reading through the MySQL manual on the BINARY operator did not encourage me:
BINARY (operator) also causes trailing spaces to be significant….The BINARY attribute in character column definitions has a different effect.
Ok, so defining the column as BINARY does not make trailing spaces significant. Instead, it changes the collation of the column to the _bin collation of the character set. As an example, if your column is defined as character set ‘utf8′, then defining the column ‘char(10) BINARY’ changes the collation to to ‘utf8_bin’. So what does that mean?
All nonbinary strings have a character set and collation. Examples of character sets are ‘utf8′ and ‘latin1′. They define the type of characters allowed in the column. Each character set has multiple collations associated only with that character set (utf8_general, utf8_bin). These collations define the rules for comparing and sorting the characters in the parent set.
The _bin collation will order the column based on the value of the string on the binary code values of the characters. These characters can be multi-byte (as in the case of utf8), and in this respect makes the _bin collation different from binary strings stored in BINARY, VARBINARY and BLOB column types.
‘_bin’ collations will still allow the lettercase to be modified (for example, calling LOWER() on a _bin collation will change ‘AA’ to ‘aa’) but ordering the values will ignore the lettercase. This ability to modify the lettercase of a _bin collation differs from a binary string, which does not allow lettercase conversion.
Unfortunately, ‘_bin’ collations are still considered nonbinary strings, making this rule apply:
Nonbinary strings have PADSPACE behavior for all collations, including _bin collations
Trailing spaces are still insignificant in comparisons: ‘a ‘ = ‘a’.
So while I was not able to avoid using the BINARY operator to make trailing spaces significant by changing the column definition, I did learn that _bin collation will allow me to sort the values of a column in a case-insensitive manner, while still maintaining lettercase significance for comparisons (‘a’ is not equal to ‘A’).
For more reading on the _bin collation and the difference from binary strings, there is some good information on the MySQL manual page.


What about varbinary type?
http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
“For VARBINARY, there is no padding on insert and no bytes are stripped on select. All bytes are significant in comparisons, including ORDER BY and DISTINCT operations. 0×00 bytes and spaces are different in comparisons, with 0×00 < space."
Does this suit your needs?
I've happened upon this issue myself and am exploring it a bit…
Very good Dave. The potential issues with VARBINARY are:
a) handling of multi-byte characters are passed to your operating system. This makes the behavior of multi-byte varbinary unreliable across different systems.
b) You cannot use functions like LOWER() and UPPER() to change the letter case (you have to use CONVERT to a different character set, which then introduces the trailing whitespace issue.)
But other than that, VARBINARY is a valid alternative as long as you know the caveats.