PDA

View Full Version : The JDBC character set problem[MYSQL]


JeremyJoung
08-24-2010, 04:26 AM
A old problem.

I use PHP 5.3.2 for Resin 4.0.7 with Quercus , and connect to MySQL 5.

My datebase is utf-8(chinese).
And although I set the character set to utf-8, but it force ruturn 'Latin1'
And result is full of the "???".

I have use the
"SET NAMES utf8"
"jdbc:mysql://localhost/sma?useUnicode=true&characterEncoding=UTF-8"

And when I check the character set I will get that.

php: "echo mysql_client_encoding($conn);"
> latin1

And the "mysql_set_charset('utf8', $conn);" (from PHP 5.3) will crashed.


"show variables like 'character_set_%'"
[Variable_name]character_set_client, [Value]latin1,
[Variable_name]character_set_connection, [Value]latin1,
[Variable_name]character_set_database, [Value]utf8,
[Variable_name]character_set_filesystem, [Value]binary,
[Variable_name]character_set_results, [Value]latin1,
[Variable_name]character_set_server, [Value]utf8,
[Variable_name]character_set_system, [Value]utf8,


The problem is just in PHP, JSP is right.
I see the source code.

When connected it automatic "SET NAMES 'latin1' "
Mysqli.java:222

Connection conn = jConn.getConnection();

if (! (conn instanceof QuercusConnection)) {
Statement stmt = conn.createStatement();

// php/1465
stmt.executeUpdate("SET NAMES 'latin1'");
stmt.close();
}

return jConn;

And getCharacterSetName is force return Latin1.
JdbcConnectionResource.java:348

public String getCharacterSetName()
{
return "latin1";
}

/**
* Alias for getCharacterSetName
*/
public String getClientEncoding()
{
return getCharacterSetName();
}


I have no idea if the Mysqli lib is used in jdbc/MySQL connection.
And I have no time to read all of the code.
But if I am not wrong. "SET NAMES utf8" can't get a binary stream, and will let the unicode character become "?".

Hope have some help.

Just that.

emil
08-24-2010, 05:26 PM
Hi,

This is a known problem with the MySQL driver and Quercus. Unfortunately because of the unique string handling requirements of converting back and forth between Java and PHP and some issues in the MySQL driver, there are situations in which non-latin1 encodings will not work at the moment. Our alternative is to write a new MySQL driver, which is on the task list, but not ready quite yet.

Thanks for your understanding,
Emil

JeremyJoung
08-25-2010, 08:25 AM
when I use
mysql_query("SET NAMES 'utf8'", $conn);
mysql_query("SET CHARACTER SET utf8", $conn);
mysql_query("SET character_set_connection=utf8", $conn);
mysql_query("SET character_set_client=utf8", $conn);
mysql_query("SET CHARACTER_SET_RESULTS='utf8'", $conn);


I wil get:

> echo mysql_client_encoding($conn);
latin1


> "show variables like 'character_set_%'"
[Variable_name]character_set_client, [Value]utf8,
[Variable_name]character_set_connection, [Value]utf8,
[Variable_name]character_set_database, [Value]utf8,
[Variable_name]character_set_filesystem, [Value]binary,
[Variable_name]character_set_results, [Value]utf8,
[Variable_name]character_set_server, [Value]utf8,
[Variable_name]character_set_system, [Value]utf8,


And output word is

"\u0080\u0000\u0080\u0080"

OCTAGRAM2
08-30-2010, 03:44 PM
The key is binary encoding. You should know that this is about shooting in your foot. However, it works. Thanks for the hint by the way. I have written a script that tries every combination of encoding, but nothing worked as desired before I had introduced binary into bruteforcer.

I was actually trying to use windows-cp1251 encoding, so my results are:

character_set_connection=cp1251 character_set_client=cp1251 character_set_results=binary
character_set_connection=cp1251 character_set_client=binary character_set_results=binary
character_set_connection=utf8 character_set_client=cp1251 character_set_results=binary
character_set_connection=binary character_set_client=latin1 character_set_results=binary
character_set_connection=binary character_set_client=cp1251 character_set_results=binary
character_set_connection=binary character_set_client=utf8 character_set_results=binary
character_set_connection=binary character_set_client=binary character_set_results=binary

Each of this triple has passed the check against SQL SELECT WHERE ... = 'russian cp1251 word', and the returned result table has field matching cp1251 word exactly.

I think, (cp1251, cp1251, binary) is the closest one.

UPD1
I have also observed that collation MATTERS! It is a luck that a table I was bruteforcing had cp1251_bin collation. Switching from cp1251_bin to cp1251_general_ci breaks every combination.

UPD2
I have also observed that character_set_connection must also be binary to enable cyrillic INSERT and UPDATE.

BTW...
http://caucho.com/resin-3.1/doc/quercus.xtp
mysql_connect("java:comp/env/jdbc/myDatabaseName");

This looks like a mistake. I have examined Quercus source code, and I can see no source code handling this. mysql_connect tries to resolve "java" as if it was a host name.

Allionn
09-12-2010, 03:54 AM
Hi, Emil.

I have this problem too.
I use Glassfish v.3, Quercus (I tried both 4.0.3 and 4.0.8 versions) and MySQL 5.
My database is utf-8(russian).

Is your new MySQL driver ready to use or not?

emil
09-13-2010, 04:45 PM
Hi Allionn,

Our MySQL JDBC driver is only in the planning stages, so no, it's not ready yet.

Emil

RiccardoCohen
11-23-2010, 09:35 AM
Hi Emil
I'd like to know if there is a workaround for utf8 chars in quercus, or any date for an update ?
Thanks
Riccardo