PDA

View Full Version : multipart/form-data POSTs are not Unicode-aware


OCTAGRAM2
05-23-2011, 04:45 PM
Unicode.semantics is on, but it is not working as expected. POST requests mostly work as usual, but when one needs to upload a file, one has to use multipart/form-data format. Regardless of current settings, form input fields are being interpreted in latin-8.

I've looked through the Quercus sources, and the problem is scattered across the following ones:

modules/quercus/src/com/caucho/quercus/env/Post.java

else if (filename == null) {
StringValue value = env.createStringBuilder();

value.appendReadAll(is, Integer.MAX_VALUE);


modules/quercus/src/com/caucho/quercus/env/StringValue.java

public int appendReadAll(ReadStream is, long length)
{
TempBuffer tBuf = TempBuffer.allocate();


TempBuffer is not Unicode-aware. The remainder of this method does nothing to interpret ReadStream in a proper charset (although ReadStream has encoding property set to "utf-8" )

Either StringValue should have appendReadAllUtf8 (or appendReadAllChar) method which takes into account ReadStream's encoding or appendReadAll must be modified to support utf-8 at least. (The latter looks better but I'm not sure if it won't break something else)

Currently I can see StringValue.appendUtf8 stub throwing UnsupportedOperationException and nothing else at all.

As a workaround one can either use <form accept-charset="iso-8859-1">. This will force modern browsers to encode cyrillic letters with HTML entities so that one can decode them one the other side.

Or use
unicode_decode(unicode_encode($_POST['probe'], 'iso-8859-1'), 'utf-8')
to decode misinterpreted utf-8.

I've checked it, it works.