The causes and solutions of the urllib.unquote garble in python


Found the problem

The urllib module in Python is used to handle url operations. The unquote method corresponds to the urldecode method in javascript, which decodes url and replaces a character like “%xx” with a single character, such as: ”% E6% B3% 95% E5% 9B % BD % BA % A2% E85% 92” will be converted to “French wine” after decoding, but in the process of use, if the posture is wrong, the final converted characters will be scrambled ”& aelig; & sup3; The & # 8226; & aring; The & # 8250; & frac12; & ccedil; & ordm; & cent; E…” .

The author encountered such a problem in a real Tornado application. After the browser passed the request to the background, the parameter value was obtained:

name = self.get_argument("name", "")
name = urllib.unquote(name)
# save to db

The value of name is printed to read: ”% E6% B3% 95% E5% 9B%BD% BD% A2% E85% 92”, which seems fine, but the result ends up in the database as a string of scrambled “æ & sup3; The & # 8226; & aring; The & # 8250; & frac12; & ccedil; & ordm; & cent; E…” This question is really a puzzle.

Cause analysis,

So I wanted to use the directly encoded characters to see if there was any confusion:

name = '%E6%B3%95%E5%9B%BD%E7%BA%A2%E9%85%92'
name = urllib.unquote(name)
# save to db

It is found that there is no problem in this way of processing. After decoding the value of name, it is “French wine”. After thinking for a while, the reason can only be attributed to self.get_argument("name") This is the code snippet. Originally, by default, get_argument returns a value of type unicode, while when unquote processes characters of type unicode, it directly returns:

u'\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92'

Note: the return value type is unicode, which means that the unquote method receives the parameter of unidoe, and the return value type is unicode, but the ’%’ is replaced by ‘\x’.

u"%E6%B3%95%E5%9B%BD%E7%BA%A2%E9%85%92"

Instead:

u"\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92"

u”\xe6\xb3\ xe5\x9b\ urldecode2 \xe9\x85\x92” is exactly an ascii code string, only in hexadecimal, let’s look at ‘e6’,‘b3’… What characters correspond to ascii respectively. You can refer to http:// www.ascii-code.com, ‘e6’ is an extended ascii character, in the range of 128-255, its corresponding symbol is ‘æ ’

DEC OCT HEX BIN Symbol
230 346 E6 11100110 æ

Now you should understand why garbled characters are generated:

法国红 e

The solution

In the call self.get_argument('name') After the method, convert the returned value to type str:

name = self.get_argument("name", "")
name = str(name)
name = urllib.unquote(name)
# save to db

After converting to a string of type str, it is equivalent to calling:

'\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92'.decode("utf-8")

>>> u'\u6cd5\u56fd\u7ea2\u9152'

conclusion