By themselves, URLs are nothing but alphanumeric strings, with some other symbols thrown in. The character set chosen to express a URL string consists of the following symbols:
|
Symbols |
Values |
|
Alphanumeric symbols |
A-Z, a-z, 0-9 |
|
Reserved symbols |
; / ? : @ & = + $ , < > # % " |
|
Other special characters |
- _ . ! ~ * ' ( ) {} | \^ [ ] ` |
For the most part, a URL string consists of letters, numbers, and reserved symbols that have special meaning within the URL string. Other special characters are found in some URL strings, although they don't have any special meaning as far as the URL is concerned. However, they may have special meaning for the Web server receiving the URL or the application that is requested via the Web server.
Interpretations of some of these special characters are presented in Table 5-2.
Characters such as * and ; and | and ` have special meanings as meta-characters in applications and scripts. These characters don't affect the URL in any way, but if they end up making their way into applications, they may change the meaning of the input altogether and sometimes create gaping security holes.
Many meta-characters are interpreted differently by different Web servers. Table 5-3 describes how various meta-characters are interpreted inside applications.
The question that arises now is, "What if we want to specify special characters such as % or ? or & or + without giving them any special meaning?" For example, suppose we want to pass two parameters, book=pride&prejudice and shipping=snailmail, on the Query String. In this case, the URL is:
http://mycheapbookshop.com/purchase.cgi?book=pride&predjudice&shipping=snailmail
Meta-Characters and Input ValidationThe single most prominent cause of over 90% of all Web application vulnerabilities is lack of proper input validation. The concept of input validation isn't new. During our days of writing Fortran code in college, the instructor used to perform manual input validation before giving us credit for the code submitted. One of the programs to be written was to calculate the natural logarithm of a number. None of the students' code ever made it past the first input given by the instructor—"banana"—when the program was expecting a number! When given unexpected input, the program would crash and dump core. In those days, little did we realize the importance of proper input validation. Making an xterm pop out by forcing meta-characters and Unix commands into a Web page form is perhaps the epitome of elegant Web hacks, attributed entirely to weak input validation. |
The result is an ambiguous URL because there are three & symbols in the Query String. Most likely, a Web server would split such a Query String into three parameters instead of two—namely, book=pride, prejudice= and shipping=snailmail.
If we want to pass the & symbol as part of the parameter value, the URL specification allows us to express reserved and special characters in a two-digit hexadecimal encoded ASCII format, prefixed with a % symbol, as follows:
|
Characters |
Hex Values |
|
All hex encoded characters |
%XX (%00-%FF) |
|
Control characters |
%00-%1F, %7F |
|
Upper 8-bit ASCII characters |
%80-%FF |
|
Space |
%20 or + |
|
Carriage return |
%0d |
|
Line feed |
%0a |
In the preceding example, the ASCII value of the & symbol is 38 in decimal and 26 in hexadecimal. Therefore, if we want to express the & symbol, we can use %26 in its place. The URL in the example would become:
http://mycheapbookshop.com/purchase.cgi?book=pride%26predjudice&shipping=snailmail
Hexadecimal ASCII encoding, while serving purposes for the most part, isn't broad enough to represent character sets larger than 256 symbols. Most modern operating systems and applications support multibyte representations of character sets of languages other than English. Microsoft's IIS Web server supports URLs containing characters encoded with multibyte UCS Translation Format (UTF-8), in addition to hexadecimal ASCII encoding.
The Universal Character Set (UCS) is defined by the International Standards Organization's draft ISO 10646. Although UCS is maintained by ISO, a separate group was formed (primarily by software vendors) to allow representation of a variety of character sets with one unified scheme. This group came to be known as the Unicode Consortium (http://www.unicode.org). As standards were developed, both Unicode and UCS decided to adopt a common representation scheme so that the computing world didn't have to deal with separate standards for the same thing. UTF-8 encoding is defined in ISO 10646-1:2000 and in RFC 2279. For operating systems that have been designed around the ASCII character encoding scheme, UTF-8 allows for easy conversion and representation of multibyte Unicode characters using ASCII mappings.
Without going into the intricacies of how UTF-8 works, let's look at Unicode encoding from a URL's point of view. Two-byte Unicode characters are encoded by using %uXXYY, where XX and YY are hexadecimal values of the higher and lower byte respectively. For the standard ASCII characters %00 to %FF, the Unicode representation is %u0000 to %u00FF. The Web server decodes 16 bits at a time when dealing with Unicode encoded symbols.