Fun with domain names - The WWW
Users browsing this thread: 3 Guest(s)
|
|||
Hello nixers,
This thread is simply a write-up about a fun experiment we did on IRC, putting an ephemeral conversation into ink. Pasting any of the following in your browser URL bar redirects/resolves to nixers.net: Code: https://πππ©ππ£π€.πππ₯ Interesting, so at which level does the conversion happens, we know DNS only supports ascii and would actually convert unicode to punycode if it encounted some. The same behavior happens with curl and dig, and snooping it with wireshark shows that the actual request sent for A is nixers.net. Code: dig @8.8.8.8 A βββ§ββ‘β’.βββ£ My guess, is that DNS doesn't actually handle such wide characters but that the tool fallback on a library that does the normalization for them. Let's test if the DNS resolves by packing our own request and sniffing it with wireshark: Code: echo -n -e And nope, it doesn't work, we get a server error: So it's a local conversion to ascii, using normalization. From the command line you can test using the following: Code: > iconv -f utf-8 -t ascii//TRANSLIT <<<ππππππ.πππ And that explains our initial issue. However, I'm still wondering which library does the conversion in all these tools, let me know if you find it, my strace log was too big and I didn't want to parse it. EDIT: I've actually found where the DNS translation is done in all these tools, they rely on getaddrinfo(3) and other OS libs. So it start from getaddrinfo to then either calls __idna_to_dns_encoding or __idna_to_ascii_lz depending on the version (my guess), which relies on libidn. So libidn is doing all the dirty work. Code: ~ > ldd $(which curl) | grep libidn You can actually test it on the command line too, similar to uconv: Code: > idn --idna-to-ascii 'https://π§π’π±ππ«π¬.π§ππ' The relevant part of the getaddrinfo docs: Other interesting thing you can do with browsers and domain names:
Yep, the web is kind of complicated... |
|||
|
|||
|
|||
|
|||
Well actually, the ascii art in my in my introduction thread gets displayed as emojis on some browsers so where's my trophy?
|
|||
|
|||
(24-08-2020, 12:33 PM)sokx Wrote: so where's my trophy You win! Though, technically, because let's be technical on a computer enthusiast forums, your thread was posted before I added support for 4B wide UTF-8 unicode in the DB. Your heart β₯ pushed the limit boundary: Code: 11100010 10011001 10100101 While the penguin π§ went further in the 4B space: Code: 11110000 10011111 10010000 10100111 But everyone is a winner in my β₯. |
|||
|
|||
Emojis. If you absolutely want to say something, but you cannot be bothered with grammar.
-- <mort> choosing a terrible license just to be spiteful towards others is possibly the most tux0r thing I've ever seen |
|||
|
|||
How about we set up DNS with only emojis as TLDs then ? I'd love to access the forums using "https://nixers.π§"
By the way, I looked up the ICANN documentation about non-ASCII domains. They're supposed to be sent as "punycode". So it's weird that your requests got translated to ASCII text. For instance: Code: $ dig +noidnout nixers.π© | grep -A1 QUESTION |
|||
|
|||
(25-08-2020, 10:24 AM)z3bra Wrote: By the way, I looked up the ICANN documentation about non-ASCII domains. They're supposed to be sent as "punycode". So it's weird that your requests got translated to ASCII text. Indeed and I've mentioned it in the original post: Quote:Interesting, so at which level does the conversion happens, we know DNS only supports ascii and would actually convert unicode to punycode if it encounted some. The transliteration actually happens locally, as I've discovered. |
|||
|
|||
Ah, I guess I must read more carefully. Sorry !
|
|||
|
|||
(24-08-2020, 08:25 AM)venam Wrote: Representing the IP in different form such as a single 4B integer: http://2990468176 or multiple Bytes.Didnβt know that. Makes me wonder: Why? What is/was a legitimate use case for this form? Maybe Iβll try to find out later, unless someone posts an answer real quick. :) |
|||
|
|||
(26-08-2020, 10:55 AM)vain Wrote: Why? What is/was a legitimate use case for this form? While the previous IDN is valid as host writing, using the concept of domain to ascii or IDNA, the ipv4 representation isn't actually valid as far as the URL standard goes but it seems to still be interpreted by the browser, cURL, and others. You can even represent it as hex: http://0xb23eec50 or octal: http://026217566120 So I have found at which level it is done, it's also from something getaddrinfo calls. I've found the culprilt, it is inet_aton(3). Relevant from the manpage, but not compatible with URL specs: Code: inet_aton() converts the Internet host address cp from the IPv4 numbers-and-dots One thing is sure in computing, it's that standards are only ideals. |
|||
|
|||
URL and IP addresses are not the same thing !
An IP address is a 32 bits integer. inet_aton (and many other similar fonctions) are used to convert human readabke strings into their binary form. The kernel always deal with IP addresses in binary form (see getaddrinfo(3)), so when openning a connection to a remote host, you MUST specify it in binary form. An URL is a human readable string specifying multiple informations, including the remote host address when appropriate. We usually represent it with a FQDN, as it's more practical. Specifying the quad dotted IP address is valid too, as functions like inet_aton can do the conversion. However, when you specify it in hexadecimal or octal, the job is even easier ! The program can pass it directly to the kernel, so this explains why these notation are accepted. Note that they probably just pass it to an *_aton function anyway. They might not be in the spec, but at this point I think that the spec might be too rigid and doesn't accept these format as they go against readability (which is what an URL is about). |
|||
|
|||
(27-08-2020, 05:24 AM)z3bra Wrote: URL and IP addresses are not the same thing ! That's not really the issue, we all know that these are different thing. Within the URL standard, IPv4 addresses are only allowed to be formatted a certain way, but relying on inet_aton to parse them allows all possible ways to format them, which can lead to the misinterpretation that these formats are also accepted everywhere, while they probably aren't as this is OS specific. (27-08-2020, 05:24 AM)z3bra Wrote: They might not be in the spec, but at this point I think that the spec might be too rigid and doesn't accept these format as they go against readability (which is what an URL is about). I agree, but standards are there for good practices, so it's probably better to not rely on these other forms in URLs. |
|||
|
|||
I wrote a small script to do the actual opposite, from ascii to other valid NFKD/IDNA forms (Based on the map I've found on unicode.org and converted to javascript). There's also the unorm library with the map here.
You can get pretty outputs such as: https://π·πΎπβππ.πβ―π. There are many other "fancy text" generator online, I even have one here, but it's the first time I create one that generates only valid URLs. EDIT: I remembered upper and lower cases are equivalent and updated the script with more glyphs. |
|||