Fun with domain names - The WWW

Users browsing this thread:
venam
Administrators
Hello nixers,
This thread is simply a write-up about a fun experiment we did on IRC, putting an ephemeral conversation into ink.
Pasting any of the following in your browser URL bar redirects/resolves to nixers.net:
Code:
https://π•Ÿπ•šπ•©π•–π•£π•€.π•Ÿπ•–π•₯
https://β“β“˜β“§β“”β“‘β“’.ⓝⓔⓣ
https://𝐧𝐒𝐱𝐞𝐫𝐬.𝐧𝐞𝐭
https://π–“π–Žπ–π–Šπ–—π–˜.π–“π–Šπ–™

Interesting, so at which level does the conversion happens, we know DNS only supports ascii and would actually convert unicode to punycode if it encounted some.
The same behavior happens with curl and dig, and snooping it with wireshark shows that the actual request sent for A is nixers.net.

Code:
dig @8.8.8.8 A β“β“˜β“§β“”β“‘β“’.ⓝⓔⓣ

My guess, is that DNS doesn't actually handle such wide characters but that the tool fallback on a library that does the normalization for them.

Let's test if the DNS resolves by packing our own request and sniffing it with wireshark:

Code:
echo -n -e
"\x13\x37\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x18π–“π–Žπ–π–Šπ–—π–˜\x0cπ–“π–Šπ–™\x00\x00\x01\x00\x01" | nc -u -w1 8.8.8.8 53
(Replace Google's DNS with your favorite one)

And nope, it doesn't work, we get a server error:

[Image: tvx7ozx.png]

So it's a local conversion to ascii, using normalization.
From the command line you can test using the following:

Code:
> iconv -f utf-8 -t ascii//TRANSLIT <<<π–“π–Žπ–π–Šπ–—π–˜.π–“π–Šπ–™
nixers.net

And that explains our initial issue. However, I'm still wondering which library does the conversion in all these tools, let me know if you find it, my strace log was too big and I didn't want to parse it.

EDIT: I've actually found where the DNS translation is done in all these tools, they rely on getaddrinfo(3) and other OS libs.
So it start from getaddrinfo to then either calls __idna_to_dns_encoding or __idna_to_ascii_lz depending on the version (my guess), which relies on libidn. So libidn is doing all the dirty work.

Code:
~ > ldd $(which curl) | grep libidn
    libidn2.so.0 => /usr/lib/libidn2.so.0 (0x00007fe303ea0000)

You can actually test it on the command line too, similar to uconv:
Code:
> idn --idna-to-ascii 'https://𝐧𝐒𝐱𝐞𝐫𝐬.𝐧𝐞𝐭'
https://nixers.net

The relevant part of the getaddrinfo docs:
Code:
AI_IDN If  this flag is specified, then the node name given in node is converted
              to IDN format if necessary.  The source encoding is that of  the  current
              locale.

              If the input name contains non-ASCII characters, then the IDN encoding is
              used.  Those parts of the node name (delimited by dots) that contain non-
              ASCII characters are encoded using ASCII Compatible Encoding (ACE) before
              being passed to the name resolution functions.

Other interesting thing you can do with browsers and domain names:
  • Point your domain name to 127.0.0.1, trick the user if they have currently a server running locally
  • Representing the IP in different form such as a single 4B integer: http://2990468176 or multiple Bytes. Even though it isn't in the URL standard
  • The usual homoglyph attack, using characters that could be confused with other ones, for example: https://paypal.com@2990468176

Yep, the web is kind of complicated...

The forums now support unicode proplerly, it's magic!
z3bra
Grey Hair Nixers
(24-08-2020, 08:25 AM)venam Wrote: The forums now support unicode proplerly, it's magic!

Finally ! πŸ’©
venam
Administrators
(24-08-2020, 09:23 AM)z3bra Wrote: Finally ! πŸ’©
And here, ladies and gentlemen, we have the first emoji posted on the nixers forums.

Actually no, it was vain posting in this thread with the penguin emoji.
s0kx
Members
Well actually, the ascii art in my in my introduction thread gets displayed as emojis on some browsers so where's my trophy?
venam
Administrators
(24-08-2020, 12:33 PM)sokx Wrote: so where's my trophy

[Image: 6b08abb09fbb012f2fe600163e41dd5b]

You win!


Though, technically, because let's be technical on a computer enthusiast forums, your thread was posted before I added support for 4B wide UTF-8 unicode in the DB.

Your heart β™₯ pushed the limit boundary:
Code:
11100010 10011001 10100101

While the penguin 🐧 went further in the 4B space:
Code:
11110000 10011111 10010000 10100111

But everyone is a winner in my β™₯.
jkl
Long time nixers
Emojis. If you absolutely want to say something, but you cannot be bothered with grammar.

--
<mort> choosing a terrible license just to be spiteful towards others is possibly the most tux0r thing I've ever seen
z3bra
Grey Hair Nixers
How about we set up DNS with only emojis as TLDs then ? I'd love to access the forums using "https://nixers.🐧"

By the way, I looked up the ICANN documentation about non-ASCII domains. They're supposed to be sent as "punycode". So it's weird that your requests got translated to ASCII text.

For instance:

Code:
$ dig +noidnout nixers.πŸ’© | grep -A1 QUESTION
;; QUESTION SECTION:
;nixers.xn--ls8h.               IN      A
venam
Administrators
(25-08-2020, 10:24 AM)z3bra Wrote: By the way, I looked up the ICANN documentation about non-ASCII domains. They're supposed to be sent as "punycode". So it's weird that your requests got translated to ASCII text.

Indeed and I've mentioned it in the original post:
Quote:Interesting, so at which level does the conversion happens, we know DNS only supports ascii and would actually convert unicode to punycode if it encounted some.

The transliteration actually happens locally, as I've discovered.
z3bra
Grey Hair Nixers
Ah, I guess I must read more carefully. Sorry !
movq
Long time nixers
(24-08-2020, 08:25 AM)venam Wrote: Representing the IP in different form such as a single 4B integer: http://2990468176 or multiple Bytes.
Didn’t know that. Makes me wonder: Why? What is/was a legitimate use case for this form?

Maybe I’ll try to find out later, unless someone posts an answer real quick. :)
venam
Administrators
(26-08-2020, 10:55 AM)vain Wrote: Why? What is/was a legitimate use case for this form?

While the previous IDN is valid as host writing, using the concept of domain to ascii or IDNA, the ipv4 representation isn't actually valid as far as the URL standard goes but it seems to still be interpreted by the browser, cURL, and others.

You can even represent it as hex: http://0xb23eec50 or octal: http://026217566120

So I have found at which level it is done, it's also from something getaddrinfo calls.
I've found the culprilt, it is inet_aton(3).

Relevant from the manpage, but not compatible with URL specs:

Code:
inet_aton() converts the Internet host address cp from the IPv4 numbers-and-dots
       notation into binary form (in network byte order) and stores it in the structure
       that inp points to.  inet_aton() returns nonzero if the address is  valid,  zero
       if not.  The address supplied in cp can have one of the following forms:

       a.b.c.d   Each  of  the  four numeric parts specifies a byte of the address; the
                 bytes are assigned in left-to-right order to produce  the  binary  ad‐
                 dress.

       a.b.c     Parts a and b specify the first two bytes of the binary address.  Part
                 c is interpreted as a 16-bit value  that  defines  the  rightmost  two
                 bytes of the binary address.  This notation is suitable for specifying
                 (outmoded) Class B network addresses.

       a.b       Part a specifies the first byte of the binary address.  Part b is  in‐
                 terpreted  as a 24-bit value that defines the rightmost three bytes of
                 the binary address.  This notation is suitable  for  specifying  (out‐
                 moded) Class A network addresses.

       a         The  value  a is interpreted as a 32-bit value that is stored directly
                 into the binary address without any byte rearrangement.

       In all of the above forms, components of the dotted address can be specified  in
       decimal,  octal  (with  a  leading  0), or hexadecimal, with a leading 0X).  Ad‐
       dresses in any of these forms are collectively termed IPV4 numbers-and-dots  no‐
       tation.   The form that uses exactly four decimal numbers is referred to as IPv4
       dotted-decimal notation (or sometimes: IPv4 dotted-quad notation).

One thing is sure in computing, it's that standards are only ideals.
z3bra
Grey Hair Nixers
URL and IP addresses are not the same thing !

An IP address is a 32 bits integer. inet_aton (and many other similar fonctions) are used to convert human readabke strings into their binary form. The kernel always deal with IP addresses in binary form (see getaddrinfo(3)), so when openning a connection to a remote host, you MUST specify it in binary form.

An URL is a human readable string specifying multiple informations, including the remote host address when appropriate. We usually represent it with a FQDN, as it's more practical. Specifying the quad dotted IP address is valid too, as functions like inet_aton can do the conversion. However, when you specify it in hexadecimal or octal, the job is even easier ! The program can pass it directly to the kernel, so this explains why these notation are accepted. Note that they probably just pass it to an *_aton function anyway.

They might not be in the spec, but at this point I think that the spec might be too rigid and doesn't accept these format as they go against readability (which is what an URL is about).
venam
Administrators
(27-08-2020, 05:24 AM)z3bra Wrote: URL and IP addresses are not the same thing !

That's not really the issue, we all know that these are different thing. Within the URL standard, IPv4 addresses are only allowed to be formatted a certain way, but relying on inet_aton to parse them allows all possible ways to format them, which can lead to the misinterpretation that these formats are also accepted everywhere, while they probably aren't as this is OS specific.

(27-08-2020, 05:24 AM)z3bra Wrote: They might not be in the spec, but at this point I think that the spec might be too rigid and doesn't accept these format as they go against readability (which is what an URL is about).

I agree, but standards are there for good practices, so it's probably better to not rely on these other forms in URLs.
venam
Administrators
I wrote a small script to do the actual opposite, from ascii to other valid NFKD/IDNA forms (Based on the map I've found on unicode.org and converted to javascript). There's also the unorm library with the map here.

You can get pretty outputs such as: https://π“·π’Ύπ”β‚‘π“‡π–˜.𝓃ℯ𝓉.

There are many other "fancy text" generator online, I even have one here, but it's the first time I create one that generates only valid URLs.

EDIT: I remembered upper and lower cases are equivalent and updated the script with more glyphs.