Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text.

Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today.

A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages.

The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and expect it to work. Parsing the UTF8 encoding scheme can be irritatingly problematic or even have security implications.

sequences = [
    b'\x41\x42\x43',  # valid in Latin1 and UTF8
    b'\xe2\x82\xac',  # valid Latin1 and UTF8
    b'\x80\x81\x82',  # valid in Latin1, invalid in UTF8 
    b'\x41\x42\x80',  # valid in Latin1, invalid in UTF8
]

def decode(sequences):
    for seq in sequences:
        print(f"Decoding: {seq}")

        try:
            latin1_decoded = seq.decode('latin1')
            print(f"  Decoded with Latin1: {latin1_decoded}")
        except Exception as e:
            print(f"  Error decoding with Latin1: {e}")

        try:
            utf8_decoded = seq.decode('utf-8')
            print(f"  Decoded with UTF8: {utf8_decoded}")
        except Exception as e:
            print(f"  Error decoding with UTF8: {e}")

decode(sequences)

$ python3 latin.py
Decoding: b'ABC'
  ...with Latin1: ABC
  ...with UTF8: ABC
Decoding: b'\xe2\x82\xac'
  ...with Latin1: â‚¬
  ...with UTF8: €
Decoding: b'\x80\x81\x82'
  ...with Latin1: €‚
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Decoding: b'AB\x80'
  ...with Latin1: AB€
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 2: invalid start byte

Stephan Bridger

Search This Blog

Latin1 vs UTF8

Labels

Comments

Post a Comment

Popular posts from this blog

yt-dlp Archiving, Improved

Unlearning, or Proof by Contradiction