Basics of Unicode in Perl

Dave Rolsky

Character Sets

Mapping of numbers to characters
ASCII - 0-127
ISO-8859-1 (aka Latin-1) - 0-255
Unicode - 0-(2^32 - 1)

Character Encoding

Mapping of byte patterns to characters
ASCII & ISO-8859-1 use a single byte per character
Unicode is not a character encoding!
UTF-8, UTF-16, and UTF-32 are multibyte encodings for the Unicode set

Encoding vs Set Confusion

Often used interchangeably
Set is abstract
Encoding defines a concrete representation

Perl's Internals

Scalar contains bytes (0-255)
Bytes can be interpreted as UTF-8 characters
The "UTF-8 flag"

Bytes vs Characters


use strict;
use warnings;
use v5.16;
use Encode qw( decode );

my $bytes = join q{}, map { chr($_) } 240, 159, 152, 184;
say length $bytes; # 4

my $utf8 = decode('UTF-8', $bytes);
say length $utf8; # 1

binmode STDOUT, ':encoding(UTF-8)';
say $utf8;

Bytes vs Characters Output


$ perl code/bytes-vs-utf8
4
1
😸

`decode` and `encode`

decode - from any encoding to Perl's internal representation
encode - from Perl's internal representation to any encoding

When to `decode` and `encode`

Decode all incoming data
Encode all outgoing data

Handle (File) I/O


open my $fh, '<:encoding(UTF-8)', $file;
my $content = read_file( $file, binmode => ':encoding(UTF-8)' );


use open ':encoding(UTF-8)';


use open ':std', ':encoding(UTF-8)';

Web Pages & Services


my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
my $content = $response->decoded_content


my $content = JSON->new->utf8->decode($json);

Except that the decoded_content is kind of broken and may or may not actually decode the content the way you'd expect, depending on the content type.
It does for anything matching m{^text/}, but not for other types.

Databases


use DBD::Pg 3.0;
my $dbh = DBD::Pg->connect(...);

Unicode Characters in Your Code


use strict;
use warnings;
use v5.16;
my $bytes = "😸";
say length $bytes; # 4

use utf8;
my $utf8 = "😸";
say length $utf8; # 1

Unicode Characters in Your Code (Take Two)


use strict;
use warnings;
use v5.16;

my $utf8_by_code = "\x{1f638}";
say length $utf8_by_code;

use charnames ':full';
my $utf8_by_name = "\N{GRINNING CAT FACE WITH SMILING EYES}";
say length $utf8_by_name;

Regex Character Classes


use strict;
use warnings;
use v5.16;
use open ':std', ':encoding(UTF-8)';


my @strings = ( '12', "\x{ff11}\x{ff12}" );
for my $string (@strings) {
    if ( $string =~ /^\p{N}+$/ ) {
        say "Unicode Number $string";
    }

    if ( $string =~ /^\d+$/a ) {
        say "ASCII Number $string";
    }
}

Regex Character Classes Output


$ perl code/regex
Unicode Number 12
ASCII Number 12
Unicode Number １２

Advanced topics

Composing characters & normal forms
Sorting
Character properties
Unicode and fonts

Basics of Unicode in Perl

Character Sets

Character Encoding

Encoding vs Set Confusion

Perl's Internals

Bytes vs Characters

Bytes vs Characters Output

decode and encode

When to decode and encode

Handle (File) I/O

Web Pages & Services

Databases

Unicode Characters in Your Code

Unicode Characters in Your Code (Take Two)

Regex Character Classes

Regex Character Classes Output

Advanced topics

`decode` and `encode`

When to `decode` and `encode`