ASCII & ISO-8859-1 use a single byte per character
Unicode is not a character encoding!
UTF-8, UTF-16, and UTF-32 are multibyte encodings for the Unicode set
Encoding vs Set Confusion
Often used interchangeably
Set is abstract
Encoding defines a concrete representation
Perl's Internals
Scalar contains bytes (0-255)
Bytes can be interpreted as UTF-8 characters
The "UTF-8 flag"
Bytes vs Characters
use strict;
use warnings;
use v5.16;
use Encode qw( decode );
my $bytes = join q{}, map { chr($_) } 240, 159, 152, 184;
say length $bytes; # 4
my $utf8 = decode('UTF-8', $bytes);
say length $utf8; # 1
binmode STDOUT, ':encoding(UTF-8)';
say $utf8;
Bytes vs Characters Output
$ perl code/bytes-vs-utf8
4
1
😸
decode and encode
decode - from any encoding to Perl's internal representation
encode - from Perl's internal representation to any encoding
When to decode and encode
Decode all incoming data
Encode all outgoing data
Handle (File) I/O
open my $fh, '<:encoding(UTF-8)', $file;
my $content = read_file( $file, binmode => ':encoding(UTF-8)' );
use open ':encoding(UTF-8)';
use open ':std', ':encoding(UTF-8)';
Web Pages & Services
my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
my $content = $response->decoded_content
my $content = JSON->new->utf8->decode($json);
Except that the decoded_content is kind of broken and
may or may not actually decode the content the way you'd expect,
depending on the content type.
It does for anything matching m{^text/}, but not for
other types.
Databases
use DBD::Pg 3.0;
my $dbh = DBD::Pg->connect(...);
Unicode Characters in Your Code
use strict;
use warnings;
use v5.16;
my $bytes = "😸";
say length $bytes; # 4
use utf8;
my $utf8 = "😸";
say length $utf8; # 1
Unicode Characters in Your Code (Take Two)
use strict;
use warnings;
use v5.16;
my $utf8_by_code = "\x{1f638}";
say length $utf8_by_code;
use charnames ':full';
my $utf8_by_name = "\N{GRINNING CAT FACE WITH SMILING EYES}";
say length $utf8_by_name;
Regex Character Classes
use strict;
use warnings;
use v5.16;
use open ':std', ':encoding(UTF-8)';
my @strings = ( '12', "\x{ff11}\x{ff12}" );
for my $string (@strings) {
if ( $string =~ /^\p{N}+$/ ) {
say "Unicode Number $string";
}
if ( $string =~ /^\d+$/a ) {
say "ASCII Number $string";
}
}
Regex Character Classes Output
$ perl code/regex
Unicode Number 12
ASCII Number 12
Unicode Number 12