-
Notifications
You must be signed in to change notification settings - Fork 14
/
index.src.html
891 lines (722 loc) · 31.9 KB
/
index.src.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
<h1>The WTF-8 encoding</h1>
<pre class='metadata'>
Shortname: wtf-8
Status: LS
Boilerplate: omit feedback-header
!Issue tracking: <a href="https://github.com/SimonSapin/wtf-8/issues">On GitHub</a>
!Change history: <a href="https://github.com/SimonSapin/wtf-8/commits/gh-pages">On GitHub</a>
!Last updated: [DATE]
Editor: Simon Sapin, Mozilla https://www.mozilla.org/, https://exyr.org/
Abstract: WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of <a>UTF-8</a> that encodes <a>surrogate code points</a> if they are not in a <a lt="surrogate code point pair">pair</a>. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use <a>UTF-16</a> internally but don’t enforce the <a>well-formedness</a> invariant that surrogates must be paired.
</pre>
<link href="logo-wtf-8.svg" rel="icon">
<p class=logo><img src=logo-wtf-8.svg width=100 height=100>
<h2 id=intended-audience>
Intended audience</h2>
<a>WTF-8</a> is a hack intended to be used internally in self-contained systems
with components that need to support <a>potentially ill-formed UTF-16</a>
for legacy reasons.
Any <a>WTF-8</a> data <em>must</em> be converted
to a <a lt="Unicode text">Unicode</a> encoding at the system’s boundary
before being emitted.
<a>UTF-8</a> is recommended.
<a>WTF-8</a> <em>must not</em> be used to represent text
in a file format or for transmission over the Internet.
In particular,
the <a href="https://encoding.spec.whatwg.org/">Encoding Standard</a> [[ENCODING]]
defines <a>UTF-8</a> and other encodings for the Web.
There is no and will not be any
<a href="https://encoding.spec.whatwg.org/#names-and-labels">
encoding label</a> [[ENCODING]] or
<a href="http://www.iana.org/assignments/character-sets/character-sets.xhtml">
IANA charset alias</a> [[CHARSETS]]
for <a>WTF-8</a>.
<h2 id=motivation>
Background and motivation</h2>
<em>This section is non-normative.</em>
When <a href="http://www.unicode.org/versions/Unicode1.0.0/">Unicode 1.0</a>
was published in 1991,
it defined 65536 <a>code points</a> from U+0000 to U+FFFF
and assigned characters to around half of them.
Many software implementations chose the obvious memory representation
for Unicode text of 16 bits per <a>code point</a> / character.
At the time, “Unicode” was synonymous with that particular encoding.
To disambiguate, that encoding is now called <dfn>UCS-2</dfn>.
As subsequent versions of Unicode assigned more characters,
it became apparent that 65536 <a>code points</a> would not be sufficient.
Unicode was extended to 1114112 <a>code points</a> from U+0000 to U+10FFFF,
and the <a>UTF-16</a> encoding was introduced.
This encoding preserves compatibility with existing 16-bit based systems
and represents new (<a lt="supplementary code point">supplementary</a>) code points
as a <a lt="surrogate 16-bit code unit pair">pair</a> of “surrogates”.
<a>UTF-16</a> is designed to represent any Unicode text,
but it can not represent a <a>surrogate code point pair</a>
since the corresponding <a>surrogate 16-bit code unit pairs</a>
would instead represent a <a>supplementary code point</a>.
Therefore, the concept of <a>Unicode scalar value</a> was introduced
and <a>Unicode text</a> was restricted to not contain any <a>surrogate code point</a>.
(This was presumably deemed simpler that only restricting pairs.)
<a>UTF-16</a> was redefined to be <a>ill-formed</a>
if it contains <a>unpaired surrogate 16-bit code units</a>.
<a>UTF-8</a> was similarly redefined to be <a>ill-formed</a>
if it contains <a>surrogate byte sequences</a>.
Meanwhile, 16-bit based systems had little to no incentive
to do anything about surrogates:
For several years,
Unicode did not assign any character to <a>supplementary code points</a>,
and then (until emoji) only comparatively rare characters.
Additionally, the Unicode Standard does not require conforming implementations
to maintain <a>well-formedness</a> of <a>UTF-16</a> strings.
As a result, surrogates do occur in practice and need to be preserved.
For example:
<ul>
<li>
In ECMAScript (a.k.a. JavaScript),
<a href="http://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16">
a <code>String</code> value</a>
is defined as a sequence of 16-bit integers
that <em>usually</em> represents <a>UTF-16</a> text
but may or may not be <a>well-formed</a>.
<li>
<a href="http://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx">
Windows applications normally use UTF-16</a>, but
<a href="http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx#maxpath">
the file system treats path and file names as an opaque sequence of <code>WCHAR</code>s</a>
(<a href="http://msdn.microsoft.com/en-us/library/windows/desktop/aa383751%28v=vs.85%29.aspx#WCHAR">16-bit
code units</a>).
</ul>
We say that strings in these systems are encoded in <a>potentially ill-formed UTF-16</a>
or <a>WTF-16</a>.
<a>Unpaired surrogate 16-bit code units</a> are the only case
where an arbitrary sequence of <a>16-bit code units</a> is <a>ill-formed</a> in <a>UTF-16</a>.
<a>UTF-8</a>, however, is more complex
and maintaining its <a>well-formedness</a> is arguably more valuable.
This specification defines <a>WTF-8</a>,
a superset of <a>UTF-8</a> that can losslessly represent
arbitrary sequences of <a>16-bit code unit</a> (even if <a>ill-formed</a> in <a>UTF-16</a>)
but preserves the other <a>well-formedness</a> constraints of <a>UTF-8</a>.
<h3 id=cesu-8>
Differences with CESU-8</h3>
Unicode defines a <a href="http://www.unicode.org/reports/tr26/">
Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)</a>.
<a>WTF-8</a> is different from CESU-8.
CESU-8 encodes <a>supplementary code points</a> as <a>surrogate pair byte sequences</a>
of six bytes,
whereas <a>WTF-8</a>, like <a>UTF-8</a>, encodes them as sequences of four bytes.
Therefore, CESU-8 is not a superset of <a>UTF-8</a>.
CESU-8 is also a mapping on <a>UTF-16</a> code units.
Therefore <a>unpaired surrogate byte sequences</a> are <a>ill-formed</a> in CESU-8,
whereas supporting them is the entire point of <a>WTF-8</a>.
<!--
CESU-8 was probably not designed,
but rather just happened when 16-bit based systems
tried to emit <a>UTF-8</a> but failed to handle <a>surrogate 16-bit code unit pairs</a> properly.
Unicode Technical Report #26, which defines CESU-8,
was written as after the fact to specify existing practices.
-->
<h2 id=terminology>
Terminology</h2>
These definitions correspond to those of the
<a href="http://www.unicode.org/glossary/">Glossary of Unicode Terms</a>. [[!UNICODE]]
A Unicode <dfn>code point</dfn> is any value in the Unicode codespace;
that is, the range of integers from 0 to 1114111.
It is noted with a “U+” prefix and four to six hexadecimal digits:
the first and last code points are U+0000 and U+10FFFF.
The <dfn>Basic Multilingual Plane</dfn> is
the range of <a>code points</a> from U+0000 to U+FFFF.
A <dfn>BMP code point</dfn> is a <a>code point</a> in the <a>Basic Multilingual Plane</a>.
A <dfn>supplementary code point</dfn>
is a <a>code point</a> not in the Basic Multilingual Plane.
That is, a <a>code point</a> in the range from U+10000 to U+10FFFF.
A <dfn>Unicode scalar value</dfn>
is a <a>code point</a> that is not a <a>surrogate code point</a>.
That is, a <a>code point</a> in the range from U+0000 to U+D7FF,
or in the range U+E000 to U+10FFFF.
A <dfn>BMP scalar value</dfn> is a <a>Unicode scalar value</a>
in the <a>Basic Multilingual Plane</a>.
That is, a <a>code point</a> in the range from U+0000 to U+D7FF,
or in the range U+E000 to U+FFFF.
<dfn>Unicode text</dfn> is a sequence of <a>Unicode scalar values</a>.
<dfn>UTF-8</dfn> is an encoding of <a>Unicode text</a> using 8-bit bytes.
Each <a>Unicode scalar value</a> is represented as a sequence of one to four bytes.
<dfn>UTF-16</dfn> is an encoding of <a>Unicode text</a> using <a>16-bit code units</a>.
<a>BMP scalar values</a> are represented as a single <a>16-bit code unit</a> with the same value.
<a>Supplementary code points</a> are represented as a <a>surrogate 16-bit code unit pair</a>.
Note: this specification is only concerned with the UTF-16
<a href="http://www.unicode.org/glossary/#character_encoding_form">encoding form</a>
(based on <a>16-bit code units</a>),
and not with the
<a href="http://www.unicode.org/glossary/#character_encoding_scheme">encoding scheme</a>
(based on bytes, with UTF-16BE and UTF-16LE variants).
A string is <dfn lt="well-formed|well-formedness">well-formed</dfn>
(not <dfn lt="ill-formed|ill-formedness">ill-formed</dfn>)
in a given encoding if it follows the specification of that encoding.
[[!UNICODE]] defines
<a href="http://www.unicode.org/glossary/#well_formed_code_unit_sequence">
Well-Formed Code Unit Sequence</a>
for <a>UTF-8</a> and <a>UTF-16</a>.
<div class=note>
In particular:
<ul>
<li><a>Unpaired surrogate 16-bit code units</a> are <a>ill-formed</a> in <a>UTF-16</a>.
<li><a>Surrogate byte sequences</a> are <a>ill-formed</a> in <a>UTF-8</a>.
<li><a>Surrogate pair byte sequences</a> are <a>ill-formed</a> in <a>WTF-8</a>.
</ul>
</div>
The <dfn>replacement character</dfn> is the <a>code point</a> U+FFFD
<code>REPLACEMENT CHARACTER</code> (�).
It is used as a substitute to replace <a>ill-formed</a> sub-sequences
during a conversion.
A <dfn>16-bit code unit</dfn> is a 16-bit integer used in <a>UTF-16</a>.
It is noted with a “0x” prefix and four hexadecimal digits:
the first and last 16-bit code units are 0x0000 and 0xFFFF.
Note: The byte serialization or memory representation of a <a>16-bit code unit</a>
(little-endian or big-endian)
is out of scope for this specification.
When an algorithm iterates over a sequence (“For every <var>i</var> in …”),
<dfn lt="consume">consuming</dfn> the next item means advancing in the sequence
such that that item will be skipped during the following iteration of the loop:
the item after the next becomes the next item.
<div class=example>
The following algorithm prints “1”, “2”, and “4”.
For every digit <var>i</var> in “1234”, run these substeps:
<ol>
<li>Print <var>i</var>
<li>If <var>i</var> is 2, <a>consume</a> the next digit.
</ol>
</div>
<h3 id=surrogates-code-points>
Surrogate code points</h3>
A <dfn>lead surrogate code point</dfn> or <dfn>high surrogate code point</dfn>
is a <a>code point</a> in the range from U+D800 to U+DBFF.
A <dfn>trail surrogate code point</dfn> or <dfn>low surrogate code point</dfn>
is a <a>code point</a> in the range from U+DC00 to U+DFFF.
A <dfn>surrogate code point</dfn> is
either a <a>lead surrogate code point</a> or a <a>trail surrogate code point</a>.
That is, a <a>code point</a> in the range from U+D800 to U+DFFF.
A <dfn>surrogate code point pair</dfn> is a sequence of
a <a>lead surrogate code point</a> followed by a <a>trail surrogate code point</a>.
An <dfn>unpaired surrogate code point</dfn> is a <a>surrogate code point</a>
that is not part of a <a>surrogate code point pair</a>.
<h3 id=surrogates-code-units>
Surrogate 16-bit code units</h3>
A <dfn>lead surrogate 16-bit code unit</dfn> or <dfn>high surrogate 16-bit code unit</dfn>
is a <a>16-bit code unit</a> in the range from 0xD800 to 0xDBFF.
A <dfn>trail surrogate 16-bit code unit</dfn> or <dfn>low surrogate 16-bit code unit</dfn>
is a <a>16-bit code unit</a> in the range from 0xDC00 to 0xDFFF.
A <dfn>surrogate 16-bit code unit</dfn> is
either a <a>lead surrogate 16-bit code unit</a> or a <a>trail surrogate 16-bit code unit</a>.
That is, a <a>16-bit code unit</a> in the range from 0xD800 to 0xDFFF.
A <dfn>surrogate 16-bit code unit pair</dfn> is a sequence of
a <a>lead surrogate 16-bit code unit</a> followed by a <a>trail surrogate 16-bit code unit</a>.
In <a>UTF-16</a>, it represents a <a>supplementary code point</a>.
An <dfn>unpaired surrogate 16-bit code unit</dfn> is a <a>surrogate 16-bit code unit</a>
that is not part of a <a>surrogate 16-bit code unit pair</a>.
<h3 id=surrogates-byte-sequences>
Surrogate byte sequences</h3>
Note: A <a>surrogate byte sequence</a>
(and therefore any byte sequence described in this section)
is <a>ill-formed</a> in <a>UTF-8</a>.
Decoders are required to treat it as an error.
A <dfn>lead surrogate byte sequence</dfn> or <dfn>high surrogate byte sequence</dfn>
is a sequence of three bytes
that represents a <a>lead surrogate code point</a> in <a>generalized UTF-8</a>.
A <dfn>trail surrogate byte sequence</dfn> or <dfn>low surrogate byte sequence</dfn>
is a sequence of three bytes
that represents a <a>trail surrogate code point</a> in <a>generalized UTF-8</a>.
A <dfn>surrogate byte sequence</dfn> is
either a <a>lead surrogate byte sequence</a> or a <a>trail surrogate byte sequence</a>.
That is, a sequence of three bytes
that represents a <a>surrogate code point</a> in <a>generalized UTF-8</a>.
<table class=data>
<caption>
Table 1. Surrogate byte sequences
Bytes noted in hexadecimal.
<tr>
<th>
<th>First byte
<th>Second byte
<th>Third byte
</tr>
<tr>
<th><a>Lead surrogate byte sequence</a>
<td>ED
<td>A0 to AF
<td>80 to BF
</tr>
<tr>
<th><a>Trail surrogate byte sequence</a>
<td>ED
<td>B0 to BF
<td>80 to BF
</tr>
<tr>
<th><a>Surrogate byte sequence</a>
<td>ED
<td>A0 to BF
<td>80 to BF
</tr>
</table>
A <dfn>surrogate pair byte sequence</dfn> is a sequence six bytes composed of
a <a>lead surrogate byte sequence</a> followed by a <a>trail surrogate byte sequence</a>.
An <dfn>unpaired surrogate byte sequence</dfn> is a <a>surrogate byte sequence</a>
that is not part of a <a>surrogate pair byte sequence</a>.
<h2 id=ill-formed-utf-16>
Potentially ill-formed UTF-16</h2>
A sequence of <a>16-bit code units</a> is <dfn>potentially ill-formed UTF-16</dfn>
if it is intended to be interpreted as <a>UTF-16</a>,
but is not necessarily <a>well-formed</a> in <a>UTF-16</a>.
It effectively encodes a sequence of <a>code points</a>
that do not contain any <a>surrogate code point pair</a>.
Note: Like <a>UTF-16</a>,
<a>potentially ill-formed UTF-16</a> can not represent a <a>surrogate code point pair</a>
since the corresponding <a>surrogate 16-bit code unit pair</a> would instead
represent a <a>supplementary code point</a>.
Unlike <a>well-formed</a> <a>UTF-16</a>, it might contain isolated <a>surrogate code points</a>.
Any sequence of <a>16-bit code units</a>
has an interpretation as <a>potentially ill-formed UTF-16</a>.
<dfn>WTF-16</dfn> is sometimes used as a shorter name for <a>potentially ill-formed UTF-16</a>,
especially in the context of systems that were originally designed for <a>UCS-2</a>
and later upgraded to <a>UTF-16</a> but never enforced <a>well-formedness</a>,
either by neglect or because of backward-compatibility constraints.
<h3 id=encoding-ill-formed-utf-16>
Encoding</h3>
To <dfn id=encode-ill-formed-utf-16>encode from <a>code points</a> to
<a>potentially ill-formed UTF-16</a></dfn>,
run these steps:
<ol>
<li>Let <var>result</var> be a sequence of <a>16-bit code units</a>, initially empty.
<li>For every <a>code point</a> <var>P</var> of the input, run these substeps:
<ol>
<li>
If <var>P</var> is a <a>supplementary code point</a>,
append to <var>result</var> two <a>16-bit code units</a> of values:
<ol>
<li><code>((<var>P</var> - 0x10000) >> 10) + 0xD800</code>
<li><code>((<var>P</var> - 0x10000) & 0x3FF) + 0xDC00</code>
</ol>
<li>Otherwise (<var>P</var> is a <a>BMP code point</a>),
append to <var>result</var> a <a>16-bit code unit</a> of value <var>P</var>.
</ol>
<li>Return <var>result</var>.
</ol>
Note: If the input is restricted to <a>Unicode text</a>,
this is identical to encoding to <a>UTF-16</a>
and the resulting sequence is <a>well-formed</a> in <a>UTF-16</a>.
<div class=note>
If, on the other hand, the input contains a <a>surrogate code point pair</a>,
the conversion will be incorrect and
the resulting sequence will not represent the original code points.
This situation should be considered an error,
but this specification does not define how to handle it.
Possibilities include aborting the conversion,
or replacing one of the <a>surrogate code points</a> of the pair
with a <a>replacement character</a>.
</div>
<h3 id=decoding-ill-formed-utf-16>
Decoding</h3>
To <dfn id=decode-ill-formed-utf-16>decode from <a>potentially ill-formed UTF-16</a>
to <a>code points</a></dfn>,
run these steps:
<ol>
<li>Let <var>result</var> be a sequence of <a>code points</a>, initially empty.
<li>For every <a>16-bit code unit</a> <var>U</var> of the input, run these substeps:
<ol>
<li>
If <var>U</var> is a <a>lead surrogate 16-bit code unit</a>,
<var>U</var> is not the last <a>16-bit code unit</a> of the input,
and the next <a>16-bit code unit</a> of the input <var>next</var>
is a <a>trail surrogate 16-bit code unit</a>,
then
<a>consume</a> <var>next</var>
and append to <var>result</var> a <a>code point</a> of value
<code>
0x10000 +
((<var>U</var> - 0xD800) << 10) +
(<var>next</var> - 0xDC00)</code>.
<li>
Otherwise,
append to <var>result</var> a <a>code point</a> of value <var>U</var>.
</ol>
<li>Return <var>result</var>.
</ol>
Note: By construction,
the resulting sequence does not contain a <a>surrogate code point pair</a>.
Note: If the input is <a>well-formed</a> in <a>UTF-16</a>,
this is identical to decoding <a>UTF-16</a>
and the resulting sequence is <a>Unicode text</a>.
<h2 id=generalized-utf8>
Generalized UTF-8</h2>
For the purpose of this specification,
<dfn>generalized UTF-8</dfn> is an encoding of sequences of <a>code points</a>
(not restricted to <a>Unicode scalar values</a>)
using 8-bit bytes,
based on the same underlying algorithm as <a>UTF-8</a>.
It is a strict superset of <a>UTF-8</a>
(like <a>UTF-8</a> is a strict superset of ASCII).
Each <a>code point</a> is encoded as a sequence of one to four bytes:
<table class=data>
<caption>
Table 2. Bit distribution
Bytes noted in binary, most significant bit first.
<code>x</code> bits represent the least significant bits of the code points.
<tr>
<th>Code point
<th>First byte
<th>Second byte
<th>Third byte
<th>Fourth byte
</tr>
<tr>
<td>U+0000 to U+007F
<td>0xxxxxxx
<td>
<td>
<td>
</tr>
<tr>
<td>U+0080 to U+07FF
<td>110xxxxx
<td>10xxxxxx
<td>
<td>
</tr>
<tr>
<td>U+0800 to U+FFFF
<td>1110xxxx
<td>10xxxxxx
<td>10xxxxxx
<td>
</tr>
<tr>
<td>U+10000 to U+10FFFF
<td>11110xxx
<td>10xxxxxx
<td>10xxxxxx
<td>10xxxxxx
</tr>
</table>
A byte sequence is <a>well-formed</a> in <a>generalized UTF-8</a>
if and only if:
<ul>
<li>It is the empty string, or
<li>
It is composed of a byte sequence that is <a>well-formed</a> in <a>generalized UTF-8</a>
followed by a byte sequence in the following table.
</ul>
<table class=data>
<caption>
Table 3. Well-formed byte sequences representing a single code point
Bytes noted in hexadecimal.
<tr>
<th>Code point
<th>First byte
<th>Second byte
<th>Third byte
<th>Fourth byte
</tr>
<tr>
<td>U+0000 to U+007F
<td>00 to 7F
<td>
<td>
<td>
</tr>
<tr>
<td>U+0080 to U+07FF
<td>C2 to DF
<td>80 to BF
<td>
<td>
</tr>
<tr>
<td>U+0800 to U+0FFF
<td>E0
<td><strong>A0</strong> to BF
<td>80 to BF
<td>
</tr>
<tr>
<td>U+1000 to U+FFFF
<td>E1 to EF
<td>80 to BF
<td>80 to BF
<td>
</tr>
<tr>
<td>U+10000 to U+3FFFF
<td>F0
<td><strong>90</strong> to BF
<td>80 to BF
<td>80 to BF
</tr>
<tr>
<td>U+40000 to U+FFFFF
<td>F1 to F3
<td>80 to BF
<td>80 to BF
<td>80 to BF
</tr>
<tr>
<td>U+100000 to U+10FFFF
<td>F4
<td>80 to <strong>8F</strong>
<td>80 to BF
<td>80 to BF
</tr>
</table>
<h2 id=the-wtf-8-encoding>
The WTF-8 encoding</h2>
<dfn>WTF-8</dfn> (Wobbly Transformation Format − 8-bit)
is an encoding of <a>code point</a> sequences
that do not contain any <a>surrogate code point pair</a>
using 8-bit bytes.
Note: Like <a>UTF-8</a> is artificially restricted to <a>Unicode text</a>
in order to match <a>UTF-16</a>,
<a>WTF-8</a> is artificially restricted to exclude <a>surrogate code point pairs</a>
in order to match <a>potentially ill-formed UTF-16</a>.
It is identical to <a>generalized UTF-8</a>,
with the additional <a>well-formedness</a> constraint that
a <a>surrogate pair byte sequence</a> is <a>ill-formed</a>.
It is a strict subset of <a>generalized UTF-8</a>
and a strict superset of <a>UTF-8</a>.
Note: Similarly, <a>UTF-8</a> is a strict superset of ASCII.
<a>WTF-8</a> <em>must not</em> be used for interchange.
See <a href="#intended-audience">Intended audience</a>.
<h3 id=encoding-wtf-8>
Encoding</h3>
To <dfn id=encode-to-wtf-8>encode from <a>code points</a>
to <a>well-formed</a> <a>WTF-8</a></dfn>,
run these steps:
<ol>
<li>Let <var>result</var> be a sequence of bytes, initially empty.
<li>For every <a>code point</a> <var>P</var> of the input, run these substeps:
<ol>
<li>
If <var>P</var> is a <a>lead surrogate code point</a>,
<var>P</var> is not the last <a>code point</a> of the input,
and the <var>next</var> <a>code point</a> is a <a>trail surrogate code point</a>,
<a>consume</a> the next code point and
set <var>P</var>’s value to:
<code>
0x10000 +
((<var>P</var> - 0xD800) << 10) +
(<var>next</var> - 0xDC00)</code>.
<li>Depending on <var>P</var>:
<dl class=switch>
<dt>U+0000 to U+007F
<dd>Append to <var>result</var> one byte of value <var>P</var>.
<dt>U+0080 to U+07FF
<dd>
Append to <var>result</var> two bytes of values:
<ol>
<li><code>0xC0 | (<var>P</var> >> 6)</code>
<li><code>0x80 | (<var>P</var> & 0x3F)</code>
</ol>
<dt>U+0800 to U+FFFF
<dd>
Append to <var>result</var> three bytes of values
<ol>
<li><code>0xE0 | (<var>P</var> >> 12)</code>
<li><code>0x80 | ((<var>P</var> >> 6) & 0x3F)</code>
<li><code>0x80 | (<var>P</var> & 0x3F)</code>
</ol>
<dt>U+10000 to U+10FFFF
<dd>
Append to <var>result</var> four bytes of values
<ol>
<li><code>0xF0 | (<var>P</var> >> 18)</code>
<li><code>0x80 | ((<var>P</var> >> 12) & 0x3F)</code>
<li><code>0x80 | ((<var>P</var> >> 6) & 0x3F)</code>
<li><code>0x80 | (<var>P</var> & 0x3F)</code>
</ol>
</dl>
</ol>
<li>Return <var>result</var>.
</ol>
Note: If the input contains a <a>surrogate code point pair</a>,
the resulting byte sequence will be not represent
the original sequence of <a>code points</a>.
Instead, it will represent the same <a>code points</a>
as if had been encoded in <a>potentially ill-formed UTF-16</a>.
This is also consistent with
encoding each <a>code point</a> to <a>WTF-8</a> individually,
and <a lt="concatenate two WTF-8 strings">concatenating</a>
the resulting <a>WTF-8</a> byte sequences.
<h3 id=decoding-wtf-8>
Decoding</h3>
To <dfn id=decode-from-wtf-8>decode from <a>well-formed</a> <a>WTF-8</a>
to <a>code points</a></dfn>,
run these steps:
Note: Since <a>WTF-8</a> <em>must not</em> be used for interchange
(see <a href="#intended-audience">Intended audience</a>),
this algorithm is deliberately not defined for arbitrary byte sequences.
It is only defined for byte sequences known to be <a>well-formed</a> in <a>WTF-8</a>,
such as sequences
<a lt="encode from code points to well-formed WTF-8">encoded from code points</a>,
<a lt="convert from potentially ill-formed UTF-16 to WTF-8">converted from UTF-16</a>, or
<a lt="concatenate two WTF-8 strings">concatenated</a>
from sequences themselves <a>well-formed</a> in WTF-8</a>.
<ol>
<li>Let <var>result</var> be a sequence of <a>code points</a>, initially empty.
<li>For every byte <var>B</var> of the input, depending on <var>B</var>:
<dl class=switch>
<dt>0x00 to 0x7F
<dd>Append to <var>result</var> a <a>code point</a> of value <var>B</var>.
<dt>0xC2 to 0xDF
<dd>
Let <var>B2</var> be the next byte and <a>consume</a> it.
Append to <var>result</var> a <a>code point</a> of value
<code>
((<var>B</var> & 0x1F) << 6) +
(<var>B2</var> & 0x3F)</code>
<dt>0xE0 to 0xEF
<dd>
Let <var>B2</var> and <var>B3</var> be the next two bytes,
and <a>consume</a> them.
Append to <var>result</var> a <a>code point</a> of value
<code>
((<var>B</var> & 0x0F) << 12) +
((<var>B2</var> & 0x3F) << 6) +
(<var>B3</var> & 0x3F)</code>
<dt>0xF0 to 0xF4
<dd>
Let <var>B2</var>, <var>B3</var>, and <var>B4</var> be the next three bytes,
and <a>consume</a> them.
Append to <var>result</var> a <a>code point</a> of value
<code>
((<var>B</var> & 0x07) << 18) +
((<var>B2</var> & 0x3F) << 12) +
((<var>B3</var> & 0x3F) << 6) +
(<var>B4</var> & 0x3F)</code>
</dl>
<li>Return <var>result</var>.
</ol>
Note: If the input is also <a>well-formed</a> in <a>UTF-8</a>,
this is identical to decoding <a>UTF-8</a>
and the resulting sequence is <a>Unicode text</a>.
<h3 id=converting-wtf-8-ill-formed-utf-16>
Converting between WTF-8 and potentially ill-formed UTF-16</h3>
To <dfn>convert from <a>potentially ill-formed UTF-16</a> to <a>WTF-8</a></dfn>,
run these steps:
<ul>
<li><a>Decode from potentially ill-formed UTF-16 to code points</a>
<li><a>Encode from code points to well-formed WTF-8</a>
</ul>
Note: This conversion never fails and is lossless.
To <dfn>convert from <a>WTF-8</a> to <a>potentially ill-formed UTF-16</a></dfn>,
run these steps:
<ul>
<li><a>Decode from well-formed WTF-8 to code points</a>
<li><a>Encode from code points to potentially ill-formed UTF-16</a>
</ul>
Note: This conversion never fails
and, if the input is <a>well-formed</a> in <a>WTF-8</a>, is lossless.
<h3 id=converting-wtf-8-utf-8>
Converting between WTF-8 and UTF-8</h3>
Since <a>WTF-8</a> is a superset of <a>UTF-8</a>,
any sequence of byte that is <a>well-formed</a> in <a>UTF-8</a>
is also <a>well-formed</a> in <a>WTF-8</a> and represents the same text.
To <dfn>convert from <a>UTF-8</a> to <a>WTF-8</a></dfn>,
return the input unchanged.
Note: This conversion never fails and is lossless.
To <dfn>convert lossily from <a>WTF-8</a> to <a>UTF-8</a></dfn>,
replace any <a>surrogate byte sequence</a>
with the sequence of three bytes <0xEF, 0xBF, 0xBD>,
the <a>UTF-8</a> encoding of the <a>replacement character</a>.
Note: Since <a>surrogate byte sequences</a> are also three bytes long,
this conversion can be done <em>in place</em>.
Note: This conversion never fails but is lossy.
To <dfn>convert strictly from <a>WTF-8</a> to <a>UTF-8</a></dfn>, run these steps:
<ol>
<li>If the input contains a <a>surrogate byte sequence</a>, return failure.
<li>Otherwise, return the input unchanged.
</ol>
Note: This conversion is lossless when it succeeds, but it can fail.
<h3 id=concatenating>
Concatenating WTF-8 strings</h3>
Concatenating <a>WTF-8</a> strings requires extra care to preserve <a>well-formedness</a>.
To <dfn>concatenate two <a>WTF-8</a> strings</dfn>, run these steps:
<ol>
<li>
If the left input string ends with a <a>lead surrogate byte sequence</a>
and the right input string starts with a <a>trail surrogate byte sequence</a>,
run these substeps:
<ol>
<li>
Let <var>lead</var> and <var>trail</var> be two <a>code points</a>,
the respective results of
<a lt="Decode from well-formed WTF-8 to code points">decoding from WTF-8</a>
these two <a>surrogate byte sequences</a>.
<li>
Let <var>supplementary</var> be the
<a lt="Encode from code points to well-formed WTF-8">encoding to WTF-8</a>
of a single code point of value
<code>0x10000 + ((<var>lead</var> - 0xD800) << 10) + (<var>trail</var> - 0xDC00)</code>
<li>
Let <var>left</var> be substring of the left input string
that removes the three final bytes.
<li>
Let <var>right</var> be substring of the right input string
that removes the three initial bytes.
<li>
Return the concatenation of
<var>left</var>, <var>supplementary</var>, and <var>right</var>.
</ol>
<li>Otherwise, return the concatenation of the two input byte sequences
</ol>
Note: This is equivalent to
converting both strings
<a lt="convert from WTF-8 to potentially ill-formed UTF-16">
to potentially ill-formed UTF-16</a>,
concatenating the resulting <a>16-bit code unit</a> sequences,
then converting the concatenation back
<a lt="convert from potentially ill-formed UTF-16 to WTF-8">to WTF-8</a>.
<h2 id=implementations>
Implementations</h2>
<em>This section is non-normative.</em>
<ul>
<li>
<a href="http://s48.org/">Scheme 48</a> uses an encoding they call <em>UTF-8of16</em>
to encode <a href="http://s48.org/1.8/manual/manual-Z-H-6.html#node_sec_5.15">filenames
and other operating system strings on Windows</a>.
This encoding is identical to <a>WTF-8</a>.
<li>
<a href="http://racket-lang.org/">Racket</a>
uses an encoding they call
<a href="http://docs.racket-lang.org/reference/bytestrings.html#%28part._.Bytes_to_.Bytes_.Encoding_.Conversion%29">
<em>platform-UTF-8</em></a> to
<a href="http://plt.eecs.northwestern.edu/snapshots/current/doc/reference/windowspaths.html#%28part._windowspathrep%29">
encode filenames on Windows</a>.
This encoding is identical to <a>WTF-8</a>.
<li>
<a href="https://github.com/mathiasbynens/wtf-8">wtf-8.js</a>
implements this specification in JavaScript.
<li>
<a href="https://github.com/SimonSapin/rust-wtf8">rust-wtf8</a>
implements this specification in <a href="http://www.rust-lang.org/">Rust</a>.
<li>
On Windows (which uses <a>potentially ill-formed UTF-16</a> in its APIs),
the Rust standard library
<a href="https://github.com/rust-lang/rfcs/blob/master/text/0517-io-os-reform.md#string-handling">uses WTF-8</a>
internally for <em>OS strings</em>,
but does not expose the <a>WTF-8</a> byte sequences.
</ul>
<h2 id=acknowledgments>
Acknowledgments</h2>
Thanks to Coralie Mercier for
<a href="https://twitter.com/koalie/status/506821684687413248">
coining the name WTF-8</a>.
Thanks for feedback and contributions from
Anne van Kesteren,
David Baron,
Dylan Petonke,
Guillaume Knispel,
Henri Sivonen,
Jacob Lifshay,
James Graham,
Lily Ballard,
Mathias Bynens,
Ms2ger,
Sam Tobin-Hochstadt,
Tab Atkins.