A string of unexpected lengths
When you start learning to program, or working in a new language, it’s often suggested that you build a simple program like Battleship or Tic-tac-toe. The games’ rules are well-defined and easy to grasp, and you only need to read and print text to get started. This frees you up to focus on the mechanics and ideas of the programming language you’re learning.
To create the game’s interface in the terminal,
you end up doing a lot of string formatting: board layout,
progress bars, announcements to the user.
The length of a string is useful when formatting for
terminals, since they usually use monospaced fonts.
For example, while writing a game of Battleship in Python
we might use the len()
function explicitly for formatting math
or implicitly in convenient built-in methods like center()
to
make exciting messages like the following:
>>> msg = 'battleship sunk!'
>>> len(msg)
16
>>> def underlined(msg):
... return msg + '\n' + '-' * len(msg)
...
>>> print underlined(msg)
battleship sunk!
----------------
>>> print msg.center(30, '*')
*******battleship sunk!*******
However, the code above won’t always work as we expect because
the len()
of text isn’t necessarily the same as its width
when displayed in a terminal.
Let’s explore three ways these numbers can differ.
Multiple bytes for one character
Byte strings (known as “strings” in Python 2) have formatting methods like center()
which assume that the displayed width of a string is equal to the number of bytes it contains.
But this assumption doesn’t always hold!
The single visible character Ä
might be encoded as several bytes in a source file
or terminal.
>>> shipname = 'Ägir'
>>> shipname
'\xc3\x84gir'
>>> len(shipname)
5
>>> print shipname.center(10, '=')
==Ägir===
>>> print shipname + '\n' + '-' * len(shipname)
Ägir
-----
The number of bytes in this byte string doesn’t match the number of characters so built-in formatting operations don’t behave correctly.
Fortunately, using Unicode strings instead of byte strings solves this problem because they usually report a length equal to the number of Unicode code points they contain.1
>>> shipname = u'Ägir'
>>> len(shipname)
4
>>> shipname.center(10, u'=')
u'===\xc4gir==='
>>> print shipname.center(16, u'*')
===Ägir===
>>> print shipname + '\n' + '-' * len(shipname)
Ägir
----
ANSI escape code formatting
ANSI escape codes let us format text
by writing bytes like '\x1b[31m'
to start writing in red, and '\x1b[39m'
to stop. If we build a string containing these sequences,
the calculated length of our string won’t match its
displayed width in a terminal:
>>> s = '\x1b[31mhit!\x1b[0m'
>>> print s
hit!
>>> len(s)
13
>>> print s + '\n' + '-' * len(s)
hit!
-------------
>>> print s.center(14, '*')
hit!*
The colored string reports a length larger than its displayed width, causing problems for built-in text-alignment methods. Fortunately, there are several Python libraries that make it easier to work with colored string-like objects that don’t include formatting characters in their length calculations.
Clint’s colored strings have formatting methods that produce the output you expect:
>>> from clint.textui.colored import green
>>> len(green(u'ship'))
4
>>> green(u'ship').center(10)
<GREEN-string: u' ship '>
>>> print green(u'ship').center(10)
ship
but this no longer works once two colored strings are combined into a new colored string:
>>> from clint.textui.colored import blue, green
>>> len(green('ship') + blue('ocean'))
39
>>> green('ship') + blue('ocean')
'\x1b[31m\x1b[22mship\x1b[39m\x1b[34m\x1b[22mocean\x1b[39m'
>>> print (green('ship') + blue('ocean')).center(10, '*')
shipocean
My own attempt at solving this problem uses smart string objects which know how to concatenate:
>>> from curtsies.fmtfuncs import green, blue
>>> len(green(u'ship'))
4
>>> green(u'ship').center(10)
green(" ship ")
>>> print green(u'ship').center(10)
ship
>>> s = green(u'ship') + blue(u'ocean')
>>> len(s)
9
>>> print s.center(13, '*')
**shipocean**
but doesn’t correctly implement every formatting method yet: above, **shipocean**
has lost its color information because
a fallback implementation of center()
was used.2
The Unicode jungle
Formatting methods of Python Unicode strings like center()
assume that the display width of a string is equal
to its character count. But this assumption doesn’t always hold!
What if we use fullwidth Unicode characters?
>>> battleship = u'扶桑'
>>> len(battleship)
2
>>> print battleship + '\n' + '-' * len(battleship)
扶桑
--
What about multiple Unicode code points that combine to display a single character?3
>>> battleship = u'Fuso\u0304'
>>> print battleship
Fusō
>>> len(battleship)
5
>>> print battleship.center(6, u'*')
*Fusō
The width of a Unicode string differs from the number of characters in it.
Fortunately, we can use the POSIX standard function wcswidth
to calculate
the display width of a Unicode string.
We can use this function to rebuild our
basic formatting functionality.4
>>> from wcwidth import wcswidth
>>> wcswidth(battleship)
4
>>> def center(s, n, fillchar=' '):
... pad = max(0, n - wcswidth(s))
... lpad, rpad = (pad + 1) // 2, pad // 2
... return lpad * fillchar + s + rpad * fillchar
...
>>> print center(c, 6, '*')
*Fusō*
-
Unfortunately, for versions of Python earlier than 3.3 it’s still possible that the
len()
of a Unicode character likeu'\U00010123'
will be 2 if your Python was built to use the “narrow” internal representation of Unicode. You can check this withsys.maxunicode
- if it’s a number less than the total number of Unicode code points, some Unicode characters are going to have alen()
other than 1.↩ -
Want to fix this? Pull requests are welcome! The fix would be pretty similar to the fix for this issue about
.ljust
and.rjust
. ↩ -
The Unicode spec calls this an extended grapheme cluster. Interestingly, the
Character
class in the Swift programming language represents an extended grapheme cluster and may be composed of multiple Unicode code points. ↩ -
Here we’re using a pure Python implementation for compatibility and readability.↩