Using string and bytes values
Python string values are similar—in some respects—to simple numeric types. There are a few arithmetic-like operators available and all of the comparisons are defined. Strings are immutable: we cannot change a string. We can, however, easily build new strings from existing strings, making the mutability question as irrelevant for string objects as it is for number objects. Python has two kinds of string values:
- Unicode: These strings use the entire Unicode character set. These are the default strings Python uses. The input-output libraries are all capable of a wide variety of Unicode encoding and decoding. The name for this type is
str
. It's a built-in type, so it starts with a lowercase letter. - Bytes: Many file formats and network protocols are defined over bytes, not Unicode characters. Python uses ASCII encoding for bytes. Special arrangements must be made to process bytes. The internal type name is
bytes
.
We can easily encode Unicode into a sequence of bytes. We can just as easily decode a sequence of bytes to see the Unicode characters. We'll show these two methods in the Converting between Unicode and bytes section, after we've looked at literals and operators.
Writing string literals
String literals are characters surrounded by string delimiters. Python offers a variety of string delimiters to solve a variety of problems. The most common literals create Unicode strings:
- Short string: Use either
"
or'
to surround the string. For example:"Don't Touch"
has an embedded apostrophe.'Speak "friend" and enter'
has embedded quotes. In the rare cases where we have both, we can use\
to avoid a quote:'"Don\'t touch," he said.'
uses apostrophes as delimiters, and an escaped apostrophe within the string. While a string literal must be complete on a single line, a'\n'
will expand into a proper newline character internally. - Long string: Use either
"""
or'''
to surround a multi-line string. The string can span as many lines as necessary. A long string can include any characters except for the terminating triple-quote or triple-apostrophe.
Python has a moderate number of \
escape sequences to allow us to enter characters that aren't possible from a keyboard. If we use ordinary str
literals, Python replaces all the escape sequences with proper Unicode characters. In an ordinary bytes
literal, each escape sequence becomes a one-byte ASCII character.
Many Python programs are saved as pure ASCII text, but this is not a requirement. When saving a file in ASCII, escapes will be required for non-ASCII Unicode characters. When saving files in Unicode, then relatively few escapes are required, since any Unicode character available on our keyboard can be entered directly. Here are two examples of the same string:
>>> "String with π×r²" >>> "String with \u03c0\u00d7r\N{superscript two}"
The first string uses Unicode characters; the file must be saved in the appropriate encoding, such as UTF-8, for this to work. The second string uses escape sequences to describe the Unicode characters. The \u
sequence is followed by a four-digit hex value. The \N{...}
escape allows the name of the character. A \U
escape—not shown in the example—requires an 8-digit hex value. The second example can be saved in any encoding, including ASCII.
The most commonly-used escape sequences are \"
, \'
, \n
, \t
, and \\
to create a quote inside a quoted string, an apostrophe inside an apostrophe delimited string, a newline, a tab, and a \
character. There are a few others, but their meanings are so obscure that numeric codes usually make more sense. For example, \v
, should probably be written as \x0b
or \u000b
; the original meaning behind \v
is largely lost to history.
Note that '\u000b'
is replaced by the actual Unicode character. We also have '\u240b'
which is a Unicode glyph, '', that symbolizes that vertical tab character. Most of the non-printing ASCII control characters also have these symbolic glyphs.
Using raw string literals
Sometimes, we need to provide strings in which the \
character is not an escape character. When preparing regular expressions, for example, we prefer not be forced to write \\
to represent a single \
character. Similarly, when working with Windows filenames, we don't want "C:\temp"
to have an ASCII horizontal tab character ('\u0008'
) replace the '\t'
sequence of characters in the middle of the string literal. We could write "C:\\temp"
but it seems error-prone.
To avoid this escape processing, Python offers the raw string. We can prefix any of the previous four flavors of delimiters with the letter r
or R
. For example, r'\b[a-zA-Z_]\w+\b'
, is a raw string. The \
characters will be left intact by Python: the '\b
' sequences are not translated to '\u0008
' characters.
If we do this without using the r"
character as the raw string delimiter, we'll create a string literal equivalent to this: '\x08[a-zA-Z_]\\w+\x08'
. This shows how a '\b
' characters are transformed to '\x08
' in a non-raw string. Omitting the leading r'
leads to a string that does not represent the regular expression we intended.
Using byte string literals
We may need to include byte strings in our programs as well as Unicode strings. In order to do this, we use a prefix of b
or B
in front of the string delimiter. A byte string is limited to ASCII characters and escape sequences that produce single-byte ASCII characters.
Generally, byte strings focus on the hexadecimal escape, \xhh
, with two hex digits for byte strings. We can also use the octal escape, \odd
, with octal digits.
We can also prepare raw byte strings using any combination of r
or R
paired with b
or B
as a prefix to the string. Here's a regular expression in ASCII bytes:
>>> rb"\\x[0-9a-fA-F]+" b'\\\\x[0-9a-fA-F]+'
The output is in Python's canonical notation using lengthy escapes for the '\\
' regular expression pattern.
To be fastidious, we are also able to use a u"
prefix to indicate that a given string is explicitly Unicode. This is relatively rare because it restates the default assumption. It can come in handy in a program where byte strings predominate; the use of u"some string"
can make the Unicode literal stand out from numerous b"bytes"
literals.
Using the string operators
Two of the arithmetic operators, +
and *
, are defined for both classes of string objects, str
and bytes
. We can use the +
operator to concatenate two string objects, creating a longer string. Interestingly, we can use the *
operator to multiply a string and an integer to create a longer string: "="*3
is '==='
.
Additionally, adjacent string literals are combined into a larger string during code parsing. Here's an example:
>>> "adjacent " 'literals' 'adjacent literals'
Since this happens at parse time, it only works for string literals. For variables or other expressions, there must be a proper +
operator.
All of the comparison operators work for strings. The comparison operators compare two strings, character by character. We'll look at this in detail in Chapter 5, Logic, Comparisons, and Conditions.
We cannot use string operators with mixed types of operands. Using "hello" + b"world"
will raise a TypeError
exception. We must either encode the Unicode str
into bytes
, or decode the bytes
into a Unicode str
object.
Strings are sequence collections. We can extract characters and slices from them. Strings also work with the in
operator. We can ask if a particular character or a substring occurs in a string like this:
>>> "i" in "bankrupted" False >>> "bank" in "bankrupted" True
The first example shows the typical use for the in
operator: checking to see if a given item is in the collection. This use of in
applies to many other kinds of collections. The second example shows a feature that is unique to strings: we're looking for a given substring in a longer string.
Converting between Unicode and bytes
Most of the Python I/O libraries are aware of OS file encodings. When working with text files, we rarely need to explicitly provide encoding. We'll examine the details of Python's input-output capabilities in Chapter 10, Files, Databases, Networks, and Contexts.
When we need to encode Unicode characters as a string of bytes, we use the encode()
method of a string. Here's an example:
>>> 'String with π×r²'.encode("utf-8") b'String with \xcf\x80\xc3\x97r\xc2\xb2'
We've provided a literal Unicode string, and encoded this into UTF-8 bytes. Python has numerous encoding schemes, all defined in the codecs
module.
To decode the Unicode string represented by a string of bytes, we use the decode()
method of the bytes. Here's an example:
>>> b'very \xe2\x98\xba\xef\xb8\x8e'.decode('utf-8') 'very ☺︎'
We've provided a byte string with eleven inpidually hex-encoded bytes. We decoded this to include six Unicode characters.
Note that there are several aliases for the supported encodings. We've used "utf-8"
and "UTF-8"
. There are still more explained in the codecs
chapter of the Python Standard Library.
The ASCII
codec is the most commonly used of these. In addition to ASCII
, many strings and text files are encoded in UTF-8
. When downloading data from the Internet, there's often a header or other indicator that provides the encoding, in the rare case that it's not UTF-8
.
In some cases, we have a document which in bytes, written in traditional ASCII. To work with ASCII files, we convert the bytes from the ASCII encoding to Unicode characters. Similarly, we can encode a subset of Unicode characters using the ASCII encoding instead of UTF-8.
It's possible that a given sequences of bytes does not properly encode Unicode characters. This may be because the wrong encoding was used to decode the bytes. Or it could be because the bytes are incorrect. The decode()
method has additional parameters to define what to do when the bytes cannot be decoded. The values for the errors argument are strings:
"strict"
means that exceptions are raised. This is the default."ignore"
means that invalid bytes will be skipped."replace"
means that a default character will be inserted. This is defined in thecodecs
module. The'\ufffd'
character is the default replacement.
The choice of error handling is highly application-specific.
Using string methods
A string object has a large number of method functions. Most of these apply both to str
and bytes
objects. These can be separated into four groups:
- Transformers: which create new strings from old strings
- Creators: which create a string from a non-string object(s)
- Accessors: which access a string and return a fact about that string
- Parsers: which examine a string and decompose the string, or create new data objects from the string
The transformer group of method functions includes capitalize()
, center()
, expandtabs()
, ljust()
, lower()
, rjust()
, swapcase()
, title()
, upper()
, and zfill()
. These methods all make general changes to the characters of a string to create a transformed result. Methods such as lower()
and upper()
are used frequently to normalize case for comparisons:
>>> "WoRd".lower() 'word'
Using this technique allows us to write programs which are more tolerant of character strings with minor errors.
Additional transformers include functions such as strip()
, rstrip()
, lstrip()
, and replace()
. The functions in the strip family remove whitespace. It's common to use rstrip()
on input lines to remove any trailing spaces and the trailing newline character which might be present.
The replace()
function replaces any substring with another substring. If we want to do multiple independent replacements, we can do something like this.
>>> "$12,345.00".replace("$","").replace(",","") '12345.00'
This will create an intermediate string with the "$
" removed. It will create a second intermediate string from that with the ,
character removed. This kind of processing is handy for cleaning up raw data.
Accessing the details of a string
We use accessor methods to determine facts about the string; the results may be Boolean or integer values. For example, the count()
method returns a count of the number of places an argument substring or character was found in the object string.
Some widely-used methods include the find()
, rfind()
, index()
, and rindex()
methods which will find the position of a substring in the object string. The find()
methods return a special value of -1
if the substring isn't found. The index()
methods raise a ValueError
exception if the substring isn't found. The "r" versions find the right-most occurrence of the target substring. All of these methods are available for both str
and bytes
objects.
The endswith()
and startswith()
methods are Boolean functions; they examine the beginning or ending of a string. Here are some examples:
>>> "pleonastic".endswith("tic") True >>> "rediscount".find("disc") 2 >>> "postlaunch".find("not") -1
The first example shows how we can check the ending of a string with the endswith()
method. The second example shows how the find()
method locates the offset of a given substring in a longer string. The third example shows show the find()
method returns a signal value of -1 if the substring can't be found.
Additionally, there are seven Boolean pattern-matching functions. These are isalnum()
, isalpha()
, isdigit()
, islower()
, isspace()
, istitle()
, and isupper()
. These will return True
if the function matches a given pattern. For example, "13210".isdigit()
is True
.
Parsing strings into substrings
There are a few method functions which we can use to decompose a string into substrings. We'll hold off on looking at split()
, join()
, and partition()
in detail until Chapter 3, Expressions and Output.
As a quick overview, we'll note that split()
splits a string into a sequence of strings based on locating a possibly repeating separator substring. We might use an expression such as '01.03.05.15'.split('.')
to create the sequence ['01', '03', '05', '15']
from the longer string, by splitting on the '.
' character. The join()
method is the inverse of split()
. That means that "-".join(['01', '03', '05', '15'])
will create a new string from the inpidual strings and the separator; the result is '01-03-05-15'
. The partition can be viewed as a single-item split to separate the head of a string from the tail.
Python's assignment statement deals very gracefully with such a method that returns more than one value. In Chapter 4, Variables, Assignment and Scoping Rules, we'll look at multiple assignment more closely.
The split()
method should not be used to parse filenames, nor should the join()
method be used to build filenames. There's a separate module, os.path
, which handles this properly by applying OS-specific rules.