Jared Foy Teach me good judgement and knowledge

Collection data types and more

Published December 13th, 2018 4:50 pm

There is a lot of fascinating stuff in here.

Conversions Like we showed early using the decimal.Decimal() method. There are two forms that Python may output things: Representational form and string form. This representational form is used by Python in the case that one would like to iterate over a Python script and extract things as they are in the local environment. However, not all objects can be accurately reproduced in representational form so the following syntax may be used: module 'sys' (built-in)
This syntax is specifically for the module sys, but it can be used for others as well. As said the string form is human readable, it makes sense when people read it. If a data type doesn't have a string form and a string is required, Python will use the representational form. Python's built-in data types are self conscious of the str.format() therefore when they are passed into this method they will be suitably represented as a string. When can also add .format() support to custom data types as we will see later on. We can also 'override' the data type's normal behavior and force it to provide either its string or representational form. We can do this by adding a 'conversion specifier' to the field. There are three specifiers: 's' which forces string form, 'r' which forces representational form, and 'a' which will force representational form however will only utilize ASCII characters. Por ejemplo:
>>>'{0} {0!s} {0!r} {0!a}'.format(decimal.Decimal('93.4'))
This will output a string in which these conversion factors have been accounted for and performed. !s will convert to a string, which will be shown without much difference because it is being inputed into a string. !r is going to give the Decimal('93.4') object form inside the string and !a is going to do the same but only using ASCII characters, which is already taking place in the aforementioned string.
If we were to format a string that contains mandarin then we could use the curly brace notation and the !a conversion factor in order to get output that is represented only as ASCII. In similar fashion we may also simply use the ascii() method which will provide ASCII output.
Format Specifications
The default formatting of integers, floating-point numbers, and strings is often good. We can also exercise fine control over these if we want something else. If we are controling strings we can control the fill character, the alignment within the field, and the minimum and maximum field widths. String format specifications are introduced using a colon(:) and this is followed by an optional pair of characers these being a 'fill character' and an 'alignment character'. A fill character may not by a curly brace (})(im not sure exactly what that means) and a alignment character may be (<) for aligning left, (^) for aligning center, (>) for aligning right. We can also use an optional minimum width integer, and if we want to specify a max width we would input it last as a period followed by an integer. Important: if we specify a fill character then we also have to specify alignment as well. We leave out the sign and type parts of the format specs because they have no effect on strings. It doesn't matter, both harmless and pointless to have a colon without any of the optional elements. Por ejemplo:
>>>s = 'The sword of truth'
>>>'{0}'.format(s) #this shows default formatting
'The sword of truth'
If we want to pad the string to give it a minumum width of 25 characters then we could do this:
>>>'{0:25}'.format(s) #the padding direction defaults to pad the right hand side and push the string to the index 0 position.
>>>'{0:>25}'.format(s) # this will push all the text to the back of the string
>>>'{0:^25}'.format(s) #this will center align the text in the padding (what happens when your padding is not divisible by 2???)
>>>'{0:-^25}'.format(s) # this will do the same as above but it will fill the space with '-' instead of whitespace
>>>'{0:.10}'.format(s) #this will truncate the string because it is longer than ten characters, additionally the '.' fill character will not be employed because there is no space to pad. However, I assume that if we were to do this on a lot of strings, such as iterating of strings in a list then we would have different results wich make use of the '.' character. It also seems that you would need to employ the alignment character here, for a reason that I'm not certain of.
As previously noted, we can have replacement fields inside format specifications. This makes it possible to have computed formats. Here for example are a couple ways of setting a string's max width using a 'maxwidth' variable.
>>>maxwidth = 12
'{0}'.format(s[:maxwidth])
#this will give the same output as this
'{0:.{1}}'.format(s, maxwidth)
The difference being that we are implementing the standard string slicing and the second uses an inner replacement field (i'm still not sure what the presence of the . means, I would assume that it is optional but idk)
For integers, the format specs allow us to control fill characters, alignment, the sign, using or not using a 'nonlocale-aware' comma sperator to group digits, minimum field width, and a number base.
To format an integer we begin with a colon, after which we may use an optional fill character (but we can't use }) and an alignment character these being the same as previously stated (<^>) however we also can use a = for the filling to be done between the sign and the number. Also we can use an optional sign character: '+' forces the output of the sign, '-' outputs the sign only for negative numbers, and a space outputs a space for positive numbers while outputing a - for negative numbers. We can then use an optional minimum width integer (for this we can use a preceding # to get the base prefix output if we are using binary, octals, or hex's) a 0 will give you zero padding. We can also employ an optional comma, if present it will cause the number's digits to be grouped into threes with a comma seperating each group. If we want the output. If we desire output in a base other than decimal we need to add a type character (b for binary, o for octal, x for lowercase hex, X for uppercase hex,) and because Python is Python you can also use the 'd' which is the default decimal base. But wait, there are two more type characters: 'c' which means that the Unicode character corresponding to the integer is going to be outputted, and 'n' which outputs numbers in a 'locale-sensitve way' (whatever that means) Note that if you use 'n', it doesn't make sense to use a comma. (This isn't clear to me)
Okay, time for some examples:
#Here is our string
number = '{0:0=12}'.format(6125)
'000000006125'
#this gives the same output as:
number = '{0:012}'.format(6125)
#the difference being that one is padding and one if filling, although I'm not sure what the exact difference is
Important note: we can't specify a max field width for integers because this could case truncation, and then the integer would no longer be represented. As was stated earlier, we can use a , to seperate groups of numbers, such as those having decimal points or exponents. I get it now, a locale is the denomination indicator of the currency used.
Using 'n' has the same effect as 'd' when given an integer as an argument in the format() method. It also has the same effect as g when given a floating-point number. 'n' is special because it will respect the locale of the user (which I'm assuming is the environment of the Python interpreter that is running the code). The default locale is C, in this locale the default decimal is '.' and the grouping characters are an empty string ''. If we would like to respect the locale of the user we can do this:
import locale
locale.setlocale(locale.LC_ALL, '')
There is a footnote about threaded programs, which I have no understanding of, but if so, call locale.setlocale() once and do so before threads start up. Passing an empty string as the locale tells Python to try to automatically determine the user's locale. It does this by examing the LANG environment variable. Then it falls back to the C locale if this can't be done.
'n' is exceptionally useful for integers but it is more limited with floating-point numbers because when they become large they are output in exponential form. Floating point numbers have these format specs:
1. Fill character
2. Alignment within field
3. The sign
4. Whether to use a non-locale aware comma separator when grouping digits
5. Minimum field width
6. Number of digits after the decimal place
7. Whether to present the number in standard or exp form or as a percentage
We can specify how many digits to place after the decimal by writing a '.' followed by an integer. We can add a type character at the end: 'e' for exp form using a lowercase e, 'E' for uppercase,'f' for standard float-point form,'g' for general form which will be the same as 'f' unless its super big, then it will be shown the same as 'e'...
You can also use '%' which results in a percentage notation by multiplying the decimal digits by 100, the output will have a % appended to it.
Since 3.1, decimal.Decimal literals can be formatted as floats wich includes support for ',' comma separated groups.
Python supports formatting complex numbers using the same syntax as for floats, ex:
>>>"{:,.4f}".format(3.59284e6-8.984327843e6j)
'3,592,840.0000-8,984,327.8430j'
A draw back to this approach is that exactly the same formattng is applied to both the resal and the imagnary parts; but we can always use the 3.0 technique of accessing the complex number's attributes individually if we want to format each one differently.
Example: print_unicode.py
Before we examined the str.format() method's format specs. We have seen many code snippets that show particular aspects. Here we will review a small yet useful example that makes use of the str.format() so that we can see format specs in a real context. This example uses some of the string methods we saw in the previous section but it also introduces a function from the unicodedata module. The following program has only 25 lines of executable code. It imports two modules: sys and unicodedata. It defines one custom function that being print_unicode_table(). Let's being by looking at a sample run to see what it does. OK. Then we will look at the code at the end of the program where processing really starts then we will look at the custom function.
After the imports and the creation of the print_unicode_table() function, execution reaches the code that I just wrote down. We begin by assuming that the user has not given a wrod to match on the command line. If a command-line argument is given and is -h or --help, then we print the programs usage info and set word to 0 as a flag to indicate that we are finished. Otherwise, we set the word to a lowercase copy of the argument the user typed in. If the word is not 0 then we print the table. When we print the usage info we use a format spec that just has the format name, in this case that happens to be the position number of the argument. We also can write:
print('usage: {0[0]} [string]'.format(sys.argv))
Which would attain the same result.
def print_unicode_table(word):
print('decimal hex chr {0:^40}'.format(name))
print('------- --- --- {0:-<{40}'.format(''))
code = ord(' ')
end = sys.maxunicode
while code < end:
c = chr(code)
name = unicodedata.name(c, '*** unkown ***')
if word is None or word in name.lower():
print('{0:7} {0:5X} {0:^3c} {1}'.format( code, name.title()))
code += 1
In the next code section, we used some extra blank space for clarity (which Python thankfully allows) The first two lines of the function's suite print the title lines. The first str.format() prints the text 'name' centered in a field 40 characters wide while the second line prints a string with whitespace 40 characters wide. It uses a hyphen as the fill character. We can also use the string replication character * but I think this is messier. You can also just format it with a super long '-', but that lacks elegance.
We keep track of Unicode code points in the code variable initializing it to the code point for a space (which happens to have only one code point.) Then we set the end variable to be the highest Unicode code point available (which can vary.) In the while loop we get the Unicode character that corresponds to the code and save it to 'c' variable. This is done niftily using the chr() function. The unicodedata.name() function returns the Unicode character name for the given Unicode character. Then we have the fallback argument which prints if their is found no suitable name. If the case may be that the user doesn't define a word (word = None), or they do but it is in a lowercase copy of the Unicode character name, then we print the corresponding row.
Even though the code variable was passed to the .format() method only once, it is used three times in the format string. First in order to print the code as an integer, second to print the code as an uppercase hex number (X), and thirdly to print the Unicode character that corresponds to the code using the 'c' format specifier. The second argument in the format method is the character's Unicode character name. It is printed using 'title' case which means the first letter of each word is uppercased, and the following letters are lowercased. Let's just say that the .format() method is really versatile and we will use it a lot.
Character Encoding
Computers can only store bytes, when it comes down to it. These are 8-bit values which (if unsigned) range from 0x00 to 0xFF. Each character has to be represented in terms of bytes. In the good ol days the computer pilgrims devised encoding schemes that assigned a particular character to a particular byte. Like when using ASCII encoding 'A' is represented by 0x41, 'B' by 0x42, an so on. In the States and West Europe the Latin-1 encoding is oft used; its characters ranging from 0x20 to 0x7E are the same sa the corresponding characters in 7-bit ASCII, with those in the range of 0xA0 to 0xFF used for accented characters and other symbols needed by non-English Latin alphabets. There are a bunch of encodings, but Unicode reigns supreme. Needless to say, all these different encoding have proved unwieldy, especially when you be trying to write stuff that is going to be used across the globe. Therefore, the Unicode encoding has been adopted hegemonically.
Unicode assigns every character to a 'code point' as we have seen before. The difference with Unicode is that it is not limited to using one byte per character, therefore it can represent every character in every language in a single encoding, so it is better than the others because it can handle characters from a mixture of languages, rather than just one. Magnifique!
There are currently more than 100,000 Unicode characters defined, so even using signed numbers (foggy on what this means), a 32-bit integer is more than adequate to store any Unicode code point. The simplest way to store Unicode characters is as a sequence of 32-bit integers, one integer per character. It sounds good because it should produce a one to one mapping of characters to 32-bit integers, which you would think would make indexing quick. Unfortunately, it ain't the case as some Unicode characters are representable by one to two code points.
These days, Unicode is stored on disk and in memory using UTF-8, UTF-16, or UTF-32. UTF-8 is backward compatible with ASCII 7-bit because its first 128 code points are represented by single-byte values that are the same as the 7-bit ASCII character values. In order to represent all of the Unicode characters, UTF-8 uses two, three, or more bytes per character. This makes UTF-8 very compact for representing text that is all or mostly English. The Gtk library which is used by GNOME windows OS (and others) uses UTF-8, perhaps UTF-8 is becoming the 'de facto' standard format for XML, and many web pages use UTF-8 these days. Java uses UCS-2 which is the same as UTF-16. This representation uses 2 to 4 bytes per character, mostly the common characters are represented by 2 bytes, UCS-4 (UTF-32) uses four bytes per character. Using UTF-16 or UTF-32 for storing Unicode in files or for sending over a network connection has a potential pitfall that being if the data is sent as integers then the 'endiannness' matters. One solution to this is to precede the data with a byte order mark so that readers can adapt accordingly. There is no such problem with UTF-8.
Python represents Unicode using UTF-16 or UTF-32. When using UTF-16 Python uses a slightly simplified version that uses two bytes per character and so can only represent code points up to 0xFFFF. When using UTF-32 Python can represent all code points. The max code point is stored in the read only sys.maxunicode attribute. If the value is 65,535 then Python was compiled with UTF-16, if its a bigger number (maybe double) then Python is using UTF-32. (Which I think mine is)
The str.encode() method returns a sequence of bytes which in reality is a 'bytes' object (covered later). It is encoded according to the encoding argument we supply. Using this method we can see more clearly the difference between encodings and why making incorrect encoding assumptions can lead to errors:
>>>artist = 'Tage Åsén'
>>>artist.encode('Latin1')
b'Tage \xc5s\xe9n
>>> artist.encode("CP850")
b'Tage \x8fs\x82n'
The opening 'b' indicates a bytes literal rather than a string. When creating bytes literals we can use a mixture of printable ASCII characters and hex escapes. We should be able to tell that we can't proper encode this name because it doesn't have the Å character or any accented characters. Attempting to encode this string will give us a UnicodeEncodeError. The Latin-1 encoding is known as ISO-8859-1 which is an 8-bit encoding that has all the necessary characters for the name. Some names are not so lucky. Unicode, however, can encode all sorts of names. In UTF-16 the first two bytes are the byte order mark, these are used by the decoding function to detect whether the data is big- or little-endian so that it can adapt accordingly. (Clearly, I'm missing something when it comes to this endianness thing)
Note that the first arg of the .encode() method is case-insensitive, hypens and underscores in the name are treated the same. We can also use aliases 'latin1' and 'latin_1' are the same. str.encode() can also accept an optional second argument wich is usd to tell it how to handle errors. Using the second arg we can basically encode everything, even if it is not going to do so well. We will lose data but we can pass the argument 'ignore' or 'replace'. If we use 'backslashreplace' then it will replace non ASCII characters with \x \u \U. We can also use '{0!a}'.format(artist) in order to convert the encoding. (interesting)
str.encode() is complemented by bytes.decode() and bytearray.decode(). These work by decoding the string they are called on, printing the characters. The differences between Latin-1, CP850, and UTF-8 makes it evident that guessing encodings probably wont work well. Thankfully, UTF-8 is becoming the lingua franca. In 50 years my kids may not even know others existed.
Python .py files use UTF-8, so Python knows the encoding to use with string literals. This gives us the ability to type any character imaginable into our strings (provided Sublime Text allows such a feat!). When Python reads data from external sources like sockets it doesn't know what encoding is being used so it returns bytes which we can then decode accordingly. If Python is reading a text file then it uses the local encoding unless we specify another. Thankfully some files specify encoding formats. XML is assumed to use UTF-8 unless stated otherwise in the xml directive. When reading XML we might extract the first 1000 bytes, look for an encoding spec, and then if found decode the file using the spec. Otherwise we could fallback to UTF-8 and try that. This should work for any xml or plain text file that uses any of the single byte encodings supported by Python, except for these things called 'EBCDIC'-based encodings. This won't work for multiple byte based encodings like UTF-16 or 32. At least two packages for auto detecting a file's encoding are available from the PPI.
Here are a couple example that we will do to consolidate our learning:
quadratic.py is a program that contains quadratic equations which are in the form of ax^2 + bx + c = 0 where a ≠ 0 describe parabolas. The roots of these equations are derived from the formula: x = -b plus or minus the square root of b^2 - 4ac divided by 2a. The b^2 - 4ac part of the formula is called the discriminant. If it is positive there are two real roots, if it is zero there is one real root, and if it is negative there are two complex roots. Let's write a program that accepts the a,b, and c factors from the user (with the b and c factors allowed to be 0), and then lets calculate and output the root(s). Here we go:
We need both the float and the complex math libraries since the square root functions for real and complex numbers are different. We need sys for sys.float_info.epsilon which we will need in order to compare floating-point numbers with 0.
We also need a function that can get a floating-point number from the user. This function will loop until the user enters a valid float-point number. It will accept zero only if allow_zero is True. Thanks to the get_float() function it is easy to get the a,b, and c factors. The bool second arg says whether zero is an acceptable input. The code is going to look different than the formula because we begin by calculating the discriminant. If the discriminant is 0, we know that we have one real solution and then we calculate it directly. Otherwise, we take the real or complex square root of the discriminant and calculate the two roots. If we'd like a more robust way to use variables inside of printable strings then we can use the locals() method by doing .format(**locals()) We can even omit the field names and leave Python to populate the fields using the positional args passed the the str.format() method. But if we do this we just need to remember that we did! It looks like I successfully executed this program. Cheers.
Sometimes we need to take a data set and present it using HTML. Here we develop a program that can read a file that uses a simple CSV (Comma Separated Value) format and outputs an HTML table containing the file's data. Python has a great module for handling CSV and similar formats: import csv The CSV file we are working with has one record per line and each record is divided into fields by commas. Each field may be a string or a number. Strings are enclosed in '' or "" and numbers should have no quotes unless they contain commas. Commas can be inside strings and must not be treated as field separators. Let's assume the first record contains field labels. The output we will produce is an HTML table with text left-aligned and numbers right aligned. It will have one row per record and one cell per field. The program needs to output the HTML table's opening tag, then read each line of data and for each one output an HTML row. At the end it needs to output the table's closing tag. We want to add colors, and ensure that the HTML sensitive characters are properly escaped.
Using the shell to run a program that gets info from one file, declares the script to do the getting, and declares the place to put said info we need to use some Shell syntax.
Sample data is stored in here:
data/c02-sample.csv
The program to operate on this data is:
csv2html.py
The output generated is stored in file:
co2-sample.html
The shell command would look like so:
python C:\py3eg\csv2html.py < data\co2-sample.csv > co2-sample.html The sample data would look like so:
"COUNTRY","2000","2001","2002","2003",2004
"ANTIGUA AND BARBUDA", 0,0,0,0,0
"ARGENTINA",37,35,33,36,39
We assume the sample data is in the file data\co2-sample.csv and then given the shell command stated before this file will get output that will look like so:
table border='1'>tr bgcolor='lightgreen'>
td>Country/td>td align='right'>2000/td>td align='right'>2001/td>
so on and so forth...Our program will tidy up some output and omitt some lines. This one is using HTML4 transitional with no stylesheet.
Although Python doesn't need an entry point like some languages need, it is still common to create a function called main() and to call it to start off processing. Because no function can be called before it has been created, we must make sure we call main() after the functions it relies on have been defined. the order in which the functions appear in the file (that is the order in which they have been created and given suite code has no import)
In the csv2html.py program the first function we call is main() which in turn calls print_start() and then print_line(). print_line() calls extract_fields() and escape_html(). Take a look at the program structure in the book.
When Python reads a file it begins at the top. So in this program we can see that it begins by importing sys, then creating main() then it creates the other functions in the order in which they appear in the file. When Python finally reaches the call to main() at the end of the file, all the functions that main() will call and also all the fuctions that will be called because they were nested in suites will now exist. (that is fascinating!) Execution (as we think of it) begins where the call to main() is made. Let's build the program.
The maxwidth variabe is used to constrain the number of characters in a cell. If the field is larger than 100 it will be truncated. The while loop iterates over each line of input (this could be from the user typing at the keyboard, but we expect it to be a redirected file.) We set the color we want to use and call the print_line() to output the line as an HTML table row. Sometimes we will seperate out functions to provide logical clarity, even though it isn't necessary to do so in a smaller program, but I suppose it is a good practice.
We can't use str.split(",") to split each line into fields because commas can occur inside quoted strings. In order to deal with this complication we have declared the extract_fields() function. Once we have a list of the fields (as strings, with no surrounding quotes) we can then iterate over them, creating a table cell for each one.
If a field is empty, we output an empty cell. If a field is quoted, it could be a string or it could be a number that has been quoted to allow for internal commas, in order to account for this possibility we can make a copy of the field with commas removed and try to conver the field to a float type. IF the conversion is successful we output a rght-aligned cell with the field rounded to the nearest whole number and output it as an integer. If the conversion fails we output the field as a string. To do this we used str.title() to neaten the case of the letters and we replace the word And with and as a correction to str.title()'s effect. (I think this is because str.title() will always give us a capital on the start of a word, therfore the find and replace arguments are given in the event we stumble upon and 'and'). If the field isn't over 100 characters we use the whole thing, otherwise we truncate it using the maxwidth variable, then we append an '...' ellipsis to signify that truncation has occured. No matter what, we escape any special HTML characters that the field may contain.
The function extract_fields() reads the line it is given character by character, accumulating a list of fields (this fields are each strings without any enclosing quotes) The function copes with fields that are unquoted, and with fields that are quoted with single or double quotes. It does this by putting single quoted strings into a string enclosed in double quotes and for strings with double quotes inside strings with single quotes. I like how Python has this syntax, allowing a double quote to be used inside a single quoted string.
The escape_html() function replaces each special HTML character with the appropriate HTML entity. We have to replace '&' ampersands first (i'm not sure exactly why) although the order doesn't matter for the angle brackets. Python's standard library includes a slightly more sophisticated version of this function, which we will see later on.
We are now at the close of the chapter. To wrap things up we saw a list of keywords that Python utilizes and we saw the rules that govern identifiers. Thanks to Unicode which Python employs, identifiers aren't limited to latin-1 or ASCII. We discovered the int data type, which differs from similar types in most other languages in the sense that it has no intrinsic size limitation. Basically machine memory is the only thing that limits integer length. All of Python's most basic data types are immutable however, we usually don't see the effects of this because the augmented assignment operators work so well. Literal integers are usually written as decimal numbers but we can write binary literals using the 0b prefix, octals with 0o, and hex with 0x.
When two integers are divided uisng '/' the result is always a float. If we want integer division just use '//'. Python has a bool type which holds True of False. Three logical operators and,or,not. The two binary operators (and,or) use short-circuit logic (again, I'll have to investigate more into this concept). There are three kinds of floating-point numbers available: float(),complex(),decimal.Decimal(). We usually use float which is a double precision floating-point number whose exact numerical characteristics depend on the underlying C,C#, or Java library that Python is built on (interesting, lets learn more about this!) Complex numbers are represented as two floats, one holds the real value and the other holds the imaginary number (I don't know much about complex numbers, maybe we can look into this!) Of course as previously stated the decimal.Decimal type is provided by the decimal module (which should be obvious by now!) These numbers default to have 28 decimal places for accuracy, but we can increase or decrease in order to suit the need. All three floating-point types can be used with the appropriate built-in math operators and functions. Additionally, the math module provides a variety of trig, hyperbolic, and log functions that can be used with floats. The cmath module provides a similar set of functions for complex numbers.
We covered a bunch of stuff about strings in this chapter! We can create them using '' or "" without formality. We can also use """ triple-quoted strings if we need to include new lines and quotes without formality. We can also use some escape sequences like tab (\t) and newline (\n) and Unicode characters both using hex escapes and also Unicode character names. Although strings support the same comparison operators as other Python types, we nod that sorting strings that contain non-English characters can be problematic. Strings are sequences, therefore the slice operator([]) can be used to slice and stride strings with simple yet powerful syntax. We can concatenate strings using +, and replicate strings with the *, these can also use the augmented assignment versions, although str.join() is more commonly used to concatenate. There are some other methods such as str.isspace(), str.isalpha() for testing. Some for changing case like str.lower(), and str.title(). Some for searching str.find(), and str.index(). As you can tell python has awesome string support. It lets us extract or compare whole strings or parts of strings in order to replace characters or substrings, and to split strings into a list of substrings. We can also join lists of strings into a single string. str.format() is perhaps the most versatile. With this method we create strings that have replacement fields in which variables may be input. We also have format specs in order to precisely define the characteristics of each field that will be replaced with a value.
Also discussed was character encoding, python .py files use the Unicode UTF-8 encoding by default which allows for most every language to be supported. We can convert a string into a sequence of bytes that represent a particular encoding by using the str.encode() and then we can complimentarily get the characters back by using bytes.decode(). In addition to the data types covered here, Python provides two other built-in data types: bytes and bytearray which are covered later on. Python provides several collection data types, some built-in and others which are in the standard library (I should learn more about the standard library)
Chapter 3 in which we will cover sequence types, set types, mapping types, iterating and copying collections.
We hav already learned the most important fundamental data types. Let's now learn how to gather data items together using Python's collection data types. We'll cover tuples and lists, and also introduce new collection data types, including sets and dictionaries, and then we'll cover them in depth. We'll also see how to create data items that are aggregates of other data items (like C or C++ structs or Pascal records). These items can be treated as a single unit when it is convenient, while their items will remain individually accessible. We can put aggregate items in collections just like any other items. When we put data items in collections it makes it a lot easier to perform operations that must be applied to all of the items, it also makes it easier to handle collections of items read in from files. Let's cover the basics of text file handling. After covering the individual collection data types we will look at how to iterate over collections since the same syntax is used for all of Python's collections, we'll also explore the issues and techniques involved in copying collections.
Sequence types
A 'sequence type' is one that supports the membership operator (in), the size function (len()), slices ([]), and is also iterable (meaning we can pass over the items inside). Python gives five built-in sequence types: 'bytearray','bytes','list','str', and 'tuple'. We'll deal with bytearray's and bytes later. We also have a few collection types which come in the standard library like collections.namedtuple. When iterated, all of these sequences provide their items in order. Let's cover tuples, named tuples, and lists.
Tuples
A tuple is an order sequence of zero or more object references. They support the same slicing and striding as do strings. Therefore we can easily extract from tuples. Like strings, tuples are immutable, we can't replace or delete their items. If we want to be able to modify an ordered sequence we should use a list instead which is mutable. If it happens that we have a tuple and would like to modify it we can just do list() on it and then apply the changes to the copy that it spits out. We can call the tuple data type as a function like so tuple(). If we input no args it returns an empty literal tuple. If we give it a tuple arg it returns a shallow copy of the argument, and with any other argument it attempt to convert the given object to a tuple. It accepts but only one argument. We can create an empty tuple without using the tuple() function. We can do so by simply (), and a tuple of one or more items can be created if we use commas. Sometimes tuples must be enclosed in () to avoid syntactic ambiguity. For example, to pass this tuple: 1, 2, 3 to a function we would write this_is_a_func((1,2,3)). The difference with tuples when it comes to indexing is that tuples simply have an object reference at every position, instead of a character like that of a string.
With tuples we have two methods: t.count(x) which will return the number of times object x occurs in tuple t, and then t.index(x) which returns the index position of the leftmost occurrence of object x in tuple t (this I think answers the question I had posed earlier on about which character is given when indexing a string, because tuples behave similarly, it would make sense that the left most one is given for strings as well.) If we can return no x object in the given tuple we will raise a ValueError exception. Note: these methods are available also for lists.
Additionally, we can use concatenation, replication, and slice, as well as membership 'in' and 'not in'. We can also use the augmented assignment operators, even though tuples are immutable Python will create a new modified one behind the scenes for us (beware of latency if an overlong tuple is used!) We can compare tuples using a standard comparison operators. These comparisons are applied item by item (and recursively for nested items such as tuples inside tuples 'tuple-ception').
let's do a few slices, starting with extracting one item, and a slice of items:
>>>hair = 'black', 'brown', 'blonde', 'red'
>>>hair[2]
'blonde'
>>>hair[-3:] #same as hair[1:]
('brown','blonde','red')
These work the same for strings, lists, and any other sequence type. (Cool)
>>>hair[:2], 'gray', hair[2:]
(('black', 'brown'), 'gray', ('blonde', 'red'))
This is a tuple in a tuple, as can be seen by the double parens. As you'll note, this happened because we comma separated two tuple slices with a string. This caused Python to nest tuples. If we would like python to flatten out our tuple then we must do so:
>>>hair[:2] + ('gray',) + hair[2:]
('black', 'brown', 'gray', 'blonde', 'red')

As you can see, doing this operation forced all the slices into a single tuple instead of nesting the collections. Also note that we input 'gray' as a single item tuple in order to do this. Also note that we put in a mandatory comma after the single item in the tuple. In this case had we simply put in a comma without an item we would get a TypeError. This is because Python would think we were trying to concatenate a string and a tuple, therefore we must have the comma and (). From now on the book is going to use a particular coding style when writing tuples. When there are tuples on the left-hand side of a binary operator or on the right-hand side of a unary statement, we will omit the parentheses, in every other case we will use parentheses. Some examples:
a, b = (1, 2) #left of binary operator
del a, b #right of the unary statement
def f(x):
return x, x ** 2 #right of the unary statement
for x, y in ((1, 1), (2, 4), (3, 9)): #left of the binary operator print(x,y)
Just note here that we a comma separated list is shown without parenthesis, we can probably assume that Python is creating a tuple. We don't have to do this, some people always like to use parens on their tuples, which is using the representational form. Some people are more laissez-faire.
Let's take a look at this example:
>>>eyes = ('green','blue','green','azul')
>>>colors = (eyes, hair)
colors[0][0:2]
('green', 'blue')
Here we can see that we are selecting down for slices, first grabbing index pos. 0 of eyes tuple and then select index posisitions 0 through and excluding 2. We can even select down further, I assume the possibilities of depth are near infinity.
>>> things = (1, -7.5, ("pea", (5, "Xyz"), "queue"))
>>> things[2][1][1][2]
'z'
Note here that the slice is taking tuples and characters from strings with no difference in syntax, cool!
We can hold any item of any data type with a tuple, this includes collection types. This is because these types are really only holding object references. Using complex nested data structures like these can get confusing! One way to mitigate this is by giving names to particular index positions. Like so:
>>>MANUFACTURER, MODEL, SEATING = (0, 1, 2)
>>>MINIMUM, MAXIMUM = (0, 1)
>>>aircraft = ('Airbus', 'A320-200', (100, 220))
>>>aircraft[SEATING][MAXIMUM]
220
In this example we can see how we are assigning identifiers to integers. This is being done in a declaration that these two tuples are corresponding. Then we declare aircraft and create a literal that is a nested tuple. We can then pass identifiers into the slice operator because each identifier has a corresponding integer. This simply makes it more human readable. However, we are creating a lot of variables, and it is kinda ugly. When we have a sequence on the right-hand side of an assignment, and we have a tuple on the left-hand side, we say that the right-hand side has been 'unpacked'. Sequence unpacking can be used to swap values, like so:
a, b = (b, a) #what is this madness!
We don't really need the parens on the right of this assignment but because we are going to follow this style we will use it here. We've done sequence unpacking before like this:
for x, y in ((-3, 4), (4, 12), (32, -54)):
print(math.hypot(x, y))
In this example we loop over a tuple of '2-tuples' (which I suppose is named so because it has two items in it) then we unpack each 2-tuple into variables x and y. (this is super rad, what an excellent thing!)
Named Tuples
A named tuple behaves like a plain tuple, it has the same performance characteristics. It adds the ability to refer to items in the tuple by name as well as by index position, this allows us to create aggregates of data items. The 'collections' module provides the namedtuple() function. This function is used to create custom tuple data types (cool!) like so:
>>>Sale = collections.namedtuple('Sale', 'productid customerid date quantity price')
The first argument is the name of the custom tuple data type ('Sale'). The second arg is a string of space-separated names, one for each item that our custom tuple will take. The first arg, and the names in the second arg, must all be valid Python identifiers (hmm). The function returns a custom class (which is a data type) that can be used to create named tuples. So, in this case we can treat 'Sale' just like any other Python class and we can create objects of type Sale (sweet!). According to OOP every class created this way is a subclass of tuple; OOP and subclassing is covered later.
This is really cool:
sales = collections.namedtuple('Sale', 'firstkey','secondkey','thirdkey')
sales.append(Sale('firstvalue','secondvalue','thirdvalue'))
This will give us output that creates a namedtuple literal with the corresponding values to that which were declared upon the creation of the namedtuple which is assigned to the sales variable. When we call sales.append() we are telling Python to start putting a list into sales. This is interesting because sales was used to create the data type itself. I'm not exactly sure how this works or why it was done in this manner. I suppose we could have also simply declared sales = [] and get the same result. (hmm) Anywho, using .append() we call (Sale()) and then we put in corresponding values to this custom data type. I'm not sure how this is different from a dictionary. Maybe it is different in that it is a list comprised of multiple namedtuples.
We can refer to items in the tuples using index positions, for example, the price of the first sale item is sales[0][-1], but we can also use names, which can help clear things up:
total = 0
for sale in sales:
total += sale.quantity * sale.price
print("Total ${0:.2f}".format(total))
Here we declar total as 0 then create a for iteration that will look for things that have a .quantity name and a .price name, then we multiply it out and print a string that has been formatted with a field that is being replaced with the augmented total variable. That is genius, and way over my head. The clarity and convenience which namedtuples provide are useful. It is really cool that you can access them using the access atribute. For kicks, lets do the aircraft code again to make it look nice:
Aircraft = collections.namedtuple('Aircraft', 'manufacturer model seating')
Seating = collections.namedtuple('Seating', 'minimum maximum')
aircraft = Aircraft('Airbus', 'A320-200', Seating(100,220))
aircraft.seating.maximum
220
This is super cool! Here we have a nested namedtuple setup. The second value of the Aircraft tuple is being input with a namedtuple of the Seating sort. Therefore, drilling down to get info is much more elegant. My brain hurts! When we need to extract named tuple items for use in strings there are three main approaches we can take.
print("{0} {1}".format(aircraft.manufacturer, aircraft.model))
Airbus A320-200
This works, and with Python 3.1+ we can even omit the index values in the field brackets. But in order to see what is going on this requires us to look over into the str.format() portion of the statement. We could also do this:
'{0.manufacturer} {0.model}'.format(aircraft)
This is pretty cool. Basically we are just using format() in order to pull info form the custom tuple. Appending the key to the bracketed portion in the string is providing us with the stored value associated with the key. Perhaps a drawback here is that we have to specify the 0 inside the bracket, or else format() won't know what to go looking for.
However, namedtuples have a few 'private' methods. These methods names begin with a leading underscore. One of them called namedtuple._asdict() is useful, lets do it:
"{manufacturer} {model}".format(**aircraft._asdict))
The private namedtuple._asdict() method returns a mapping of key-value pairs, in which every key is the name of a tuple element and each value is the corresponding value (cool!) Note that sometimes private methods aren't always available across versions. We used mapping unpacking to convert the mapping into key-value arguments for the str.format() method. This is super cool. I think the ** command is what is allowing format to go searching in the program for an aircraft variable which has the ability to be operated on by ._asdict(). Note that it appears that ._asdict accepts no arguments.
Lists
A list is an ordered sequence of zero or more object references. Lists support slicing and striding as do the others. We can insert, replace, and delete slices of lists because they are mutable. We can call the list data type as a function list(). If we supply no args then it gives us an empty literal. If we put in an argument that is itself a list then it returns a shallow copy. With any other argument it attempts to convert the given object to a list. We can give it no more than one argument. We can also create lists without using the list function. We do this by declaring [] or we can also put stuff in it as well. We can also create lists by using a 'list comprehension'. But that is covered later. Since all the items in a list are just object references, lists in similar fashion to that of tuples can hold items of any data type, including other collection types like tuples. Lists can be compared using our standard set of comparison operators (<.<=.==.!=.>=.>), accordingly comparisons are applied item by item and recursively, in like fashion to that of tuples.
Lists can be nested, iterated over, and sliced. Lists will give you the same features as tuples and more, because they are mutable. They do membership testing with 'in' and 'not in', concatenation with +, extending with +=, replication with *, and *=. Lists can also be used with the built-in len() function, and also with the del statement (which is for deletion). We can use the slice operator on a list but sometimes we want to take a few pieces. We can do this by using the 'sequence unpacking' operator, an asterisk *. When used with two or more variables on the left-hand side of an assignment, one of which is preceded by *, items are assigned to the variables, with all those left over assigned to the starred variable. This is confusing, so let's see some examples:
>>>first, *rest = [9,2,-4,8,7]
>>>first, rest
(9, [2,-4,8,7])
As you can see the output is a tuple which then includes the list of the remainder of the items in the list that we created. Pretty clever.
>>> first, *mid, last = "Charles Philip Arthur George Windsor".split()
>>> first, mid, last
('Charles', ['Philip', 'Arthur', 'George'], 'Windsor')
>>> *directories, executable = "/usr/local/bin/gvim".split("/")
>>> directories, executable
(['', 'usr', 'local', 'bin'], 'gvim')
Here it seems like the .split() method is parcing the string it is called on into two seperate strings and the argument that is being passed to it looks like it is removing the '/'. Although I'm not exactly sure how the 'starred expression' are taking up the right stuff. Maybe the .split() method is working from the -1 position of the string and then stopping and excluding the argument character. (?)
Python also has a related concept called starred arguments. For examply if we have a function that requires three arguments like so:
def product(a, b, c):
return a * b * c #these aren't starred args, they are just multiplied together
>>>product(2, 3, 5)
30
>>>L = [2,3,5]
product(*L)
30
>>>product(2, *L[1:])
30
Here we can see starred arguments in action as L is become a list which houses the integers which are to be used as arguments for the function product(). We can also see that we may use slice operators on then as well, as is standard with lists. It looks like the lists just know to match up the index positions with the arguments defined in the function def.
Let's cover some list methods:
L.append(x) -- Appends item x to the end of list L
L.count(x) -- Returns the number of times item x occurs in list L
L.extend(m) -- Appends all of the iterable m's items to the end L += m -- of list L
L.index(x,start,end) -- Returns the index position of the leftmost occurence of item x in list L. If none is found then ValueError
L.insert(i, x) -- Inserts item x into list L at position i
L.pop() -- Returns and removes the rightmost item of list L
L.pop(i) -- Returns and removes the item at index position i in L
L.remove(x) -- Removes the leftmost occurrence of item x from L
L.reverse() -- Reverses list L in-place
L.sort(...) -- Sorts L in-place; this method accepts the same key and reverse optional arguments as the built-in sorted()
It is good to note that there is never any syntactic ambiguity regarding whether operator * is the multiplication or the sequence unpacking operator. When it appears on the left side of an assignment, it will unpack, if it appears elsewhere (like in a function call) it is the unpacking operator when used as a unary operator and the mult. operator when used as a binary operator.
We already saw that we can iterate over items in alist using the syntax 'for item in L:'. However, if we want to change the items in a list the idiom to use is:
for i in range(len(L)):
L[i] = process(L[i])
The built-in range function returns an iterator that provides integers. With one integer argument, n, the iterator range() returns producing 0, 1, ..., n, -1. (I'm not exactly sure what the range function is doing, it is worth some more study)

The del Statement
Although it kinda looks like delete it doesn't really delete data. When applied to an object reference that refers to a data item that isn't a collection, the del statement unbinds the object reference from the data item and deletes the object reference. For instance:

>>>x = 41
>>>del x
x
'Error message'

However, 41 still exists, it is just ready for garbage collection. When or even if garbage collection happens may be 'nondeterministic' (depending on the implementation of Python you are running). So if cleanup is required you gotta do it yourself. Python provides two solutions to nondeterminism. One is to use a 'try ... finally' block to ensure that cleanup is done. and another is to use a 'with' statement as we will see later. When del is used on a collection data type, only the object reference to the collection is deleted. The collection and its items (and for those items that are themselves collections, for their items, recursively) are scheduled for garbage collection if no other object references refer to the collection. In the case of mutable collections like lists, del can be applied to individual items or slices. To do this use the slice operator. If the item(s) referred to are removed from the collection, and if there are no other object references referring to them, they are scheduled for garbage collection. This is interesting, because the functionality of collection types is to store object references, and not objects themselves.