Jared Foy Thy word have I hid in my heart that I might not sin against thee

Collection data types and more

Published December 17th, 2018 4:42 pm

Python has an extremely robust assortment of collection types, along with excellent methods to manipulate data.

The dict data type can be called as a function like so: dict(). When no args are added it returns an empty literal dict. With a mapping argument it returns a dictionary based on the argument. For example, returning a shallow copy if the argument is a dictionary. It is also possible to use a sequence argument, that is if each item in the sequence is itself a sequence of two objects the first of which is used as a key and the second of which is used as a value. Alternatively, for dictionaries where the keys are valid Python identifiers, keyword args can be used with the key as the keyword and the value as the key's value. Dictionaries can also be created using braces. Using empty braces {} creates an empty dictionary. Nonempty braces must contain one or more comma-separated items, each of which consists of a key, a literal colon, and a value. Another way of creating dictionaries is to use a dictionary comprehension, which is covered later. This shows different syntaxes which provide the same result:
d1 = dict({'id': 1948, 'name': 'Washer', 'size': 3})
d2 = dict(id=1948, name='Washer', size=3)
d3 = dict([('id', 1948), ('name', 'Washer'), ('size', 3)])
d4 = dict(zip(('id', 'name', 'size'), (1948, 'Washer', 3)))
d5 = {'id': 1948, 'name': "Washer", 'size': 3}
It looks like the zip() method is matching object references which reside in tuples to key-value pairs. Dictionary d1 is created using a dictionary literal. d2 is created using keyword arguments, d3 and d4 are created from sequences, d3 being tuples which reside in a list and d4 using the zip() method on tuples. d5 is created from a dictionary literal that lacks the formality of dict(). As stated the zip() function that is used in d4 returns a list of tuples, the first of which has the first items of each of the zip() function's iterable arguments, the second of which has the second items, an so on. (learn more about zip()). The keyword argument syntax like used in d2 is usually the most compact and convenient, providing the keys are valid identifiers.
Remember, our dictionaries can contain a wide variety of things such as lists and tuples:
d = {'root': 18, 'blue': [75, 'R', 2], 21: 'venus', -14: None, 'mars': 'rover', (4, 11): 18, 0: 45}
Dictionary keys are unique, so if we add a key-value item whose key is the same as an existing key, the effect it to replace that key's value with a new value. Brackets are used to access individual values like so: d['root'] returns the value associated, which is 18, accordingly d[21] returns 'venus'. d[91] returns KeyError because there is no such key in the dictionary. We can also use brackets to add and delete items. To add an item we use the = operator, like so: d[X] = 59. To delete an item we use the del statement like so: del d['mars']. This will delete the item whose key is 'mars'. It will raise a KeyError if no item has that key. Items can also be removed (and returned) from the dictionary using the dict.pop() method. Dictionaries support the built-in len() function, and for their keys, fast membership testing with 'in' and 'not in'. All dictionary methods are listed here:
.clear() -- removes all items from dict
.copy() -- returns a shallow copy of dict
.fromkeys(s,v) -- returns a dict whose keys are the items in sequence s and whose values are None or v if given
.get(k) -- returns key k's associated value, or None if k doesn't exist in dict
.items() -- returns a view of all the (key,value) pairs in dict
.keys() -- returns a view of all the keys in dict
.pop(k) -- returns key k's associated value and removes the item whose key is k, or raises a KeyError if k isn't in dict
.pop(k,v) -- returns key k's associated value and removes the item whose key is k, or returns v if k isn't in dict
.popitem() -- returns and removes an arbitrary (key,value) pair form dict, or raises a KeyError if dict is empty
.setdefault(k,v) -- the same as .get() except that if the key is not in dict, a new item is inserted with the key k, and with a value of None or of v if given
.update(a) -- adds every (key,value) pair from 'a' that isnt in dict and for every key that is in both dict and 'a', replaces the corresponding value in d with the one in 'a'. Note that 'a' can be a dictionary, an iterable of (key,value) pairs, or keyword arguments.
.values() -- returns a view of all the values in dict
Because dictionaries have both keys and values, we might want to iterate over a dictionary by (key,value) items, by values, or by keys. For example, here are two equivalent approaches to iterating by (key,value) pairs:
for item in d.items()
print(item[0], item[1])
for key, value in d.items():
Iterating over a dictionary's values is similar:
for value in d.values():
To iterate over a dictionary's keys we can use dict.keys(), or we can simply treat the dictionary as an iterable that iterates over its keys, like this:
for key in d:
for key in d.keys():
If we want to change the values in a dictionary, the idiom to use is to iterate over the keys and change the values using the brackets operator. For example, here is how we would increment every value in dictionary d, assuming that all the values are numbers:
for key in d:
d[key] += 1
The dict.items(), dict.keys(), and dict.values() methods all return dictionary views. A dictionary view is effectively a read-only iterable object that appears to hold the dictionary's items or keys or values, depending on the view we have asked for. In general, we can simply treat views as iterables, however, two things make a view different from a normal iterable. One is that if the dictionary the view refers to is changed, the view reflects the change. The other is that key and item views support some set-like operations. Given the dictionary view v and set or dictionary view x, the supported operations are:
v & x #intersection
v | x #union
v - x #difference
v ^ x #symmetric difference
We can use the membership operator 'in' to see whether a particular key is in a dictionary, like 'x' in 'd'. We can use the intersection opeator to see which kes from a given set are in a dictionary, like so:
d = {}.fromkeys('ABCD', 3) #d == {'A': 3, 'B': 3, 'C': 3...}
s = set('ACX') #s == {'A', 'C', 'X'}
matches = d.keys() & s #matches == {'A', 'C'}
note that we can use a string to as an argument because the characters are being used as individual items. Because sets are made of unique items, we get 'A', 'C', and 'X' in the set. Dictionaries are often used to keep counts of unique items. An example of this is counting the number of occurrences of each unique word in a file. This is a complete program which lists every word and the number of times it occurs in alphabetical order for all the files listed on the command line:
# shebang me
import string
import sys
words = {} #here we create an empty dictionary called 'words'
strip = string.whitespace + string.punctuation + string.digits + "\"'" #here we create a string which contains all the stuff we want to ignore for filename in sys.argv[1:]: #iterating over each filename on the command line
for line in open(filename):#iterating over each line in file
for word in line.lower().split():#lower case and split into words
word = word.strip(strip)#strip off ignored characters
if len(word) > 2:#if word is atleast three characters
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("'{0}' occurs {1} time(s)".format(word, words[word]))
I won't add this as a code comment because it requires a longer explanation. We see that words is the empty dict literal. It looks like we use the slice operator to add an item. We call dict.get() with a default value of 0. If the word is already in the dictionary, dict.get() will return the associated number, and this value plus 1 will be set as the item's new value. If the word is not in the dictionary, dict.get() will return the supplied default of 0, and this value plus 1 will be set as the value of a new item whose key is the string held by 'word'. This is pretty genius! Here are two code snippets that do the same thing, although dict.get() is more efficient:
words[word] = words.get(word, 0) + 1 #or this:
if word not in words:
words[word] = 0
words[word] += 1 #this one shows the augmentation better
In the next subsection we cover default dictionaries and see another alternate solution. Once we have accumulated the dictionary of words, we iterate over its keys (which are comprised of the words in the file) in sorted order, we then print each word and the number of times it occurs. Using dict.get() allows us to easily update dictionary values, providing values are single items like numbers or strings. But what happens if each value itself is a collection? This next program will demonstrate this by reading html files given on the command line. It prints a list of each unique website that is referred to in the files with a list of the referring files listed indented below the name of each website.
import sys
sites = {} #create empty dictionary
for filename in sys.argv[1:]:#iterate over each file in typed in bash
for line in open(filename)#iterate over each line in file
i = 0 #set i to 0
while True: #do until this fails to find
site = None #declare site as None
i = line.find("http://", i)
if i > -1:
i += len('http://')#increment by length of 'http://'
for j in range(i, len(line)):
if not (line[j].isalnum() or line[j] in ".-"):
site = line[i:j].lower()
break #this code searched for valid urls
if site and '.' in site:
sites.setdefault(site, set()).add(filename)
i = j
for site in sorted(sites):
print("{0} is referred to in:".format(site))
for filename in sorted(sites[site]), key=str.lower):
print(" {0}".format(filename))
It looks like .find() accepts two arguments, the thing to find, and then a way to mark the position in which it was found. So it seems like this example is using a set as items in a dictionary. Interesting, I will have to study this more. Also, study more what the sorted() function is doing to these colletion types.
Reading and Writing Text Files
Files are opened using the built-in open() function, which returns a 'file object' (of type io.TextIOWrapper for text files). The open() function takes one mandatory argument--the filename, which may include a path--and up to six optional args, two of which we covered here. The second args is the 'mode'--this is used to specify whether the file is to be treated as a text file or as a binary file, and whether the file is to be opened for reading, writing, appending, or a combo of these. For text files Python uses an encoding that is platform-dependent. Where possible it is best to specify the encoding using open()'s encoding argument, so the syntaxes we normally use for opening files are these:
fin = open(filename, encoding='utf-8')#for reading text
fout = open(filename, 'w', encoding='utf-8')#for writing text
open()'s default mode to to read text, and by using a keyword rather than a positional arg for the encoding arg, we can omit the other optional positional args when opening for reading. Similarly, when opening to write we need to give only the args we actually want to use. Argument passing is covered in more depth later. Once a file is open for reading in text mode, we can read the whole file into a single string using the file object's read() method, or into a list of strings using the file object's readlines() method. A common idiom for reading line by line is to treat the file object as an iterator:
for line in open(filename, encoding='utf-8'):
This idiom works because a file object can be iterated over, just like a sequence, with each successive item being a string containing the next line from the file. The lines that we get back include the line termination character \n. If we specify a mode of 'w', the file is opened in 'write text' mode. We write to a file using the file object's write() method, which takes a single string as it's arg. Each line written should end with a '\n'. Python automatically translates between \n and the underlying platform's line termination characters when reading and writing. Once we have finished using a file object we can call its close() method which will cause any outstanding writes to be flushed. In small Python prgrams it is very common not to bother using close() because Python does this automatically when the file object goes out of scope. If a problem occurs, it will be indicated by a raised exception.
Back to dictionaries: a dictionary of websites will probably include a lot of items, but some are small. For these we can print their contents using their keys as field names, and use mapping unpacking to convert the dict's key-value items into key-value args for the str.format() method like so:
>>>greens = dict(green='#0080000', olive="#808000", lime="#00FF00")
>>>print("{green} {olive} {lime}".format(**greens))
Here, using mapping unpacking (**) has the same effect as writing .format(green=greens.green) but its easier to write.
Dictionary Comprehensions
A 'dictionary comprehension' is an expression and a loop with an optional condition enclosed in braces, very similar to a set comprehension. Like list and set comprehensions, two syntaxes are supported:
{keyexpression: valueexpression for key, value, in iterable}
{keyexpression: valueexpression for key, value in iterable if condition}
Here is how we could use a dictionary comp. to create a dictionary where each key is the name of a file in the current directory and each value is the size of the file in bytes:
file_sizes = {name: os.path.getsize(name) for name in os.listdr('.')}
The os ('operating system') module's os.listdir() function returns a list of the files and directories in the path that is passed as an arg, it doesn't include '.', or '..' in the list. The os.path.getsize() method returns the size of the given file in bytes. We can avoid directories and nonfile entries by adding the condition: if os.path.isfile(name). .isfile() returns True if the path passed is a file, and False for all others.
A dictionary comprehension can also be used to create an inverted dictionary. For example, given dictionary 'd', we can produce a new dictionary whose keys are d's values and whose values are d's keys:
inverted_d = {v: k for k,v in d.items()}
The result can be inverted back to the original dictionary if all the original dictionary's values are unique, the inversion will fail if any value is not hashable. Like list and set comprehensions, the iterable in a dictionary comprehension can be another comprehension, so we can nest these things ad nauseum (yikes)
Default dictionaries are dictionaries, they have the operators and methods that dictionaries provide. They are different from plain dictionaries in that they handle missing keys different. In OOP, defaultdict is a subclass of dict. OOP covered later. If we use a nonexistent key (missing) when accessing a dictionary, a KeyError is raised. This is useful because we often want to know whether a key that we expected to be present is absent. In some cases we want every key we use to be present, even if it means that an item with the key is inserted into the dictionary at the time that we first access it. If we have a dictionary 'd' which doesn't have an item with key 'm', the code 'x = d[m]' will raise a KeyError exception. But if 'd' is a suitably created default dictionary, if an item with key m is in the default dictionary, the corresponding value is returned the same as for a dictionary, however if 'm' is not a key in the default dictionary, a new item with key m is created with a default value, and the newly created item's value is returned.
When we create a def. dict., we can pass in a 'factory function'. This is a func that when called returns an object of a particular type. We can use all the built-in data types as factory funcs. The 'factory function' is used to create default values for missing keys. Note that the name of a function is an object reference to the function, so when we want to pass functions as parameters, we just pass the name. When we use a function with parentheses, the parens tell Python that the function should be called. We can create a default dictionary like so: x = collections.defaultdict(int)
The x def dict will never raise a KeyError. If we were to write x = x['xyz'] and there was no item with the key 'xyz' when the access is attempted and the key isn't found, the def dict will immediately create a new item with key 'xyz' and value 0. This is cool because then we don't have to use the .get() method on a dictionary, we can just use collections.defaultdict().
Ordered Dictionaries
collections.OrderedDict can be used as drop-in replacements for unordered dicts because they provide the same API. The difference is that ordered dicts store their items in the order in which they were inserted. If an ordered dict is passed an unordered dict or keyword args when it is created, the item will be arbitrary; this is because under the hood Python passes keyword args using a standard unordered dict. Similarly, this occurs when we use the update() method. For these reasons passing keyword args or an unordered dict when creating an ordered dict or using update() on one is best avoided. However, if we pass a list or tuple of key-value 2-tuples when creating an ordered dict, the ordering is preserved (since they are passed as a single item). We can create an ordered dict like so with 2-tuples:
d = collections.OrderedDict([('z', -4),('e', 190),('l', 7)])
However, it's more common to create ordered dicts incrementally, like so:
tasks = collections.OrderedDict()
tasks[8031] = 'Backup'
tasks[4027] = 'Scan Email'
tasks[5733] = 'Build System'
If we had created unordered dicts the same way and asked for their keys, the order of the returned keys would be arbitrary, for ordered dicts we can rely on the keys being returned in the order they were inserted. For example if we wrote: list(d.keys()), we are guaranteed to get the list in the order of insertion. We can change an item's value if we insert an item with the same key as an existing key, the order isn't changed. If we want to move an item to the end we have to delete it and then reinsert it. We can popitem() to remove and return the last key-value item, or we can popitem(last=False) in which the first item will be removed and returned.
We can create sorted dictionaries with ordered dicts. Given a dict, 'd', we can convert it into a sorted dict like so: d = collections.OrderedDict(sorted(d.items())) Note that if we insert any additional keys they would be inserted at the end. Therefore if we add more and still want it to be sorted, we would have to recall the sorted() func. Apparently doing this insertion and re-creation isn't terribly inefficient because Python uses a great sorting algorithm. Using an ordered dict to produce a sorted dict makes sense only if we expect to iterate over the dict multiple times, and if we don;t expect to do any insertions after it has been sorted.
Iterating and Copying Collections
Once we have collections of data items, we usually like to iterate over the items. This section covers iterators and the oeperators and functions that involve iterators. Sometimes we need to copy a collection. This contains some subtleties because of the way python uses object references, here we see how to copy collections and get the behavior we want.
Iterators and Iterable Operations and Functions
An iterable data type is one that can return each of its items one at a time. Any object that has an __iter__() method, or any sequence, like an item that has a __getitem__() method taking int args starting from 0 is an iterable and can provide an iterator. An iterator is an object that provides a __next__() method which returns each successive item in turn, and raises a StopIteration exception when there are no more items. The order in which items are returned depends on the underlying iterable. In the case of lists and tuples, items are normally returned in sequential order start from index pos 0, but some iterators return items in an arbitrary order, like dicts and set iterators.
The built-in iter() function has two quite different behaviors. When given a collection data type or a sequence it returns an iterator for the object it is passed, or raises a TypeError if the object can't be iterated. This use arises when creating custom collection data types, but is rarely needed in other contexts. The second iter() behavior occurs when the function is passed a callable (a function or method), and a 'sentinel' value. In this case the function passed in is called once at each iteration, returning the function's return value each time, or raise a StopIteratrion exception if the return value equals the sentinel. When we use a for item in iterable loop, Python in effect calls iter(iterable) to get an iterator. This iterator's __next__() method is then called at each loop iteration to get the next item, and when the StopIteration is raised, it is caught and the loop is terminated. Another way to get an iterator's next item is to call the built-in next() function. Here are two equivalent pieces of code (multiplying the values in a list), one using a for...in loop and the other using an explicit iterator:
product = 1
for i in [1,2,3,4]:
product *= i
print(product) #or:
product =1
i = iter([1,2,3,4])
while True:
product *= next(i)
except StopIteration:
Here with this second snippet we can see how much logic is being computed in the first snippet, as these two do the same job. Any finite iterable 'i' can be converted into a tuple by calling tuple(i), or can be converted into a list by calling list(i). The all() and any() functions can be used on iterators and are often used in 'functional-style' programming. Here are a couple usages that show all(), any(), len(), min(), max(), and sum().
>>>x = [-2, 9, 7, -4, 3]
>>>all(x), any(X), len(x), min(x), max(x), sumx(x)
(True, True, 5, 3, -4, 9, 13)
Of these, len() is most frequently used. The enumerate() func takes an iterator an returns and 'enumerator' object. This object can be treated like an iterator, and at each iteration it returns a 2-tuple with the tuple's first item the iteration number (default start is 0), and the second item is the next item from the iterator enumerate() was called on. Let's see this in practice:
The grepword.py program takes a word and one or more filenames on the command line and outputs the filename, line number, and line whenever the line contains the given word. We may see output like so:
grepword.py Dom data/forenames.txt
Data files data/forenames.txt and data/surnames.txt contain unsorted lists of names, one per line.
import sys
if len(sys.argv) < 3:
print("usage: grepword.py word infile1 [infile2 [... infileN]]")
word = sys.argv[1]
for filename in sys.argv[2:]:
for lino, line in enumerate(open(filename), start=1):
if word in line:
print("{0}:{1}:{2:.40}".format(filename, lino, line.rstrip()))
Here is a breakdown of what we just did: We begin by checking that there are at least two command-line args. By assigning a print() and then sys.exit() we functionally keep anything that has less than 3 command-line arguments from running our code. sys.exit() performs an immediate clean termination, closing any open files. sys.exit() accepts an optional int arg which is passed to the calling shell. We then assume that the first argument is the word the user is looking for and that the other arguments are the names of the files to look in. We call open() without specifying an encoding, this causes Python to use platform-dependent encoding. The file object return by the open() function in text mode can be used as an iterator, returning one line of the file on each iteration. By passing the iterator to enumerate() we get an enumerator itrator that returns the iteration number (in variable lino, which stands for 'line number') and a line from the file which is stored in 'line'. If the word the user is looking for is in the line, we print the filename, line number, and the first 40 characters of the line with trailing whitespace. The enumerator() function accepts an optional keyword arg (start), which defaults to 0; we have used this argument set to 1 since by convention, text file line numbers are counted from 1. (interesting) Many times we don't need an enumerator, but rather an iter that returns successive integers. This is what range() provides. If we need a list or tuple of integers, we can convert the iter returned by range() by using the appropriate conversion function:
>>>list(range(5)), list(range(9,14)), tuple(range(10, -11, -5))
notice that we are creating a tuple without formality here, in which resides lists and a tuple. The range() function is mostly used for two purposes: to create lists or tuples of integers, and to provide loop counting in for...in loops. For example, these following equivalent snippets ensure that list 'x's items are all non-negative:
for i in range(len(x)):
x[i] = abs(x[i]) #or:
i = 0
while i < len(x):
x[i] = abs(x[i])
i += 1
In both of these snippets if list x was perhaps [11, -3, -12, 8, -1], after it would be [11, 3, 12, 8, 1], this is because abs() is giving us the absolute value by removing the negative numbers and making them positive. Because we can unpack an iterable using the * operator, we can unpack the iterator returned by the range() function. If we hvae a func called calculate() which takes four args, here are some ways we could call it with args, 1,2,3, and 4:
t = (1,2,3,4)
We can see that the first one is simply using the inputted arguments in a comma seperated syntax. The second is unpacking tuple t using the * operator, the third is simply using a range() function which is unpacked using * as well. Because range() is giving us an 'iterator' all it needs to be is unpacked. (interesting!) Let's now look at a small, complete program to consolidate our learning. This program reads in a file of 'forenames' and a file of 'surnames', creating two lists, then creating the file test-names1.text and writes 100 random names into it. In this program we'll use random.choice() which takes a random item from a sequence. Because of this, we may get duplicates. Let's look at the func that returns the list of names, and then we will look at the rest of the program:
def get_forenames_and_surnames():#declare function
forenames = [] #create empty list literal
surnames = [] #same here
for names, filename in ((forenames, 'data/forenames.txt'), (surnames, 'data/surnames.txt')):
for name in open(filename, encoding='utf-8'):
return forenames, surnames
Let's break this down: In the outer for...in loop, we iterate over two 2-tuples, unpacking each 2-tuple into two variables. Even though the two lists might be quite large, returning them from a function is efficient because Python uses object references, so the only thing that is really returned is a tuple of two object references. Inside Python programs it is convenient to always use Unix-style paths, because they can be typed without the need for escaping, and they work on all platfroms (even Windows). If we have a path we want to present to the user in a variable 'path', we can always import the os module and call path.replace('/', os.sep) to replace forward slashes with the platform specific seperator. Here's some more code:
forenames, surnames = get_forenames_and_surnames()
fh = open('test-names1.txt', 'w', encoding='utf=8')
for i in range(100):
line = '{0} {1}\n'.format(random.choice(forenames), random.choice(surnames))
In this snippet we retrieved the two lists and then open the output file for writing, and keep the file object in variable fh which stands for 'file handle'. We then loop 100 times, in each iteration we create a line to be written to the file. We remember to include a newline '\n' at the end of every line. We dont use the loop variable i, it's only used to satisfy the for...in loop's syntax.
Common Iterable Operators and Functions List
s + t --Returns a sequence that is the concat. of sequences s and t
s * n --Returns a sequence that is int n concatenations of sequence s
x in i --Returns True if item x is in iterable i, accept 'not in'
all(i) --Returns True if every item in iterable i evaluates True
any(i) --Returns True if any item in iterable i evals True
enumerate(i, start) --Normally used in for...in loops to provide a sequence of (index, item) tuples with indexes starting at 0 or start
len(x) --Returns the length of x. If x is a collection it is the number of items; if x is a string it is the number of characters
max(i, key) --Returns the biggest item in iterable i or the item with the biggest key(item) value if a key function is given
min(i, key) --Returns smallest...
range(start,stop,step) --Returns an int iter. With one arg (stop), the iter goes from 0 to stop -1; with two args (start,stop) the iter goes from start to stop -1; with 3 args it goes from start to stop -1 in steps of step.
reversed(i) --Returns an iterator that returns the items from iter i in reversed order
sorted(i,key,reverse) --Returns a list of the items from iter i in sorted order; key is used to provide DSU sorting, if reverse is True the sorting is reversed
sum(i,start) --Returns the sum of the items in iter i plus start (which default to 0) i can't include strings
zip(il, ...,iN) --Returns an iterator of tuples using the iters il to iN
Python functions are objects like any other, they can be passed as args to other functions, and stored in collections without formality. Remember, a functions name is an object reference to the function. The parentheses which follow the object reference which call the function. When a key function is passed (like the abs() func), it is called once for every item in the list (with the item passed as the function's only parameter), to create a 'decorated' list. The the decorated list is sorted, and the sorted list wihout the decoration is returned as the result. We are free to use our own custom function as the key function. For example, we can case-insensitively sort a list of strings by passing the str.lower() method as a key. If we have the list, x, of ['sloop', 'yawl', 'cutter', 'schooner', 'ketch'], we can sort it case-insensitively using DSU which means Decorated, Sort, Undecorated, using a single line of code by passing a key function, or do the DSU explicity, as these two equiv. code snippets show:
x = sorted(x, key=str.lower) #or the equivalent
temp = []
for item in x:
temp.append((item.lower(), item))
x = []
for key, value in sorted(temp):
Python's sort algorithm is an 'adaptive stable mergesort' that is fast and smart, it is well optimized for partially sorted lists. By adaptive we meant that the sort algorithm adapts to circumstances, like taking advantage of partially sorted data. By stable we mean that items that sort equally are not moved in relation to each other, and 'mergesort' is the generic name for the sorting algorithm used. When sorting collections of integers, strings, or other simple types their 'less than' operator is used. Python can sort collections that contain collections, working recursively to any depth. If you want to know more about Tim Peter's algorithm an explanation and discussion is in the file listsort.txt which comes with the Python source code.
We can create our own key function that is passed into the sorted() function. If we want to inverse how the sort is done, by first sorting by strings and then ints we can do like so:
def swap(t)
return t[1], t[0]
This seems to be manipulating the sort, but in order to do this effectively, I would need to know more about how the sorted() function works. Then we can plug this into a sorted() as the key function like so:
sorted(x, key=swap)
Lists can be 'sorted in-place' using list.sort(), which takes the same optional args as sorted(). Sorting can be applied only to collection where all the items can be compared with each other! In other words, you can't sort a list which contains items that are int's and str's. Imagine we have a list that contains integers within strings, integers, and floating-point numbers, we can truly sort these if we give a sort parameter of something that can reasonably account for the differences between the data types. The float type can do such a thing. But instead of converting to float we will simply sort them as if they were all converted to floats. So we can do like this:
sorted(['-1.3', 16, 8.023], key=float) #This will give us a nice sort
Copying Collections
Python uses object references, when we use the assignment operator (=) no copying takes place. If the right-hand operand is a leteral such as a string or a number, the left-hand operand is set to be an object ref that refers to the in-memory object that holds the literal's value. If the right-hand operand is an object ref, the left-hand operand is set to be an object ref that refers to the same object as the right-hand operand. This makes assignment very efficient. We we assing large collections like long lists, the saving become apparent:
>>>songs = ['Because', 'Boys', 'Carol']
>>>beatles = songs
>>>beatles, songs
(['Because', 'Boys', 'Carol'], ['Because', 'Boys', 'Carol'])
Here we can see that a new object ref has been created 'beatles' and that it refers to the same literal that 'songs' refers to, therefore we have a tuple now created which is the very same object. No copying has been done here. Any change made through either object ref is visible to the other. This is most often the behavior we want, since copying large collections can be expensive. It also means that we can pass a list or other mutable collection data type as an argument to a function, modify the collection in the function, and know that the modified collection will be accessible after the function call was completed. Sometimes we really do want to copy the collection. For sequences, when we take a slice like songs[:2] the slice is always an independent copy of the items copied. Therefore we can copy an entire sequence by doing:
songs = ['Because', 'Boys', 'Carol']
beatles = songs[:]
beatles[2] = 'Cayenne'
beatles, songs
(['Because', 'Boys', , 'Cayenne'], ['Because', 'Boys', 'Carol'])
For dicts and sets, copying is done by using dict.copy() and set.copy(). Also, the copy module gives us copy.copy() that returns a copy of the object it is given. We can also copy the built-in collection types by using the type as a function with the collection to be copied as its arg. Like so:
copy_of_dict_d = dict(d)
These are all 'shallow' copying techniques. Which means that only object refs are copied and not the objects themselves. For immutable types like numbers and strings, this has the same effect as copying (except it is more efficient) but for mutable types this means that the objects they refer to are referred to both by the original collection and by the copied collection. In order to copy everything, we need to 'deep-copy'. To do this we must import copy and then utilize copy.deepcopy(). This is interesting, and it says so much about the beauty of python, it doesn't give you more than you need, and if you need more you can have it. But you don't get it dumped on you at the expense of things you would at other times always prefer. What an excellent architecture.
This is the end of the review of the collection types as well as three of the standard library collection types (collections.namedTuple, collections.defaultdict, collections.OrderedDict). Python also provides the collections.deque which is a double ended queue and many other types are available from third parties and from the Python Package Index (pypi) If you want to come back and try out some of the ending examples go to page 149.