Jared Foy Ye know all things

Python Strings and more

Published December 11th, 2018 5:06 pm

Lots of interesting things today!

Comparing strings:

Strings are capable of being compared with the comparison operators we have previously discussed. These being: < <= == != > >=. These operators, when used, compare the strings byte by byte in memory (i'm not exactly sure how this process happens). The difficulty with this is that Unicode uses multiple code points or maybe byte sequences for each character. In order to deal with this we can import the unicodedata module and then call unicodedata.normalize(). We can use one of four different normalizations for the first argument. These being NFKC, NFC, NFD, NFKD. For the second argument we will put in the character's byte sequence that we want to use. The output will be the corresponding byte sequences associated with the first arguments encoding method. We have a second problem which is the fact that different alphabets will put their characters in different positions. Python doesn't make assumptions (as it shouldn't). When comparing strings, Python compares the 'in-memory' byte representation. This gives a sort order based on Unicode code points which gives ASCII sorting for English. If we apply lower or uppercasing to the string then we will be giving a more natural English language order. We probably wouldn't need to normalize if the strings are not from external sources. Python's sorting methods can be customized. See documentation for more info on sorting methods.

Slicing and Striding Strings:

We've already learned that we can handpick characters from a string using the [] bracket method (item access operator). A slice operator is a thing that can extract a subsequence of characters from a string. That's why we call it a slice operator. An interesting note: If we use a negative integer we can select characters from the back of the string, starting with -1 (it makes sense that you can't use -0 as that doesn't seem to be a thing). But if you want to pick from the beginning of the string, use 0 as the first index position. Also know that all things in a string count as a character (even whitespace). If you try to select something that is not within the range of the index you will trip a IndexError. There are three syntaxes for the slice operator:

seq[start]
seq[start:end]
seq[start:end:step]

It should be apparent that seq stands for sequence. Also know that the sequence slice operator can be used on any sequence, such as a list, tuple, or string. The start, end, and step arguments must represent integers (if we are using variables) or be integers themselves. Way back when, we already used the first argument: the start one. seq looks like a metalogical variable, put in place the actual string or list or what have you where you see that. In the second method in which we are wanting to extract a sequence from the string, we first put in the start position, this begins the slice at and including the index position but the end index position that is put in for an argument does not extract, it only extracts up to the previous index position as is stated.

Interestingly, if we don't put in any arguments inside the brackets, but only a colon, then Python will select the first index position and then select until the entirety of the sequence is exhausted. This would be the same as doing this: seq[0:len(sequence)]. Using the slice operator copies the entire sequence when used this way. Interestingly, if we use a negative integer as our starting place, and leave out the second argument then our selection will start from and including the third from last character and then select moving backward to the end of the string. If we choose to use the third argument of the slice operator we can get some interesting results. Firstly, omitting the first or second argument, save for the colon representing that those arguments are existing, we can use a negative integer as the third argument to get the first argument to act as though it is starting from the end of the string. The -3, chosen perhaps as a third argument will extract the third character from the end of the string as well as every subsequent third character. Now be careful here because this operator works differently than if we omitted the third argument. Whereas we you select a negative integer as your first argument, and omit the second argument save the colon, and don't use the third argument, you will get a selection that starts at the position of the first negative integer (and includes it) and then it will work to the back of the string for the sequence that is being selected. However, if we use a third argument then it will work to the front of the string! If we use a[::-1], then we will reproduce a string that is exactly reverse of the original. When utilizing the step argument in the slice operator, we are doing a thing called striding. This is usually not done on strings, mostly on other data types.

We can use this to concatenate strings as well, simply use the sequence you want and the use the + to add more stuff. But if we want a more efficient method, use the string.join() method.

String Operators and Methods

Strings are immutable, and therefore all the functions that we can use with immutable objects can be used with strings. These include membership testing 'in', concatenation '+', appending '+=', replication '*', and augmented assignment replication '*='. Here we will go through these. Strings are sequences, theforefore they are 'sized' objects. That means we can use len() using a string as an argument. If this returns 0 then the string is just an empty literal. Any other positive integer is the overall length of the string. Before, we saw that the + operator is overloaded to provide string concatenation. I'm not sure what this overloading is though. If we want to concatenate a lot of strings, it's better if we use the string.join() method. This method will take any sequence as an argument such as lists and tuples. Interesting that this is an operation that we are performing with the string module, but it can be performed on things other than strings. This method joins the sequences together into a single string with the string the method was called on between each one. So for instance if I have a white space string and I join a list of things to it, I will produce a string that has white space between every item in that list. If we join stuff together without a space, we will just get pure concatenation, that being all items are in a string with no seperation between the items. We can reverse strings using the string.join(reversed(arg)) method, but it's simpler to just use the slice operator as mentioned previously, so that you would have something like s = s[::-1]

The '*' operator gives us replication (multiplication). This is pretty straight forward. s = '=' * 5 gives us five equal signs. Accordingly, we can also use the augmented assignment operation as well.

When applied to strings the 'in' membership operator returns True if its left-hand string argument is a substring of, or equal to, its right-hand string argument. When we want to find the position of one string inside the other, we can use two different methods.

Firstly, we can use the str.index(); this returns the index position of the substring, or raises a ValueError exception if it fails. We can also use the str.find() method which will return the index of the substring, or -1 if it fails to find it. I'm assuming this will give us an index position of the first place that it is finding a character that is in the string you are looking for inside the other string. The str.find() and str.join() methods can accept additional arguments as well, that being the start and end index positions of the substring. Although I'm not sure how you would format the second and third arguments that this method accepts.

The methods str.count(), str.endswith(), str.find(), str.rfind(), str.index(), str.rindex(), and str.startswith() all accept up to two optional arguments; those being a start and an end position.

s.count('m', 6) == s[6:].count('m')

These two statements are equivalent. The string.count() method is using the character 'm' as the thing to be counted, and it is going to start counting at the index position 6. Accordingly, the statement on the right is going to use the slice operator to retrieve a copy of s that begins at the index position 6 and is comprised witht the rest of the string, it is also going to count 'm' because the method has been called to it. I like the first one better, though.

s.count('m', 5, -3) == s[5:-3].count('m')

is also equivalent because the left will look for m, starting at the index 5 and end at the third one form the end of the string, it will do accordingly to the right hand side.

An interesting thing, we can create tuples without formality. This can be done by declaring a variable and then simply tacking on a bunch of comma seperated things.
>>>s = 'this', 'that', 'how about that'
('this', 'that', 'how about that')

We can use str.endswith() and str.startswith() with a single string argument, por ejemplo:

>>>s.startswith('From:')
We can also do this with a tuple of strings. Let's see this in action using str.endswith() and str.lower()
if filename.lower().endswith(('.jpg', '.jpeg')):
print(filename, ' is a JPEG image')
The is*() methods such as isalpha() and isspace() return True if the string they are called on has at least one character, and every character in the string meets the criterion. Por ejemplo:

>>>'917.5'.isdigit(), ''.isdigit(), '-2'.digit(), '203'.isdigit()

this will return a bunch of falses and then the last will be true. (Negative integers aren't digits) It's important to note that the is*() methods work on the basis of Unicode character classifications. This means we can use the Unicode escape characters as well as some strang symbols that are classified as numbers in order to get a True result. Because of this, we can't rely on this when testing for integers that can be converted from string to integer format.

When we get strings from external sources they might have unwanted whitespace attached. We can strip this from the left of the string content using str.lstrip() and str.rstrip() accordingly. We can do both ends using str.strip(). We can also pass an argument into the .strip() method in which is contained certain characters that we would like to remove from the string.

We can replace strings within strings using the str.replace() method. This method takes two string arguments and returns a copy of the string it is called on with every occurrence of the first string replaced with the second. If we put an empty string as the second argument then it will simply delete the first string given as an argument from the string it was called on.

Sometimes we need to split a string into a list of strings. For example, we could have a text file of data with one record per line. Perhaps each record's fields were seperated by an asterisk. We can use str.split() and pass in the string to split on as its first argument. The second argument is optional, but if you pass it you will be telling it the maximum number of split to make. If you don't enter a second argument then it will just do as many splits as possible. Passing the * as the argument in string.split() will create a list that doesn't include the asterisk, but seperates everything else out as different strings inside such a list. Pretty useful for some things, I'm sure. This was pretty useful, just used it to calculate the age of Leo Tolstoy when he died. It seems more and more to me that simply being able to manipulate data gets you most of the way there when it comes to programming. What a fascinating thing.

The string.maketrans() method creates a translation table which maps characters to characters. It accepts one, two, or three arguments. The first argument is a string containing characters to translate from, and the second arguments is a string containing the characters to translate to. These arguments need to have the same length. The string.translate() method takes a translation table as an argument and returns a copy of its string with the characters translated according to the table. These two methods seem to compliment eachother pretty well.

table = "".maketrans("\N{bengali digit zero}" "\N{bengali digit nine}", "0123456789")
print("20749".translate(table))
# prints: 20749
print("\N{bengali digit two}07\N{bengali digit four}"
"\N{bengali digit nine}".translate(table))
# prints: 20749

Note that we are using Python's concatenation of strings in order to create an object table that has an empty string, and then a .maketrans() method which contains two lists, one of bengali and the other of arabic numerals. Interesting to see it done this way. We can use the third argument accepted by .maketrans() and .translate() in order to pass a string in which is contained any unwanted characters. If we need higher sophistication we can use a codec, see codecs module or docs for more info.

There are other library modules that provide other functions for strings. Such is the unicodedate module, and the difflib module, io module, and the textwrap module, and the string module, and there is also a module called 're' wihch gives support for 'regular expressions'

String Formatting with str.format()

The str.format() method gives us a very flexible and powerful way of making strings. If we want to learn how to use its complex functions let's learn some syntax. The str.format() method returns a new string with the 'replacement fields' in its string replaced with its arguments suitably formatted. Replacement fields are identified by a field name in braces. If the field is just an integer, it is taken to mean that you want the index position of one of the arguments passed to str.format() Let's do an example:

string = 'My favorite epic poem is {0} by {1}'
string.format('Paradise Lost', 'John Milton')

This formatting will now show string to be:
'My favorite epic poem is Paradise Lost by John Milton'

If you need to use braces in your formatted string just be sure to use double braces and then your braces for the content that you want to be shown embraced. We can utilize integers with the format method. It will just put the integer into the new string. You can concatenate stuff with the format method, but you might as well just use .join() for that.

These are the syntaxes that a replacement field can use:
{field_name}
{field_name!conversion}
{field_name:format_specification}
{field_name!conversion:format_specification}

Interestingly, a replacement field may also contain a replacement field inside of it. This cannot have any formating and they are only used for 'computed formatting specifications'.

Field Names

A field name can be either an integer corresponding to one of the str.format() arguments or the name of the method's keyword arguments. This would look like so:
'{who} turned {age} this year'.format(who="She",age=88)
'She turned 88 this year'

We can also mismatch keyword arguments with index arguments (positional). If we use a keyword argument as well as an positional argument, be sure to put the positional argument first. Don't worry about the corresponding position of the brackets in the string being formatted.

Field names may refer to collection data types like lists. If such is the case, we can provide an index inside the brackets. Don't try to do this with a slice. Here's what it looks like:

stock = ['paper','envelopes','notepads',4]
#this creates the list
string = 'We have {0[1]} and {0[2]} in stock'
string.format(stock)
'We have envelopes and notepads in stock'

Because we are using the positional argument we are assuming the first list, however because there is only one list in the stock variable, then the 0 index position seems to be just indicating that we are pulling index positions from items which are in a list. Remember, zero is the first index position of a list.

We can also do things with the str.format() method when working with Python dictionaries. Dictionaries store key-value items.

d = dict(animal='elephant', weight=12000)
string = 'The {0[animal]} weighs {0[weight]}kg'
string.format(d)
'The elephant weighs 12000kg'

Note here that the key is being indicated inside the brackets, and not the value itself, however the value is being selected.
"math.pi=={0.pi} sys.maxunicode=={1.maxunicode}".format(math, sys)

This outputs a formatted string in which the index positional argument is pointing toward things in the math and sys module. This is interesting, it seems to show the consistency of the language when dealing with libraries that are imported. Cool.

So field name syntax allows us to use both positional and keyward syntax. We can identify lists or dictionaries that we are pulling information from, we can use the key of the dictionary or the index position of the list. Also important is that Python 3.1 allows for empty field names. So if we just supply curly braces, then Python will automatically assign the curly brace to the values supplied in the format arguments.

We can use an advanced technique that is really convenient. So let's learn: The local variables that are currently in scope are available from the built-in locals() function. This function returns a dictionary whose keys are local variable names and whose values are references to the variables' values. We can use 'mapping unpacking' to feed this dictionary into the str.format() method. The operator to use here is ** and it can be applied to a mapping (like a dictionary) in order to produce a key-value list suitable for passing to a function. This is really cool! Check it out:

element = 'Silver'
number = 47
string = 'I drive a {element} Dodge truck model year \'{number}'
string.format(**locals())
"I drive a Silver Dodge truck model year '47"

This is fascinating, it looks like Python is creating a dictionary from all the local variables, saving key-value pairs for each. And then Python works it out and formats the string. Unpacking a dictionary with the ** operator in a str.format() method allows us to use the dictionary's keys as field names inside the text being formatted. This means that we don't have to worry about all the second order work being done inside the .format() function. We can just declare our variables outside in the local environment. Important note, if we want to pass multiple arguments to the format() method, only the last one can use mapping unpacking (**)