Heffalump (2) [Avatar] Offline
#1
I am confused by the statement on p228 of The Quick Python Book, 2e concerning the raw string (r"\ten")
"The previous RE example then becomes
regexp = re.compile(r"\ten")
which works as expected. The compiled RE looks for a single backslash followed by the letters ten."

This isn’t what I expected because earlier on p228 it states " No matter how you do it, raw string notation can be taken as an instruction to Python saying, Don’t process special sequences in this string.” If Python is not processing special sequences in the string r"\ten", why doesn’t it interpret it as two backslashes followed by the letters ten?

I ran the following small program to count occurrences of en in a file and this confirmed that Python did interpret regexp = re.compile(r"\ten") as en instead of as \ten.
import re
regexp = re.compile(r"\ten")
count = 0
file = open("RawStrTest.txt", 'r')
for line in file.readlines():
if regexp.search(line):
count = count + 1
file.close()
print(count)

To find how Python interpreted r"\ten" if re.compile is not used I did the following.
>>> print("\ten"smilie
en

and

>>> print(r"\ten"smilie
\ten

So it seems that if re.compile is not used then r"\ten" is interpreted as two backslashes followed by the letters ten. This is what I expected.
So why did the authors of the book expect otherwise? Are my expectations naïve or is my interpretation the natural one and is re.compile causing a problem?
Is re.compile best avoided as on p226 it states that re.compile “isn’t strictly necessary”?
Heffalump (2) [Avatar] Offline
#2
Re: Raw Strings
Further to my last post I did:
>>> r"\ten"
Thie response was
'\\ten'

Why does this happen? It seems to be going too far. Rather than just not interpreting special characters in a raw string, Python seems to be uninterpreting or reverse engineering some of them. Python seems to ask what string would produce "\ten" after some of the special characters had done their work.
naomi.ceder (134) [Avatar] Offline
#3
Re: Raw Strings
What you're seeing is not quite what you think. There is a difference between the actual values and the way that they are shown and entered.

So the "r" in front of r"\ten" is to ensure that re.compile will receive both backslashes when it compiles the regular expression. Without the r, Python would go ahead and escape the string on input and thus pass " en to re, which would be understood as "<TAB>en", not what we want.

Similarly when displaying the value that was entered as r"\ten", Python needs to double the number of slashes, because that's what would be needed to enter "\ten" if the r wasn't used.

You can think of the r as just a shortcut for escaping everything:

r"\ten" == "\\ten"

This is admittedly a confusing topic, but Python is behaving as it should. And this is why people get headaches dealing with regular expressions.

Thanks for writing and I hope this helps.