mejohnsn (22) [Avatar] Offline
#1
The full version of both (since they don't fit in the subject) is:


File "./stocktracker2.py", line 43, in parse_stock_html
result['stock_name'] = quote.find('h2').contents[0]
AttributeError: 'NoneType' object has no attribute 'contents'

And as you may notice reading the above, I get the exact same failure with stocktracker2.py in the chapter 05 dir of the downloaded source code for this book.

Now I know the text does warn that scraping is an unreliable approach, since the page returned is SO changeable. But this is exactly the kind of thing I would hope gets updated in the downloads.

Also, when running in an emacs buffer so that I can do searches, I can see that in fact, the h2 element does not occur. I think that is causing this error message, which would then be python's way of saying that the h2 element was not even found. Yet when I follow the procedure recommended in the book for inspect the element with Firebug, I -do- see it. So I am tantalizing close.

So what should we do with this chapter, if Yahoo!'s returned page is so unpredictable, or so far off that we cannot expect to run either the code printed in the book or the code in the download? Is there a new value for the User-Agent that will make things work? Should we be doing a query with options explicitly set? Or will even those changes leave too many modifications yet to be done to Listing 5.2?

Message was edited by: mejohnsn to correct Listing #.
mejohnsn
anthony.briggs (30) [Avatar] Offline
#2
Re: Listing 5.3 fails on "quote.find('h2')..." with "Attr... no attr 'contents"
Hi John,

You're correct - if BeautifulSoup doesn't find a tag, it'll return None.

In this case, I would print out the contents of the quote object, to see what's in there. It may be that it's finding a different section of html, the html has changed, etc.

If you're still stuck, I'll take a look at Yahoo's page and see what they've changed.

Thanks,

Anthony
mejohnsn (22) [Avatar] Offline
#3
Re: Listing 5.3 fails on "quote.find('h2')..." with "Attr... no attr 'contents"
Hi, Anthony-

Thanks for the reply. I did as you suggested, and recognized the quote object as looking very much like the output I described in my original post: there is no H2 element in it.

The object is:


<div class="yfi_quote_summary"><div id="yfi_quote_summary_data" class="rtq_table"><table id="table1"><tr><th scope="row" width="48%">Prev Close:</th><td class="yfnc_tabledata1">565.21</td></tr><tr><th scope="row" width="48%">Open:</th><td class="yfnc_tabledata1">568.00</td></tr><tr><th scope="row" width="48%">Bid:</th><td class="yfnc_tabledata1"><span id="yfs_b00_goog">569.95</span><small> x <span id="yfs_b60_goog">400</span></small></td></tr><tr><th scope="row" width="48%">Ask:</th><td class="yfnc_tabledata1"><span id="yfs_a00_goog">570.50</span><small> x <span id="yfs_a50_goog">300</span></small></td></tr><tr><th scope="row" width="48%">1y Target Est:</th><td class="yfnc_tabledata1">742.91</td></tr><tr><th scope="row" width="48%">Beta:</th><td class="yfnc_tabledata1">1.15</td></tr><tr><th scope="row" width="54%">Next Earnings Date:</th><td class="yfnc_tabledata1">N/A</td></tr></table><table id="table2"><tr><th scope="row" width="48%">Day's Range:</th><td class="yfnc_tabledata1"><span><span id="yfs_g53_goog">565.82</span></span> - <span><span id="yfs_h53_goog">571.48</span></span></td></tr><tr><th scope="row" width="48%">52wk Range:</th><td class="yfnc_tabledata1"><span>473.60</span> - <span>670.25</span></td></tr><tr><th scope="row" width="48%">Volume:</th><td class="yfnc_tabledata1"><span id="yfs_v53_goog">2,229,125</span></td></tr><tr><th scope="row" width="48%">Avg Vol <span class="small">(3m)</span>:</th><td class="yfnc_tabledata1">2,617,680</td></tr><tr><th scope="row" width="48%">Market Cap:</th><td class="yfnc_tabledata1"><span id="yfs_j10_goog">186.31B</span></td></tr><tr><th scope="row" width="48%">P/E <span class="small">(ttm)</span>:</th><td class="yfnc_tabledata1">17.32</td></tr><tr><th scope="row" width="48%">EPS <span class="small">(ttm)</span>:</th><td class="yfnc_tabledata1">33.00</td></tr><tr class="end"><th scope="row" width="48%">Div & Yield:</th><td class="yfnc_tabledata1">N/A (N/A)</td></tr></table></div></div>

as output by "print quote" placed immediately before "result['stock'] = quote.find('h2').contents[0]" in listing 5.3

BTW: it is not immediately obvious to us Beautiful Soup learners why you are looking for h2 elements in the first place. How does this help you find ah_price and ah_change etc.? Not that I can find ah_price in the above quote object either. Or should I expect to find those only in contents[0] and contents[1] after subsequent calls to find?
anthony.briggs (30) [Avatar] Offline
#4
Re: Listing 5.3 fails on "quote.find('h2')..." with "Attr... no attr 'contents"
HI John,

I just had a quick look at Yahoo's stock page. I don't have the code for that section to hand right now, but it looks like they've changed the ids of the sections - that section that you've posted is actually for the table underneath the stock quotes that we're looking at.

If you change the id that the code is using to generate quote to "rtq_div" instead, I suspect that it'll start working, but I'll check that when I get home.

The rationale behind it is that we're chopping sections of the web page out, using the div ids to get just the bit that we want, and then filtering the name of the stock, its most recent price, etc. from that chunk. If you install Firebug and inspect the title of the stock (eg. "Yahoo! Inc. (YHOO)" in http://finance.yahoo.com/q?s=YHOO), you'll see what I mean.

Hope that helps,

Anthony
mejohnsn (22) [Avatar] Offline
#5
Re: Listing 5.3 fails on "quote.find('h2')..." with "Attr... no attr 'contents"
Hi, Anthony-

The change you suggested gets me further, but it still fails, just further along. Now if I already knew Beautiful Soup, that would have been enough of a hint for me to finish it up and get something working. But since this WAS my attempt to learn Beautiful Soup, I am not 100% sure what the right next step is, though I am going over http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Quick Start to try to figure that out myself.

But from reading that, it seems to me that though the code may run slower, it might have been a better idea to skip the preliminary soup.find('h2') and go straight for the ids we know have the data we want: wouldn't it have been better to go straight for something like soup.findAll('span', id="yfs_184_goog") for the current price?
anthony.briggs (30) [Avatar] Offline
#6
Re: Listing 5.3 fails on "quote.find('h2')..." with "Attr... no attr 'contents"
Ok, I just had time to look through the code and see what's going on. For the most part, it's just that the ids of those span tags have changed, but they've also switched the structure a bit on the last part, from a class for up/down changes to including an image. That makes things easier, since you don't now need to write a custom function to pull out the right part.

Here's my version of the code (this is for stocktracker3.py -- listing 5.3):


def find_quote_section(html):
....soup = BeautifulSoup(html)
....quote = soup.find('div', attrs={'class': 'yfi_rt_quote_summary'})
....return quote

def parse_stock_html(html, ticker_name):
....quote = find_quote_section(html)
. ...
....# was 'yfs_l91_goog'
....result['ah_price'] = quote.find('span', attrs={'id': 'yfs_l86_'+tick}).string
. ...
....# was yfs_z08_goog
....result['ah_change'] = (quote.find(attrs={'id': 'yfs_c63_'+tick}).contents[-1])
. ...
.....# was yfs_l10_goog
.....result['last_trade'] = quote.find('span', attrs={'id': 'yfs_l84_'+tick}).string
.... ...
....# was yfs_c10_goog + fancy class
.....result['change'] = quote.find('span', attrs={'id': 'yfs_c86_'+tick}).string


I've also attached the python file - the code editing on here is a bit nasty. I got those ids mainly with firebug - right click, inspect element.

You probably could just grab these ids directly, but there are a couple of reasons why not, the main being that it's easier just to focus on the piece of html that you're interested in instead of the whole page. You can print it out and look at it if you need to.

Another would be that not all sites organise their html with ids like Yahoo does, so you might find yourself needing to depend more on the structure, something like the first h2 tag within div id="header".

Finally, you might find that the search runs faster if you grab a small chunk and search through that, rather than searching through the entire html several times. I haven't tested that, and I doubt it'll make much difference in our case, but if you have a lot of pages to search through it might make a significant difference.

Hope that helps,

Anthony
mejohnsn (22) [Avatar] Offline
#7
Re: Listing 5.3 fails on "quote.find('h2')..." with "Attr... no attr 'contents"
Thanks for the file, Anthony, but I have to point out: you did not update the comments to use the new names for the idssmilie

I am correcting this in my own copy and expecting everything will work flawlessly once that is done -- until the next time Yahoo! changes something on their sitesmilie

BTW: one thing I noticed from the Beautiful Soup tutorial that would be good to mention in the book: the API makes extensive use of the Python rules for default arguments. In fact, I still do not understand how, when there is only one argument, the calls to soup.find('nameoftag') work at all, since as I read the signature for find(), there are TWO arguments with no defaults, 'name' and '**kwargs'. So neither in the book examples, nor in the online tutorial examples, do I understand how Beautiful Soup figures out when it is looking at a single name argument, and when a single **kwargs, nor how the type resolution works.

To my memory, default method arguments have not even been covered yet in the book by the time the reader reaches Listing 5.3, but I might have missed something.

The idea that methods have 'extra' parameters is mentioned in brief passing on p. 82, but that is not much preparation for the more advanced use of the language feature in "quote.find('span', attrs={'id': 'yfs_l84_'+tick})"

But this may have proved to be a wild-goose chase on my part: it now appears to me that you use only two arguments of find(), the name and the attrs. Since you never need to filter tags with a given name and attrs, you never need **kwargs. But I still do not understand how you can omit the name in L62 of stocktrackerp3.py: everywhere else you include the name 'span'.

Update: I edited my 'webcrawler2.py' (based on Listing 5.2 and 5.3) to match the yfs_ values you use in the downloaded file, and it works now. Thanks! Now the only mystery is how you decided which value to use for "last traded", but that is not a Python or Beautiful Soup questionsmilie

The downloaded file (stocktracker3.py) itself hangs when run alone, which is why I decided to take the above approach instead.

Message was edited by: mejohnsn to show results of latest testing.
mejohnsn

Message was edited by: mejohnsn to name file explicitly.
mejohnsn