The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

Svideo (21) [Avatar] Offline
I have a figured out how to drill down thought a xml converted webpage to the tagged hrefs, but I would like to get the class tag value at the end of the tag (AAA).
Any suggestions on how you get the that value.


Dim TableRows = From TR In xdoc...<tr> _
Where TR.@class = TableName

Dim links = From link In TableRows... _
Select link.@class
Svideo (21) [Avatar] Offline
Re: Linq to XML referencing the class value in a href
a href="" class="BBLink">AAA

Some how that didn't get in the first message...
jwooley (123) [Avatar] Offline
Re: Linq to XML referencing the class value in a href
In all honesty, I would probably use a regular expression to parse the query. While I love LINQ to XML, many sites are not truely XHTML compliant. There are tools to take such a site and return it as XHTML, but my success has varied with this in the past.

If you do decide to use LINQ to XML, make sure to specify the namespaces that are included in the source. Most issues people have is with not including the namespace. Remember, XML variables are strong typed just as CLR types are.

That being said, see if the following query would work (Assuming you have imported the namespaces):

Dim query = from node in source...<a> _
where node.@class.Value = MySearchString _
select node.Value

This will return an IEnumerable(Of String) which you would then iterate over, or use the standard First/Single/etc methods as appropriate.

Svideo (21) [Avatar] Offline
Re: Linq to XML referencing the class value in a href
Decided to use HTMLAgilityPack with the new extended property (HtmlDocumentExtensions) to export xml to be searched with Linq to XML. This might make the results a lot more consistant with linq. Because some sites return executed javascript, used a webbrowser control although would like to find something a bit lighter weight to build a library. Maybe someone knows of a library?

My goal was to program a search to find the one table that had 100 rows. So next is to is convert all the properties in a site to an object where I can find what I want quickly. Regex is fine, but I will be converting tens of thousands of pages into a database and the names and positions of the tables are inconsistent. there was a product called that did something like this, but they aren't supporting it anymore.

So now I can check every table, find the first one where the table has 100 rows by checking the .count property, and use that table name, instead of hoping regex finds what I want in a changing landscape of names and positions.

I did try the .value property but I was doing something else wrong and was stuck on stupid.

Svideo (21) [Avatar] Offline
Re: Linq to XML referencing the class value in a href
Using your example code and I'm stuck in the same place.

Where node.@class.Value = SearchName

Error 1 'Value' is not a member of 'String'
jwooley (123) [Avatar] Offline
Re: Linq to XML referencing the class value in a href
Your right. I coded that from memory. It should be:

where node.@class = MySearchString _
select node.Value

The attribute evaluates as a string.

The HTMLAgilityPack is the best one that I've heard of as well.

Svideo (21) [Avatar] Offline
Re: Hexadecimal value is invalid character
As a follow up it is working well but one of the issues is the conversion to xml that you get a Hexadecimal value 0x is an invalid character, Hexadecimal value is an invalid character, in the xml conversion.

There is a pretty good article about stripping control characters out of the HTML before conversion to xml.