The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

Kouji Matsui (1) [Avatar] Offline
#1
Hi, I'm reading this section and code.
Your note said "HashSet is an efficient data structure for lookups", is this means only words consolidate?

1. This code is consolidate words by HashSet, but it used into PLINQ query. I think maybe no advantage hash-set algorithm because only iterated contained words by from-in-select clause...
2. I tried benchmarking differents.

private static Func<string, string> PartialFuzzyMatch<T>(T wordSet)
    where T : ICollection<string>
{
    return word =>
        (from w in wordSet.AsParallel()
            select new { Word = w, Distance = w.JaroWinkler(word) })
        .OrderByDescending(w => w.Distance)
        .Select(w => w.Word)
        .FirstOrDefault();
}

private static async Task MainAsync()
{
    using (var httpClient = new HttpClient())
    {
        ////////////////////////////////////
        // Sample search words.

        var searchText = await httpClient.GetStringAsync(
            "https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/asynchronous-workflows");
        var searchWords = searchText.
            Split(new[] {' ', '\r', '\n', '\t'}, StringSplitOptions.RemoveEmptyEntries).
            Distinct().
            ToArray();

        ////////////////////////////////////
        // Dictionary words.

        var text = await httpClient.GetStringAsync(
            "https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/computation-expressions");
        text += await httpClient.GetStringAsync(
            "https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/pattern-matching");
        var words = text.
            Split(new[] { ' ', '\r', '\n', '\t' }, StringSplitOptions.RemoveEmptyEntries).
            ToList();

        ////////////////////////////////////
        // Prepare words.

        // By HashSet
        var fastFuzzyMatch = PartialFuzzyMatch(new HashSet<string>(words));
        // By Array (string[])
        //var fastFuzzyMatch = PartialFuzzyMatch(words.Distinct().ToArray());

        ////////////////////////////////////
        // Benchmark beginnig.

        var sw = new Stopwatch();
        sw.Start();

        // Compute.
        var results =
            searchWords.
            Select(word => new { Word = word, Result = fastFuzzyMatch(word)}).
            ToArray();

        sw.Stop();
        Console.WriteLine(sw.Elapsed);
    }
}


It's interesting results...

* HashSet --> avg 1.5sec
* Array (string[]) --> avg 2.0sec

(.NET 4.6.2)

Before, I think these results are Array faster than HashSet, but real world results are opposite...
Do you know what means?

# I used JaroWinkler code from https://github.com/admin2210/EditDistance
Stive (9) [Avatar] Offline
#2
The best way to analyse a performance gap is to run a profiler.

In your case, I would say the gap comes from Distinct(), which is not run in the HashSet case.
Riccardo Terrell (31) [Avatar] Offline
#3
In general HashSet is better choice for fast lookups because is a hash-based collection, so look ups are O(1).
However, in this case the reason does not applies. as Stive mentioned, the usage of HashSet is for convenience to remove duplicates avoiding extra over head, like calling the Distinct operator.
it is the right tool for the job.
kudos to Stive