Kouji Matsui (1) [Avatar] Offline
#1
Hi, I'm reading this section and code.
Your note said "HashSet is an efficient data structure for lookups", is this means only words consolidate?

1. This code is consolidate words by HashSet, but it used into PLINQ query. I think maybe no advantage hash-set algorithm because only iterated contained words by from-in-select clause...
2. I tried benchmarking differents.

private static Func<string, string> PartialFuzzyMatch<T>(T wordSet)
    where T : ICollection<string>
{
    return word =>
        (from w in wordSet.AsParallel()
            select new { Word = w, Distance = w.JaroWinkler(word) })
        .OrderByDescending(w => w.Distance)
        .Select(w => w.Word)
        .FirstOrDefault();
}

private static async Task MainAsync()
{
    using (var httpClient = new HttpClient())
    {
        ////////////////////////////////////
        // Sample search words.

        var searchText = await httpClient.GetStringAsync(
            "https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/asynchronous-workflows");
        var searchWords = searchText.
            Split(new[] {' ', '\r', '\n', '\t'}, StringSplitOptions.RemoveEmptyEntries).
            Distinct().
            ToArray();

        ////////////////////////////////////
        // Dictionary words.

        var text = await httpClient.GetStringAsync(
            "https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/computation-expressions");
        text += await httpClient.GetStringAsync(
            "https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/pattern-matching");
        var words = text.
            Split(new[] { ' ', '\r', '\n', '\t' }, StringSplitOptions.RemoveEmptyEntries).
            ToList();

        ////////////////////////////////////
        // Prepare words.

        // By HashSet
        var fastFuzzyMatch = PartialFuzzyMatch(new HashSet<string>(words));
        // By Array (string[])
        //var fastFuzzyMatch = PartialFuzzyMatch(words.Distinct().ToArray());

        ////////////////////////////////////
        // Benchmark beginnig.

        var sw = new Stopwatch();
        sw.Start();

        // Compute.
        var results =
            searchWords.
            Select(word => new { Word = word, Result = fastFuzzyMatch(word)}).
            ToArray();

        sw.Stop();
        Console.WriteLine(sw.Elapsed);
    }
}


It's interesting results...

* HashSet --> avg 1.5sec
* Array (string[]) --> avg 2.0sec

(.NET 4.6.2)

Before, I think these results are Array faster than HashSet, but real world results are opposite...
Do you know what means?

# I used JaroWinkler code from https://github.com/admin2210/EditDistance
Stive (9) [Avatar] Offline
#2
The best way to analyse a performance gap is to run a profiler.

In your case, I would say the gap comes from Distinct(), which is not run in the HashSet case.
Riccardo Terrell (16) [Avatar] Offline
#3
In general HashSet is better choice for fast lookups because is a hash-based collection, so look ups are O(1).
However, in this case the reason does not applies. as Stive mentioned, the usage of HashSet is for convenience to remove duplicates avoiding extra over head, like calling the Distinct operator.
it is the right tool for the job.
kudos to Stive