Hi, I'm reading this section and code. Your note said "HashSet is an efficient data structure for lookups", is this means only words consolidate? 1. This code is consolidate words by HashSet, but it used into PLINQ query. I think maybe no advantage hash-set algorithm because only iterated contained words by from-in-select clause... 2. I tried benchmarking differents.
private static Func<string, string> PartialFuzzyMatch<T>(T wordSet)
where T : ICollection<string>
{
return word =>
(from w in wordSet.AsParallel()
select new { Word = w, Distance = w.JaroWinkler(word) })
.OrderByDescending(w => w.Distance)
.Select(w => w.Word)
.FirstOrDefault();
}
private static async Task MainAsync()
{
using (var httpClient = new HttpClient())
{
////////////////////////////////////
// Sample search words.
var searchText = await httpClient.GetStringAsync(
"https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/asynchronous-workflows");
var searchWords = searchText.
Split(new[] {' ', '\r', '\n', '\t'}, StringSplitOptions.RemoveEmptyEntries).
Distinct().
ToArray();
////////////////////////////////////
// Dictionary words.
var text = await httpClient.GetStringAsync(
"https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/computation-expressions");
text += await httpClient.GetStringAsync(
"https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/pattern-matching");
var words = text.
Split(new[] { ' ', '\r', '\n', '\t' }, StringSplitOptions.RemoveEmptyEntries).
ToList();
////////////////////////////////////
// Prepare words.
// By HashSet
var fastFuzzyMatch = PartialFuzzyMatch(new HashSet<string>(words));
// By Array (string[])
//var fastFuzzyMatch = PartialFuzzyMatch(words.Distinct().ToArray());
////////////////////////////////////
// Benchmark beginnig.
var sw = new Stopwatch();
sw.Start();
// Compute.
var results =
searchWords.
Select(word => new { Word = word, Result = fastFuzzyMatch(word)}).
ToArray();
sw.Stop();
Console.WriteLine(sw.Elapsed);
}
}
It's interesting results... * HashSet --> avg 1.5sec * Array (string[]) --> avg 2.0sec (.NET 4.6.2) Before, I think these results are Array faster than HashSet, but real world results are opposite... Do you know what means? # I used JaroWinkler code from https://github.com/admin2210/EditDistance
|