Skip to content

migrate from NHunspell to WeCantSpell.Hunspell #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 25, 2025

Conversation

hahn-kev
Copy link
Contributor

@hahn-kev hahn-kev commented Aug 16, 2024

Right now LCM is using NHunspell, which hasn't been updated in 9 years and only supports dotnet framework (it does work on windows through compatibility, but not any other platform).

It looks like the suggested alternative is WeCantSpell.Hunspell.

One issue with WeCantSpell is that it makes the wordlist read only once it's been parsed. There's a discussion about that here, from what I can tell that's the biggest hurdle blocking us from adopting the library. I gave it a try, but I don't understand some of the Hunspell details enough to figure out what's going on in our old version.

Waiting for #306 to be merged in so the tests run.


This change is Reviewable

Copy link

github-actions bot commented Aug 16, 2024

LCM Tests

    16 files  ±0      16 suites  ±0   2m 54s ⏱️ -1s
 2 830 tests ±0   2 810 ✅ ± 0   20 💤 ± 0  0 ❌ ±0 
11 268 runs  ±0  11 100 ✅ +11  168 💤  - 11  0 ❌ ±0 

Results for commit 10f5884. ± Comparison against base commit aee8c1a.

♻️ This comment has been updated with latest results.

@aarondandy
Copy link

If it is of interest to you, I added "Add" methods in the following release: https://www.nuget.org/packages/WeCantSpell.Hunspell/6.0.0 .

@hahn-kev hahn-kev marked this pull request as ready for review March 5, 2025 09:25
Copy link
Contributor

@jasonleenaylor jasonleenaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the updated WordList being serialized back out to disk for WeCantSpell, am I missing it? (I'm pretty sure that we do with NHunspell because we end up suggesting people use that dictionary in LibreOffice ect.

Reviewed 5 of 7 files at r1, 2 of 2 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @hahn-kev)


src/SIL.LCModel.Core/SpellChecking/SpellEngine.cs line 93 at r2 (raw file):

		private bool _isVernacular;
		private bool _gotIsVernacular;

Real question. Now that we have the option which is better, two bools or one bool?

@aarondandy
Copy link

aarondandy commented Mar 5, 2025

I don't see the updated WordList being serialized back out to disk for WeCantSpell, am I missing it?

WeCantSpell doesn't support that. Add operates on the in-memory internal state only. It would be a huge pile of problems to solve to save after modification. For example, should comments be preserved on save? What about the ordering of lines in the dic/aff files? What about any simplifications or fixes that occur during load, should those be preserved? OH HECKIN' HECK what about the file encodings 🫠. Just ... so much to stress over 😬. Does NHunspell support saving? I never noticed that in their API.

To work with v6 , you would have to keep something like:

  • base.aff
  • base.dic
  • stuff-to-add.txt

On load, load "base", then add words from your own file. When adding a word in memory, add it to both the WordList and "stuff-to-add.txt", either directly to the file or in to an in memory List<whtvs> representation to be persisted later.

See also: aarondandy/WeCantSpell.Hunspell#103

Real question. Now that we have the option which is better, two bools or one bool?

If you are asking me, one bool?:

private bool? _isVernacular;
public bool IsVernacular
{
  get
  {
    _isVernacular ??= Check(SpellingHelper.PrototypeWord);
    return _isVernacular.GetValueOrDefault();
  }
}

@hahn-kev
Copy link
Contributor Author

hahn-kev commented Mar 6, 2025

@jasonleenaylor I think saving changes is handled outside NHunspell in an Exception file which you can see here

public void SetStatus(string word1, bool isCorrect)
{
var word = Normalizer.Normalize(word1, Normalizer.UNormalizationMode.UNORM_NFC);
if (Check(word) == isCorrect)
return; // nothing to do.
// Review: any IO exceptions we should handle? How??
SetStatusInternal(word, isCorrect);
var builder = new StringBuilder();
var insertedLineForWord = false;
if (File.Exists(ExceptionPath))
{
using (var reader = new StreamReader(ExceptionPath, Encoding.UTF8))
{
string line;
while ((line = reader.ReadLine()) != null)
{
var item = line;
if (item.Length > 0 && item[0] == '*')
item = item.Substring(1);
// If we already got it, or the current line is before the word, just copy the line to the output.
if (insertedLineForWord || string.Compare(item, word, StringComparison.Ordinal) < 0)
{
builder.AppendLine(line);
continue;
}
// We've come to the right place to insert our word.
if (!isCorrect)
builder.Append("*");
builder.AppendLine(word);
insertedLineForWord = true;
if (word != item) // then current line must be a pre-existing word that comes after ours.
builder.AppendLine(line); // so add it in after item
}
}
}
if (!insertedLineForWord) // no input file, or the word comes after any existing one
{
// The very first exception!
if (!isCorrect)
builder.Append("*");
builder.AppendLine(word);
}
// Write the new file over the old one.
File.WriteAllText(ExceptionPath, builder.ToString(), Encoding.UTF8);
}

SetStatusInternal is abstract and implemented by the specific library, this is basically exactly what aarondandy was suggesting. There's no code that I can see which is saving from NHunspell, I'm not even sure how you would do that given it's API.

@hahn-kev
Copy link
Contributor Author

I've updated this PR to remove NHunspell, and remove the strong name requirement, we're no longer strong naming LCM dlls. I'm not sure if this would qualify as a breaking change. If there's a program using LCM which requires strong named assemblies then this would be breaking, but I wouldn't expect that. As it is there's a bunch of solutions on nuget to sign other nuget packages automatically to work around this problem, so anyone downstream that still wants strong names I would recommend that.

Copy link

@imnasnainaec imnasnainaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 17 files at r5, all commit messages.
Reviewable status: 15 of 19 files reviewed, all discussions resolved (waiting on @jasonleenaylor)


tests/SIL.LCModel.Core.Tests/App.config line 7 at r5 (raw file):

			<listeners>
				<clear />
				<add name="FwTraceListener" type="SIL.LCModel.Utils.EnvVarTraceListener, SIL.LCModel.Utils, Culture=neutral, PublicKeyToken=f245775b81dcfaab" initializeData="assertuienabled='true' logfilename='%temp%/asserts.log'" />

What did Culture=neutral do and why isn't it needed anymore?


src/SIL.LCModel.Core/SpellChecking/SpellEngineWeCantSpell.cs line 0 at r5 (raw file):
.editorconfig has max_line_length = 98 but this file has a number of lines longer than that.


src/SIL.LCModel.Core/SpellChecking/SpellEngineWeCantSpell.cs line 1 at r5 (raw file):

using System;

It looks like using System; isn't needed.


Directory.Build.props line 18 at r5 (raw file):

    <AppendTargetFrameworkToOutputPath>false</AppendTargetFrameworkToOutputPath>
    <PackageOutputPath>$(MSBuildThisFileDirectory)/artifacts</PackageOutputPath>
	<WarningsAsErrors>NU1605;CS8002</WarningsAsErrors>

This file has a mix of space and tab indenting.


src/SIL.LCModel.Core/SpellChecking/SpellEngine.cs line 11 at r5 (raw file):

using System.Text;
using Icu;
using SIL.PlatformUtilities;

Looks like using SIL.PlatformUtilities; is no longer needed.


tests/SIL.LCModel.Core.Tests/SpellChecking/SpellingHelperTests.cs line 8 at r5 (raw file):

using System.IO;
using System.Linq;
using System.Runtime.InteropServices;

Is using System.Runtime.InteropServices; still needed?


src/SIL.LCModel.Core/SpellChecking/SpellEngine.cs line 94 at r2 (raw file):

		private bool _isVernacular;
		private bool _gotIsVernacular;
		public bool IsVernacular

According to https://stackoverflow.com/a/78230896/10210583, an empty /// <inheritdoc /> is useless; since you've dropped it from IsVernacular`, you might as well drop the other ones in this file.

@aarondandy
Copy link

I'm not sure if this would qualify as a breaking change.

It technically is a breaking change and is why I don't want to get myself involved in any strong naming 😅. I think you should be careful about doing this but that doesn't mean you can't obviously. If you understand your user base well enough (maybe your user base is you) it makes it a lot easier. It has been a long while since I have had to deal with GAC, strong naming, or Full Framework dependencies but I think that is where you may run into most issues. That said I haven't researched it fully myself and I forgot everything I might have known about it a decade ago... and I never looked back!

What did Culture=neutral do and why isn't it needed anymore?

This along with the public key token is part of the assembly's fully qualified name. If you aren't doing strong naming, you don't need that.

Copy link

@imnasnainaec imnasnainaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 17 files at r5.
Reviewable status: 17 of 19 files reviewed, all discussions resolved (waiting on @jasonleenaylor)

@hahn-kev
Copy link
Contributor Author

Thanks for the review @imnasnainaec, I've applied changes from your feedback

Copy link

@imnasnainaec imnasnainaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r6, all commit messages.
Reviewable status: 18 of 19 files reviewed, all discussions resolved (waiting on @jasonleenaylor)


Directory.Build.props line 18 at r5 (raw file):

Previously, imnasnainaec (D. Ror.) wrote…

This file has a mix of space and tab indenting.

It still does.

@hahn-kev
Copy link
Contributor Author

Alright @imnasnainaec it turns out the props file wasn't caught by the editor config which is why my editor was formatting it incorrectly. I've fixed that and the whitespace.

Copy link

@imnasnainaec imnasnainaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @hahn-kev)

@hahn-kev hahn-kev enabled auto-merge (squash) March 25, 2025 06:08
@hahn-kev hahn-kev merged commit 4f1aa51 into master Mar 25, 2025
3 of 4 checks passed
@hahn-kev hahn-kev deleted the migrate-to-we-cant-spell branch March 25, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants