Skip to content

Remove adjust logic in MMapDirectory, #1090 #1138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 19, 2025

Conversation

paulirwin
Copy link
Contributor

@paulirwin paulirwin commented Mar 8, 2025

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Remove adjust logic in MMapDirectory.

Related #1090

Description

See discussion on #1090. This fixes an "adjust" variable hack to allocate an empty buffer instead of subtracting 1 from the offset.

@paulirwin paulirwin added the notes:bug-fix Contains a fix for a bug label Mar 8, 2025
@paulirwin paulirwin requested a review from NightOwl888 March 8, 2025 22:07
@@ -314,7 +314,7 @@ internal virtual ByteBuffer[] Map(MMapIndexInput input, FileStream fc, long offs
input.memoryMappedFile = MemoryMappedFile.CreateFromFile(
fileStream: fc,
mapName: null,
capacity: length,
capacity: fc.Length, // use the full length of the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this back to the passed in length. Again, there is no telling how broken the implementation is that we are dealing with. FileStream.Length may not be reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FileStream.Length is not reliable (I do not believe this to be the case) then we have bigger problems, because the only caller to this internal method passes in fc.Length as the length argument (see line 243). My hypothesis (since this bug is not reproducible) was that the file length on disk was growing between the call to this method and the call to create the memory mapped file as a race condition. I'm fine with reverting it and letting this fail in that case, it might have just been a one-off fluke. But note that if the reason for reverting it is that FileStream.Length is unreliable, then the calling code would be invalid as well (additionally, CreateFromFile would be bogus too, as no matter what we pass, it uses that same property to check the capacity argument). I do not think we should assume that it is unreliable and try to work around that unless you have a proposed solution. But I'm also fine with accepting this code as possibly rarely concurrency-unsafe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene uses file locking to ensure that the Map() method is not called during a write operation, so the length is consistent during this operation. While you are correct that passing 0 is the equivalent operation as calling FileChannel.map() in Java, passing the length should be fine. More importantly, it has been tested very thoroughly using the length parameter and we haven't had any reports of problems.

Also, since length is passed as a parameter to the method, we should be using it to ensure that the end chunk is calculated using the same value. This would fail in Java if there were a race condition here.

We could use 0 on .NET Framework and .NET Core and pass FileStream.Length in all other cases, but it would require some messy platform detection code or additional assemblies for different platforms. Using length is the best way to ensure it is consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright that has been reverted. This means this PR does not attempt to fix the bug in question, so I've updated the title and description to reflect the change to the adjust value alone. Given that a race condition is the only way I can explain the error, and that it has only happened once in tests, I guess we can just accept this for now. I think we should remove this issue from the 4.8.0 scope as well as a result. If someone runs into this in the wild, they'll probably find the issue and chime in with better reproduction steps. Or maybe it was the result of a cosmic ray bit flip and it'll never happen again in the wild...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed #1090 and there are a few things that could be contributing to the failure. It is specifically testing the file locking functionality and in this case there was a failure on Linux.

  1. Technically, Linux supports FileStream.Lock() which could be changed here, however after reading about it on dotnet/runtime and how it is possible to bypass the CLR to change it, we went with using the same approach we used on macOS, which is to use SharingNativeFSLock. We probably could have used FileStream.Lock() (with NativeFSLock) at the time, however, it seems that they have changed the native call in .NET 6 so it no longer gets an exclusive lock on Linux: FileStream.Lock() differs in behaviour between Windows and Linux using .NET 6.0 dotnet/runtime#64634.
  2. On Windows, the HResult values of IOException are known. However, on all other platforms, these values depend on the underlying OS. We determine what the value is for the current OS by provoking the exception during static initialization and recording the value. There is a non-zero chance that this logic could fail to get a value and be set to null. When that happens, the implementation uses the FallbackNativeFSLock, which just does a best effort, but is known to have concurrency problems. Note there is some debug code to determine the HResult value that was found here.
  3. We added a "hole" in the locking implementation specifically so RAMDirectory can copy the lock file. This "hole" could cause the SharingNativeFSLock to fail. Ideally, the share setting would always be set to Share.None to ensure the lock functions properly in all cases. However, we would need to find some alternative for RAMDirectory to be able to copy the contents of the file while there is a lock held on it. This test failure could have occurred because the lock failed to throw its exception due to using Share.Read instead of Share.None. It might be worth a try to see if RAMDirectory can function without the lock file or with a fake lock file when copying the contents of the directory.

Detecting the HResult code by provoking the exception is not ideal. Also, due to the fact that MS changed the underlying native method for FileStream.Lock() in .NET 6, it doesn't sound like that is a viable alternative any longer on Linux. It may be that dropping to native code is the only way to properly solve it on Linux 100% of the time. Note that RavenDB changed the implementation to do all index file access using native code (including the memory mapped bit and file locking bit). I suspect that we could do the file locking in isolation without dropping to native code on everything, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this a bit more, the exception is almost certainly due to the RAMDirectory "hole". Allowing Share.Read access in Obtain() means the IOException that causes the lock to function will not occur. Without the exception, IsLocked() can never return true. Since the lock file is just a marker that indicates whether or not the directory is locked, using a fake one for RAMDirectory should be fine.

So, we would need to change

dir.Copy(this, file, file, context);

so there is a special case to create a fake lock file instead of copy it. And remove the "hole" from

// LUCENENET: Allow read access of OpenOrCreate for the RAMDirectory to be able to copy the lock file.
// For the Open case, set to FileShare.None to force a file share exception in IsLocked().
share: mode == FileMode.Open ? FileShare.None : FileShare.Read,

so it is always using Share.None.


Alternatively, we could change the SharingNativeFSLock to attempt to write a byte in IsLocked().

// try to find out if the file is locked by writing a byte. Note that we need to flush the stream to find out.
stream.WriteByte(0);
stream.Flush(); // this *may* throw an IOException if the file is locked, but...
// ... closing the stream is the real test

This would cause an IOException when the file uses Share.Read since it would not be writable, but it would be a different exception than NativeFSLockFactory.HRESULT_FILE_SHARE_VIOLATION, so we would need to test for both cases (and depending on the exception, we may need to store another HResult value during initialization).


However, I still think that the best solution would be to change it to use native methods. This would eliminate exceptions thrown just to check true/false whether or not the directory is locked, since we could rely on native error codes instead. After all, it is supposed to be a NativeFSLockFactory...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good finds! Can you point me to how RAMDirectory is related to this test, though? AFAICT it randomly gets one of the FS_DIRECTORIES instances which is SimpleFSDirectory, NIOFSDirectory, or MMapDirectory... not RAMDirectory. I'm not quite following how that is connected to this failure (although I understand how having Read access could mean the lock is never considered held).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't related to this test. The "hole" was added in 10b326a to fix #356. I haven't tested to see whether the SharingNativeFSLock fails when the first file is opened with FileShare.Read, though. It is just a suspect. I wasn't sure about this "fix" when it was added, but it seemed to work (up until now when we discovered it doesn't always work).

The NativeFSLockFactory needs unit tests for each OS and NativeFSLock implementation, but it might be better to wait until after we consider whether to change the approach to use native method calls.

@@ -314,7 +314,7 @@ internal virtual ByteBuffer[] Map(MMapIndexInput input, FileStream fc, long offs
input.memoryMappedFile = MemoryMappedFile.CreateFromFile(
fileStream: fc,
mapName: null,
capacity: length,
capacity: fc.Length, // use the full length of the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene uses file locking to ensure that the Map() method is not called during a write operation, so the length is consistent during this operation. While you are correct that passing 0 is the equivalent operation as calling FileChannel.map() in Java, passing the length should be fine. More importantly, it has been tested very thoroughly using the length parameter and we haven't had any reports of problems.

Also, since length is passed as a parameter to the method, we should be using it to ensure that the end chunk is calculated using the same value. This would fail in Java if there were a race condition here.

We could use 0 on .NET Framework and .NET Core and pass FileStream.Length in all other cases, but it would require some messy platform detection code or additional assemblies for different platforms. Using length is the best way to ensure it is consistent.

@paulirwin paulirwin changed the title Fix random failing test with MMapDirectory capacity out of range, #1090 Remove adjust logic in MMapDirectory, #1090 Mar 11, 2025
@paulirwin paulirwin merged commit 52ee6bb into apache:master Mar 19, 2025
275 checks passed
@paulirwin paulirwin deleted the issue/1090 branch March 19, 2025 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notes:bug-fix Contains a fix for a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants