-
Notifications
You must be signed in to change notification settings - Fork 648
Remove adjust logic in MMapDirectory, #1090 #1138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -314,7 +314,7 @@ internal virtual ByteBuffer[] Map(MMapIndexInput input, FileStream fc, long offs | |||
input.memoryMappedFile = MemoryMappedFile.CreateFromFile( | |||
fileStream: fc, | |||
mapName: null, | |||
capacity: length, | |||
capacity: fc.Length, // use the full length of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this back to the passed in length
. Again, there is no telling how broken the implementation is that we are dealing with. FileStream.Length
may not be reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If FileStream.Length
is not reliable (I do not believe this to be the case) then we have bigger problems, because the only caller to this internal method passes in fc.Length
as the length
argument (see line 243). My hypothesis (since this bug is not reproducible) was that the file length on disk was growing between the call to this method and the call to create the memory mapped file as a race condition. I'm fine with reverting it and letting this fail in that case, it might have just been a one-off fluke. But note that if the reason for reverting it is that FileStream.Length
is unreliable, then the calling code would be invalid as well (additionally, CreateFromFile would be bogus too, as no matter what we pass, it uses that same property to check the capacity argument). I do not think we should assume that it is unreliable and try to work around that unless you have a proposed solution. But I'm also fine with accepting this code as possibly rarely concurrency-unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene uses file locking to ensure that the Map()
method is not called during a write operation, so the length
is consistent during this operation. While you are correct that passing 0
is the equivalent operation as calling FileChannel.map()
in Java, passing the length
should be fine. More importantly, it has been tested very thoroughly using the length
parameter and we haven't had any reports of problems.
Also, since length
is passed as a parameter to the method, we should be using it to ensure that the end chunk is calculated using the same value. This would fail in Java if there were a race condition here.
We could use 0
on .NET Framework and .NET Core and pass FileStream.Length
in all other cases, but it would require some messy platform detection code or additional assemblies for different platforms. Using length
is the best way to ensure it is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright that has been reverted. This means this PR does not attempt to fix the bug in question, so I've updated the title and description to reflect the change to the adjust value alone. Given that a race condition is the only way I can explain the error, and that it has only happened once in tests, I guess we can just accept this for now. I think we should remove this issue from the 4.8.0 scope as well as a result. If someone runs into this in the wild, they'll probably find the issue and chime in with better reproduction steps. Or maybe it was the result of a cosmic ray bit flip and it'll never happen again in the wild...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed #1090 and there are a few things that could be contributing to the failure. It is specifically testing the file locking functionality and in this case there was a failure on Linux.
- Technically, Linux supports
FileStream.Lock()
which could be changed here, however after reading about it on dotnet/runtime and how it is possible to bypass the CLR to change it, we went with using the same approach we used on macOS, which is to useSharingNativeFSLock
. We probably could have usedFileStream.Lock()
(withNativeFSLock
) at the time, however, it seems that they have changed the native call in .NET 6 so it no longer gets an exclusive lock on Linux: FileStream.Lock() differs in behaviour between Windows and Linux using .NET 6.0 dotnet/runtime#64634. - On Windows, the HResult values of IOException are known. However, on all other platforms, these values depend on the underlying OS. We determine what the value is for the current OS by provoking the exception during static initialization and recording the value. There is a non-zero chance that this logic could fail to get a value and be set to
null
. When that happens, the implementation uses theFallbackNativeFSLock
, which just does a best effort, but is known to have concurrency problems. Note there is some debug code to determine the HResult value that was found here. - We added a "hole" in the locking implementation specifically so
RAMDirectory
can copy the lock file. This "hole" could cause theSharingNativeFSLock
to fail. Ideally, theshare
setting would always be set toShare.None
to ensure the lock functions properly in all cases. However, we would need to find some alternative forRAMDirectory
to be able to copy the contents of the file while there is a lock held on it. This test failure could have occurred because the lock failed to throw its exception due to usingShare.Read
instead ofShare.None
. It might be worth a try to see ifRAMDirectory
can function without the lock file or with a fake lock file when copying the contents of the directory.
Detecting the HResult code by provoking the exception is not ideal. Also, due to the fact that MS changed the underlying native method for FileStream.Lock()
in .NET 6, it doesn't sound like that is a viable alternative any longer on Linux. It may be that dropping to native code is the only way to properly solve it on Linux 100% of the time. Note that RavenDB changed the implementation to do all index file access using native code (including the memory mapped bit and file locking bit). I suspect that we could do the file locking in isolation without dropping to native code on everything, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing this a bit more, the exception is almost certainly due to the RAMDirectory
"hole". Allowing Share.Read
access in Obtain()
means the IOException that causes the lock to function will not occur. Without the exception, IsLocked()
can never return true
. Since the lock file is just a marker that indicates whether or not the directory is locked, using a fake one for RAMDirectory
should be fine.
So, we would need to change
dir.Copy(this, file, file, context); |
so there is a special case to create a fake lock file instead of copy it. And remove the "hole" from
lucenenet/src/Lucene.Net/Store/NativeFSLockFactory.cs
Lines 496 to 498 in 54505e8
// LUCENENET: Allow read access of OpenOrCreate for the RAMDirectory to be able to copy the lock file. | |
// For the Open case, set to FileShare.None to force a file share exception in IsLocked(). | |
share: mode == FileMode.Open ? FileShare.None : FileShare.Read, |
so it is always using Share.None
.
Alternatively, we could change the SharingNativeFSLock
to attempt to write a byte in IsLocked()
.
lucenenet/src/Lucene.Net/Store/NativeFSLockFactory.cs
Lines 799 to 802 in 54505e8
// try to find out if the file is locked by writing a byte. Note that we need to flush the stream to find out. | |
stream.WriteByte(0); | |
stream.Flush(); // this *may* throw an IOException if the file is locked, but... | |
// ... closing the stream is the real test |
This would cause an IOException when the file uses Share.Read
since it would not be writable, but it would be a different exception than NativeFSLockFactory.HRESULT_FILE_SHARE_VIOLATION
, so we would need to test for both cases (and depending on the exception, we may need to store another HResult value during initialization).
However, I still think that the best solution would be to change it to use native methods. This would eliminate exceptions thrown just to check true/false whether or not the directory is locked, since we could rely on native error codes instead. After all, it is supposed to be a NativeFSLockFactory
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good finds! Can you point me to how RAMDirectory is related to this test, though? AFAICT it randomly gets one of the FS_DIRECTORIES instances which is SimpleFSDirectory, NIOFSDirectory, or MMapDirectory... not RAMDirectory. I'm not quite following how that is connected to this failure (although I understand how having Read access could mean the lock is never considered held).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't related to this test. The "hole" was added in 10b326a to fix #356. I haven't tested to see whether the SharingNativeFSLock
fails when the first file is opened with FileShare.Read
, though. It is just a suspect. I wasn't sure about this "fix" when it was added, but it seemed to work (up until now when we discovered it doesn't always work).
The NativeFSLockFactory
needs unit tests for each OS and NativeFSLock
implementation, but it might be better to wait until after we consider whether to change the approach to use native method calls.
@@ -314,7 +314,7 @@ internal virtual ByteBuffer[] Map(MMapIndexInput input, FileStream fc, long offs | |||
input.memoryMappedFile = MemoryMappedFile.CreateFromFile( | |||
fileStream: fc, | |||
mapName: null, | |||
capacity: length, | |||
capacity: fc.Length, // use the full length of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene uses file locking to ensure that the Map()
method is not called during a write operation, so the length
is consistent during this operation. While you are correct that passing 0
is the equivalent operation as calling FileChannel.map()
in Java, passing the length
should be fine. More importantly, it has been tested very thoroughly using the length
parameter and we haven't had any reports of problems.
Also, since length
is passed as a parameter to the method, we should be using it to ensure that the end chunk is calculated using the same value. This would fail in Java if there were a race condition here.
We could use 0
on .NET Framework and .NET Core and pass FileStream.Length
in all other cases, but it would require some messy platform detection code or additional assemblies for different platforms. Using length
is the best way to ensure it is consistent.
Remove adjust logic in MMapDirectory.
Related #1090
Description
See discussion on #1090. This fixes an "adjust" variable hack to allocate an empty buffer instead of subtracting 1 from the offset.