-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable order in large-order expansions (at least in besselk
)
#67
Comments
Thank you for writing this up Chris! Very interesting and quite enjoyed diving into this code. I'll have to admit my first reaction was how many users would want to use this feature or even know how to... as in would they know how many terms to consider in this expansion and how would this feature play well with the other algorithms for different branches? I'd assume that the It might help to give some more context here as I don't really think I've documented this. With the SIMD PR the U-polynomials are starting to be a small fraction of the overall routine. I'll try to show how much with these benchmarks.. julia> @btime besselk(50.0, x) setup=(x=rand()+50.0)
75.063 ns (0 allocations: 0 bytes)
1.9277371284156387e-13
julia> @btime Bessels.besselk_large_orders(50.0, x) setup=(x=rand()+50.0)
57.477 ns (0 allocations: 0 bytes)
2.1893935965831894e-13
julia> @btime Bessels.Uk_poly_Kn(1.2,50.0, x,Float64) setup=(x=rand()*5.0)
14.679 ns (0 allocations: 0 bytes)
1.0172531392362487 So there are a couple things going on there.... just getting to the The U-polynomial function though, only takes about 20% of the time of the total routine and about 25% of the time the This effect is apparent in the julia> @btime Bessels.Uk_poly_Kn(1.2,50.0, x,Float32) setup=(x=Float32(rand()*5.0))
7.913 ns (0 allocations: 0 bytes) So using 5 terms instead of 10 decreases the U-polynomial function by half but in reality that is about 7 ns. We can benchmark the top function for Float 32 to see the difference.. julia> @btime besselk(50.0f0, x) setup=(x=Float32(rand()+50.0))
65.242 ns (0 allocations: 0 bytes) Which is about 10 ns (or ~13%) faster than the So I think I made the decision to just use a fixed amount of U-polynomials for either Float32 or Float64 primarily because it was easier but also because I didn't see that significant of a jump in computation time at the time using 5 terms instead of 10. But we could absolutely think about having an additional check in the Float64 implementation to use a variable amount of terms. I do that for the regular Bessel functions but that's because the expansion is slower to converge so there was a large difference going from 20 terms to 10 terms. Steps forward could definitely be improving other parts of the routine so any performance increase in the U-polynomials would be reflected in the top of the routine. But I think the big thing for me is definitely having a higher precision implementation! I did try to tackle that by deriving the exact coefficients out. Ideally we could have a routine that would work in either |
Wow. It blows my mind that the branching is so much of the cost of that function call. How does one even go about debugging or trying to improve something like that? I certainly don't have the impression that all branches cost 18ns, so what is going on with these ones? Given that, I totally see your point that just fixing the number of polynomials is probably the way to go. Considering that extra branches might actually be pretty expensive, it seems obviously preferable to spend those CPU cycles getting a more accurate answer instead of waiting so long for the faster one that you barely save time. With regard to computing the polynomials in higher precision, the function to get them is pretty simple with NIST 10.41.9: using Polynomials
function Uk_polynomials(T, max_order)
P0 = Polynomial([one(T)])
out = [P0]
mul_int = Polynomial([one(T), zero(T), T(-5)])
mul_frnt = Polynomial([zero(T), zero(T), one(T), zero(T), -one(T)])/T(2)
for j in 1:max_order
Pjm1 = out[end]
Pjm1_int = integrate(mul_int*Pjm1)/T(8)
Pjm1_drv = derivative(Pjm1)
newP = mul_frnt*Pjm1_drv + Pjm1_int - Pjm1_int(zero(T))/T(8)
push!(out, newP)
end
out
end and then with that function in scope you could do, say, julia> Uk_polynomials(Rational{BigInt}, 3)
4-element Vector{Polynomial{Rational{BigInt}, :x}}:
Polynomial(1//1)
Polynomial(1//8*x - 5//24*x^3)
Polynomial(9//128*x^2 - 77//192*x^4 + 385//1152*x^6)
Polynomial(75//1024*x^3 - 4563//5120*x^5 + 17017//9216*x^7 - 85085//82944*x^9) (I originally did the first 10 and that required going to I assume that bringing |
Yes, the branching aspect is definitely expensive. Some of the checks are more expensive than others as some require computing square roots or cube roots which adds to that check overhead.... but they are expensive and I do feel like some of these examples (along with examples from some of the other threads 😅) deserves a bigger write up or something. I am in crunch time for thesis defense stuff but hopefully could write something up about branches and prediction etc. So any improvements to figure out more quickly what branch to take would be very welcome as that will improve the overall function significant. There is also the significant overhead of just computing the coefficients before the U-polynomials which is like 75% of the time in the whole I would definitely prefer hard coding these into the source file than to use Polynomials.jl.. We have them up to n=21 stored as exact integers. Bessels.jl/src/U_polynomials.jl Lines 202 to 224 in 4c4b013
BigFloats and BigInt because they are so slow and allocate... It would be nice to have dedicated routines for Double64 and Float128 that did not allocate and were fairly fast. This should be possible but we need a way to store these coefficients in a way that could be used efficiently without allocating... this would honestly be a really great addition here to add methods for besselk for Double64 and Float128 ....
|
I hear that. I also didn't notice that the integer form was already in there. If you didn't use an automated polynomial tool, I shudder to think how much grinding went into that. I admire your dedication. Also, I forgot you're graduating now! I'm happy to leave this on hold and come back when you have more time and bandwidth. In terms of concrete things that I think I understand how to do, I could certainly provide analogs of Or is there some easy way to hard-code those coefficients without bringing the struct Using just EDIT: Actually, I just remembered that |
😄 the amount of time I've spent deriving coefficients for all these expansions definitely surpasses any amount of time I've spent on this actual library haha! I've gotten a little smarter about it now than a year ago but it is painful. And no worries about the time I can definitely think and play around with it right now I just won't be able to flesh it out completely.
Yes - this was a question I grappled with earlier. Oscar might have a better suggestion for this... However, an advantage of having the exact integers is we really don't have to do too much effort for this (there is a caveat I'll say later). If we have a test function of the first 10 coefficients using the exact integer representation. function test(x::T, p2) where T
u0 = one(T)
u1 = evalpoly(p2, (3, -5)) / 24
u2 = evalpoly(p2, (81, -462, 385)) / 1152
u3 = evalpoly(p2, (30375, -369603, 765765, -425425)) / 414720
u4 = evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725)) / 39813120
u5 = evalpoly(p2, (1519035525, -49286948607, 284499769554, -614135872350, 566098157625, -188699385875)) / 6688604160
u6 = evalpoly(p2, (2757049477875, -127577298354750, 1050760774457901, -3369032068261860,5104696716244125, -3685299006138750, 1023694168371875)) / 4815794995200
u7 = evalpoly(p2, (199689155040375, -12493049053044375, 138799253740521843, -613221795981706275, 1347119637570231525, -1570320948552481125, 931766432052080625, -221849150488590625)) / 115579079884800
u8 = evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125)) / 22191183337881600
u9 = evalpoly(p2, (6427469716717690265625, -659033454841709672064375, 11921080954211358275362500, -87432034049652400520788332, 334380732677827878090447630, -741743213039573443221773250, 992115946599792610768672500, -790370708270219620781737500, 345821892003106984030190625, -64041091111686478524109375)) / 263631258054033408000
u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, 290938676920391671935028890625)) / 88580102706155225088000
return evalpoly(x, (u0, u1, u2, u3, u4, u5, u6, u7, u8, u9, u10))
end We can neatly use this function in any precision we want without loading dependencies into the package. The user simply has to load them... julia> using Quadmath
julia> using DoubleFloats
julia> test(big"1.2", big"1.4")
656724.3266940676809279037485970819304152637485970819304152637485970819304153963
julia> test(Double64(1.2), Double64(1.4))
656724.3266942296
julia> @benchmark test(Double64(1.2), Double64(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.864 μs … 2.908 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.867 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.870 μs ± 29.317 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█
██▂▂▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▁▂▁▁▁▁▁▂▂▁▂▁▂▂▁▂▁▂▂▂▁▂▁▂▂ ▂
1.86 μs Histogram: frequency by time 2.04 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark test(BigFloat(1.2), BigFloat(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 11.761 μs … 5.188 ms ┊ GC (min … max): 0.00% … 68.27%
Time (median): 12.728 μs ┊ GC (median): 0.00%
Time (mean ± σ): 16.669 μs ± 111.479 μs ┊ GC (mean ± σ): 10.28% ± 1.53%
▃▇█▇▅▃▁ ▂▃▃▂▁▁ ▂▁▁▁▂▁▁▁ ▂
███████▇▅▆███████████▇███████████▇▆▅▆▇▆▆▅▆▅▅▄▄▅▃▄▄▄▅▂▃▂▄▄▃▄▃ █
11.8 μs Histogram: log(frequency) by time 30.6 μs <
Memory estimate: 22.77 KiB, allocs estimate: 513.
julia> @benchmark test(Float128(1.2), Float128(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.977 μs … 5.537 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.068 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.076 μs ± 65.653 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▁▄▅▇█▇▇▆▆▆▆▄▃▃▂▁▁ ▁
▁▁▁▁▁▁▂▂▂▃▄▅▅▆▇███████████████████████▇▇▇▇▆▅▅▅▄▃▃▂▂▂▂▂▂▁▁▁ ▄
2.98 μs Histogram: frequency by time 3.18 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Clearly we can see the performance advantages of using And so this just works out of the box with any precision! It is definitely an advantage of Julia just being composable like that. It's also why I spent some time just getting the exact coefficients down. julia> typeof(test(Float32(1.2), Float32(1.4)))
Float32
julia> typeof(test(Float64(1.2), Float64(1.4)))
Float64
julia> typeof(test(Double64(1.2), Double64(1.4)))
Double64 (alias for Double64)
julia> typeof(test(Float128(1.2), Float128(1.4)))
Float128
julia> typeof(test(BigFloat(1.2), BigFloat(1.4)))
BigFloat So there is a caveat to this.... all those integers can be represented by julia> promote_type(Float32, Int128)
Float32
julia> promote_type(Float64, Int128)
Float64
julia> promote_type(Double64, Int128)
Double64 (alias for Double64)
julia> promote_type(Float128, Int128)
Float128 This will preserve the input types very well... but that is not the case for julia> promote_type(Float128, BigInt)
BigFloat Which will promote them all to function test2(x::T, p2) where T
u0 = one(T)
u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, 29093867692039167193502889062512334455677)) / 88580102706155225088000
return evalpoly(x, (u0, u10))
end Now we can see that no matter what we put into the function we will get julia> typeof(test2(Float128(1.2), Float128(1.4)))
BigFloat
julia> typeof(test2(Float64(1.2), Float64(1.4)))
BigFloat
Which of course is bad.... Of course the simplest thing to do is to just convert the function test2(x::T, p2) where T
u0 = one(T)
u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, T(29093867692039167193502889062512334455677))) / 88580102706155225088000
return evalpoly(x, (u0, u10))
end So I've just convert the julia> typeof(test2(Float64(1.2), Float64(1.4)))
Float64
julia> typeof(test2(Float128(1.2), Float128(1.4)))
Float128 But it doesn't make this function completely non-allocating. julia> @benchmark test2(Float128(1.2), Float128(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 183 evaluations.
Range (min … max): 572.754 ns … 10.086 μs ┊ GC (min … max): 0.00% … 93.20%
Time (median): 595.407 ns ┊ GC (median): 0.00%
Time (mean ± σ): 605.754 ns ± 235.024 ns ┊ GC (mean ± σ): 1.02% ± 2.47%
▁▃▄▄▃▃▂▃▄▃▆██▇▅▂ ▁▁ ▁▁ ▁▁ ▁▁ ▂
▄█████████████████▇▆█████▇▇▇█▇██████████▇▅▆▆▇▆▆█▇█████████▇▇▇ █
573 ns Histogram: log(frequency) by time 669 ns <
Memory estimate: 192 bytes, allocs estimate: 4. Which is also bad 😄 So if we could fix that issue I think this could work. Perhaps having some bigger concrete Integer type like Edit: It looks like all integers would fit into an Int256 size up until at least n=21. It would be nice if this was just in Base though..... |
I looked a bit more into this and all the methods we use for I think the main hurdle is how do we want to store the higher precision coefficients. Kind of a personal gripe but I do wish there was just native support for |
Sorry to have lost the thread here. Do you have an opinion about using |
Honestly, I just have a purely subjective inclination to not take on any dependency unless for a very good reason 🤷♂️ very unjulian but could be convinced. Again, just an unjustified preference. So are you imagining a file that has all the coefficients stored as Or do you mean have all the coefficients listed in |
Like can we somehow get julia> using DoubleFloats
julia> const P = (Double64(big"1.23"), Double64(big"1.8"), Double64(big"2.3"))
(1.23, 1.8, 2.3)
julia> function test(x)
return evalpoly(x, P)
end
test (generic function with 1 method)
julia> function test2(x)
PP = (Double64(big"1.23"), Double64(big"1.8"), Double64(big"2.3"))
return evalpoly(x, PP)
end
test2 (generic function with 1 method)
julia> using BenchmarkTools
julia> @btime test(Double64(rand()))
34.045 ns (0 allocations: 0 bytes)
2.797159851038373
julia> @btime test2(Double64(rand()))
322.688 ns (6 allocations: 312 bytes)
3.375839473234667 EDIT: Or even better get this to work for any type... function test3(x::T) where T
PP = (T(big"1.23"), T(big"1.8"), T(big"2.3"))
return evalpoly(x, PP)
end
# Maybe better with parse?
function test3(x::T) where T
PP = (parse(T, "1.23"), parse(T, "1.8"), parse(T, "2.3"))
return evalpoly(x, PP)
end |
Hm. Well, for your @generated test4(v::V, x::X) where{V,X}
T = promote_type(V, X)
coefs = generate_U_coefs(T) # fixed number
quote
u20 = evalpoly(...)
u19 = evalpoly(...)
[...]
end
end But another thought that just occurs to me is that maybe it wouldn't be very hard to write our own little polynomial type that does the few operations we need to use my earlier function to generate the coefs. That is again completely doable at compile time and would cut down a lot on the amount of hard-inline giant integer coefficient lists. |
Yes - I think the generated functions could be a good solution! Ideally, we just need to generate the constants in the right type that can work for a bunch of input types. I think your method for the polynomials is nice, but I do also want to have a method that is general for all the other methods. So like for any minimax polynomials we could apply a similar method instead of regenerating all the polynomials. I'm just hoping we could find a good solution that could work for the other expansions besides the U-polynomials and also for the minimax polynomials as well! It could be that the generated functions works for all of these cases nicely! |
I finally got around to cleaning up the uniform asymptotic expansion code in
BesselK.jl
and I ended up with a solution that is within a fewns
ofBessels.besselk_large_orders
while still allowing a user to pick the order of the expansion. I'm wondering if any part of this new work would be interesting as a PR here. There are a few parts to it:Auto-generating the U_k polynomials, which I do with this file. The result is that you automatically generate a file that looks like this, and you can see that there's a struct for each polynomial that trims out zeros and has a slightly specialized eval method. The generation uses
Polynomials.jl
, but once the file is generated you don't need it, and so it would not be a new dependency.The actual equivalent to
Bessels.besselk_large_orders
now looks like this (file ref):Which maybe is objectionable because it is generated function that automatically generates and uses function names. But it does work, and the
@nexprs
is a way to be completely sure the loop is unrolling and stuff. I have a feeling that with some smarter inner-loop tweaks this could at least match the current performance.There are a few potential advantages to this:
U_k
file usesPolynomials.jl
, and so one could actually generate exact rational representations for all of theU_k
polynomial coefficients and potentially use something like that to get more accurate coefficients for big floats or whtatever.I haven't really explored either of those options. But if they're potentially interesting, I'm happy to prepare a PR that puts in the generated U_k polynomials and stuff. I'm sure other people here will have thoughts about the custom struct and methods and stuff, so I'm happy to discuss all aspects of it if there is any interest in the first place.
The text was updated successfully, but these errors were encountered: