New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Aggregate signature module implementation #461
Conversation
In verification I use Bos-Coster to do the multi-exp which is very fast (consistently about 830 iterations for ten random scalar/point pairs, an iteration being a point-add and a scalar-subtract). Interestingly this does not require any pre-computation. However if you use the aggregate verifier with only one or two points the performance is relatively bad, in the one-point case it just breaks down to using a binary addition ladder. Fixing this would require I change the API to require a verify-enabled secp context, which sucks to require for just a special case. Thoughts? |
src/ecmult_multi_impl.h
Outdated
secp256k1_gej max_p; | ||
secp256k1_gej max_p1; | ||
|
||
iter_count++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, hanging debug code. Should I expose this somehow or just delete it?
Should state-machine violations (e.g. trying to create two partial signatures with same key/nonce) be |
01c3343
to
6c2b341
Compare
I expected it to need this for the edge case. What I haven't benchmarked out but think might be interesting: switching to a boring multiexp once enough points have dropped out that bos-coster is not making much progress anymore. (though this wouldn't be the existing very code, it would be optimized for smaller number (e.g. no endomorphism, smaller precomp, perhaps "joint-point" precomp). |
in your heap, I believe you will get better performance if the heap itself is on indexes to entries which themselves don't move around... your comparison code would know how to dereference the indexes. Because you support only a max of 32 right now char would be fine. |
Heh, yeah, on 30-key signatures using indices in the heap gives a 15%+ improvement on verification speed, which means we now take only 47% as long to verify compared to 30 individual verifications. (These numbers to be taken with salt, given I've only done a few measurements and I'm using my battery-backed laptop with a bunch of other stuff running. But they are pretty consistent.) Before we were at 57%. Using an index-only heap not only gets rid of a bunch of copying, it lets us do swaps with the triple-xor trick ( Next I'll try catching scalar 1's and adding their points to a running return value rather than spamming the heap with them, and using the endomorphism, but I may be in the air for the next 10+ hours and unable to report. |
You need a heapify: repeated insert is O(N log N) instead of O(N). While working on this before I came up with an optimization that halved the number of comparison operations; after googling a bunch I found it had a name, but I don't recall the name or the exact optimization and searching again is failing. :( I think (part of) it went like this: when starting pop the first element and set it aside in a separate super-root position effectively outside of the tree. Then in each iteration, read the super-root and root, compute the new values. Swap the root and super-root (since the root will now be the highest value) then rebalance the root. This saves replaces the rebalance in both the delete+insert with just a single rebalance. I keep thinking there was another optimization which came from enforcing the heap property less strongly which was possible because no insertion is needed (only delete and replace, which only sweep the heap in one direction), but I don't recall what that part was. :( It would be useful to know what a profile looks like .. e.g. how much of the time is spent in comparisons-- this would tell how much gain there would be from optimizations like the above or from speedups in the comparison (e.g. storing the number of non-zero words in the scalars, so comparison can compare just those values or if equal start on non-zero words). For your figures above for the 10 point case 12% of your adds will be gej+ge, so it will likely be faster to detect and switch to the special case. (stop me before I make some observation that there is some crazy effective affine trick that can be done by tracing the computation depth of each point, and putting all at the same depth on the same isomorphism) |
Very tired. Gained a fair bit by replacing the scalar subtraction with simpler code that doesn't do a modular negation, and another fair bit by hashing less stuff (pre-hashing all the constant stuff). Tried switching to native int math in a couple ways when the scalars got small enough, that had no effect I could observe. The endomorphism helped but not as much as I wanted it to. Had some frustration with Agreed profiling would be good at this point to see what's actually happening. I'm doubtful at this point that the heap is costing us a lot, but we'll see. Will try the gej+ge thing a shot, that's quick and easy to do. Spent some time thinking about effective affine, I didn't get anywhere, but maybe I'm just missing something obvious. |
Re: on where my arms were waving about effective affine like tricks. on my desktop, gej+ge is .181us faster than gej+gej, subtracting of 0.0644us per pubkey for additional fe multiplies, it would only take a batch of 29 points that would later gej+ge to pay for the cost of a feinv to bring them back to affine. So, e.g. as points are changed to no longer be affine they could be removed from the heap, and once half the points are removed (ratio will depend on the tradeoff in the asymptotic behavior of the algorithm), the whole thing batch inverted so everything is back at z=1. At least that is the general direction my hands were waving, -- to get really interesting it would be useful to get rid of the inverse. Hm. I thought we already handled the negating for the endo... but I guess the wnaf code is doing that.. and manages to do it without negating the point directly but instead permuting the wnaf table entries. Cute trick. (obviously negations could be deferred here by sign tracking, but I don't see how to avoid most of them). RE: degenerate performance when the magnitudes are too different: When I first pointed out these techniques, sipa remarked: 9:59 < gmaxwell> Anyone looked at using bos-coster (or other similar algorithims) for batch verification? |
You could generalise this by diverting "small" multiples (probably 2-4 bits or so) to an entry in a small-multiples table. Say it's 3 bits, so 7 (non-zero) values. This can be finished off with 3 doubles and 9 adds to get the result in the "1" entry, less if there are empty entries. @gmaxwell That 'replace' operation was the first thing I noticed missing from Java's PriorityQueue. In that class, the heap replaces a removed head with the last (array) element and sifts down. Then a subsequent add is placed at the end and sifted up (these operations are careful to keep a dense representation in the underlying array). A replace should be able to just remove the head, bubble up the largest children, then put the new entry in the "hole" and sift up (not sure Floyd's trick refers to this specifically or any case of sifting new entries up from the end). I'm somewhat skeptical that any sort of z->1 "fix-ups" mid-algorithm can work out; my sense is the cost is likely higher than the savings, but it obviously depends on the average number of additions you're performing on all the intermediate points. Removing points (or rather scalars) from the heap could hurt though. Assuming there is any advantage to be gained, a periodic stop-and-fix should be enough; maybe keep a counter of how many non-affine points have crept in. Looking at n1.P1 + n2.P2 + ..., with n1 >= 2^k.n2. As noted above, https://ed25519.cr.yp.to/ed25519-20110926.pdf mentions using (n1 −2^k.n2).P1 +n2.(2^k.P1 + P2). They don't use it there, because the scalars are all randomised (by 128-bit values), and it virtually never occurs. For a general multi-exp, you couldn't assume anything of course. A few thoughts:
I'd expect the endomorphism to have a large impact for small batches, but the advantage should taper off for larger batches, and memory use is a constant factor higher (<2). I'll try and get a rough "scale" from my Java implementation.
Or just track the "longest length" (hey, it's a heap), and do all comparisons to that length. |
src/modules/aggsig/tests_impl.h
Outdated
CHECK(secp256k1_ec_pubkey_create(ctx, &pubkeys[i], seckeys[i]) == 1); | ||
} | ||
|
||
aggctx = secp256k1_aggsig_context_create(none, pubkeys, 10, seed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line makes address sanitizer complain loudly because it is an out of bounds read (pubkeys only has 5 elements).
Needs rebase now, but added several commits:
Some approximate perf numbers on my laptop with 30-signature aggregates:
The numbers are much less impressive with small aggregates (in fact for one or two signatures they're significantly worse) and much more impressive for very large aggregates. We do actually need to deal with the big-number-small-number performance issues, there is a real DoS vector in the current code where if the user passes an "aggregate signature" with a 16-bit s value, say, the endomorphism-enabled code will spend 112 bits of iterations grinding that Edit: callgrind tells me we're spending 87% of our time in I don't see any obvious ways forward here, for uniformly random input this algorithm does a very good job of quickly reducing its input sizes and not repeating any work. |
I see (every time) test failures after building (--enable-endomorphism=yes if that matters):
Actually there are still some gains to be had in the heap implementation. I re-implemented here: peterdettman@aa01f4b . That's at least 5% faster on |
For the pathological inputs (big-small problem), just adding a binary ladder step when a single subtraction isn't enough seems to at least avoid disaster: peterdettman@cea5d3a . |
Very cool, my numbers also show your code performing 5% or so faster than mine. I'll maybe clean up your commit (just renaming We want to use the existing I'm not sure what to make of your test failures, I don't see this on my code or yours, and Travis seems happy with my code at least. |
I may have reproducedg dettman's crash. looking into it, as an aside:
you only zero the first one, but add 20 of them. If tests are run with ncount 0, some of the entries are uninitilized by earlier code. (looks like this error is common in the new tests, not just at this one point) |
src/scalar_4x64_impl.h
Outdated
r->d[1] = a->d[1] - b->d[1] - (a->d[0] < b->d[0]); | ||
r->d[0] = a->d[0] - b->d[0]; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carry propagation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
is always greater than b
when this function is called. Each carry is propogated, except that there is no carry to propogate to the difference of the lowest-significance words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a[0,0,0,1] - b[1,0,0,0], (so a > b), carry at lowest limb not propagated to highest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I understand, thanks. I will fix this and add a testcase.
We obviously have to handled the cases where scalars differ by large amounts, to avoid attackers making our validation slow with effort linear in the slowdown-- but it's obvious how to handle that. While thinking about generally adversarial scalars (stronger threat model than we need for aggregate signatures, perhaps, but it's conservative and we , how could we defend against this pattern?
This one doesn't result in gratuitously different magnitudes but it makes the algorithm exceptionally slow to converge. Should we just have some sanity maximum for the main bos-coster iteration that detects if it is not converging and bails out to a simple straus' algorithm without precomp? Even for cases where an attacker cannot directly control the scalars except via grinding this might have give us a piece of mind in not having to prove that hard cases are cryptographically hard to reach, since even that attack could only cause a 5 times slowdown or whatever. |
@gmaxwell Take a look at peterdettman@cea5d3a (again). The handling for the big-small case is perhaps not what you assumed it is. I tried your case in equivalent Java code and it chewed it up (much faster than random cases). If I switch to the big-small handling mentioned in the eddsa paper, then yes, it's much slower. With no special case, it essentially hangs, of course. I don't yet see a need for a bail-out routine; if the main loop guarantees a minimum 1-bit shortening of the largest scalar in any given iteration (without linear number of point doublings), then AFAICT the overall worst-case is already not much worse than the average. That commit shows a simple way to achieve the 1-bit guarantee, but it can certainly be improved on if we care to. |
@peterdettman I confirm that your "internal ladder step when the result is still bigger then the next best" solves my killer test case. Contrary to my earlier belief the simpler corner case handling ( nX + mY = (n-(2^n)m)X + m((2^n)X + Y) ) is enough to make my 'killer' case not run forever (as you observed), though its still a fair bit slower than random cases. Your proposal is, as you note, faster than random for that input. With your code I think the algorithm must converge exponentially and I am doubtful I can construct something that won't and I think this can be proven. (As an aside, I think switching to the binary ladder step feels a lot like the escape to bisection in dekker's root finding mechanism: https://en.wikipedia.org/wiki/Brent%27s_method#Dekker.27s_method ) |
Rebased everything and pulled in Peter Dettmann's heap improvements. At Greg's suggestion spent some time reading Pippenger's Algorithm by djb. Unfortunately the described algorithm appears almost certainly to be slower than Bos-Coster for our usecase (one multiexp with many bases vs many multiexps that share bases); to get the benefit of Pippenger we need to use the "transpose" of the described algorithm, which seems to require a lot of insight/study to obtain. ...Not to knock djb's writing, he did a phenomenal job translating the original Pippenger paper from a series of lemmas about matrix row counts (seriously) into an algorithm! Next steps:
Then Greg has suggested we do something smarter about memory management than this "maximum 32 aggregates because we want fixed stack size requirements". Need more discussion, once we resolve that I think I can take the WIP label off. (Edit re-push last commit with a signature/timestamp on it) |
Would it be worthwhile to separate off the multimul (we can't really call it multiexp, right?) into a separate PR. That is generally useful utility functionality (which at least @bbuenz has asked for before), while the actual signature scheme may warrant more discussion? |
@sipa I think so, though the aggregate signatures are (a) the only API exposure of the multimul (that feels weird :)) and (b) are also the only benchmark exposure. Though I think I might add a ecmult_benchmark anyway since we have so many different versions of that function now.. |
1692df9
to
185fab9
Compare
include/secp256k1_aggsig.h
Outdated
|
||
/** Generate a nonce pair for a single signature part in an aggregated signature | ||
* | ||
* Returns: a newly created context object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns nonce, I assume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, oops, no, it only returns an error code.
Address Pieter's nits. |
@apoelstra This will make it easier to integrate Strauss: sipa@67aa632 |
@apoelstra New version: sipa@d22829e |
ok, cherry-picked |
@apoelstra This adds Strauss-wNAF secp256k1_ecmult_multi (used for n < 200, much scientific, wow). sipa@23602bd |
Cherry-picked and rebased. Need to refactor to be sure we're not putting multiple types in the scratch space (alignment) and to have better dispatch logic. |
@apoelstra Squashed, restructured, and with Bos-Coster removed: https://github.com/sipa/secp256k1/commits/20170816_aggsig |
Superceded by #473 |
@apoelstra No it isn't, #473 doesn't include aggsig, only ecmult_multi. |
src/modules/aggsig/main_impl.h
Outdated
if (!secp256k1_fe_set_b32(&fe_tmp, sig64 + 32)) { | ||
return 0; | ||
} | ||
if (!secp256k1_ge_set_xquad(&r_ge, &fe_tmp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When n_pubkeys is 1 (which is probably going to be pretty common), there is a more efficient verification algorithm where you don't decompress R, but instead just compare the X coordinate of the recomputed version, and check that its Y coordinate is a quadratic residue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I don't think this even needs n_pubkeys
to be 1.
8afc544
to
da9c8bc
Compare
Updated to compare x-coordinate only of R (and checking that its y-coord is a quadratic residue). |
da9c8bc
to
d2491a9
Compare
…(haven't yet compiled
d2491a9
to
5fa7589
Compare
Rebased on latest master. One mild surprise is that I can't call |
Yes, if I recall correctly aggsig previously always used bos-coster and now strauss_wnaf which uses a fair amount of scratch space due to the multiplication table:
|
I have a few minor suggestions in my branch https://github.com/jonasnick/secp256k1/commits/aggsig-module
|
It would be helpful if users could compute the optimal scratch space for verification. I suggest jonasnick@e81fc06 |
I'm closing this because I believe it is obsolete given the recent research that led to a better understanding of multi-sigs and aggregate sigs. We can still take it as inspiration for future implementations of aggregate sigs. |
No description provided.