That XOR Trick (2020)

florian.github.io

139 points by hundredwatt 3 days ago

danbruc 2 hours ago

For calculating the XOR of 1 to n there is a closed form solution, so no need to XOR them together in a loop.

  (n & ((n & 1) - 1)) + ((n ^ (n >> 1)) & 1)

Or a much more readable version

  [ n, 1, n + 1, 0 ][n % 4]

which makes it clear that this function cycles through a pattern of length four.

Why this works can be seen if we start with some n that is divisible by four, i.e. it has the two least significant bits clear, and then keep XORing it with its successors. We start with xxxxxx00 which is our n. Then we XOR it with n + 1 which is xxxxxx01 and that clears all the x's and leaves us with 00000001. Now we XOR it with n + 2 which is xxxxxx10 and that yields xxxxxx11 which is n + 3. The cycle finishes when we now XOR it it with n + 3 which yields 00000000. So we get n, 1, n + 3, 0 and then the cycle repeats as we are back at zero and at n + 4 which is again divisible by four.

Thorrez 7 minutes ago

In your array-based equation, you say n+1, but in your explanation you say n+3. Is that a mistake?
iotasilly an hour ago
Testing that formula for the first 10000 numbers using J.
```
   xor =: (16 + 2b0110) b.
   f =: 3 : 'xor/ 1 + i. y'
   g =: 3 : '(4 | y) { y, 1, (y+1), 0
   (f"0 -: g"0)  i. 10000
```
The result is 1, so this corroborate your formula is correct in this range. Your proof is general and clear.

antirez 28 minutes ago

About one month ago I applied XOR in a similar (but a bit more complicated way) to Redis Vector Sets implementation, in the context of sanity check of loading a vset value from the RDB file. I believe the way it works is quite interesting and kinda extends the applicability of the trick in the post.

The problem is that in vector sets, the HNSW graph has the invariant that each node has bidirectional links to a set of N nodes. If A links to B, then B links to A. This is unlike most other HNSW implementations. In mine, it is required that links are reciprocal, otherwise you get a crash.

Now, combine this with another fact: for speed concerns, Redis vector sets are not serialized as

    element -> vector

And then reloaded and added back to the HNSW. This would be slow. Instead, what I do, is to serialize the graph itself. Each node with its unique ID and all the links. But when I load the graph back, I must be sure it is "sane" and will not crash my systems. And reciprocal links are one of the things to check. Checking that all the links are reciprocal could be done with an hash table (as in the post problem), but that would be slower and memory consuming, so how do we use XOR instead? Each time I see a link A -> B, I normalize it swapping A and B in case A>B. So if links are reciprocal I'll see A->B A->B two times, if I use a register to accumulate the two IDs and XOR them, at the end, if the register is NOT null I got issues: some link may not be reciprocal.

However, in this specific case, there is a problem: collisions. The register may be 0 even if there are non reciprocal links in case they are fancy, that is, the non-reciprocal links are a few and they happen to XOR to 0. So, to fix this part, I use a strong (and large) hash function that will make the collision extremely unlikely.

It is nice now to see this post, since I was not aware of this algorithm when I used it a few weeks ago. Sure, at this point I'm old enough that never pretend I invented something, so I was sure this was already used in the past, but well, in case it was not used for reciprocal links testing, this is a new interview questions you may want to use for advanced candidates.

jonathanlydall 2 hours ago

This was a go to interview question to be solved in C# at a place I worked at a while back which had developers allocated to projects working on pretty standard line of business systems.

The XOR solution was a valid answer, but not the only answer we would have happily accepted.

The interview question was chosen such that it's very easy to understand and quick to solve, meaning it would indicate the candidate knew at least the basics of programming in C#. Almost surprisingly, we actually had candidates applying for "senior" level positions who struggled with this.

It could be solved in a multitude of ways, e.g:

- XOR as above

- Use of a HashSet<int>

- Use for loop and List which contains a number and its count.

- Use LINQ to group the numbers or something and then find the one with the count.

As long as what they did worked, it was a "valid" answer, we could then often discuss the chosen solution with the candidate and see how they reacted when we let them know of other valid solutions.

It was really great for not being a "one clever trick" question and could act as a springboard to slightly deeper discussions into their technical thought processes and understanding.

TacticalCoder 18 minutes ago

[dead]

woadwarrior01 41 minutes ago

Anyone interested in bit-level tricks like this, should have a copy of Hacker's Delight on their bookshelf.

tromp 3 hours ago

It's funny how the author fails to apply the XOR trick in the two missing values problem:

> We can thus search for u by applying this idea to one of the partitions and finding the missing element, and then find v by applying it to the other partition.

Since you already have u^v, you need only search for u, which immediately gives you v.

praptak 6 hours ago

Fun fact: the xor swap fails when the variables are aliases. This was the trick used in one of the underhanded code competitions.

Basically xor swapping a[i] with a[j] triggered the evil logic when i was equal to j.

CodesInChaos 2 hours ago

The submission by David Wagner, Philipe Biondi at https://bingweb.binghamton.edu/~scraver/underhanded/_page_id...
The state of RC4 consists of a random permutation of bytes. Whenever it outputs a value, it further permutes the state by swapping some bytes of the state. Th xor swap trick sets one of these values to zero, whenever RC4 attempts to swap the same item within the permutation. This gradually zeros out the state, until RC4 outputs the plaintext.
vaylian 4 hours ago

It would set a[i] to zero instead of swapping two values, right?

analog31 10 hours ago

The first thing that occurred to me is that if a number is missing from a list, the sum of that list will fall short. But I like XOR's.

meindnoch 3 hours ago

Sum and xor are the same, but over different fields.
anitil 10 hours ago

It really tickles my brain in a lovely way that it avoids all overflow risk as well
- repiret 8 hours ago
  
  There is no overflow risk. The trick works on any Abelian group. N-bit values form an Albanian group with xor where 0 is the identity and every element is its own inverse. But N-bit values also form an Abelian group under addition with overflow, where 0 is the identity and 2s-compliment is the inverse.
  If you’re working on an architecture where a single multiplication and a bit shift is cheaper than N xor’s, and where xor, add, and sub are all the same cost, then you can get a performance win by computing the sum as N(N+1)/2; and you don’t need a blog post to understand why it works.
  
  ikurei 3 hours ago
  
  I think they meant that XOR avoids the overflow risk, whereas doing the sum of the array to figure out which number could cause an overflow.
  
  jonathrg 5 minutes ago
  
  (wrapping) overflow doesn't affect the final result.
  
  OjotCewIo an hour ago
  
  > The trick works on any Abelian group
  (https://en.wikipedia.org/wiki/Abelian_group -- I'll use ⋆ as the Abelian group's operation, and ~ for inversion, below.)
  I believe you are implying:
  (g(1) ⋆ ... ⋆ g(n)) ⋆ ~(g(i(1)) ⋆ g(i(2)) ⋆ ... ⋆ g(i(n-1))) = g(m)
  where "m" is the group element index that is not covered by "i".
  However, for this to work, it is requried that you can distribute the inversion ~ over the group operation ⋆, like this:
  ~(g(i(1)) ⋆ g(i(2)) ⋆ ... ⋆ g(i(n-1))) = ~g(i(1)) ⋆ ~g(i(2)) ⋆ ... ⋆ ~g(i(n-1))
  because it is only after this step (i.e., after the distribution) that you can exploit the associativity and commutativity of operation ⋆, and reorder the elements in
  g(1) ⋆ ... ⋆ g(n) ⋆ ~g(i(1)) ⋆ ~g(i(2)) ⋆ ... ⋆ ~g(i(n-1))
  such that they pairwise cancel out, and leave only the "unmatched" (missing) element -- g(m).
  However, where is it stated that inversion ~ can be distributed over group operation ⋆? The above wikipedia article does not spell that out as an axiom.
  Wikipedia does mention "antidistributivity":
  https://en.wikipedia.org/wiki/Distributive_property#Antidist...
  (which does imply the distributivity in question here, once we restore commutativity); however, WP says this property is indeed used as an axiom ("in the more general context of a semigroup with involution"). So why is it not spelled out as one for Abelian groups?
  ... Does distributivity of inversion ~ over operation ⋆ follow from the other Abelian group axioms / properties? If so, how?
  
  lblume 4 hours ago
  
  You can also calculate the XOR-accumulation of all values between 1 and n in O(1) using a lookup table like this:
  [n, 1, n+1, 0][n%4]
- analog31 10 hours ago
  
  True, I hadn't thought of that. I'm spoiled by Python. ;-)

iotasilly 3 hours ago

Since J allow you to write short code, here are three example in J. The first use iota1000, the second a random permutation, and the third use matrix notation to create a little guessing game.

Example 1: Find the missing number

  xor =: (16 + 2b0110) b.
  iota1000 =: (i. 1000) 
  missingNumber =: (xor/ iota1000) xor (xor/ iota1000 -. 129) 
  echo 'The missing number is ' , ": missingNumber

This print 'The missing number is 129'

Example 2: Using a random permutation, find the missing number.

   permuted =: (1000 ? 1000)
   missingNumber = (xor/ permuted) xor (xor/ permuted -. ? 1000)

Example 3: find the missing number in this matrix.

  _ (< 2 2) } 5 5 $ (25 ? 25) 

   12  9  1 20 19
    6 18  3  4  8
   24  7  _ 15 23
   11 21 10  2  5
    0 16 17 22 14

Final test: repeat 10 times the example 3 (random matrices) and collect the time it takes you to solve it in a list of times, then compute the linear regression best fit by

  times %. (1 ,. i. 10)

Did you get better at solving it by playing more times?

I am not affiliated with J, but in case you want to try some J code there is a playground: https://jsoftware.github.io/j-playground/bin/html2/

Edited: It seems I am procrastinating a lot about something I have to do but don't want to.

hsfzxjy 9 hours ago

To derive "The XOR trick" I think both *associativity* and communitativity are needed.

That is, one should also prove a ^ (b ^ c) = (a ^ b) ^ c. Instinctive, but non-trivial.

kavouras an hour ago

Yeah that's what I was thinking, you need both
- OjotCewIo an hour ago
  
  yep, you need both; and in fact the definition includes both: https://en.wikipedia.org/wiki/Abelian_group

mrbluecoat 10 hours ago

PTSD for me on this topic due to a week wasted cleaning up PHP malware using XOR for obfuscation and encryption: https://www.godaddy.com/resources/news/php-malware-and-xor-e...

akovaski 7 hours ago

The partitioning algorithm to find two missing/duplicate numbers is clever, I wouldn't have thought of that. It should also work if you have a list with 1 missing and 1 duplicate, yeah? You'd probably have to do an extra step to actually find out which number is missing and which is a duplicate after you find the two numbers.

> If more than two elements are missing (or duplicated), then analyzing the individual bits fails because there are several combinations possible for both 0 and 1 as results. The problem then seems to require more complex solutions, which are not based on XOR anymore.

If you consider XOR to be a little bit more general, I think you can still use something like the partitioning algorithm. That is to say, considering XOR on a bit level behaves like XOR_bit(a,b)=a+b%2, you might consider a generalized XOR_bit(a,b,k)=a+b%k. With this I think you can decide partitions with up to k missing numbers, but I'm too tired to verify/implement this right now.

ameliaquining 10 hours ago

Ah, my least favorite technical interview question. (I've been asked it, but only after I first read about it online.)

phendrenad2 9 hours ago

Indeed, it kind of feels like asking if someone knows what the number 5318008 means.
motorest 5 hours ago

> Ah, my least favorite technical interview question.
The epitome of turning technical interviews into a trivia contest to make them feel smart. Because isn't that the point of a tech interview?
- empiko 5 hours ago
  
  Is there any other field where they give you random brain teasers for an interview? My friends outside of IT were laughing their heads off when they hears about the usual interview process.
  
  sfn42 4 hours ago
  
  I've always had reasonable interview questions. Get some data from an API and display it in a table. Make a class that can store car data and get them by plate number. Make a class that calculates tax based on a bracket system.
  I haven't even read the article so I don't know what this is about really but if an interviewer seriously asked me about some obscure xor trick I'd laugh at them.
- ur-whale 3 hours ago
  
  In what way is that question trivia?
  I believe you under-estimate what a good interviewer is trying to do with questions such as these:
  Either you've seen the trick before and you get an opportunity to show the interviewer that you're an honest person by telling him you have. Huge plus and the interview can move on to other topics.
  Either you haven't and you can demonstrate to the interviewer your analytical skills by dissecting the problem step by step and understanding what the code actually does and how.
  Bonus if you can see the potential aliasing problem when used to swap two variables.
  Not a trivia question at all.

burnt-resistor 10 hours ago

In ye olden days, bit manip operations were faster than algebraic operations.

And sometimes even faster than a load immediate, hence XOR AX, AX instead of MOV AX, 0.

heisenbit 8 minutes ago

And in these modern days it matters that an algorithm can use divide and conquer and can be parallelized. Xor plays nice here. Also the lack of carry bits and less branching help in the crypto space.
GuB-42 9 hours ago

"xor ax, ax" is still in use today. The main advantage is that it is shorter, just 2 bytes instead of 3 for the immediate, the difference is bigger in 32 and 64 bit mode as you have to have all these zeroes in the instruction.
Shorter usually mean faster, even if the instruction itself isn't faster.
- sparkie 6 hours ago
  
  In long mode, compilers will typically emit `xor eax, eax`, as it only needs 2 bytes: The opcode and modrm byte. `xor ax, ax` takes 3 bytes due to the operand size override prefix (0x66), and `xor rax, rax` takes 3 bytes due to the REX.W prefix. `xor eax, eax` will still clear the full 64-bit register.
  Shorter basically means you can fit more in instruction cache, which should in theory improve performance marginally.
  
  Someone 4 hours ago
  
  Size isn’t everything. You should start by reading the manual for your CPU to see what it advises. The micro-architecture may treat only one of the sequences specially. For modern x64, I think that indeed is the shorter xor sequence, where, internally, the CPU just renames the register to a register that always contains zero, making the instruction independent of any earlier instructions using eax.
  IIRC, Intel said a mov was the way to go for some now ancient x86 CPUs, though.
- tyfighter 8 hours ago
  
  Modern x86 implementations don't even do the XOR. It just renames the register to "zero".
- burnt-resistor 9 hours ago
  
  Barely. x86 is fading. Arm doesn't do this in GCC or Clang.
  > Shorter usually means faster
  It depends, so spouting generalities doesn't mean anything. Instruction cache line filling vs. cycle reduction vs. reservation station ordering is typically a compiler constraints optimization problem(s).
  
  userbinator 8 hours ago
  
  Arm doesn't do this in GCC or Clang.
  Because Arm64 has a zero register, and Arm32 has small immediates, and all instructions are uniformly long.

mzs 6 hours ago

I like the 'store prev ^ next' trick for lists that can be walked from the front or from the back.

ZoomZoomZoom 3 hours ago

> XOR is commutative, meaning we can change the order in which we apply XOR. To prove this, we can check the truth table for both x ^ y and y ^ x

This is nonsensical, where does the second truth table come from? Instead you just observe that, by definition, 1^0 == 0^1.

XeO3 8 hours ago

Apart from these applications of XOR, a favourite one is using Bitwise AND to find Even/Odd numbers.

cyberax 3 hours ago

I think it misses the XOR trick for bi-directional lists: https://en.wikipedia.org/wiki/XOR_linked_list

They're really evil on modern CPUs.

nullc 10 hours ago

Generalizing an 'xor accumulator' support set difference of more than one element is interesting: https://github.com/bitcoin-core/minisketch

moron4hire 10 hours ago

Why do people hate traditional for loops so much? In a conversation about petty micro optimizations, we end up performing two loops instead of one, all because sticking three operations in one statement is "yucky"?

devjab 4 hours ago

I think you raise a good question, but Python doesn't have a traditional for loop. To do it in one loop, you'd either have to simulate a traditional for loop with something like range, or you'd have to build a c/zig/rust lib and use it with cffi (or whatever rust uses that I forgot what was named). Or you're going to do it the "pythonic" way and write two loops, probably with a generator. As far as micro optimisation I'd argue that it depends on what you want. Speed or stable memory consumption? The single loop will be faster (for the most part) but the flip side is that there is a limit on how big of a data set it can handle.
It's all theoretical though. On real world data sets that aren't small I don't see why you wouldn't hand these tasks off to C/Zig/Rust unless you're only running them once or twice.
ToValueFunfetti 9 hours ago

xor(1..n) = switch(n % 4) { case 0: return n; case 1: return 1; case 2: return n + 1; default: return 0; }
So you don't actually need the first loop (at least for the set of integers 1..n example), but bringing that up is probably out of scope for this article.
delifue 10 hours ago

Its main benefit is to avoid having extra data structure (like hash map) to find the missing or duplicate, using O(n) time and O(1) space.
- moron4hire 9 hours ago
  
  No, again, that's not my point. The code from the article is O(2n) when it could be O(n). I know we're not supposed to care about constant factors, but I've lived in a world where not micro optimizing the ever loving shit out of my software could potentially make people throw up, so this sort of stuff kind of stands out to me.
  
  repiret 9 hours ago
  
  The code in the article is written in Python, whose only for loop is for-each. It is 2N XOR operations, regardless of whether you use one or two loops.
  I probably would have written it with a single loop, using the `enumerate` iterator adapter. But in Python, two loops is almost certainly more efficient.
  
  Dylan16807 8 hours ago
  
  You can loop over the range and then do result ^= i ^ A[i]. If adapters are slow you don't need them here.
  Having only one loop gives you a better memory access pattern, because it's 2 XOR operations in between each memory access. Two loops is the same number of instructions, but it spends one loop ignoring memory and then another loop doing rapid-fire memory accesses. In python there's enough overhead that it's unlikely to matter. But in a faster language running on a busy machine that could make a real difference.
  
  sfn42 32 minutes ago
  
  for i in range(n - 1):
  Is pretty much a standard for loop. Between that and
  for n in numbers:
  You can do pretty much the same things as a more conventional language.
  You could also solve it pretty simply like this:
  expected_sum = (n * (n + 1)) / 2
  missing_num = expected_sum - sum(numbers)
  This only iterates the list once. This would probably be my solution if I was given this task in an interview.
  
  NoahZuniga 7 hours ago
  
  O(2n) doesn't exist. The whole point of big O is that you ignore such "trivial" things as what factor comes before the n
  
  moron4hire 7 hours ago
  
  Did I not say that?
  
  haskellshill 6 hours ago
  
  >we end up performing two loops instead of one, all because sticking three operations in one statement is "yucky"
  You seem to believe that "O(2n)"
  for value in range(1, n + 1): result ^= value for value in A: result ^= value
  is slower than "O(n2)"
  for value in range(1, n + 1): result ^= value result ^= A[value-1]
  simply because the latter has one "for loop" less. Am I misunderstanding you, or if not, why would this matter for speed?
  
  a_t48 5 hours ago
  
  Unless both loops get unrolled it's ever so slightly slower due to having to check for the end value twice. Plus potentially a cache hit at the start of the second loop.
  
  codebje 3 hours ago
  
  None of this is as straightforward as it seems.
  A "for" loop in Python isn't particularly cheap. It compiles to some static overhead to set up the iterator, then each loop iteration compiles to the "FOR_ITER" opcode, a "STORE_FAST" opcode to assign the iteration value to a variable, and the body of the loop.
  "FOR_ITER" calls the "__next__()" method of the iterator (which is on top of the interpreter object stack), catches the StopIteration exception to know when to terminate the loop (by jumping past the loop body), and stores the iterator value to the top of the stack. What "__next__()" does is totally opaque - we don't know what kind of object A is - but since we've added the overhead of a function call already it wouldn't matter if it was a super tight bit of machine code, we're already paying a (relatively) hefty runtime cost.
  A particularly bad implementation of "__next__()" for some custom iterable collection might be so stupid as to walk through the collection until it reaches the current index's item and returns that, so "for value in A" could in fact be O(n^2).
  Plus, "result ^= A[value-1]" is substantially more work than "result ^= value", so even just on the loop bodies the two examples aren't very similar at all. Evaluating "A[value-1]" may wind up calling a "__getitem__()" method on A.
  If A is, say, a linked list or binary tree, iterating it is very cheap but indexing it is O(n), so the second loop might be O(n^2) where the first is O(n).
  So maybe we be a bit more Pythonic, and do:
  for i, value in enumerate(A) result ^= i result ^= value
  One loop, no indexing of A! But we've not actually saved anything: the __next__() method of enumerate's iterator will increment its index then call the __next__() method of A's iterator, (approximately) the same work as if we'd done two FOR_ITER, one for an index and one for A.
  Why would this matter for speed? I don't know. Unless 'n' is pretty big a human won't even notice the execution time of any of this code.
  
  moron4hire 3 hours ago
  
  Even assuming python's foreach loop in these cases get optimized down to a very bare for loop, the operations being performed are dominated by the looping logic itself, because the loop body is so simple.
  Each iteration of a for loop performs one index update and one termination comparison. For a simple body that is just an XOR, that's the difference between performing 5 operations (update, exit check, read array, XOR with value, XOR with index) per N elements in the one loop case versus 7 operations (update, exit, read array, XOR with value, then update, exit, XOR with index) in the two loop case. So we're looking at a 29% savings in operations.
  It gets worse if the looping structure does not optimize to a raw, most basic for loop and instead constructs some kind of lazy collection iterator generalized for all kinds of collections it could iterate over.
  The smaller the loop body, the higher the gains from optimizing the looping construct itself.
  
  codebje 6 hours ago
  
  You did, but it might not be an effective strategy to mention asymptotic complexity to help forward your argument that one linear implementation is faster than another.
  Whether it's a win in Python to use one or two loops isn't so clear, as a lot is hidden behind complex opcodes and opaque iterator implementations. Imperative testing might help, but a new interpreter version could change your results.
  In any case, if we want to nitpick over performance we should be insisting on a parallel implementation to take advantage of the gobs of cores CPUs now have, but now we're on a micro-optimisation crusade and are ignoring the whole point of the article.
  
  perfmode 9 hours ago
  
  real world performance will depend on how much of that N fits in cache. and in what cache it fits (L1, 2, 3). once loaded, you may not pay much cost to access each value a second time.
  
  MindSpunk 8 hours ago
  
  Doing 2 loops over the data means you have to do a full pass over the data twice. If your N doesn’t fit in L3 then you’re going to load N twice instead of once. Loading twice, even out of L1 is still slower than never loading twice at all.
  
  moron4hire 8 hours ago
  
  Exactly. And there's also the fact that sometimes the data stream we're processing is unbounded and ephemeral. For example, reading values from a physical sensor. It may not match up to this specific example, but the point remains that a single "loop" over a data set might be all you get, so pack as much into that loop as you can.
anitil 10 hours ago

I think it's just an interesting approach to solving particular limited problems. If I needed to solve this I'd end up either using set arithmetic or sorting the list, both of which use more memory and time. Maybe down low in some compiler loop or JVM loop this could be the difference between a sluggish application and a snappy one
- moron4hire 10 hours ago
  
  That's not my point. My point is that the exact same code from the original article could be done in a single, traditional for-loop, instead of two for-each loops.

st0le 11 hours ago

Another fun trick I've discovered.

`XOR[0...n] = 0 ^ 1 .... ^ n = [n, 1, n + 1, 0][n % 4]`

nullc 10 hours ago

Tables yuck :P, maybe
XOR[0...x] = (x&1^(x&2)>>1)+x*(~x&1)
- bsdz 4 hours ago
  
  ~Is there a simple proof for this type of identity?~
  Actually I found something through Gemini based on the table mod 4 idea in previous post. Thanks.
tialaramex 10 hours ago

Right, or in summary, no you don't need to all that extra work up front.

ur-whale 3 hours ago

One interesting problem related to the trick (which as pointed out elsewhere in the thread, fails spectacularly when the two variables alias to the same memory location) is to find other dyadic functions of integers that have the same property.

makeset 6 hours ago

Fun fact: you can show that there is another binary operator that performs the same triple assignment swap.

johnea 11 hours ago

Wow! This is a flashback! Hope you're doing well Andy W!