At work, we've got a some 0.8.5 cassandra clusters running stably for nearly a year now. This is fairly behind the stable cassandra release, at this point, but we got things to a stable point and didn't want to mess with it further, so 0.8.5 it is.
So, a node died. A volume went dead, and that was that. Fortunately, the documentation on how to handle node failure are pretty clear, particularly when using the docs from DataStax.
We went for option two, bringing up a new node at the adjacent token. Unfortunately, the new node saw the dead node it was meant to replace as up, and thus in bootstrapping attempted to stream data from the dead node. Shockingly, that doesn't work.
Why did the new node thing the dead node was up? Every other node in the ring saw the dead node as down.
My suspicion, not truly confirmed by a proper review of the code but conveniently strengthened by observed behavior, is because of seeds.
When we built our cassandra clusters, we used chef. And not just chef, but "cluster chef". As the link shows, there are several generations of cluster chef. When we built our clusters, version 2 was under development, so we based things on version one (if memory serves). That's so old and decrepit that they don't even list it.
Unfortunately, while the version made rolling out a new cluster relatively easy, it was not so great about maintaining existing clusters, and it made some unwise choices when selecting seeds for your cassandra ring. Specifically:
- Every node known to cluster chef was used as an entry in the seed list
- Including each node's own IP
(Note: it is entirely possible that subsequent releases of cluster chef make better choices. Of course, it's similarly possible they do not.)
(Note further: I was aware of this stupid choice, but frankly, it was working, and development efforts are focused on totally separate parts of the system. So I left it alone, noting but not comprehending the magnitude of this folly. It is thus not unreasonable to conclude that somebody should shoot me in the face).
Populating the seed list this way made cluster maintenance a real nuisance, because a new node would always come up thinking of itself as a seed, and seeds won't auto-bootstrap. So we would carefully pick out initial tokens, make sure the node attribute's were set to bring it up with auto_bootstrap, and of course, it wouldn't, and we'd be stuck scrambling to repair this node that joined the ring without ever receiving any of its range data. Our reliance on QUORUM operations saved us in this case; a sub-QUORUM operation could have lead to fairly unpleasant levels of inconsistency (which for us would've been unacceptable but of course varies according to use).
Anyway, about seeds. Take a look at the gossip architecture. Note that seeds are special.
I read up more on seeds and general practices within the community, and found that folks in the know seemed to prefer a smaller, actively managed seed list. They kept it consistent across the cluster (or at least across a "data center"), and kept it up to date.
At this point, I:
- Stopped Cassandra on the new node that was stuck in its never-ending-'cause-it's-never-starting streaming state
- Picked two of the servers with marginally lighter load as seeds, and updated the configuration for all other nodes in the ring. (The dead node was explicitly excluded from the possible seeds, naturally.)
- Performed rolling restart of all other nodes in the ring
- Verified that all the nodes listed the old dead node as, well, dead (down).
- Updated the seed configuration to be consistent on the new node
- Brought up the new node, and verified that it saw the dead node as down, and that it was streaming data from one of the live neighbors instead of the dead one.
- Cheered. Hurray!
Some lessons I take from this, that I'll at some point dig into the Cassandra source to verify/understand:
- Don't accept the choices of a tool you don't know well; pick the seeds yourself
- Actively manage the seeds so they are known to you and reflect a reasoned choice.
- If a seed dies, get it out of the seed list and propagate that change throughout the ring before you try to replace the dead node.
No comments:
Post a Comment