Robert Cowham's Weblog 1 of 1 article Syndicate: full/short

Fast Perforce Checkpointing   03 Feb 06
[print permalink all comment ]

There was an interesting discussion not too long back on how to do fast check-pointing for your server.

The basic procedure with check-pointing for backups is shown in the System Administrator's Guide. For larger sites it can take tens of minutes, getting up to hours sometimes, which becomes inconvenient if you have limited windows for backup due to people working in different time-zones etc.

As an aside on the -z option to zip a checkpoint while backing up - it is worth checking for your server hardware the performance of the CPU overhead of zipping vs. the writing to disk of the checkpoint. Thus in some circumstances it might be worth check-pointing and zipping as you go, and in others zipping offline.

With thanks to Chris Bartz who posted it to the Perforce User mailing list in such a well documented fashion:

Okay, here are the gory details. I can't take credit for inventing it; Perforce tech support gave me most of the details and I'm pretty sure others are doing very similar things. To bootstrap the process you need to create an offline database. This is done by:

1) use "p4 counter journal" to get journal counter value. The checkpoint name will be checkpoint.<journal counter+1>.
2) "p4 admin checkpoint" (or "p4d -jc" if you prefer)
3) Optional. Zip and backup the truncated journal file
4) Delete old offline database db.* files
5) Build offline database with "p4d -r <offlineDir> -jr <checkpoint>"
6) Zip and backup checkpoint

We do the above steps once a week so that we start each week with a fresh offline database. We currently keep all the journals between rebuilding the offline database so we could recover from a real checkpoint plus journal files if there was some problem with the offline database.
Rebuilding and keeping all the journals in between isn't really required but when I set it up I wasn't 100% confident in the whole process. If I were making other changes to the process I would probably go with once a month rebuilds and maybe not keep all the journals.

The offline checkpoint is done daily with:

1) use "p4 counter journal" to get value of journal counter
2) Truncate journal file with "p4d -r <root> -jj <journal filename>" This creates a files <journal filename>.jnl.<journal counter> and starts a new journal file
3) Read truncated journal into offline database with "p4d -r <offline root> -jr <journal filename>.jnl.<journal counter>"
4) Optional. Zip and backup journal file
5) Checkpoint offline database with "p4d -r <offline root> -jd <checkpoint>.<journal counter + 1>". The journal file + 1 is so it has the same name as perforce would give it if we checkpointed the live database.
6) Optional. Zip and backup checkpoint
7) Optional. Delete old checkpoints and journals (we keep all journals between rebuilding the offline database and 3 checkpoints).

When this is done we have a checkpoint and journal file that should be exactly the same as if we did the "p4 admin checkpoint" on the live database. There is essentially zero downtime (except the weekly rebuild).

The offline database could be on another machine and the checkpoint could be done there if disk space or processing power were an issue.

The depot files are backed up after this process is done. We do not shut perforce down for that backup. You really don't need to; what perforce does to handle this is simple and does work.

Another note on the final remark - imagine the metadata (as restored from the checkpoint + journal) has information about 9 revisions of a file, and due to the backup having happened a little time after the checkpoint (and journal being a little out of date), and yet the RCS format archive file actually contains 10 revisions. The server will carry on fine. Obviously the opposite is not true (metadata has 10 revisions and archive file only 9). In both cases, if there is some inconsistency you will potentially lose some work, but most recent activity will be stored in people's workspaces (and they may remember any changelists they have recently submitted).

Thus your disaster recovery scenario needs to include what happens when you get your server back online and what people need to do (e.g. Tech Note 2 - Working Disconnected). Sally Page of Symbian gave an excellent presentation to the UK User Group on Symbian's DR experience and lessons learnt.

 

Copyright © 2008 Robert Cowham