Robert Cowham's Weblog 20 to 24 of 43 articles Syndicate: full/short

Fast Perforce Checkpointing   03 Feb 06
[print permalink all comment ]

There was an interesting discussion not too long back on how to do fast check-pointing for your server.

The basic procedure with check-pointing for backups is shown in the System Administrator's Guide. For larger sites it can take tens of minutes, getting up to hours sometimes, which becomes inconvenient if you have limited windows for backup due to people working in different time-zones etc.

As an aside on the -z option to zip a checkpoint while backing up - it is worth checking for your server hardware the performance of the CPU overhead of zipping vs. the writing to disk of the checkpoint. Thus in some circumstances it might be worth check-pointing and zipping as you go, and in others zipping offline.

With thanks to Chris Bartz who posted it to the Perforce User mailing list in such a well documented fashion:

Okay, here are the gory details. I can't take credit for inventing it; Perforce tech support gave me most of the details and I'm pretty sure others are doing very similar things. To bootstrap the process you need to create an offline database. This is done by:

1) use "p4 counter journal" to get journal counter value. The checkpoint name will be checkpoint.<journal counter+1>.
2) "p4 admin checkpoint" (or "p4d -jc" if you prefer)
3) Optional. Zip and backup the truncated journal file
4) Delete old offline database db.* files
5) Build offline database with "p4d -r <offlineDir> -jr <checkpoint>"
6) Zip and backup checkpoint

We do the above steps once a week so that we start each week with a fresh offline database. We currently keep all the journals between rebuilding the offline database so we could recover from a real checkpoint plus journal files if there was some problem with the offline database.
Rebuilding and keeping all the journals in between isn't really required but when I set it up I wasn't 100% confident in the whole process. If I were making other changes to the process I would probably go with once a month rebuilds and maybe not keep all the journals.

The offline checkpoint is done daily with:

1) use "p4 counter journal" to get value of journal counter
2) Truncate journal file with "p4d -r <root> -jj <journal filename>" This creates a files <journal filename>.jnl.<journal counter> and starts a new journal file
3) Read truncated journal into offline database with "p4d -r <offline root> -jr <journal filename>.jnl.<journal counter>"
4) Optional. Zip and backup journal file
5) Checkpoint offline database with "p4d -r <offline root> -jd <checkpoint>.<journal counter + 1>". The journal file + 1 is so it has the same name as perforce would give it if we checkpointed the live database.
6) Optional. Zip and backup checkpoint
7) Optional. Delete old checkpoints and journals (we keep all journals between rebuilding the offline database and 3 checkpoints).

When this is done we have a checkpoint and journal file that should be exactly the same as if we did the "p4 admin checkpoint" on the live database. There is essentially zero downtime (except the weekly rebuild).

The offline database could be on another machine and the checkpoint could be done there if disk space or processing power were an issue.

The depot files are backed up after this process is done. We do not shut perforce down for that backup. You really don't need to; what perforce does to handle this is simple and does work.

Another note on the final remark - imagine the metadata (as restored from the checkpoint + journal) has information about 9 revisions of a file, and due to the backup having happened a little time after the checkpoint (and journal being a little out of date), and yet the RCS format archive file actually contains 10 revisions. The server will carry on fine. Obviously the opposite is not true (metadata has 10 revisions and archive file only 9). In both cases, if there is some inconsistency you will potentially lose some work, but most recent activity will be stored in people's workspaces (and they may remember any changelists they have recently submitted).

Thus your disaster recovery scenario needs to include what happens when you get your server back online and what people need to do (e.g. Tech Note 2 - Working Disconnected). Sally Page of Symbian gave an excellent presentation to the UK User Group on Symbian's DR experience and lessons learnt.

Review of 59 Minute Scrum   02 Feb 06
[print permalink all comment ]

I went to the 59 Minute Scrum  session last October which sounded interesting - from the blurb:

"This session will give attendees a unique opportunity to experience agile practices first hand in a non-technical environment."

I had good expectations but was left a little under-whelmed at the actual event. Now I have to admit that I arrived late and missed the intro so maybe it's all my fault and the intro answered everything! (Discussion over a beer afterwards showed I wasn't totally alone though).

The format was to split up into teams and discuss various scenarios. These were implementation projects including marketing, programmes and anything else we fancied on topics such as: a theme park around "spam", a health club for pets and a space tourism project.

The event was fairly tightly timetabled so we had a slot to discuss our backlog, then split up further in our teams to actually come up with ideas on particular areas. Then round 2 for more implementation. Finally, all the teams (8 if I remember correctly - typically 2 per idea) presented their solutions to the whole group and touched briefly on how well they had addressed the backlog/made progress etc.

There was then an all-too-brief Q&A session at the end.

For me the format didn't really work. I was left with too many questions and faffing about in the exercise had very limited benefit. I would much rather have had a well though through presentation on Scrum with lots of time for questions. At most half an hour of some exercises and then a presentation and lessons learnt would have worked much better for me.

Maybe I am just becoming a stick-in-the-mud?! I am likely to stear clear of such sessions in the future though unless there is time for a solid hour or more afterwards to get into details and ask questions.

Perforce Recovery Story (and tips for processing seriously large journal files)   06 Jan 06
[print permalink all comment ]

The Problem

I was recently at a client site looking at sorting out a perforce repository which had some database errors (due to disk problems on the NAS server). A quick look at the server log showed entries like:

Perforce server error:
	Date 2005/12/01 10:03:12:
	Operation: user-fstat
	Operation 'dbscan' failed.
	Database scan error on db.have!
	dbscan: db.have: Cannot create a file when that file already exists. 
	Corrupt tree

The Solution

The easiest solution initially seemed to be to restore from a previous checkpoint and the current journal. (I started by copying all the db.* and journal files before starting to make any changes.)

Then things started becoming more complicated. The journal file turned out to be 9Gb in size (and yes that meant they hadn't done a checkpoint for a looong time!).

I tried the journal recovery with high hopes (after removing db.* files in that directory), but...

E:\Perforce>p4d -r . -jr g:\backup.ckp.51 journal
Recovering from g:\backup.ckp.51...
Recovering from journal...
Perforce server error:
        Journal file 'journal' replay failed at line 42327877!
        Bad opcode '' journal record!

Note the size of the journal file (43M lines!):

E:\Perforce>time /t && wc journal && time /t
11:58
43,611,160 398236488 9536756331 journal
12:01

Fortunately I had installed some Unix utilities for Windows (from unxutils.sf.net and yes that is unx not unix) including wc and as it later turned out, sed.

Trying to edit a 9Gb with Notepad or Write (all that were available on a Windows 2003 Server) was not possible. I couldn't find any other easily downloadable editor capable of such feats and I was wondering how I could do anything sensible with this journal file. But then I realised I had sed available to me, and it was then fairly easy to start identifying the problem and set about resolving it.

Identifying and Fixing The Corrupted Lines

To print out the block of lines around the error:

E:\perforce>sed -n -e "42327850,42327900{p;}" journal > extract.txt 

This showed the some spaces or other strange characters on the start of line 42,327,877 (lines chopped for brevity):

@pv@ 1 @db.have@ @//FSmith/x-platform/packages/doc/html-tool/spur.gif@ ...
                     @rv@ 3 @db.user@ @bjones@ @bjones@@somecompany.com@ @@ 1121675402 ...
@rv@ 3 @db.domain@ @UK-A7993@ 99 @UK-A7993@ @d:\dev@ @@ @@ @fredb@ 1125593143 ...

(the problem line is the second with @rv@). I was able to “fix” this line with the following (all on one line):

time /t && 
sed -n -e "1,42327876{p;};42327877,42327877{s/^[^@]*//;p;};42327878,${p;}" journal > 
new.journal && time /t 

The sed command just prints all lines except for the offending line on which it runs a regular expression removing the unwanted chars at the start of the line (it turns out they weren't just spaces either). This created new.journal with the offending line fixed (“time/t” just shows current time – took 5-6 minutes to process 9Gb file).

As it later turned out, this new journal file still had some problems since it appeared to be out of order with respect to the latest checkpoint file (shown by any record for the db.counter value for journal not being correct). As a result, I started to lose confidence in the reliability of the journal file at all.

So in the end I took a different tack using the "undocumented" (see "p4 help undoc") commands p4d -xv/-xr to both validate the various database tables (db.*) and then to recover them. There only appeared to be an error in db.have table which is not that worrying (it is a list of all files synced to client workspaces and thus can be reset by the users in the last resort).

Validating db.review
Validating db.have
Problems Summary:
pages which are not connect to tree or freelist
Validating db.label
Validating db.integ

(The -xr option just fixed things).

And The Moral of the Story is...

Well there are potentially lots of morals here, but a selection is:

  1. Keep your journal file on a different disk (volume) to your database (db.*) files if you can to avoid a single disk problem corrupting both.
  2. Do regular checkpoints! (Once a week probably bare minimum, though usually once every 24 hours is ideal). There are various mechanisms for dealing with large databases and if checkpoint times become a burden (e.g. many tens of minutes).
  3. When dealing with large files, don't forget those unix tools such as sed which are always there and very powerful - also easy to install on Windows.
  4. Remember to talk to support since they will know about relevant undocumented or otherwise commands (in some circumstances they have "fixed" a checkpoint or journal file using internal tools and resent it to the client). At the very least they will act as a sounding board and confirm that what you are planning to do makes sense - always worth doing given that you are often dealing with the "crown jewels" of a company's intellectual property and also that commands often take a reasonable amount of time to run - hours can flit by unnoticed (well unless you are holding up a project team when every minute is begrudged).
  5. Consider disaster recovery up front (all part of business continuity - look at ITIL/BS15000 for some ideas on this). Spend an appropriate amount of money on your server and disks (RAID etc) to try and avoid these errors in the first place. However, Murphy's law is always lurking and it is often the little things that catch you out (e.g. air conditioning dies and server then dies). Thus you need the backup strategies (checkpointing etc) in place as appropriate.

Serena Buyout - To Avoid Sarbanes Oxley?!   05 Dec 05
[print permalink all comment ]

I went to the Serena UK Customer day recently where there was a brief discussion as to the reasons for the £1.2bn Serena buyout by Silver Lake Partners. The FAQ (follow previous link) mentions:

"We expect that under private ownership we will have far greater flexibility to focus on
meeting your needs more effectively and efficiently. Being a private company should
also allow us to make long-term investments in our product and service offerings and
ultimately become a more valuable partner to you."

The presentation mentioned these investments which might depress the share price if it remained public. Not explicitly mentioned, were costs and other requirements of Sarbanes Oxley compliance. Interesting how Serena as a vendor of tools to help other organisations achieve compliance is itself avoiding the issue by going private!

Does your tool "Own the World"?   27 Nov 05
[print permalink all comment ]

There was a column a few months back on CMCrossroads on tool selecting, and these thoughts escaped our Agile SCM column for various reasons. Meanwhile, Brad has commented on his blog, so I thought I would weigh in too!

Agile SCM requires minimal disruption of flow to developer’s lives, and thus tools that help, not hinder, this process. The right processes are obviously key to effective and efficient development and it would appear that if you have a good process, then the more you can enforce it the safer life will be. However, as we wrote in The Illusion of Control, too much safety leads to much reduced productivity.

As Joel Spolsky writes regarding defect trackers and enforcing process:

Historically, I am opposed to custom fields in principle, because they get abused. People add so many fields to their bug databases to capture everything they think might be important that entering a bug is like applying to Harvard. End result: people don't enter bugs, which is much, much worse than not capturing all that information.

Best of Breed vs Integrated Suite?

This is a classic conundrum and there will always be good arguments for both sides. Indeed it is not possible to come down on one side or the other without knowing the details of a particular organization and all the nitty-gritty requirements.

However, let’s consider how some ideas might help us in our decision making process.

Who Owns the World?

Many tools make the mistake of thinking that they own the world – well at least the environment that they are going to run in. The user (developer) is going to be safely cocooned inside the wonderfully productive environment of the tool and never going to have to leave for the big bad world outside. Thus the vendors make little attempt to provide any interface to the outside world.

Now obviously considering that some things are “beyond the pale” has some advantages. Unfortunately the disadvantages are often considerable.

Over the years many vendors have come up with wonderful development paradigms, development environments, 4th generation languages and similar which were indeed very productive. Often these environments included (very) rudimentary version control. The problem is that they implemented it badly (and often unreliably) and provided no hooks to external SCM tools which were used elsewhere in the organization.

This is a bit like the XP principle of YAGNI and refactoring as opposed to big upfront design. As Martin Fowler writes in Is Design Dead?:

People aren't good at anticipating, so it's best to strive for simplicity. However, people won't get the simplest thing first time, so you need to refactor in order get closer to the goal.

Thus an attempt to anticipate everything developers are likely to need would seem to be rather difficult.

The rise of Eclipse which is making the basic environment a commodity with a good design for plugins and extensibility is perhaps a result of this worldview. Companies such as Borland are now repositioning previously standalone components such as J-Builder as add-ons to Eclipse rather than as competition.

This is somewhat similar to the arguments for performing builds using a general purpose language. The grand daddy of them all, Make, has its own arcane syntax and issues which have evolved and been tweaked and extended to try and solve the ever expanding set of problems posed by building systems.

In Martin Fowler experiences with Rake (a Ruby based build tool) he suggests:

The fact that rake is an internal DSL [domain specific language] for a general purpose language is a very important difference between it and [make and ant]. It essentially allows me to use the full power of ruby any time I need it, at the cost of having to do a few odd looking things to ensure the rake scripts are valid ruby. […] Furthermore since ruby is a full blown language, I don't need to drop out of the DSL to do interesting things - which has been a regular frustration using make and ant. Indeed I've come to view that a build language is really ideally suited to an internal DSL because you do need that full language power just often enough to make it worthwhile - and you don't get many non-programmers writing build scripts.

My personal current favourite for such tools is Scons (written in Python) which has a similar approach that I have found to be very powerful.

Advantages of Integrated Suites

Of course integrated suites can be a big win if your problem domain is sufficiently close to what the designers had in mind. In addition, if the suite and its existing processes can be tailored easily to your requirements, then of course you should rate it highly in your selection criteria.

But you need to be careful. There’s an awful lot of “shelfware” out there consisting of tools sold as a “silver bullet” and never really used. The psychology seeming to be that if you pay enough money then the problem will be taken away from you.

Wolf Suites In Sheep’s Clothing

Then we have the category of suites that purport to be integrated but under the covers all is not what it first seemed. The classic example is of different tools which a company acquires by buying the original vendor, and with a lick of paint and a wave of the magic wand over the marketing materials suddenly has an “integrated suite”. The challenges of integrating tools not designed to do so are considerable and it can take many years for good integration to happen (if it ever does).

Customizability

Is also a two-edged sword in that you can spend all your time customizing the tool and not enough actually doing the work.

As has been noted by various people on CMCrossroads, workflows implemented by scripting can have a cycle of “script a little, test a little, repeat”. This would suggest that tools which offer the ability to design workflows graphically are immune from these problems. However, that is often not the case, and has been pointed out before, some of these workflow designers are in fact very difficult both to change control and to debug. A complicated workflow is an inherently hard problem to both comprehend and manage (which does point towards keeping it as simple as possible).

The key area of customizability is that of being able to link the tool (or suite) to the external world in the form of other tools. Thus providing sensible hooks to allow third party tools to link in would seem ideal. And yet this can be rather difficult to do in practice, and thus gets pushed to the back of the queue by the vendor. In addition, I suspect there is often a business rationale that they think if they don't make it easy to link to external tools then the user will be forced to stick with the vendor's tool.

It's not quite on topic, but I could resist commenting on a classic example of third party hooks that don't work very well - Microsoft's SCC integration to Visual Studio .Net. This is based on a lowest common denominator API which used to be released under NDA but is now relatively freely available. It worked reasonably well with Visual Studio 6, but was totally re-implemented for Visual Studio .Net and rather badly it seems.

Conclusion

Thus I would personally tend to come out on the side of linking best of breed tools together rather than going for an integrated suite, since my experience is that integrated suites don't try hard enough to provide clean interfaces to the outside world. Of course, some best of breed tools make integrations with other tools a bit like teaching a pig to sing - "frustrates you and annoys the hell out of the pig"!

So the real answer is that every potential customer for SCM tools needs to draw up their own requirements and evaluate against them. Delaying commitment by avoiding lock-in is a very valuable feature, but it doesn't always pay off as it should perhaps in theory!

Do not turn brain off when choosing tool! An ounce of requirements analysis and evaluation is worth a ton of tweaking down the track.

 

Copyright © 2008 Robert Cowham