Overview
--------

Pct is a program counter/instruction pointer sampling-based profiling system.
It allows low overhead, instruction-level granularity accounting of where
programs spend their time.  No recompiling or code instrumentation is required.
Some information can be gleaned even for stripped binaries, though reports
can be more detailed as more debugging information is available in object files.
A variety of output formats are available.  On some systems LD_PRELOAD can even
elide the re-linking step for dynamically linked executables.

Statistical Profiling Background
--------------------------------

Statistical profiling is amongst the oldest of program profiling techniques.
The basic idea is to just let hardware timer interrupts temporarily suspend
program execution, and then to observe where the OS must resume execution.
At this observation point, a counter corresponding with the instruction pointer
(aka program counter) is incremented.  In any time-sharing OS, these timer
interrupts already happen as a matter of course.  Hence the only overhead is in
the memory access to increment the observation counter.  If the program spends
a lot of its time at particular spots then statistically one will observe high
counts in that portion of this "execution interruption histogram".

The basic tradeoff of sampling-based profiling vs. code instrumentation is that
the former allows instruction-level timing, but does not allow reconstruction
of program control flow transfer graphs.  If you really need the latter you can
compile all your objects with -pg and use gprof.  At least for me, profiling
data of such high semantic content is only necessary very rarely, while often
I need to track down basic block or even instruction-level timing.  There are
also occasionally strange interactions with compiler optimizations and code
instrumentation, causing errant behavior or even compiler crashes.

There is also an issue of "Heisenberg observation interference".  Code which
has been instrumented to collect data on its execution is really not the same
as the original code.  Data inaccuracies can result.  Since timer interrupts
happen anyway, the only extra overhead is the main memory access and cache
contention to the buffer of counters.  This is typically a truly tiny error
since timer interrupts are not usually configured to go off much faster than
100,000 times main memory access latencies.  Even so, this data accuracy issue
is not usually very important in practice.  This performance hit could be much
worse if the file is stored on a network file system since network latency can
easily be comparable to timer-interrupt frequencies.  Care should be taken as
to the placement of profiling files.

There is a final issue of statistical accuracy.  The basic assumption of
statistical profiling is that the interruption probability is proportional to
how long it takes to execute the part of the program at a given location.  This
assumes at least that timer interrupts are not synchronized with paths through
a program.  The basic task of statistical profiling amounts to estimating the
multinomial probability density of interruptions at all interruptible program
locations.  One cannot make this estimate without error.  How big is the error?

In the typical situation with many possible locations, a per-location binomial
approximation to the multinomial becomes valid.  When the probability of being
at any one location is small, per-location Poisson approximations become valid.
This is the common case in profiling any reasonably complex program.  In this
case, one can use the fact that the variance of a Poisson distribution equals
its mean, and that an efficient estimator of the mean is simply the observed
rate.  All this basically amounts to the fact that if n counts out of a total
of N samples are seen at location X, then the expected variance is sqrt(n).

The upshot of this is that caution should be exercised in concluding that a
count of 4 at some location and a count of 1 at another really amounts to much.
It is true that the estimate of the difference time the program spends at the
two spots will be 3.  But the sum or difference of two Poisson random variables
is also Poisson with the combined variance equal to the sum of the summand
variances.  So for this example, while the estimate of the difference is 3,
the variance on that estimate is sqrt(5)=2.2.  The tail probability for a
Poisson variable of mean 2.2 being 3 or more is about 62%  { This probability
is 1-P(diff<3)=P(diff=0)+P(diff=1)+P(diff=2)=(1+2.2+2.2*2.2/2)*exp(-2.2)=0.62 }
Hence, simply by chance one will see differences of this size or greater 62% of
the time.  It is thusly not statistically significant.  Notice, however, that
if one count were 10 and the other 40, the difference is 30 with a variance
1700, or root variance 41.  The similar tail probability is only 3.1%.  Hence
the difference is significant with fairly high confidence.

All this detailed probability is just to ward one away from making grandiose
claims about relative timings.  For the most part any parts of the program that
are worth optimizing will have a significant number of counts.

In sum, recompiling or maintaining separate profiling libraries and programs is
seldom warranted.  A much easier to adhere to policy is to *always* compile
with debugging, or at least symbols enabled on any development system.  In
addition to enabling good interpretation of any profile data PCT might collect,
this also has the virtue of allowing core dumps to be more meaningful.
Proprietary system softare, of course, often limits how much can occur here,
but one does what one can.  Disk space is really very cheap these days.

Platform Dependencies and System Architecture
---------------------------------------------

The PCT architecture is very portable and very modular.  It should work on
almost any variant of Unix.  Minimal system requirements are something like:
  any way to sample PC's (system call profil(2) or virtual I-timer signals),
  any way to get pct_on() called before much happens, and
  any way to interpret program counters (nm, disassembly, the GNU addr2line).
Notice that there usually multiple methods for each requirement, and it isn't
usually hard to get at least one of them to work.  This is another part of the
appeal of statistical profiling.  The system tools needed are usually present.

The profil(2) system call seems to date back nearly to AT&T version 7 Unix.
It is trivial to implement and useful.  I have not yet found an OS without it.
Virtual I-timers allow arbitrary signal handlers to be run at timer interrupt
time.  The point of program resumption is almost always availiable to signal
handlers.   Usually they can grab the PC off of a prior stack frame.  In latter
day POSIX there is even a siginfo_t structure that exports this information.
One benefit of I-timers is that they allow 1-byte rather than 2-byte sampling
granularity.  There is an interface bug in profil(2) which prevents this.  This
only ever matters on instruction sets where an instruction can be an odd number
of bytes (e.g. x86).  Even there it usually creates only a very small error.
In general the great portability of profil(2) makes it attractive, at least for
tracking time spent in the main program text.

Ideally pct_on() can be called automagically.  Any system on which g++ works
even a little bit should allow the gcc extension __attribute__((constructor))
to work, at least in a static linking context.  This relies on the collect2
program to cull together global code.  Many ELF and prior executable formats
have ".ini" or ".init" object initializer sections, which may work for static
or shared objects or both.  __asm__ or asm directives can probably place
pct_on() into the appropriate section.

On Unix systems with fully functional shared library systems, profiling can also
be activated with an LD_PRELOAD environment variable.  On other Unix systems one
can toggle whether profiling is used via the $PCT environment variable.  If it
is not set, profiling will not occur.

Finally, in a pinch, pct_on() can always be called manually from main().  This
might also be necessary for reliable argv[0] determination.  Profiling isn't
activated if $PCT is not set.  Hence, calling pct_on() does not imply that each
program invocation will result in profiling data.  So it is safe to just leave
in your source.  pct_on() is also careful to not re-run itself, so calling it
twice is fine, in case you link with a shared library whose initializer calls it
before your function in main().

Translating program counters to relevant source text coordinates is a low-level
activity.  Most systems provide at least some tools for this.  Pct provides a
variety of report formats, available as composable Unix filters.  This allows a
lot of flexibility in how data are generated and combined.  It also lets PCT
easily yield incremental usefulness of reports according to however many of the
system's tools actually work or are available.  The more working tools, the more
PCT reports can be.  This choice also lets the filters be very simple programs,
since they are decoupled from most of the system-specific complexity.

Portability Obstacles
---------------------

One choice that Pct makes that might break is using, by default, argv[0] as a
path prefix for the profil count data.  This choice is motivated to make
possible/easy profiling multiple programs whose invocations are all coordinated
by shell-scripts.  Since in many interesting cases pct_on is called from
outside of main(), an inferential technique is used to deduce argv[0] from the
global variable 'environ'. The implementation of this _argv() function is
described in the NOTES file.  Any pitfalls associated with bogus inferred
argv[0]s can be avoided by setting $PCT appropriately to direct output to some
named file or a default "pc.pct" file in the working directory at program
startup time.

Chuck Blake	5/14/2000	cb@mit.edu
