o Must search $PATH for argv[0] in itimer.c !!!!

o When re-using a buffer check that the pathname is the same (and other sorts
  of compatibility constraints).

x Test whether it works on:
    Linux   -- .o/.a works fine.  .so works fine.  LD_PRELOAD works
    FreeBSD -- .o    works fine.  .so works fine.  LD_PRELOAD works
    OpenBSD -- .o/.a works fine.  All .so fail, ld.so doesn't call constructors
    NetBSD  -- ?
    Solaris2-- .o/.a works fine.  .so works fine.  LD_PRELOAD works
    SunOS 4 -- .o/.a works fine.  .so works fine.  LD_PRELOAD no worka
    HP-UX  9,10,11?
    AIX    4-- ?
    Irix   5/6 ?

o For LD_PRELOAD=libprofil.so to work on Linux, the executable must be linked
  with "-export-dynamic".  Can this be made a gcc default?  Hmm.  Not too
  important since LD_PRELOAD=libitimer.so is arguably what everyone will want
  to use anyway.

o Portability-wise, it is often possible to get shared library profiling with
  the itimer mechanism.  'ldd' may report enough information.  For example, the
  recent glibc/linux ldd reports virtual addresses where libraries get loaded.

  ldd is usually just an invocation of the actual program with ld.so eliding the
  main() invocation.  Consequently, it should be a pretty easy hack to have some
  environment variable or other inherited process attribute that tells 'ldd' to
  be verbose in the way that libitimer.so requires.  This is certainly a very
  expensive alternative.

  Hmm.  If one is already hacking ld.so toward this end, it would be better to
  just hack it to build and export a region[] vector that could be used
  directly.  This is all unnecessary if the next idea is workable, which allows
  even better tracking of executable regions.

  All of this is future work.  For now we just munge 'ldd' output and cache the
  answer in a /tmp/pct/maps file.  This can cause some misleading numbers for
  the first run of a program.

o Is it possible to wrap mprotect() with a customized version that updates
  region[] when executability status changes?  Would this allow dlopen()d shared
  objects to be profiled?  (Cool for Python, Perl, etc.) Will this even work or
  does dlopen() use its own inlined syscall and not the libc version?

  Assuming we can modify the libc syscall stubs used by ld.so, we need to wrap
  more than just mprotect().  Since we need to know the object pathname we need
  to track open, close, mmap, and mprotect.  Only then can we correctly convert
  the observation of where mprotect adds PROT_WRITE to a file-bound region.

* On Linux /proc/PID/maps works to give us what we need.  Profiling into
  dlopen()d modules currently seems to work fine.

  In general there should be a "misses" field in the headers for a PCT file.
  This would be incremented whenever a PC could not be found in region[].
  Given elapsed real-time and whole system/kernel profiling, the misses field
  can also tell us things like how much time is spend in kernel modules or
  post-entry dlopen()d shared objects (the only current "holes" in the system).
  Misses would be printed out by pct-stat and mostly be used as a diagnostic to
  determine if there is a significant amount of 'missed' object code running.
  profil(2) could easily have done this, too, with an interface extension, but
  of course in the 70s shared objects were a MULTICS thing.  Oh well.  For now
  it seems only SIGVTALRM has any hope of being a kickass profiling hook.

- (Linux only): we can use miss occurrance as a way to instigate rediscovery
  of executable regions (JJ's idea).

o Rather than worrying about analyze/pct-sub we can just have a flag to our
  pct-kern-snap program that uses an existing pct file and subtracts off the
  old values from the new values.  Of course, the fact that we can do decent
  collection of ongoing families of programs over multiple phases of their
  execution does make the general add/subtract desirable...

o Use microsecond timestamp instead of PID to ensure uniqueness.  This assumes
  only A) gettimeofday() actually returns microsecond accurate data and
       B) execve() is very unlikely to complete in less than 1 us.
  B) is good for the next 5..10 year.  A) is more sketchy.  Maybe cycle counters
  can be used on systems which do not have good time queries.  Hmm.  A global
  shared counter is needed for serial numbers, and all the concomittant hair of
  concurrent global state management.

o It is perhaps desirable to have the uniqueness prefix be optional, but on a
  per-object basis.  Per-object tuning is currently hard to specify with one
  shared environment variable.  Hmm.

o Issue: gen-maps will stuff *some* absolute pathname into the PCT maps file.
  The program may be re-invoked from a different directory, thusly picking up
  different shared objects with the same name, possibly even altering sizes
  and locations.  This is an unavoidable, but probably innocuous problem since
  the very common case is to always pick up the same shared library.

o Issue: fork()d processes/threads that truly run in parallel (i.e. multiple
  CPUs) can clobber each other's counter updates.  Basically the result is a
  missed counter increment.  This is extremely rare and probably not an issue
  in practice.

x gen-maps|proc/maps dependence can be (mostly) eliminated and disk space can
  be saved with an alternate type of collection file.  The format would be a
  simple array of the sampled PCs *not* count aggregated, but simply in sample
  order.  A post processor could build a histogram.  Appending to the PC log
  file is as simple as a write().  mmap, etc are unnecessary.  Efficiently
  stored sparse files are unnecessary.

  The only real drawback is that the log file is unbounded in size over time.
  The fixed size (text-segment based) histogram uses bounded real disk space,
  even for very rapid sampling rates.  The PC log file uses 4 or 8 bytes per
  sample.  At the typical 100..1024 samples per second, this is only 400..8192
  bytes per second.  That's very slow growth, and an upper bound rate over the
  whole multitasking system.  This is clearly not appropriate for a very long
  running program like a server.  For simulations and other intensive tasks,
  thousands of seconds is a fairly long time on modern machines.  With dozens
  or hundreds of GB disks available, a few MB log file is fairly innocuous.
  If bounding space is truly an issue then a circular file could be used with
  some large size limit of say 8..32 MB.  Sampling would then be limited to
  the last N seconds, which is not such a bad tradeoff for bounding space.

  A final observation is that PC logging obviates the need for binary search
  and concomittant main memory accesses in the signal handler.  A word write()
  to a local file is a few hundred cycles overhead.  This is almost surely
  larger than the memory loads/stores of the binary search over known regions,
  but not by a large factor.  The delegation to post-mortem time of correctly
  associating the PC and object files somewhat compensates for the increased
  overhead of making a system call inside the signal handler.  With operating
  system support, e.g. an alternate to profil(2), this overhead could surely
  be eliminated.  In this scenario the locality of a PC log file would surely
  make it faster than the equivalent code histogram.

  A negative aspect of the deferral of any and all PC interpretation to post
  mortem is the loss of the ability to track code in post startup dynamically
  loaded objects unless their load locations can be saved somehow a la the
  mprotect trapping idea.

  At the very least this is a useful alternate mode of operation.  It requires
  expanding the behavior of pct data files.  There should be at least three
  formats -- linear PC stream, circular, and the PC histogram.  This is nice
  even in the context of offline program analysis.

o Really should timestamp log records in random-spacing mode so time windows
  can be specified in microseconds, not just record numbers.
