Advanced heapprof

As you use heapprof more, you’ll want to understand the corner cases of memory allocation better. Here are some general advanced tips worth knowing:

  • The performance overhead of heapprof hasn’t yet been measured. From a rough eyeball estimate, it seems to be significant during the initial import of modules (because those generate so many distinct stack traces) but fairly low (similar to cProfile) during code execution. This will need to be measured and performance presumably tuned.

  • The .hpx file format is optimized around minimizing overhead at runtime. The idea is that the profiler continuously writes to the two open file descriptors, and relies on the kernel’s buffering in the file system implementation to minimize that impact; to use that buffering most effectively, it’s important to minimize the size of data written, so that flushes are rare. This is why the wire encoding (cf file_format.*) tends towards things like varints, which use a bit more CPU but reduce bytes on the wire. This also helps keep the sizes of the generated files under control.

  • The profiler very deliberately uses C++ native types, not Python data types, for its internal operations. This has two advantages: pure C++ types are faster and more compact, (because of the simpler memory management model), and they eliminate the risk of weird recursions if the heap profiler were to try to call any of the Python allocators. NB, however, that this means that the heap profiler does not include its own memory allocation in its output!

  • More generally, the heap profiler only profiles calls to the Python memory allocators; C/C++ modules which allocate memory separately from that are not counted. This can lead to discrepancies between the output of heapprof and total system usage.

  • Furthermore, all real malloc() implementations generally allocate more bytes than requested under the hood (e.g., to guarantee memory alignment of the result; see e.g. this function in tcmalloc). Unfortunately, there is no implementation-independent way to find out how many bytes were actually allocated, either from the underlying C/C++ allocators or from the higher-level Python allocators. This means that the heap measured by heapprof will be the “logical” heap size, which is strictly less than the heap size requested by the process from the kernel. However, it is that latter size which is monitored by external systems such as the out-of-memory (OOM) process killers in sandbox environments.

Controlling Sampling

The sampling rate controls the probability with which heap events are written. Too high a sampling rate, and the overhead of writing the data will stop your app, or the amount of data written will overload your disk; too low a sampling rate, and you won’t get a clear picture of events.

heapprof defines sampling rates as a Dict[int, float], which maps the upper range of byte sizes to sampling probabilities. For example, the default sampling rate is {128: 1e-4, 8192: 0.1}. This means that allocations from 1-127 bytes get sampled at 1 in 10,000; allocations from 128-8191 bytes get sampled at 1 in 10; and allocations of 8192 bytes or more are always written, without sampling. These values have proven useful for some programs, but they probably aren’t right for everything.

As heapprof is in its early days, its tools for picking sampling rates are somewhat manual. The best way to do this is to run heapprof in “stats gathering” mode: you can do this either with

python -m heapprof --mode stats -- mycommand.py args ...

or programmatically by calling heapprof.gatherStats() instead of heapprof.start(filename). In this mode, rather than generating .hpx files, it will build up a distribution of allocation sizes, and print it to stderr at profiling stop. The result will look something like this:

-------------------------------------------
HEAP USAGE SUMMARY
               Size    Count      Bytes
              1 - 1   138553       6646
              2 - 2    30462      60924
              3 - 4     3766      14076
              5 - 8   794441    5293169
             9 - 16  1553664   23614125
            17 - 32 17465509  509454895
            33 - 64 27282873 1445865086
           65 - 128  9489792  801787796
          129 - 256  3506871  567321439
          257 - 512   436393  143560935
         513 - 1024   347668  257207137
        1025 - 2048   410159  466041685
        2049 - 4096   135294  348213256
        4097 - 8192   194711 1026937305
       8193 - 16384    27027  278236057
      16385 - 32768     8910  183592671
      32769 - 65536     4409  200267665
     65537 - 131072     2699  228614855
    131073 - 262144     1478  277347497
    262145 - 524288     1093  306727390
   524289 - 1048576      104   75269351
  1048577 - 2097152       58   83804159
  2097153 - 4194304       37  106320012
  4194305 - 8388608        8   44335352
 8388609 - 16777216        6   69695438
16777217 - 33554432        3   55391152

This tells us that there was a huge number of allocations of 256 bytes or less, which means that we can use a small sampling rate and still get good data, perhaps 1e-5. There seems to be a spike in memory usage in the 4096-8192 byte range, and generally the 256-8192 byte range has a few hundred thousand allocations, so we could sample it at a rate of 0.1 or 0.01. Beyond that, the counts drop off radically, and so sampling would be a bad idea. This suggests a sampling rate of {256: 1e-5, 8192: 0.1} for this program. You can set this by running

python -m heapprof -o <filename> --sample '{256:1e-5, 8192:0.1} -- mycommand.py args...

or

heapprof.start('filename', {256: 1e-5, 8192: 0.1})

Some tips for choosing a good sampling rate:

  • The most expensive part of logging isn’t writing the events, it’s writing the stack traces. Generally, very small allocations happen at a huge variety of stack traces (nearly every time you instantiate a Python variable!), but larger ones are far less common. This means that it’s usually very important to keep the sampling rate low for very small byte sizes – say, no more than 1e-4 below 64 bytes, and preferably up to 128 bytes – but much less important to keep it low for larger byte sizes.

  • The only reason you want to keep the sampling rate low is for performance; if at any point you can get away with a bigger sampling rate, err on that side.