Erlang: Don't Let Your Production Crash

loss of sharing

1> F = fun() ->
1>  L  = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
1>  L2 = [L, L, L, L, L, L, L, L, L, L],
1>  L3 = [L2,L2,L2,L2,L2,L2,L2,L2,L2,L2],
1>  L4 = [L3,L3,L3,L3,L3,L3,L3,L3,L3,L3],
1>  L5 = [L4,L4,L4,L4,L4,L4,L4,L4,L4,L4],
1>  L6 = [L5,L5,L5,L5,L5,L5,L5,L5,L5,L5],
1>  L7 = [L6,L6,L6,L6,L6,L6,L6,L6,L6,L6],
1>  L8 = [L7,L7,L7,L7,L7,L7,L7,L7,L7,L7],
1>  L9 = [L8,L8,L8,L8,L8,L8,L8,L8,L8,L8],
1>  pid(0,0,0) ! L9,
1>  ok
1> end.
#Fun<erl_eval.20.21881191>
2> F().
HUGE size (2222222220)
Aborted

reached_max_restart_intensity

Erlang shuts down with 'reached_max_restart_intensity' in logs.

don't copy-paste MaxR and MaxT from templates
supervisor2 from rabbitmq
don't let it crash so often (use circuit breakers)
erl -heart but see below about typo in config file
don't crash or do long tasks in gen_server init callback

timeout infinity

application doesn't respond with 0% cpu usage

get stacktraces:

file:write_file("/tmp/procs.txt",erlang:system_info(procs)).

find them all: grep -n infinity src/*.erl

trollface: and replace by default 5 seconds

gen_server calls itself

application doesn't respond, 0% cpu usage, (optional) timeout failures for internal calls

read logs, get stacktraces:

file:write_file("/tmp/procs.txt",erlang:system_info(procs)).

or get process inbox and current call via

process_info(Pid)

fix deadlock

cast an OOM

eheap_alloc: Cannot allocate 100500XXX bytes of memory (of type "old_heap")

you can make it even worse by undocumented erl +snsp

or unoptimized selective receive

monitor your queues (vmstats, etop)
gen_server:call instead of cast
flow control

error_logger

same as above
switch to github.com/basho/lager
learn to use tracing via dbg or redbug from github.com/massemanet/eper instead of debug logging
slightly related: don't print to stdout

send a lot of binaries to an idle process

OOM again

simple one-liner:

[{M, P, process_info(P, [registered_name, initial_call, current_function, dictionary]), B} || {P, M, B} <- lists:sublist(lists:reverse(lists:keysort(2, [case process_info(P, binary) of {_, Bins} -> SortedBins = lists:usort(Bins), {_, Sizes, _} = lists:unzip3(SortedBins), {P, lists:sum(Sizes), SortedBins}; _ -> {P, 0, []} end || P <- processes()])), 5)].

read blog.heroku.com/archives/2013/11/7/logplex-down-the-rabbit-hole

small subbinaries of large binaries

OOM

same one-liner

apply binary:copy/1 carefully

binaries in ETS

OOM

invisible for process_info(_, binary)

compare erlang:memory(binary) with sum of all binaries held by processes

erlang limits (man erl)

ports
processes
ets tables
distribution buffer busy limit +zdbbl

cpu usage 100%

it might not be that bad but if you see random timeouts ->

monitor run queue length (vmstats)

system limits

swap vs. OOM
ulimit -n
time wait
ephemeral ports limit

events caught by riak_sysmon

large heap
long gc
busy ports
distribution buffer full

fprof all the things

don't do that in production because overhead is too large

apply it only to handful of processes

-heart + typo in config file

leads to an endless race of 2 processes restarting each other

killall -9 heart beam.smp until it hits both of them

non-obvious things to check

error_logger inbox from crash dump (there might be some messages)
erlang stdout
dmesg
use github.com/lpgauth/vmstats/

Q&A

Anton Lebedevich

mabrek@gmail.com

twitter.com/widdoc

github.com/mabrek