Erlang: Don't Let Your Production Crash

loss of sharing

1> F = fun() ->
1>  L  = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
1>  L2 = [L, L, L, L, L, L, L, L, L, L],
1>  L3 = [L2,L2,L2,L2,L2,L2,L2,L2,L2,L2],
1>  L4 = [L3,L3,L3,L3,L3,L3,L3,L3,L3,L3],
1>  L5 = [L4,L4,L4,L4,L4,L4,L4,L4,L4,L4],
1>  L6 = [L5,L5,L5,L5,L5,L5,L5,L5,L5,L5],
1>  L7 = [L6,L6,L6,L6,L6,L6,L6,L6,L6,L6],
1>  L8 = [L7,L7,L7,L7,L7,L7,L7,L7,L7,L7],
1>  L9 = [L8,L8,L8,L8,L8,L8,L8,L8,L8,L8],
1>  pid(0,0,0) ! L9,
1>  ok
1> end.
#Fun<erl_eval.20.21881191>
2> F().
HUGE size (2222222220)
Aborted

reached_max_restart_intensity

Erlang shuts down with 'reached_max_restart_intensity' in logs.

timeout infinity

application doesn't respond with 0% cpu usage

get stacktraces:

file:write_file("/tmp/procs.txt",erlang:system_info(procs)).

find them all: grep -n infinity src/*.erl

trollface: and replace by default 5 seconds

gen_server calls itself

application doesn't respond, 0% cpu usage, (optional) timeout failures for internal calls

read logs, get stacktraces:

file:write_file("/tmp/procs.txt",erlang:system_info(procs)).

or get process inbox and current call via

process_info(Pid)

fix deadlock

cast an OOM

eheap_alloc: Cannot allocate 100500XXX bytes of memory (of type "old_heap")

you can make it even worse by undocumented erl +snsp

or unoptimized selective receive

error_logger

send a lot of binaries to an idle process

OOM again

simple one-liner:

[{M, P, process_info(P, [registered_name, initial_call, current_function, dictionary]), B} || {P, M, B} <- lists:sublist(lists:reverse(lists:keysort(2, [case process_info(P, binary) of {_, Bins} -> SortedBins = lists:usort(Bins), {_, Sizes, _} = lists:unzip3(SortedBins), {P, lists:sum(Sizes), SortedBins}; _ -> {P, 0, []} end || P <- processes()])), 5)].

read blog.heroku.com/archives/2013/11/7/logplex-down-the-rabbit-hole

small subbinaries of large binaries

OOM

same one-liner

apply binary:copy/1 carefully

binaries in ETS

OOM

invisible for process_info(_, binary)

compare erlang:memory(binary) with sum of all binaries held by processes

erlang limits (man erl)

cpu usage 100%

it might not be that bad but if you see random timeouts ->

monitor run queue length (vmstats)

system limits

events caught by riak_sysmon

fprof all the things

don't do that in production because overhead is too large

apply it only to handful of processes

-heart + typo in config file

leads to an endless race of 2 processes restarting each other

killall -9 heart beam.smp until it hits both of them

non-obvious things to check

Q&A

Anton Lebedevich

mabrek@gmail.com

twitter.com/widdoc

github.com/mabrek

/