IPv4 Scan 2021 - Multiprobes Analysis
September 19, 2021
To my surprise the dataset preprocessed in my previous post has duplicated entries. These are scans to the same host and port but with a different timestamp.
Questions like “which open port is more likely” will biased because the same host-port may be counted more than once.
On the other hand, this opens new questions:
- which is the reason to scan the same port more than once? If it is fixed by the scanner we can deduce that ports scanned once were scanned more times but the other probes failed and get an estimation of such.
- is the same port opened due different reasons?
- could we characterize the scanner based on the timestamps like scanning patterns?
The second surprise was that even working with small samples (around 100MB), Pandas/Dask has serious performance problems:
- consumes much more memory (gigas)
- CPU at 100% all the time
- simple operations like
groupby
take forever.
Goodbye Pandas, hello Julia?
Julia’s DataFrames
First we need to install a few packages:
julia> import Pkg
julia> Pkg.add("DataFrames")
julia> Pkg.add("Parquet")
julia> Pkg.add("CategoricalArrays")
julia> Pkg.add("StatsBase")
julia> Pkg.add("Statistics")
julia> Pkg.add("StatsPlots")
Then we load the dataframe:
julia> using Parquet, DataFrames, CategoricalArrays, StatsBase, Statistics, StatsPlots
julia> df = DataFrame(read_parquet("scans"))
Categorical data
The compress=true
is needed so the column will be of the smallest type that can represent the categories, in our case, UInt8
; otherwise CategoricalArrays.jl
uses UInt32
by default.
Parquet.jl
does not load the categories (or Pandas’s to_parquet
is not writing them). This consumes more RAM because the reason
and port
columns are strings.
We can make them categorical back again with:
julia> df[!, :reason] = categorical(df[:, :reason], compress=true)
Ordinal data
We do the same for port
column but we additionally mark the categorical as ordered.
julia> df[!, :port] = categorical(df[:, :port], compress=true, ordered=true)
As explained earlier, the ports don’t have a natural order however I this as an opportunity to explore and document how to work with ordinals.
CategoricalArrays.jl
orders lexicographically by default. To change the order we need to do it later with levels!
.
First we get ports labels (strings):
julia> s = levels(df.port)
Then we parse them as integers and sort them numerically:
julia> s = sort(parse.(UInt16, s))
We get back the ports labels as strings:
julia> s = string.(s)
Finally we rewrite the levels of the ordinal column:
julia> levels!(df.port, s)
Now the port
column is an ordinal column and the order is implied by the numerical interpretation of its labels.
Is the same port opened due different reasons?
The :foo
are symbols that in our case are the names of the columns. The ∘
(\circ
in Latex) is the composite operator: length ∘ unique
is equivalent to length(unique(x))
. The whole :reason => length ∘ unique => :nunique
reads as: take the reason
column, count how many unique values are in each group and store the result (one per group) in the column nunique
.
julia> g = groupby(df, [:ip, :port])
julia> df2 = combine(g,
:reason => length ∘ unique => :nunique)
julia> countmap(df2.nunique)
Dict{Int64, Int64} with 1 entry:
1 => 64787998
Nope, for each open port there is only one reason why it is open.
Which is the reason to scan the same port more than once?
masscan
supports a --retries
flag. From the documentation:
--retries
: the number of retries to send, at 1 second intervals. Note that since this scanner is stateless, retries are sent regardless if replies have already been received.
This means that masscan
will send \(N\) probes to each port, always, within a second apart.
Let’s check that.
What is the distributions of probes per open port?
nrow
is a special value that DataFrame
’s combine
will interpret as count the rows of each group. The rest follows the usual meaning: nrow => :count
means store the count in a column named count
.
julia> g = groupby(df, [:ip, :port])
julia> df2 = combine(g, nrow => :count)
julia> countmap(df2.count)
Dict{Int64, Int64} with 5 entries:
5 => 1
4 => 27
2 => 13038
3 => 750
1 => 64774182
Notice how most of ip-port tuples were scanned once.
So masscan
didn’t send \(N\) probes to each port or it did it but the some probes never were answered (why?, who knows).
This could explain why some ports were scanned twice while others only one.
What is the distribution of intervals between probes for each port?
Sanity check: from the distribution of probes per port we know that we have 13038 ports with 2 probes that will contribute with 13038 rows to the difference dataframe;
750 ports with 3 probes that will contribute with 750 * 2 rows to the result; 27 ports with 4 probes contributing with 27 * 3 rows and finally 1 port with 5 probes contributing with 5 * 4 rows.
The expected total is 14623 that it is exactly the row count of df2
.
julia> g = groupby(df, [:ip, :port])
julia> df2 = combine(g,
:timestamp => diff ∘ sort => :interval)
julia> countmap(df2.interval)
Dict{Union{Missing, Int32}, Int64} with 30 entries:
0 => 320
1 => 2793
2 => 585
3 => 2625
4 => 843
5 => 159
6 => 316
7 => 1364
8 => 512
9 => 1435
10 => 495
11 => 149
12 => 103
13 => 52
14 => 139
15 => 1715
16 => 828
17 => 15
18 => 6
19 => 1
23 => 2
24 => 5
25 => 3
28 => 2
29 => 1
31 => 55
32 => 92
33 => 4
34 => 3
40 => 1
Certainly a histogram is better for this case:
This was more spread than I expected. Most of the intervals are in the low range but there is non-negligible count for the 15 secs interval.
A quick statistics for the intervals:
julia> describe(df2, :mean, :std, :min, :q25, :median, :q75, :max, cols=:interval)
1×8 DataFrame
Row │ variable mean std min q25 median q75 max
│ Symbol Float64 Float64 Int32 Float64 Float64 Float64 Int32
─────┼─────────────────────────────────────────────────────────────────────
1 │ interval 6.8051 5.75237 0 2.0 5.0 10.0 40
Or, statistic by statistic:
julia> mean(df2.interval)
6.805101552349039
julia> median(df2.interval)
5.0
julia> mode(df2.interval)
1
julia> std(df2.interval)
5.752370354791699
julia> quantile(df2.interval, [.25, .5, .75])
3-element Vector{Float64}:
2.0
5.0
10.0
julia> iqr(df2.interval)
8.0
The median confirms our first analysis: the distribution is right skewed (the mean is on the right of the median).
What about the zero interval?
We can filter which rows has such in two ways being the second one the preferred and fastest:
julia> filter(dfrows -> dfrows.interval == 0, df2, view=true)
julia> filter(:interval => ==(0), df2, view=true)
Choosing one of the got IPs we can get the probes:
julia> filter(:ip => ==(22207380), df, view=true)
2×5 SubDataFrame
Row │ timestamp port ttl reason ip
│ Int32? Cat…? Int32? Cat…? Int64?
─────┼──────────────────────────────────────────────
1 │ 1619740697 80 42 syn-ack 22207380
2 │ 1619740697 80 42 syn-ack 22207380
timestamp
as you see has 1-second resolution (4 bytes).
We could assume then that these two probes were done with 1-second interval apart but due the low resolution of the clock we got the same timestamp.
Some thoughts
My initial idea was to use this walk-through to learn and practice Pandas. Having a dataset of a non-trivial size, I knew that this was going to be a challenge.
But what a better opportunity to work with Dask too!
I really tried to make it work but even processing a 10% of the dataset made no difference: Pandas and Dask consumed so much memory that I couldn’t finish a single group-by + aggregation.
It is obvious that there are too many copies.
Doing a home-made custom aggregation function to sort this, I successfully bypassed the memory problem but I ended up in another one: CPU 100% never-finishing execution problem.
The custom aggregation function was written in Python, of course, but calling Python code for each row is incredible slow.
And all of this for a reduced dataset!
I’m talking of processing a 10% dataset and it didn’t finish after running for a whole night.
After a week of trying and failing, it was clear that Pandas+Dask need more love.
That’s when I considered Julia.
Julia code is compiled into machine code and because it deduces the types (most of the times), it can pack the data in arrays with high-locality and generates fast code ala C.
It is not magic and the libraries are designed to work in this way and avoid any sort of temporal copies.
On the other hand Julia libraries are much more modest in capabilities compared with Python’s ones.
It is a non-trivial trade-of.
Related tags: pandas, julia, categorical, ordinal, parquet, statistics, seaborn