I was curious about what a phone does behind the scene when it has a wifi connection. In this post I will go step by step on how to turn a wifi card into an access point and how to setup DHCP, DNS, routing and firewall rules to make a computer into a man in the middle box.
There are no hacks, spoofing or poisoning. I will go just through the happy case when I control the target device. Despite of this, it was a interesting problem to solve.
And when I made the mitm box work and I was celebrating, a careful inspection to the traffic shown a few surprises.
In the previous post we reviewed why and how an extension length attack is possible on some hash function and how we can use it to break a prefix-keyed MAC.
Long story short: when the full internal state of the hashing is exposed, an extension attack is possibly.
So hash functions that don’t expose their internal state are okay, right?
Yes in general but the devil is in the details…
SHA-224 does not expose its full state and it is one of those “safe” hash functions (sometimes it is found in the literature as such) but…
In this quick post we will see why SHA-224 is vulnerable to length extension attackeven if its internal state is not fully exposed.
A trusted party with access to a private key k can compute an authentication code or MAC.
Compute the message authentication code (MAC) doing H(k ∥ m).
In theory only who knows the secret key k can create and verify those, but no, this schema es broken.
This post covers matasano challenges from 28 to 30 so spoiler alert.
A keyed hash prefixes the message with the key k and computes a hash like SHA-1. The resulting hash is the MAC for the given message.
Then, someone that knows also k can verify if a message is authentic or not computing the MAC and comparing it with the one provided with the message.
If the computed hash matches the one provided, the message is authentic, otherwise it is not.
Unfortunately this prefix-keyed hash for MAC is broken.
Some very well known hash functions expose their internal states that allows an adversary to append data to the message and continue the hash computation and generate a new valid MAC.
There are a lot of hash algorithms for different use cases but tabulation hashing caught my attention years ago for its incredible simplicity and nice independence properties.
Fast and simple.
I will explore a cythonimplementation and see how fast really is.
Let’s say that we want to compare two Pandas’ dataframes for unit testing.
One is the expected dataframe, crafted by us and it will be the source of truth for the test.
The other is the obtained dataframe that is the result of the experiment that we want to check.
Doing a naive comparison will not work: first we may want to tolerate some minor differences due computation imprecision; and second, and most important, we don’t want to know just if the dataframes are different or not
We want to know where are the differences.
Knowing exactly what is different makes the debugging much easier – trying to figure out which column in which row there is a different by hand is not fun.
We the pass of the years one keeps storing files, music in my case.
A few thousands.
But technology improved in this sense and new encoders exist that can compress (loosely) the same audio with the same quality but at a much smaller bit rate, and therefore, resulting in much smaller files.
byexample is a tool that reads snippets of code from your documentation, executes them and compares the obtained results with the expected ones, from your docs too.
If a mismatch happen we say that the example in your documentation failed that could mean one of two things:
your code (the snippet) does not do what you expect so it has a bug
or the code does exactly what it is supposed but you forgot to update your doc.
Very useful for testing and keep your docs in sync!
But byexample does not really execute anything by itself. Having to code an interpreter for Ruby, Java, C++ and others would be insane.
Instead, byexample sends the snippets of code toa standard interpreter like IRB for Ruby or cling for C++.
Interpreting the output from they is not always trivial.
When a interpreter prints to the terminal, it may write special escape/control sequences, invisible to human eyes, but interpreted by the terminal.
That’s how IRB can tell your terminal to output something with reds and blues colors.
In the previous post we saw that we cannot enable a controller or add a process to a cgroup freely. That would make the implementation of each controller harder.
In v2 the hierarchy is subject to the no internal process constraint that ensures that a controller will have all the processes in leaves of its domain tree.
This is the last of a 3-post series about cgroup and certainly, this no internal process constraint was the hardest to understand.
A system has hundreds of resources that process may use (and exhaust!): controlling them it is not trivial and requires a precise intervention by part of the kernel.
We’ll use the simplest resource to understand: the amount of process ids or pids.
While CPU and memory are the most common resources that can be exhausted, the process id space is not infinite: it is an integer typically in the range of 1 to \(2^{16}\).
A malicious or really-bugged program can trivially consume all the available pids spawning thousands of processes and threads long before other resource get exhausted. This the so called fork bomb.
Once you run out of pids, no other process can be started leaving the system unusable.
In this post we will explore the rules of resources distribution in a cgroup hierarchy and in particular how to avoid fork bombs to explode.
The following snippet loads any Python module in the ./plugins/ folder.
This is the Python 3.x way to load code dynamically.
>>>defload_modules():...dirnames=["plugins/"]...loaded=[]......# For each plugin folder, see which Python files are there...# and load them...forimporter,name,is_pkginpkgutil.iter_modules(dirnames):......# Find and load the Python module...spec=importer.find_spec(name)...module=importlib.util.module_from_spec(spec)...spec.loader.exec_module(module)...loaded.append(module)......# Add the loaded module to sys.module so it can be...# found by pickle...sys.modules[name]=module......returnloaded
The loaded modules work as any other Python module. In a plugin system you typically will lookup for a specific function or a class that will serve as entry point or hooks for the plugin.
For example, in byexample the plugins must define classes that inherit from ExampleFinder, ExampleParser, ExampleRunner or Concern. These extend byexample functionality to find, parse and run examples in different languages and hook –via Concern– most of the execution logic.
Imagine now that one of the plugins implements a function exec_bg that needs to be executed in background, in a separated Python process.
We could do something like:
>>>loaded=load_modules()# loading the plugins>>>mod=loaded[0]# pick the first, this is just an example>>>target=getattr(mod,'exec_bg')# lookup plugin's exec_bg function>>>importmultiprocessing>>>proc=multiprocessing.Process(target=target)>>>proc.start()# run exec_bg in a separated process>>>proc.join()
This is plain simple use of multiprocessing…. and it will not work.
Well, it will work in Linux but not in MacOS or Windows.
In this post I will show why it will not work for dynamically loaded code (like from a plugin) and how to fix it.
The file system of a docker container is ephemeral: it will disappear as soon as the container is destroyed.
To prevent data losses you can mount a folder of the host into the container that will survive the destroy of the container.
But it is not uncommon to find issues with the ownership and permissions of the shared files.
The file system of the host represents who is the owner of each file with an user and a group numbers.
Plain numbers.
Humans, on the other hand, think in terms of user names: “this is the file of Alice; this other is of Bob”.
The mapping between the numbers that the file system uses and the names that the humans understand is stored in /etc/passwd and /etc/groups.
And here we have a problem.
These two files, /etc/passwd and /etc/groups, live in the host’s file system and they are used to map the files’ numbers to names when you are seeing the files from the host.
When you enter into the docker container (or run a command inside), the shared files, mounted with -v, will have the same file numbers.
But, an here is the twist, inside the container you will be using for the mapping the /etc/passwd and /etc/groups files of the container and not of the host.
Histogram of TTL observed: the peaks indicate the different operative systems and their relative position respect the expected positions estimate the mean distance between the hosts and the scanner in hop count.
In the Multiprobes Analysis, we explored the statistics of the hosts and the communication to them.
This is a follow up to keep exploring the data:
which is the average distance between a host and the scanner in term of hop count?
These time to live (TTL) are numbers between 0 and 255 set by the host. In its journey, the packet goes through different routers that each decrements by one the TTL.
Once a packet has a TTL of zero, the router will discard it.
Assuming that the packet goes straight-forward to us. If there are loops in the routing systems, the packet will loop between the devices going nowhere. That’s the point of the TTL: to catch and drop packets looping around.
Of course the host set the TTL to some reasonable high number so the packet should reach to its destination (us) before reaching to zero.
We cannot know exactly which route/s the packets taken but we can count for how many hops they went through: any change in the hop-count and we will get different TTLs and that’s evidence that they taken different routes.
The inverse is not true: a constant hop-count does not imply that the route taken is the same, only the route/s have the same hops.
In this post I will continue exploring how much a route is stable in terms of hop-counts.
To my surprise the dataset preprocessed in my previous posthas duplicated entries. These are scans to the same host and port but with a different timestamp.
Histogram of the interval between probes to the same host-port in seconds.
Questions like “which open port is more likely” will biased because the same host-port may be counted more than once.
On the other hand, this opens new questions:
which is the reason to scan the same port more than once? If it is fixed by the scanner we can deduce that ports scanned once were scanned more times but the other probes failed and get an estimation of such.
is the same port opened due different reasons?
could we characterize the scanner based on the timestamps like scanning patterns?
The second surprise was that even working with small samples (around 100MB), Pandas/Dask has serious performance problems:
When defining a Python function you can define the default value of its parameters.
The defaults are evaluated once and bound to the function’s signature.
That means that mutable defaults are a bad idea: if you modify them in a call, the modification will persist cross calls because for Python its is the same object.
>>>deffoo(a,b='x',c=[]):...b+='!'...c+=[2]...print(f"a={a} b={b} c={c}")>>>foo(1)# uses the default lista=1b=x!c=[2]>>>foo(1)# uses the *same* default lista=1b=x!c=[2,2]>>>foo(1,'z',[3])# uses another lista=1b=z!c=[3,2]>>>foo(1)# uses the *same* default list, againa=1b=x!c=[2,2,2]
A mutable default can be used as the function’s private state as an alternative to functional-traditional closures and object-oriented classes.
But in general a mutable default is most likely to be a bug.
Could Python have a way to prevent such thing? Or better, could Python have a way to restart or refresh the mutable defaults in each call?
This question raised up in the python-list. Let’s see how far we get.
SMT/SAT problems are in general undecidable: basically we cannot know if we can get even an answer. The search space is in general infinite.
But several concrete problems have a reduced space and may be tractable: while theoretical really hard, we can find a solution in a reasonable time.
Still, a solver may need some extra help to navigate across the search space.
A strategy is a guide that hopefully will lead to a solution (sat or unsat) in a reasonable time using a reasonable amount of resources.
It is quite an art and the concepts were around for more than 30 years.
This post is a quick overview of concepts like strategies, tactics, tacticals, merit order, cost functions, heuristic functions, goals, approximations, redundancy control, search strategies, probes and a few more.
A branch is an expensive operation even in modern CPUs because the computer will know which of the paths is taken only in the latest stages of the CPU pipeline.
In the meantime, the CPU stalls.
Modern CPUs use branch prediction, speculative execution and instruction reordering to minimize the impact of a branch.
They do a good job but still a branch is potentially expensive so they are replaced by branchless variants.
Minimum check-elapsed time (y axis) per branch/branchless function (x axis).
If-Then-Else or ITE for short, are symbolic expression that denotes a value chosen from two possible values based on a condition. These are the symbolic branch.
Naturally we could rewrite a symbolic ITE with a symbolic branchless expression.
The question is: which is better for a solver like Z3? Which makes the SMT/SAT solver faster?
After two weeks working on this post I still don’t have an answer but at least I know some unknowns.
I’ve being writing for a long time. I’m far from being a good writer but having the consistency to write at least a blog post every month gave me more expressive power.
Being a non-English native speaker, that also put me in an uncomfortable position –out of the confort zone– but looking back, even with all my mistakes, I really improved.
I’m using a bunch of different technologies to assist me:
Plotting data was always for me a weak point: it always took me a lot of time to make the plots and graphs, reading the documentation, googling how to do one particular tweak and things like that.
So I challenge myself: could I build a cheatsheet about plotting?
Yes, yes I could.
Seaborn is a very simple but powerful plotting library on top of matplotlib designed for statistical analysis.
A space company is planning the next 4 years. It has several projects, each one with its own budget requirement per year, but the company has a limited budget to invest.
Moreover, some projects depends on others to make them feasible and some projects cannot be done if other projects due unbreakable restrictions.
Project
1st
2nd
3rd
4th
Depends
Not
Profit
1 Cube-1 nano-sat
1.1
2
12
2 Cube-2 nano-sat
2.5
2
12
3 Infrared sat
1
4.1
on 6
with 4
18
4 Colored img sat
2
8
with 3
15
5 Mars probe
2
8
4.4
on 1 & 2
12
6 Microwave tech
4
2.3
2
1
Under an incredible amount of assumptions and good luck, what is the best strategy to maximize the profit?
A small business sells two types of chocolate packs: A and B.
The pack A has 300 grams of bittersweet chocolate, 500 grams of chocolate with walnuts and 200 grams of white chocolate.
The pack B has 400 grams of bittersweet chocolate, 200 grams of chocolate with walnuts and 400 grams of white chocolate.
The pack A has a price of 120$ while the pack B has a price of 90$.
Let’s assume that this small business has for today 100 kilograms of bittersweet chocolate, 120 kilograms of chocolate with walnuts and 100 kilograms of while chocolate.
How many packs of A and B type should be packed to maximize the profits?
Qubes OS has an interesting network system to isolate more-or-less trusted application virtual machines (App) from absolute untrusted network VMs (Net).
These last ones have the drivers required to handle ethernet and wifi cards that expose them to a potentially deathly bug lurking in the drivers.
An additional VM is put in the middle between App VMs and Net VMs. This absolute trusted proxy VM serves as a safe firewall (Proxy).
In this post will explore how these VMs connect and how the packets are forwarded up and down along this chain of VMs.
A ghost writer is a person that writes a document, essay or paper but the work is presented by other person who claims to be the author.
Detecting deterring ghostwritten papers is an article written by a (ex)ghost writer and explains what happens behind the scene when a student pays for this dark service.
Would be possible to detect this in an automated way?
Given a set of documents, could we determine of they were written or not by the person or people who claim to be the authors?
This problem is known as authorship attribution and I will show a few papers that I read about this, in particular around the concept of stylometric, fingerprints that the real author leaves when he or she writes.
While implementing a bounded queue or ring buffer in a single-thread universe is relatively easy, doing the same when you have two threads, the implementation of a lock-free queue is more challenging.
In this first part will analyse and implement a lock-free single-producer single-consumer queue. A multi-producer multi-consumer queue is described in the second part.
When two or more concurrent tasks perform non-atomic read/write operations over the same data we have a race condition and the system will be in an undefined state.
But what exactly does that suppose to mean? What is behind the generic undefined state?
Same animation that last 1 second in a loop. From top to down, the first is an animation without any frame lost, the second had lost some frames but draw() is still in sync, the last one lost the same amount of frames but draw() used its own notion of time an got out of sync. code
The concept of Index of Coincidence was introduced by William Friedman in 1922-1923 in the paper titled “The Index of Coincidence and its Applications in Cryptoanalysis” [1].
The Navy Department (US) worked in the following years on this topic coming, in 1955, with a paper titled “The Index of Coincidence” by Howard Campaigne [2].
However I found some confusing and misleading concepts in the modern literature.
This post is a summary of what I could understand from Howard’s work.
Before building complex state machine we need to learn the basics blocks.
When the solution to a problem can be seen as set of states with transitions from ones to others, modeling them as a nondeterministic finite automatas makes clear how the solution works and allows to spot deficiencies.
A regular expression is an example of this. As an introductory step let’s review how to turn a regex into a NFA.
When you need to access to different hosts jumping from one to another, forwarding the agent is much more secure than copying and pasting your keys around.
But if one host gets compromised it will expose your agent: even if the attacker will not get your secret keys he will be able to login into any system as you.
You cannot prevent this, but you can restrict this to reduce the splash damage.
Given a matrix secret key \(K\) with shape \(n\textrm{x}n\), the Hill cipher splits the plaintext into blocks of length \(n\) and for each block, computes the ciphertext block doing a linear transformation in module \(m\)
$$ K p_i = c_i\quad(\textrm{mod } m)$$
For decrypting, we apply the inverse of \(K\)
$$ p_i = [K]^{-1} c_i \quad(\textrm{mod } m)$$
To make sense, the secret key \(K\) must be chosen such as its inverse exists in module \(m\).
The Mersenne-Twister 19937 or just MT19937 is one of the most used pseudo random number generator with a quite large cycle length and with a nice random quality.
However it was not designed to be used for crypto.
The Counter mode, or just CTR mode, turns a block cipher into a stream cipher.
More specifically, it builds a pseudo random generator (PRG) from a block cipher and then generates a random string using the PRG to encrypt/decrypt the payload performing a simple xor.
The idea is to initialize the PRG with a different seed each time but if this does not happen, all the plaintexts will be encrypted with the same pseudo random key stream – totally insecure.
AES and other ciphers work on blocks; if the plaintext length is not multiple of the block size a padding is added.
If during the decryption the pad is checked and returns an error, we can use this to build a padding oracle: a function that will tell us if an encrypted plaintext has a valid pad or not.
It may not sound too much exiting but armed with this padding oracle we can break CBC one byte at time.
Despite of been 4 years old, Shellshock is still a very interesting topic to me not for the vulnerability itself but for the large ways to trigger it even in the most unexpected places.
Take the 4 characters () { and open the world.
Fragments found in the field.
Creativity and few hours reading man pages are all what you need.
In this post will review the Cipher Block Chaining mode (or CBC) and how we can build an ECB/CBC detection oracle to distinguish ECB from CBC using cryptonita
– Spoiler Alert! –
This will be the bases for breaking ECB in a later post.
A plaintext was encrypted via a XOR with key of unknown bytes of length, repeating this key as much as needed to cover the full length of the plaintext.
HTTP Proxies blacklisting evil domains, firewalls blocking weird traffic and IDSs looking for someone that shouldn’t be there are reasonable and understandable policies for a corporate environment.
But when a friend opened his browser this week and went to google.com the things got odd.
The browser refused the connection warning him that the SSL certificate of the server wasn’t issue by google.com at all or signed by a trusted authority.
A friend of mine called me: a girl friend of him was hopeless trying to recover her thesis from a corrupted usb stick three days before her presentation.
She was working in her thesis for almost a year, saving all the progresses in that usb stick. But what she didn’t know was that an usb memory has a limited number of writes and with more writes eventually the file system gets corrupted.
This is the real story behind a forensics rally trying to recover her one year work.