Every week, I hear about a new secure software designed to protect your privacy, thwart the NSA/GCHQ and save kittens. Most of the time, though, they're started by people that are very enthusiastic yet unskilled.
They tend to concentrate directly on choosing algorithms and writing code, instead of stepping back and thinking a bit about what they want to develop.
Sure, they probably spent some time saying things like:
- that piece of data should absolutely be encrypted
- users will all have key pairs to authenticate themselves
- we should use AES, that's the safest choice (what are theses "modes" you're talking about?)
That is not how you design a protocol. That is not how you design a software using encryption. And that is not how you will design the next secure distributed social network.
To design your system, you need three things:
- a good threat model
- theoretical tools addressing the threats
- algorithms implementing these theoretical tools
As you see, most of the projects only have the third item, and that's insufficient to design a correct system. If you don't have a good threat model, you don't have a good mental model of your users and attackers, their means and their objectives. If you don't have the theoretical tools, you will try to shoehorn your favorite algorithm on the problem without knowing if it really fits (example: using hash algorithms to store passwords :p).
So, in this post, I'll provide those (simplified) theoretical definitions. You will probably recognize some of them.
High level view
First, you need to forget notions like "privacy", use any of these terms to describe the properties you want to achieve:
- authentication: you can recognize which entity you are communicating with
- authorization: the entity cannot get access to data it has no permission on (note that it is different from authentication)
- confidentiality: the data should not be readable by an entity that has no permission on it (it can be protected by crypto, but also by policies in the code)
- integrity: unauthorized modification of the data can be detected and marked as invalid
- non repudiation: an entity cannot deny it has executed an action
- deniability: an entity _can_ deny it has executed an action
Ok, now that we have some basic properties, let's apply them: think for a long time about the actors of the system (users, malicious users, admins, sysadmins, random attacker on the network, etc), what authorizations they have, what they should not get access to, what data moves on the network and between whom.
You should now have a very basic threat model and a rough overview of your system or protocol: you know what part of the network communications should be confidential, you know where you would need to authenticate.
You will now need some ideas about the type of attacks that could happen to your system, because you probably did not think of everything. Separate your systems in logical parts (like "client", and "server", etc), observe them, and observe how they communicate.
Here are some security properties that will be useful when you will try to choose algorithms later:
- Random oracle: a system that answers deterministically every question with a random answer from its answer space. You cannot predict what it will answer, but if you send the same question twice, it will answer the same both times.
- Perfect secrecy: for any two plaintext messages of same size, an attacker cannot distinguish which plaintext maps to which ciphertext. Basically, the adversary learns nothing from the ciphertext only.
- Semantic security: same thing as perfect secrecy, except what the adversary learns on the plaintext is negligible. example: One Time Pad
those properties can be tested by creating a "game", where the attacker tries to guess information on the data:
- IND-CPA (indistinguishability under the chosen plaintext attack) test: the adversary can generate as many ciphertexts as he wants. Then, he chooses two messages m0 and m1 of same length, those messages are encrypted, and one of the ciphertexts is sent back to him. The adversary should not be able to guess which message was used to generate that ciphertext (note that this is just one way of testing for CPA, there are many other schemes, some with stronger properties)
- IND-CCA (indistinguishability under the chosen ciphertext attack): the attacker can get the decryption of arbitrary ciphertexts, but should not, from this, be able to decrypt any other ciphertext.
Here are some common attack types that can be applied to crypto protocols. The list is not exhaustive, and covers only crypto attacks: there are many more ways to attack a system.
- Replay attacks: the attacker has observed some valid (encrypted or not) data going on the wire, and tries to send it again. Obviously, it should not be accepted
- MITM (Man In The Middle): the attacker can observe and modify live data running between two actors of the system. In the worst case, the attacker should not be able to forge valid data, decrypt data, or impersonate one of the users. In the best case, it should be detectable.
- Oracle attack: when an algorithm or protocol has a part that can act as an oracle (can be asked something and give answer, like a server), an attacker could exploit flaws in the algorithm to get useful information on the data (then the oracle is not a random oracle). Timing attacks are part of this type of attack. See also padding oracle attacks, or the recent BREACH attack on TLS.
- Offline attack: the attacker got access to some encrypted data (on the wire, or by accessing a disk somewhere), stored it, and tries to decrypt it for an amount of time
You should now have a better view of the system: what are the parts of the system that need protection, what attacks they must resist, and what properties they should have.
That means we can go to the next part: choosing the tools to implement the solution.
General cryptographic functions
No, we will not choose algorithms right now. That would be too easy :D
We will choose from a list of cryptographic constructions that implement some of the security properties of the system, and combine them to meet all the needed properties:
- Secure Pseudo Random Function: function from spaces K (keys) and X (message) to Y (other message space). The basic definition is that if you choose randomly one of these functions (like choosing randomly a k from K), its output will appear totally random (testable with IND-CPA).
- Pseudo Random Permutation: this is a PRF where X and Y are the same space. It is bijective (every y from Y maps to exactly one x from X, and every y is an output of the function), and there exists an efficient inversion function for it (from Y to X). Example: AES (in ECB, which is unsafe for common use)
- Message Authentication Code: defines a pair of algorithms. One of them takes a key k and a message m and outputs a code c. The other algorithm takes k, m, c and outputs True or False. An attacker should not be able to forge a valid c without knowing k. A PRF could be used to construct a MAC system. A hash function too. Example: HMAC
- Authenticated encryption: an encryption function (with semantic security under the CPA) where the attacker cannot forge new ciphertexts that decrypt correctly. Example: AES-GCM.
- Hash function: collision resistance (cannot find different messages m1, m2 so that H(m1) == H(m2), with different collision levels). Not easily invertible. Usually, they are fast. Example: SHA2.
- Trapdoor function: a function that is easy to compute, for which finding its inverse is hard, unless you have specific information. (secure under IND-CCA). examples: RSA, DSA.
- Zero knowledge proof: a way to prove something to the other party in a communication, without giving her any info, except the proof.
There are a lot of other constructions, depending on your needs, from low level algorithms like the Diffie Hellman key exchange to higher level protocols like OTR. Again, the constructions will depend on the security properties.
Choosing the algorithms
Can we do it now? YES! But there are rules. You must not choose an algorithm because it's hype or because someone said so in an old book. Basically, you choose algorithms that implement the properties you need (like authenticated encryption), and you choose the parameters of the algorithm (key size, exponent, elliptic curve) depending on the strength you need. Basically, a key size can define how much time encrypted data should remain impossible to decrypt. Those parameters also define the performance of the algorithm. Don't choose them without consulting experts, or you will face problems similar to those encountered by the projects that used low RSA exponents (it looks good from a performance standpoint, but it introduces very bad security).
Am I done now?
Nope. We have only define some very high level parts. Creating a protocol implies a lot of thoughts on:
- how you establish the communications
- protocol negotiation (version, algorithms, etc)
- key exchange
- nonce usage
- storing sessions
- handling lost connections
- closing the connection
As you can see, designing a protocol involves a lot more than choosing a few algorithms. Note that this was only a very rough overview of what you would need to create a safe system. And we did not even start coding!
So, if you want to build the next privacy protecting system, please talk to experts. They don't necessarily want to make you feel bad. They just have a lot of formal tools and the experience needed to see what will not work.