Nicolas314

All my geeky stuff ends up here. Mostly Unix-related

Convergent Encryption

with 2 comments

If you did not follow the latest buzz, an Internet startup is getting a lot of attention lately: Bitcasa. Their offer seems just too good to be true: for 10$/month you get unlimited remote storage for your data on their servers. The best part is: they claim your data will be encrypted on their servers so that even they will not be able to access your file contents. They also claim you can get unlimited storage by using de-duplication: the first time a file is uploaded it is truly stored on their servers and all consecutive attempts to upload the same file will just return a pointer to the already stored one.

First reaction: this sounds like bullshit. How could you both encrypt a file so that only their legitimate user can access it, and identify redundancies on the servers? If everybody has a different encryption key, the same file will encrypt differently for each user and prevent any attempt at de-duplication. And if everybody has the same encryption key it kind of defeats the whole point of encryption, doesn’t it?

The Bitcasa founder recently mentioned convergent encryption during an interview, which pushed me into looking further into the topic. I have tried to wrap my head around it and summarize my understanding of the whole process here.

Let us do a thought experiment:

The client is given a list of files to store on the server.

Each file is cut into 4kB-blocks using padding wherever necessary. The same process is repeated for each block:

  1. Compute K = SHA256(block)
  2. Compute H = SHA256(K)
  • K will be used as encryption key for this block.
  • H will be used as an index to store/retrieve the block server-side.

Now the client queries the server for H. If the server already has this block, it notifies the client that the block needs no uploading. If the block has never been seen before it is encrypted using K and AES256 then uploaded to the server.

Once all blocks have been either uploaded or identified as already known, the client can store a list of all (H,K) pairs and enough metadata to rebuild complete files from individual blocks. Re-building the whole archive is now a matter of N requests for blocks identified by H, and N decryptions with the associated Ks, followed by file reconstruction based on metadata.

Store these metadata on the local client, and store one copy on the server too for good measure. To ensure only the client can read them back, metadata can be encrypted with a key only known to the client, or using a key derived from a user passphrase.

What did we achieve?

Client-side: initial files are now reduced to a set of (H,K) pairs and metadata needed to rebuild them. Since a copy is stored on the server, the client local archive can safely be erased and rebuilt from scratch.

Server-side: duplicated blocks are stored only once. Blocks are encrypted with keys unknown to the server and indexed by a hash that does not yield any information about key or contents. Metadata are also stored on the server but encrypted with a key only known to the user.

Executive summary:

  • The server does not have any knowledge about the data it is storing
  • The server cannot determine which files are owned by a given user without attacking their metadata.

This scheme is a good way to deal with snooper authorities, but what about dictionary attacks? An attacker who prepared a list of known plaintext blocks can easily discover if they are already present on the server. To be fair, dictionary attacks are quoted as the main vulnerability of convergent encryption schemes in all papers I have seen so far.

What are we trying to protect against?

If you offer to the world a possibility to remotely store their files, chances are that you will soon end up with a very long list of all movies and music ever produced by humanity on your servers. Your main opponent are copyright holders who do not want the public to share their productions so easily.

I believe dictionary attacks are hardly going to be usable by copyright holders. The same data block could very well be present in two very different data files, I do not see how you could ever prevent somebody from publically storing 4KB of data that match a 4KB-block in a copyright-protected data file. Copyright protection applies to a complete work, not to individual 4KB-components. You would need to prove that the same user has access to all individual blocks for the copyright violation to be proven, but that is not possible in the described scheme.

Come to think of it, this suddenly looks like it could work.

Advertisements

Written by nicolas314

Monday 19 September 2011 at 11:15 pm

2 Responses

Subscribe to comments with RSS.

  1. If you have to store (H,K) for each block per user, this gives you a 64-fold reduction. Good but not ideal. I don’t know if they documented it but they could apply the same technique recursively on (H,K).

    Mathias

    Tuesday 20 September 2011 at 2:53 pm

    • I only took a 4k block size as an example. Since nothing is currently documented on their website this whole post is pure conjecture.

      nicolas314

      Tuesday 20 September 2011 at 3:29 pm


Leave a Reply to Mathias Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: