All my geeky stuff ends up here. Mostly Unix-related

Posts Tagged ‘network storage

Convergent Encryption

with 2 comments

If you did not follow the latest buzz, an Internet startup is getting a lot of attention lately: Bitcasa. Their offer seems just too good to be true: for 10$/month you get unlimited remote storage for your data on their servers. The best part is: they claim your data will be encrypted on their servers so that even they will not be able to access your file contents. They also claim you can get unlimited storage by using de-duplication: the first time a file is uploaded it is truly stored on their servers and all consecutive attempts to upload the same file will just return a pointer to the already stored one.

First reaction: this sounds like bullshit. How could you both encrypt a file so that only their legitimate user can access it, and identify redundancies on the servers? If everybody has a different encryption key, the same file will encrypt differently for each user and prevent any attempt at de-duplication. And if everybody has the same encryption key it kind of defeats the whole point of encryption, doesn’t it?

The Bitcasa founder recently mentioned convergent encryption during an interview, which pushed me into looking further into the topic. I have tried to wrap my head around it and summarize my understanding of the whole process here.

Let us do a thought experiment:

The client is given a list of files to store on the server.

Each file is cut into 4kB-blocks using padding wherever necessary. The same process is repeated for each block:

  1. Compute K = SHA256(block)
  2. Compute H = SHA256(K)
  • K will be used as encryption key for this block.
  • H will be used as an index to store/retrieve the block server-side.

Now the client queries the server for H. If the server already has this block, it notifies the client that the block needs no uploading. If the block has never been seen before it is encrypted using K and AES256 then uploaded to the server.

Once all blocks have been either uploaded or identified as already known, the client can store a list of all (H,K) pairs and enough metadata to rebuild complete files from individual blocks. Re-building the whole archive is now a matter of N requests for blocks identified by H, and N decryptions with the associated Ks, followed by file reconstruction based on metadata.

Store these metadata on the local client, and store one copy on the server too for good measure. To ensure only the client can read them back, metadata can be encrypted with a key only known to the client, or using a key derived from a user passphrase.

What did we achieve?

Client-side: initial files are now reduced to a set of (H,K) pairs and metadata needed to rebuild them. Since a copy is stored on the server, the client local archive can safely be erased and rebuilt from scratch.

Server-side: duplicated blocks are stored only once. Blocks are encrypted with keys unknown to the server and indexed by a hash that does not yield any information about key or contents. Metadata are also stored on the server but encrypted with a key only known to the user.

Executive summary:

  • The server does not have any knowledge about the data it is storing
  • The server cannot determine which files are owned by a given user without attacking their metadata.

This scheme is a good way to deal with snooper authorities, but what about dictionary attacks? An attacker who prepared a list of known plaintext blocks can easily discover if they are already present on the server. To be fair, dictionary attacks are quoted as the main vulnerability of convergent encryption schemes in all papers I have seen so far.

What are we trying to protect against?

If you offer to the world a possibility to remotely store their files, chances are that you will soon end up with a very long list of all movies and music ever produced by humanity on your servers. Your main opponent are copyright holders who do not want the public to share their productions so easily.

I believe dictionary attacks are hardly going to be usable by copyright holders. The same data block could very well be present in two very different data files, I do not see how you could ever prevent somebody from publically storing 4KB of data that match a 4KB-block in a copyright-protected data file. Copyright protection applies to a complete work, not to individual 4KB-components. You would need to prove that the same user has access to all individual blocks for the copyright violation to be proven, but that is not possible in the described scheme.

Come to think of it, this suddenly looks like it could work.

Written by nicolas314

Monday 19 September 2011 at 11:15 pm

My 2c on Amazon

with one comment

Hide the family jewels

As an early adopter I have enjoyed digital cameras at home for over 12 years now. This translates into about 20Gb of JPEGs on my home partition which I absolutely do not want to loose. I had the painful experience of getting burglarized a few years back and was lucky enough to recover my computers from the police station a couple of days later. The hardware itself has no importance to me but the pictures are of course priceless. This calls for a drastic solution: backup, backup, and remote backup. First two steps are easy: multiply the copies of your pictures using rsync on various hard drives around the house and you are covered against single hard drive failure. Make sure you take the habit of sync’ing them all every time you get a new bunch of pics and you are set. Now what are the solutions for remote backups?

Store it at work

The obvious solution is to encrypt a disk and leave it somewhere in my office, but that has obvious drawbacks. First is that I have to think about bringing the disk home every time I add more data. I tried it for a while and could never think about updating the drive. Second point is that there are lots of people going through my office every day. Even if I trust my colleagues, it is always tempting to borrow a USB hard drive you have seen sitting around the office for ages. The contents are of course encrypted, which makes the drive appear as unformatted to the untrained eye.

I do not want to lock stuff in drawers. Last time I did, I lost the keys and had to destroy a drawer to get to my stuff. Kinda cryptography in the real world, except brute force actually works.

Network storage

Network storage solutions are a dime a dozen and literally exploding these days. I tried a lot of them and came to the conclusion that Dropbox is by far the best in terms of usability and functionalities. It is the only solution I tried that has clients for Windows, Mac and Linux and that can dig through the firewall and http proxy at work without me configuring anything. It also has an iPhone app to review your files on the go and this is absolutely gorgeous. I can finally have the illusion of having the same disk at home on all machines, at work, and in my pocket.

I will probably become a paid subscriber at some point. The remaining detail I have to fix is to figure out how to upload 20 gigs of data to their servers with my puny 100kB/s home DSL connection. Dropbox also does not offer encryption, I have to figure out a way to encrypt everything on the fly but still make contents accessible for easy retrieval like an index or equivalent.

Amazon S3

Another shot at network storage solutions brought me to Amazon S3. This service offered by Amazon is mostly aimed at developers who want to host large amounts of data like a database backend for a dynamic web site. It is a bit rough around the edges. Lots of people have tried disguising the whole thing as a network disk without much success. Reviewing existing Python APIs and fuse-based stuff did not reveal anything revolutionary or stable. Anyway, I felt I just had to try it out.

My tests consisted in creating a dedicated directory (a bucket in Amazon terms) and upload 100 Mb of data to see how easy it would be. I want both to be able to sync my picture directories and encrypt all contents on the way up without having to recode too much stuff. I ended up with a little bit of Python glu around rsync and gpg that was not too satisfactory. It worked for basic tests but I would not have relied on my own code for production :-)

Amazon S3 is not a free service, but it isn’t expensive either. Doing my whole test set ended up with a bill for less than 2 euros. Fair. But this is where it hurts: Amazon billed me in US dollars and that triggers international charges on my credit card that are far above these 2 euros. In the end I might make my bank richer and will not bring anything to Amazon.

Pained by what I had discovered on my bank monthly slip, I decided to close the lid on the S3 experience and deleted all data from the bucket I created. Next month I was charged $0.02 for this operation, which turned into an absolutely ridiculous amount in euros with a fair charge attached from the credit card because they did not appreciate my micro-payment.

This is probably the last time I ever use S3. I really do not understand why Amazon can bill me in euros for books (even when I buy in the UK) and not for services. Another good idea could be for them to cumulate bills until they reach a reasonable sum like 10 or 15 euros. It would not change much to their cash flow and would really avoid un-necessary bank feeding.

My 2c on Amazon S3 have cost me more than my phone bill this month.

Written by nicolas314

Thursday 10 December 2009 at 11:09 pm