Scaling Copyright

Let's say we want register copyright for all the new pages on the web, in real time as they're posted. Every article, blog post, and comment. Maybe we're going to store them all in Swarm now. How do we do it?

We could use a contract like this:

contract Registrar {
    mapping (bytes32 => address) public hashes;
    function register(bytes32 hash) {
        if (hashes[hash] == 0) {
            hashes[hash] = msg.sender;
        }
    }
}

(There's a security flaw here; see my last post. But we're going to move on from this anyway.)

That works, at least at small scale, but it's storage on the blockchain for every registration. You have to pay 20K gas for it. It bloats the chain. Can we do better?

One way is to use events instead of storage. The registrations are mainly for human consumption anyway, it's not like other contracts have any use for the data. (I think. Any ideas?)

So we could switch to using events:

contract Registrar {
    event logOwner(bytes32 indexed hash, address owner);
    function register(bytes32 _hash) {
        logOwner(_hash, msg.sender);
    }
}

That's a little better (though it has the same security flaw). We save about 19K gas. There's nothing to prevent multiple registrations of the same hash but we can do a search and find the oldest. Events barely add to the blockchain size; the transactions are in the blockchain anyway, and if you store an event, all Ethereum does is squish it down into a bloom filter. You can query the bloom filter to see whether the data's there (with some risk of false positives, which are fine), and if the bloom filter says yes, the node just reruns the transactions and calculates the event data from scratch. Indexes add a little but still, it's pretty cheap.

If we want to look up any hashed data and find its owner, that's pretty much the best we can do. Find the oldest event registering the hash and see who claimed it.

But it doesn't really prevent plagiarism. Anyone can get a new hash by altering a single bit in the source file. To check for plagiarism we need some kind of approximate hashing to find near-matches of content. If you download a song and want to see who owns it, you can't just hash it and look it up. You need perceptual hashing to see whether someone else registered similar content.

Since plagiarism is so easy, we're just pretending we can look up the real owner of the content from the hash, using the blockchain alone. We can get even cheaper registrations if we don't pretend. Instead of doing lookups on chain, let the copyright owner (or supposed owner) hang onto the proof. He can present us a document saying "I registered a hash of X on block Y" and we can verify that that's true. That's all the consensus we really need; plagiarism detection can be off-chain.

With that approach, here's our registrar contract:

contract Registrar {}

Seriously, that's it.

If you want to register a copyright, you send a transaction to this contract, with a hash in the transaction data. You could just send sha3(content), but we still have the security flaw mentioned above; namely, anyone could copy your transaction and maybe get theirs in the blockchain first, with themselves as sender. So instead we use this:

sha3(ownerAddress, sha3(content))

Paste the result into the data field of any client, send the transaction. Get back the transaction hash.

Now publish the following on your website, Swarm, whatever:

  • the transaction hash
  • the owner address
  • the content

Anyone can run the content and address through the hashing formula, then look up the transaction hash on the blockchain with this javascript:

web3.eth.getTransaction(transactionHash);

This will return a javascript object which includes:

  • the block number the transaction went in
  • the address of the sender
  • the data sent along with the transaction (in this case just the sha3 hash)

Now you've really got minimal registration. You're sending 32 bytes of message data to the network, and you're not even logging it. You're not adding anything to blockchain storage other than the transaction itself.

But...we're still not there yet. We still have one transaction per article. That might be ok when we've got massive sharding and we're doing 100,000 tx/sec but right now, 10 tx/sec is more our speed. We can't copyright all the pages that way.

If you're copyrighting a batch of your own stuff, it's easy. Say you're copyrighting three pages, send this hash in your transaction message:

sha3(signerAddress, sha3(page1), sha3(page2), sha3(page3))

Same process, but with just one transaction, anyone can verify that you registered all three pages. You just have to publish the list of pages

But remember, because the hashed data includes your address, it's ok to publish that hash before it's in the blockchain. It's not actually registered with a timestamp before it's in the chain, but nobody can steal so long as you don't publish the content before it's timestamped.

So you can just publish the outer hash on your website, Swarm, Whisper, whatever. Maybe you've got some friends or collaborators who are doing the same. Any of you can scarf up all these hashes and publish them to the blockchain in one fell swoop.

So Alice publishes a hash:

aliceHash = sha3(aliceAddress, sha3(page1), sha3(page2), sha3(page3))

She just publishes aliceHash, not the data that made it. Now Bob publishes another hash:

bobHash = sha3(bobAddress, sha3(page4), sha3(page5))

And then you add some of your own and publish the whole thing:

sha3(yourHash, aliceHash, bobHash)

You get the transaction hash and tell Alice and Bob about it:

"Hey Alice I posted [aliceHash, bobHash, yourHash], and here's the transaction hash"

Now all three of you are free to reveal how you made your hashes. You could have collectively registered a thousand pages and it all went into a transaction with 32 bytes of msg.data.

Of course, there needs to be some incentive to get people to publish each other's hashes. But as long as they publish on a regular basis, that's free. It's just a tit-for-tat; you scratch my back, I'll scratch yours. Keep track of who posts your hashes, and occasionally post theirs in return. Sometimes do a favor for someone new, and if they return the favor, do it some more. This is pretty much how BitTorrent works.