Yes, another post about url shorteners. Recently, apart from complaining about them, I have been thinking about how URL shortening services work; services such as bit.ly and tinyurl.
Many of these services reduce a URL to a small string; typically a length of 3 to 6 characters. As a result, they can’t be simply hashing the URL. For example, if you use MD5 to hash the domain name of this site you get (in hexadecimal): 4302e8ae08795f0c67c932338f516e2f. The resulting hash value is longer than the URL itself! Not very useful for a URL shortening service.
So how do they work? I still don’t know but here’s one approach that I took:
Let’s say we want to produce a code using characters from the following alphabet [a-zA-Z0-9]; that gives a total of 62 different alphanumeric characters. For a 5 character code there are 62*62*62*62*62 (=916,132,832) possible combinations. If we associate each code with a given URL – a simple one to one mapping – then that’s a lot of URLs! The key point is that, unlike a hash function, I don’t think the URL is used as input to determine what comes out of the other end; a “random” character code is generated and then is just stored with the URL so it can be retrieved with a simple table look-up.
I came up with a probabilistic approach to generating these “hash” codes. I say probabilistic as it just generates a code at random. If there is a collision, it just tries again and generates another one. So how likely are collisions to occur? Well according to the birthday paradox we should expect to see a collision after generating 2n/2 items, or approx. every 215 items for a 5 character code using the example code, assuming, of course, that all generated codes are equally likely to occur. It’s an example, it will do!
You can look at the source code here. In the example I use a bit vector to record what codes have already been generated; the bit vector is limited to representing 231-1 different values therefore the example code is restricted to generating a maximum of 5 character codes; each character requires 6 bits. I’ll let you do the math as I have already done it 🙂
If you do run it you may have to increase the maximum heap size, e.g. -Xmx256m. I ran out of the heap space the first time I ran it using Eclipse!
It’s a first attempt so there is likely some room for improvement but it’s a start. Would be great to hear any alternative thoughts on how these things work.