Updated: Dec 8, 2022
Having a good and efficient system is very important when it comes to creating a new system. A good system design makes sure that the system is resilient, highly available, easily scalable, and fault tolerant. In this blog, I will be designing a URL shortener application (also known as Tiny URL) that would be strong enough to work at scale.
Let's dive right in -
The Problem Statement
Design a highly scalable system to shorten URLs, i.e when the user inputs a long URL, the system needs to return a short URL. When any person hits the short URL, it needs to return the longer URL, and redirect the browser to this URL.
What is a Tiny-URL or URL shortener?
A tiny URL or URL shortener is a service that creates a shorter version of long URLs.
It's a system similar to solutions that companies like `bitly` provide. People generally prefer the shortened version of the URL as it is easier to manage and share.
For example, if the input to the system is
our system should provide output something like:
You may think that the solution to the above problem looks simple! Right?
Generate a random number and store it in a map with the longer URL!
The problem with the above solution is that it is not scalable. By the end of this blog, we will be coming up with a solution that would work great at scale.
Modularising the problem statement:
Before jumping right in, I will divide the problem statement into the following sub-problem statements :
- Designing the final output string
- Methods to generate the output string discussed in the above point
- The above output string should not be predictable
Apart from the above, we should also make sure that the short link will redirect users to the original long URL with minimal latency. Our system should also be highly available across the globe. A nice-to-have feature would be if the users can provide short custom short URLs and an expiration time for each of these short URLs.
Let's start by designing the final output string
Designing the final output string (characters and its length)
We know the output string should look like
The variable in the above part is `random-string`. I am assuming that my `random-string` will contain the following characters :
( A - Z )
( a - z )
( 0 - 9 )
Analysis and estimation of the output URL
The length of this `random-string` depends on the scale that our system works on.
I will assume that our system needs to handle 2000 requests/second.
Let's say this system needs to be active for at least 100 years, then the total number of unique random strings generated would be ~6 trillion strings. It's simple multiplication -
100 years = 100 * 365 * 24 * 60 * 60 seconds
2000(req/sec) * 100 years = ~ 6 trillion
To calculate the length of the output string, I will use permutation and combination formulas. If you are not aware of the way permutation and combination works, it's fine, just assume that the calculation below is correct. I came up with the following equation to calculate the length of the string ( x is the length of the string) :
x ^ 62 = 6 * (10^12)
, where `^` is the exponent
7 ^ 62 is 3.5 trillion, and 8 ^ 62 is 218 trillion
We can't have the length in decimals, so we will go with the random string having a length of 8 digits in the base 62 number system. All the numbers to 6 trillion can be stored in 43 bits.
To summarise, the length of the URL should be 8 characters long if we want our system to be active for 100 years serving 2000 req/sec.
Generate the output string discussed in the above problem
Now comes the challenging part, how do we design a system so that it can efficiently serve the requests worldwide and generate unique 8-digit numbers at 2000 req/sec.
Use a random string generator to generate a random string. Check the database for this random string. If no such entry is present, use this string as a short URL and enter this mapping into the database.
This solution works perfectly when there is only one server. In our case (since our system needs to work at scale ) there can be multiple servers, so there can be chances of collision. To understand collision let's look at the example below.
For example in the diagram below, both servers - Server 1 and Server 2 get requests for different URLs. Both of them generate the same random string and check for its availability in the database. They don't find the shortened version and assume that it's okay to go ahead with this URL and make an entry in the database. Guess what? Out of these two, for one URL, the data would be corrupted.
One way to mitigate this problem is that, after pushing the entry into the database, the servers should do a get query on the database to fetch the mapping for the randomly generated URL. Then check if the resultant long URL corresponds to the same URL, which was present in the request. If not, then generate a new random number and repeat the same process until the entry is present in the database.
The problem with this approach is that there will be at least one get and one put operation on the database, which will increase the response time of the operation. There can be multiple if the random number generator isn't a good one.
The second approach makes use of a hashing function called MD5. The MD5 hashing function generates an output of length 128 bits. The idea is to input the long URL into the function, get the 128-bit output and take the first 43 bits of the resultant 128-bit string and use it as the short URL. We truncate after 43 bits because 6 trillion can be stored in 43 bits.
Since we are truncating the 128-bit to 43 bits, there can be a chance that 2 different URLs have the same first 43 bits of the generated 128-bit string. To mitigate this, we can check the database if any mapping is already present for the 43-bit and then alter the shortened URL.
The biggest advantage of this method is that for multiple requests having the same long URL we can have a single entry in the database. This will help to reduce a lot of space.
The third approach makes use of a counter. The idea is that there's a micro-service called "counter" which is common to all the servers. The initial count set in the counter will be 0. Whenever a server requests a string, it will return the current number and increment the counter by one.
This seems to be a nice idea - the problem of collision is solved and there are no database calls for getting as was the case for the above methods. But there's a big problem with this approach. There's a single point of failure! If the rate of requests increases, there's a possibility that this counter will go down, and till it re-instantiates, all of the requests in all the servers will fail.
To mitigate this problem of a single point of failure, we can have this micro-service assign a range of numbers to the servers. And the servers can internally use a counter in this range to generate the shortened URL string. When the servers exhaust this range, they can go back to the micro-service and ask it for another range that is unassigned.
For example in the above image, there are 2 servers - Server 1 and Server 2 for URL generation. The micro-service assigns a range of 0-1,000,000 to Server 1, similarly, Server 2 was assigned a range of 1,000,001 to 2,000,000. There are other ranges that are unassigned. These servers are expected to have an internal counter in this range. Whenever a request comes to Server 1, it will check the counter, increment it to 1 and use this string. Whenever any server's range gets exhausted ( i.e. the counter has used all the numbers in the range ), it will ask the micro-service for a new range. Then the micro-service will look for an unused range and assign it to the server.
Any problems with this approach?
Well yes! since the servers are using a counter, it becomes easy to predict what may be the next URL or range of URLs, and this is a security threat. This information can be exploited in various ways. To remove this predictability, the servers can append random 4-5 extra bits to the generated string.
An important point to note while designing this system is that the part of generating the URLs can be slow but the Get operation should be extremely fast. These URLs can be stored in cache or cloud-front for seamless delivery of long URLs from the shorter URLs.
In the comments, let me know your feedback on the above approaches and how you would like to do things differently.
And that's a wrap! Hi, I am Gourav Dhar, a software developer and I write blogs on Backend Development and System Design. Subscribe to my Newsletter and learn something new every week - https://thegeekyminds.com/subscribe