2.6 KiB

Raw Permalink Blame History

9.5. Cache

A web server for a simplified search engine has 100 machines to respond to search queries, which may then call to processSearch(string query) to another cluster of machines. The machine responding to a query is chosen at random. The method processSearch is very expensive. Design a caching mechanism for the most recent queries. Explain how to update the cache when data changes

I would implement a stack (chapter 3, stacks and queues) which uses a LIFO system. last in first out. We can keep the last 100 queries, as long as it's not empty we keep filling it, when it's full, a new query comes in, the oldest query is removed, and the new query is added to the stack.

Solution

Assumptions

Calling between machines is fast
We are caching a lot of queries
The most popular queries are very popular, so they are always in the cache

System requirements

Efficient lookups given a key
Expiration of old data so it can be replaced with new data
Quick updating and handling of cache

1. Design a cache for a single system

A linked list allows easy purging of old data, by moving fresh items to the front. A hash table allows efficient lookup f data, but doesn't ordinarily allow easy data purging. We have a linked list where a node is moved is moved to the front each time it's accessed, and the end of the linked list contains the stalest information. Also, we have a hash table mapping from a query to the corresponding node in the linked list.

2. Expand to many machines

Each machine has its own cache
- it's quick, no machine calls, but not effective. many repeat queries are treated as fresh
Each machine has a copy of the cache
- the entire data structure of hash table and linked list is duplicated. Updating the cache means sending data to N machines. because the cache would take up more space, we could store less data
Each machine stores a part of the cache
- when machine i needs to look up the results for a query, it finds which machine has it, and then asks machine j for it. How to know this? The cache can be divided based on some formula hash(query) % N. Machine i finds out machine j has the data, asks machine j for it, and machine j either checks cache or calls the function. then updates its cache.

3. Updating results when content changes

Some pages are so popular they are always cached, we need to refresh the cache either periodically or on demand, when content changes.

The content at a url changes, or the page is removed
The ordering of results change in response to the rank of a page changing
New pages appear related to a particular query

See book page 384 for the solutions for this.

2.6 KiB Raw Permalink Blame History