ctci/09. System design and scala.../9.5. Cache.md

43 lines
2.6 KiB
Markdown

# 9.5. Cache
> A web server for a simplified search engine has 100 machines to respond to search queries, which may then call to processSearch(string query) to another cluster of machines. The machine responding to a query is chosen at random. The method processSearch is very expensive. Design a caching mechanism for the most recent queries. Explain how to update the cache when data changes
I would implement a stack (chapter 3, stacks and queues) which uses a LIFO system. last in first out. We can keep the last 100 queries, as long as it's not empty we keep filling it, when it's full, a new query comes in, the oldest query is removed, and the new query is added to the stack.
## Solution
### Assumptions
* Calling between machines is fast
* We are caching a lot of queries
* The most popular queries are very popular, so they are always in the cache
### System requirements
* Efficient lookups given a key
* Expiration of old data so it can be replaced with new data
* Quick updating and handling of cache
### 1. Design a cache for a single system
A linked list allows easy purging of old data, by moving fresh items to the front. A hash table allows efficient lookup f data, but doesn't ordinarily allow easy data purging. We have a linked list where a node is moved is moved to the front each time it's accessed, and the end of the linked list contains the stalest information. Also, we have a hash table mapping from a query to the corresponding node in the linked list.
### 2. Expand to many machines
* Each machine has its own cache
* it's quick, no machine calls, but not effective. many repeat queries are treated as fresh
* Each machine has a copy of the cache
* the entire data structure of hash table and linked list is duplicated. Updating the cache means sending data to N machines. because the cache would take up more space, we could store less data
* Each machine stores a part of the cache
* when machine i needs to look up the results for a query, it finds which machine has it, and then asks machine j for it. How to know this? The cache can be divided based on some formula hash(query) % N. Machine i finds out machine j has the data, asks machine j for it, and machine j either checks cache or calls the function. then updates its cache.
### 3. Updating results when content changes
Some pages are so popular they are always cached, we need to refresh the cache either periodically or on demand, when content changes.
* The content at a url changes, or the page is removed
* The ordering of results change in response to the rank of a page changing
* New pages appear related to a particular query
See book page 384 for the solutions for this.