ctci/09. System design and scala.../9.5. Cache.md

2.6 KiB

9.5. Cache

A web server for a simplified search engine has 100 machines to respond to search queries, which may then call to processSearch(string query) to another cluster of machines. The machine responding to a query is chosen at random. The method processSearch is very expensive. Design a caching mechanism for the most recent queries. Explain how to update the cache when data changes

I would implement a stack (chapter 3, stacks and queues) which uses a LIFO system. last in first out. We can keep the last 100 queries, as long as it's not empty we keep filling it, when it's full, a new query comes in, the oldest query is removed, and the new query is added to the stack.

Solution

Assumptions

  • Calling between machines is fast
  • We are caching a lot of queries
  • The most popular queries are very popular, so they are always in the cache

System requirements

  • Efficient lookups given a key
  • Expiration of old data so it can be replaced with new data
  • Quick updating and handling of cache

1. Design a cache for a single system

A linked list allows easy purging of old data, by moving fresh items to the front. A hash table allows efficient lookup f data, but doesn't ordinarily allow easy data purging. We have a linked list where a node is moved is moved to the front each time it's accessed, and the end of the linked list contains the stalest information. Also, we have a hash table mapping from a query to the corresponding node in the linked list.

2. Expand to many machines

  • Each machine has its own cache
    • it's quick, no machine calls, but not effective. many repeat queries are treated as fresh
  • Each machine has a copy of the cache
    • the entire data structure of hash table and linked list is duplicated. Updating the cache means sending data to N machines. because the cache would take up more space, we could store less data
  • Each machine stores a part of the cache
    • when machine i needs to look up the results for a query, it finds which machine has it, and then asks machine j for it. How to know this? The cache can be divided based on some formula hash(query) % N. Machine i finds out machine j has the data, asks machine j for it, and machine j either checks cache or calls the function. then updates its cache.

3. Updating results when content changes

Some pages are so popular they are always cached, we need to refresh the cache either periodically or on demand, when content changes.

  • The content at a url changes, or the page is removed
  • The ordering of results change in response to the rank of a page changing
  • New pages appear related to a particular query

See book page 384 for the solutions for this.