ch9 done ✔️

2020-10-09 15:57:16 +02:00 · 2020-10-09 15:57:16 +02:00 · c4749bc3aa
parent 42b8f3ea70
commit c4749bc3aa
5 changed files with 155 additions and 0 deletions
--- a/scalability/9.3.
+++ b/scalability/9.3.
@ -0,0 +1,21 @@
+# 9.3. Web crawler
+
+> If you were designing a web crawler, how would you avoid getting into infinite loops?
+
+Infinite loop means that one page redirects to the other, and the other redirects to the one. We can try doing this loop once or twice, but after that it should be discarded.
+
+I would create hash map of urls and the frequency they have been visited, each time they are visited the count is imcremented by one. Set a threshold, and when it is reached, do not visit that url again.
+
+The hash map could include the previous url, the url it is visiting from, so that only certain paths (graph-style) are discarded and not any url that is visited more often.
+
+## Hints
+
+> How do you define if two pages are the same? Url, content? Both of these can be flawed, why?
+
+If the url is the same, the page should be the same... unless it's been visited a long time ago. Then we can include a threshold of time in the hash map as well.
+
+## Solution
+
+The page `www.careercup.com/page?pid=microsoft-interview-questions` is very different from `www.careercup.com/page?pid=google-interview-questions`, yet the url is the same. You can append url parameters arbitrarily to any url without changing the page, as long as it's not a parameter that the web application recognizes and handles.
+
+There is no perfect way to define a different page. One way to tackle this is having an estimate of similarity. If, based on the content and the url, a page is deemed to be sufficiently similar to other pages, we deprioritize crawling its children. We can create a signature based on the url and content.
--- a/scalability/9.4.
+++ b/scalability/9.4.
@ -0,0 +1,13 @@
+# 9.4. Duplicate URLs
+
+> Given 10 billion URLs, how do you detect duplicate documents? Assume duplicate means that the URLs are identical.
+
+10 billion urls is a lot of data to keep in just one machine. But to create a simple version, we assume we can. We can create a hash table where each url maps to true if it's already been found. But when we can't store that in memory, two solutions
+
+## Solution 1: dosk storage
+
+We do two passes to the document: first we split the list of urls into 4000 chunks of 1GB each. We can store each url u in a file named x.txt where x = hash(u) % 4000. All urls with the same hash value would be in the same file. In the second pass, we implement the solution from befor. Load each file into memory, create a hash table, and look for duplicates.
+
+## Solution 2: multiple machines
+
+We send the url to machine x. We can parallelize the operation, so that all 4000 chunks are processed at the same time. But now we need many machines, which is not realistic. We need to consider how to handle failure.
--- a/scalability/9.5.
+++ b/scalability/9.5.
@ -0,0 +1,42 @@
+# 9.5. Cache
+
+> A web server for a simplified search engine has 100 machines to respond to search queries, which may then call to processSearch(string query) to another cluster of machines. The machine responding to a query is chosen at random. The method processSearch is very expensive. Design a caching mechanism for the most recent queries. Explain how to update the cache when data changes
+
+I would implement a stack (chapter 3, stacks and queues) which uses a LIFO system. last in first out. We can keep the last 100 queries, as long as it's not empty we keep filling it, when it's full, a new query comes in, the oldest query is removed, and the new query is added to the stack.
+
+## Solution
+
+### Assumptions
+
+* Calling between machines is fast
+* We are caching a lot of queries
+* The most popular queries are very popular, so they are always in the cache
+
+### System requirements
+
+* Efficient lookups given a key
+* Expiration of old data so it can be replaced with new data
+* Quick updating and handling of cache
+
+### 1. Design a cache for a single system
+
+A linked list allows easy purging of old data, by moving fresh items to the front. A hash table allows efficient lookup f data, but doesn't ordinarily allow easy data purging. We have a linked list where a node is moved is moved to the front each time it's accessed, and the end of the linked list contains the stalest information. Also, we have a hash table mapping from a query to the corresponding node in the linked list.
+
+### 2. Expand to many machines
+
+* Each machine has its own cache
+  * it's quick, no machine calls, but not effective. many repeat queries are treated as fresh
+* Each machine has a copy of the cache
+  * the entire data structure of hash table and linked list is duplicated. Updating the cache means sending data to N machines. because the cache would take up more space, we could store less data
+* Each machine stores a part of the cache
+  * when machine i needs to look up the results for a query, it finds which machine has it, and then asks machine j for it. How to know this? The cache can be divided based on some formula hash(query) % N. Machine i finds out machine j has the data, asks machine j for it, and machine j either checks cache or calls the function. then updates its cache.
+
+### 3. Updating results when content changes
+
+Some pages are so popular they are always cached, we need to refresh the cache either periodically or on demand, when content changes.
+
+* The content at a url changes, or the page is removed
+* The ordering of results change in response to the rank of a page changing
+* New pages appear related to a particular query
+
+See book page 384 for the solutions for this.
--- a/scalability/9.6.
+++ b/scalability/9.6.
@ -0,0 +1,45 @@
+# 9.6. Sales rank
+
+> A large ecommerce company wants to list the best-selling products, overall and by category. One product might be the #1056th best-selling product overall but the #13th under "sports equipment" and #24th under "safety". Describe how you would implement the system
+
+I am assuming I am building the storage system, the backend. We get notified when a new product is added, or sold.
+
+Clearly each object has various pointers for different rankings in different categories. Because there are many products, they might be saved in different machines. In the beginning, each product is ranked and assigned a number. A central machine controls the rest of the machines, which contains a list of all products for each category.
+
+For the edit:
+
+When the ranking changes for category X, all products from category X are visited (saved in the central machine) and their rankings updated. The rankings could be saved in a linked list, so that if a product goes from position 15 to 5, then only the positions 5-15 are updated and not each of them. If a product falls from 4 to 24, all positions from 4 to 23 get their number reduced by 1, the product previously in pos 4 gets 24, and the rest are unchanged.
+
+For a new addition:
+
+for the category of the new product + the overall category, the ranking of this new product is calculated and the corresponding products get their ranking changed.
+
+For deletion:
+
+for the deletion of product X with ranking Y, the linked list in position Y-1 just points to Y+1 instead of to Y. The numbers of the following elements are changed.
+
+For lookup:
+
+locate the object in the main machine, access it, get its rankings.
+
+## Hints
+
+> What is the expectation on availability and accuracy?
+
+I would say both are important, but the accuracy of the products in the front is more important than the products at the back.
+
+> Purchases occur very frequently, you should limit database writes
+
+We can access the db every 50-100-1k purchases maybe
+
+> Where would it be appropriate to cache data or queue up tasks?
+
+When purchases are very frequent, we only access the database every min or something. And we can cache the information of the products that are sold most frequently, in a LIFO style
+
+## Solution
+
+Scope the problem, should we list the ranking of the past week, month, all time? Discuss with interviewer.
+
+As an assumption, we can assume the ranking for more popular products is more important than that of the lesser popular ones.
+
+The solution goes for a SQL type database, see book pages 397-400.
--- a/scalability/9.7.
+++ b/scalability/9.7.
@ -0,0 +1,34 @@
+# 9.7. Personal financial manager
+
+> Design a personal financial manager, which connects to your bank accounts, analyze your spending habits, and make recommendations
+
+Assumptions:
+
+* This program is an extension of the browser and runs in the background, every time you open your bank account, the service is activated and the data updated in real time. same with mobile.
+* Otherwise, it would be a program that the user opens when they want to find data, then the app connects to the bank, processes the new data, and shows metrics or gives recommendations.
+
+We can choose option 2.
+
+The user will have to give the bank details somehow, through an API, or some other secure connection. The program will have a hidden variable of a timestamp, showing when the app was opened last. On the first run, the program will fetch all the bank information until date, and will save the current timestamp. The next time the program is open, the program will fetch the data from the saved timestamp until current, to avoid fetching all data at all times. The program can also have a 'Refresh' button to fetch new data if the user wants it, in which case the timestamp will be updated.
+
+Ideally, the fetched data will come in some readable format, json, xml. The service will parse this format, and save the information in memory. Given that this program is for one user only, not many transactions are expected per day. We can make an assumption of no more than 20 transactions per day. For security reasons, this data can stay in the user's pc or mobile in an encrypted format.
+
+There can be a list of all the user's accounts, and at each fetch new transactions are appended to this. NLP tools or regex methods could make it possible to categorize each purchase into a different category, also with earnings from the salary etc. We can have a visualize function to see a graph of money fluctuation through time, overall and per category.
+
+We can use traditional ML models as decisions trees or random forests for predictive analysis of future expenses. The user can add a monthly budget per account or category, and the system can update them if the expenses are nearing this limit.
+
+## Hints
+
+> Try to reduce unnecessary db queries. If you don't need to permanently store the data in the db, you might not need it in the db at all.
+
+So what, I can fetch all data every time the user wants to? That is a lot of data each time.
+
+> As much work as possible should be done asynchronously
+
+## Solution
+
+The user should be able to update the category if it is improperly assigned. In that case, the AI would adapt to it. Notifications could be done per email, either regularly or when some limit is reached, not necessarily when the user opens the program.
+
+Another assumption might be that the updates don't have to be instantaneous, a delay of 24h could be acceptable. We can pull the data periodically. Asynchronously, we could queue up tasks, each one with a specific priority. Lowest priority tasks would eventually also get done, just later. But they wouldn't be in the queue forever because there are new higher priority tasks, eventually lower priority tasks are also done.
+
+The solution also includes some tips to categorize transactions.