ctci ch9 theory and 2 exercises

2020-09-14 09:40:02 +02:00 · 2020-09-14 09:40:02 +02:00 · 42b8f3ea70
parent 679e95bda1
commit 42b8f3ea70
5 changed files with 156 additions and 2 deletions
--- a/Recursion/README.md
+++ b/Recursion/README.md
@ -1,7 +1,5 @@
 # Chapter 8 Recursion and dynamic programming

-Questions p146, solutions p353
-
 You can know a problem is recursive when it can be built off of subproblems. If a program says: compute the nth, or the first n, or all..., it might be recursive.

 ## How to approach
--- a/scalability/9.1.
+++ b/scalability/9.1.
@ -0,0 +1,17 @@
+# 9.1. Stock data
+
+> You are building a service that will be called by up to 1k client applications to get end-of-day stock price information (open, close, high, low). Assume you already have the data, and you can store it in any format you wish. Design the client-facing service that provides the info to client applications. Design the development, rollout, and ongoing monitoring and maintenance.
+
+The service is not ongoing so it will be called by different clients I assume. There are 4 parameters per stock, and I already have all the data and it is stored in a way that is efficient to search. The database should have fast lookup, and new additions can take time. Because the system should be scalable, many queries might be made and there are many stocks, NoSQL databases are preferred. MongoDB for instance.
+
+The client will see a list of stocks, or can search their own stock. Then the front-end module will send a request to the database in the back-end, it will search the stock's information in the db, and will return these 4 parameters to the front-end. Then they are displayed in an infographic with the opening value, highest value, lowest and closing value.
+
+For maintenance, there will be 2 queries to the stock market. At opening time, to get the opening value for the stock. And at the end time, to obtain the end value, high and low. In the simplest version, if the client makes a request in the middle of the day, they will only see the opening value. In a more sophisticated scenario, each time a client selects a stock, regardless of time, a query will be done to the stock market for the high, low and last value for that specific time, that data will be saved in the db and the information shown to the client.
+
+For security, the clients can only have the possibility to query a stock, and the four parameters are already given to them.
+
+## Solution
+
+Mentions using a SQL database, but it is much heavier than the requirements and it needs an additional layer to view and maintain data, which increases the implementation costs. Clients should not have access to additional information. If they make expensive and ineffective queries, our db might bear the costs of that.
+
+Another approach is XML, saving stocks, date and their 4 parameters in xml. It's easy to distribute, easily read by machines and humans, most languages have a library to perform XML parsing so clients can implement it easily. It is easy to add new data. But this solution sends the clients **all** information, even if they want part of it. And performing queries on the data means parsing the entire file.
--- a/scalability/9.2.
+++ b/scalability/9.2.
@ -0,0 +1,24 @@
+# 9.2. Social network
+
+> Design the data structures for a very large social network. Design the algorithm to show the shortest path between two people (me -> Bob -> Susan -> Jason -> You)
+
+This is a graph problem. There are many nodes and many edges. There is probably a python library to use. Or make a hash table, with each node as key and its connections as values. Then we can make a breadth first search to find the destination person. Once we reach it, we stop. Also, once a person is visited, a flag should be activated to discard recurrent loops. And those people should be discarded.
+
+**Bidirectional breadth first search**: you can also start looking from the destination to the origin, but then my hash table data structure doesn't work. It needs another list in the values, a list of preceding nodes. Which nodes lead to this node. And then start search from both ends, whoever reaches the other end first, the route wins and we stop. Or when the two searches collide, we know we have found a path.
+
+BBFS needs access to both origin and destionation points, which is not always the case. What about scalability?
+
+If a connection between two friends is frequented a lot, it can be saved in a sort of cache if it is queried often.
+
+With millions of users, the whole database cannot be kept in one machine. Instead of keeping people's friends as an object, we can keep the ID of where they are stored in the db, which machine and which section.
+
+Jumping from machine to machine is expensive. We can batch this: if 5 of someone's friends are in a certain machine, we can visit this machine all at once. Also, instead of storing people randomly in different machines, we can store them based on country of residence.
+
+Since in a scalable way many queries are being done in each case, we cannot mark nodes as visited. We could create different flags for each query. Or we can create an additional hash table showing whether the node has been visited in this particular query.
+
+Possible problems:
+
+* What when a server fails?
+* How can you take advantage of caching?
+* What if no friend is found, do you continue forever or when do you give up?
+* Some people have more friends and therefore more chances of making a path to new people. How to use this info to choose where to start traversing?
--- a/scalability/README.md
+++ b/scalability/README.md
@ -0,0 +1,109 @@
+# Chapter 9 System design and scalability
+
+Your goal in these problems is to understand use cases, scope a problem, make reasonable assumptions, create a solid design based on those assumptions, and be open about the weaknesses of your design. Do not expect something perfect.
+
+## Handling the questions
+
+1. Communicate: stay engaged with the interviewer, be open about the issues in the system.
+2. Go broad first: don't dive straight into the algorithm part or get excessively focused on one part.
+3. Use the whiteboard: use the whiteboard to draw a picture of what you're proposing. helps the interviewer follow your proposed design.
+4. Acknowledge interviewer concerns: don't brush off the interviewer's concern, validate them. Acknowledge them and make changes accordingly.
+5. Be careful about assumptions
+6. State your assumptions explicitly: when you do make assumptions, state them. It allows the interviewer to correct you if you're mistaken, and it shows that you know what assumptions you're making.
+7. Estimate when necessary: use other data you know to estimate
+8. Drive: talk to your interviewer, ask questions, drive the car. Go deeper, make improvements.
+
+## Design: step-by-step
+
+As an example, your manager might ask you to design a system such as TinyURL.
+
+### 1. Scope the problem
+
+Make sure you're building what the interviewer wants, what specifically you're being asked to implement. Do people create their own short URLs, is it auto-generated, do we keep track of stats, does the URL stay alive forever? Make a list of the major features or use cases:
+
+* Shortening a URL into a TinyURL
+* Analytics for a URL
+* Retrieving the URL associated with a TinyURL
+* User accounts and link management
+
+### 2. Make reasonable assumptions
+
+Don't assume you will only deal with 100 users per day, or that you have unlimited memory available. But you can assume you will have a max of one million URLs per day, and you can estimate how much data your system might have to store.
+
+Some assumptions need some "product sense". It is not advantageous that links are only ready after 10mins, users will want the link to be available immediately. But it is ok for stats to take 10mins to load.
+
+### 3. Draw the major components
+
+Go to the whiteboard, draw a diagram of the major components. You might need a frontend server that pull data from the backend's data store. You might have other servers that crawl the internet for some data, and another that process analytics. Draw a picture of what this system might look like. Walk through the system from end-to-end to provide a flow.
+
+### 4. Identify the key issues
+
+What are the bottlenecks or major challenges of the system. Some links might be accessed frequently, but others can suddenly peak. You don't necessarily want to constantly hit the database. The interviewer might provide some guidance, use it.
+
+### 5. Redesign for the key issues
+
+Adjust the system for the key issues. Stay up at the whiteboard and update the diagram. Be open about limitations in your design.
+
+## Algorithms that scale: step-by-step
+
+Sometimes you are asked to design an algorithm, but in a scalable way.
+
+1. Ask questions: the interviewer might have left out details, intentionally or unintentionally.
+2. Make believe: pretend that the data fits on one machine and there are no memory limitations. Then fix the problem.
+3. Get real: how much data can you store in one machine, and what problems occur when you split the data? How do you logically divide the data, and how does one machine identify where to look up a difference piece of data.
+4. Solve problems: think how to solve the issues identified in step 3: the solution for one issue might be to remove the issue entirely, it simpliy mitigate the issue. Work iteratively, once you have solved the problems from step 3, tackle the new problems..
+
+## Key concepts
+
+> Horizontal vs. vertical scaling
+
+* Vertical scaling: increasing the resources of a specific node, for example add additional memory to a server to improve its ability to handle load changes.
+* Horizontal scaling: increase the number of nodes. Add more servers, decreasing the load of individual servers.
+
+> Load balancer
+
+Typically some frontend parts of a scalable website are thrown behind a load balancer, allowing the system to distribute the load evenly so that one server doesn't crash and take down the whole system. To do this, you need a network of cloned servers that all have the same code and access to the same data.
+
+> Database denormalization and NoSQL
+
+Joins in a relational databaset such as SQL can get very slow when the system is bigger. For this reason, you would generally avoid them. Denormalization is one part of this. Means adding redundant information into a db to speed up reads. Add different tables with redundant information.
+
+You can also go with a NoSQL db, which does not suppor joins and structures data in a different way. It is designed to scale better.
+
+> Database partitioning, sharding
+
+It means splitting the data across multiple machines while ensuring you know which data is on which machine.
+
+* Vertical partitioning: partitioning by feature. If you are building a social network, you can have one partition for tables related to profiles, another for messages. If one table gets very large, you might need to repartition that database.
+* Key-based / hash-based partitioning: it uses some part of the data (an ID for example) to partition. For example, allocate N servers and put the data on mod(key, n). But the number of servers must be fixed. Adding new servers means reallocating the data, which is very expensive.
+* Directory-based partitioning: you maintain a lookup table for where the data can be found. You can add new servers easily, but the lookup table can be a single point of failure. And constantly accessing this table impacts performance.
+
+> Caching
+
+An in-memory cache can deliver very fast results, you pair keys to values and it typically sits between the application layer and the data store. When an app requests a piece of information, it first tries the cache. If the cache does not contain the key, it looks it up in the data store. When you cache, you can cache a query and its result directly, or you can cache the specific object.
+
+> Asynchronous processing and queues
+
+Slow operations should ideally be done asynchronously, otherwise the user might get stuck waiting for a process to complete. Sometimes this can be done in advance, preprocessing. If we are running a forum, we should re-render a page that lists the most popular posts and the number of comments. The list might be slightly out of date, but that is better than a user stuck waiting on the website to load simply because someone added a new comment and invalidated the cached version of this page.
+
+> Networking metrics
+
+* Bandwidth: the maximum amount of data that can be transferred in a unit of time in the best conditions (bits/s, gbyte/s)
+* Throughput: the actual amount of data that is transferred
+* Latency: how long it takes the data to go from one end to the other. The delay between a user sending information and the receiver receiving it
+
+> MapReduce
+
+A MapReduce program is used to process large amounts of data. It requires a map step and a reduce step See more in page 642 of book.
+
+* Map step: takes data and creates a <key, value\> pair
+* Reduce step takes a key and a set of associated values and 'reduces' them in some way, emitting a new key and value
+
+## Considerations
+
+* Failures: any part of the system can fail, plan accordingly
+* Availability and reliability: av: percentage of time the system is operational. Re: probability that the system is operational for a certain unit of time
+* Read-heavy vs. write-heavy: if an application will read many times, you could queue up the writes (think about potential failure). If it is read-heavy, cache
+* Security: think about security issues and design around those
+
+See example problem in page 143 in book.
--- a/README.md
+++ b/README.md
@ -77,6 +77,12 @@ If you can't afford to buy the book, you can find a free pdf [here](http://ahmed
 * Permutations of string with unique characters
 * Permutations of string with duplicate characters

+## Chapter 9 System design and scalability
+
+* Stock data
+* Social network
+* Web crawler
+
 ## Chapter 10 Sorting and searching

 * Sorted matrix search