ctci/09. System design and scala.../9.3. Web crawler.md

22 lines
1.5 KiB
Markdown

# 9.3. Web crawler
> If you were designing a web crawler, how would you avoid getting into infinite loops?
Infinite loop means that one page redirects to the other, and the other redirects to the one. We can try doing this loop once or twice, but after that it should be discarded.
I would create hash map of urls and the frequency they have been visited, each time they are visited the count is imcremented by one. Set a threshold, and when it is reached, do not visit that url again.
The hash map could include the previous url, the url it is visiting from, so that only certain paths (graph-style) are discarded and not any url that is visited more often.
## Hints
> How do you define if two pages are the same? Url, content? Both of these can be flawed, why?
If the url is the same, the page should be the same... unless it's been visited a long time ago. Then we can include a threshold of time in the hash map as well.
## Solution
The page `www.careercup.com/page?pid=microsoft-interview-questions` is very different from `www.careercup.com/page?pid=google-interview-questions`, yet the url is the same. You can append url parameters arbitrarily to any url without changing the page, as long as it's not a parameter that the web application recognizes and handles.
There is no perfect way to define a different page. One way to tackle this is having an estimate of similarity. If, based on the content and the url, a page is deemed to be sufficiently similar to other pages, we deprioritize crawling its children. We can create a signature based on the url and content.