Beruflich Dokumente
Kultur Dokumente
1
Greplin searches:
- Greplin helps you search all your personal information, wherever it is.
- As Michael Arrington of TechCrunch said, we’ve “attacked the other half of search.”
- Greplin supports over a dozen services today, with more added constantly.
Requirements
• Many inserts
• Fewer searches
• Low per-user cost
- Overall, we handle about 50M documents per GB of RAM with median search latencies
around 200ms.
Memory
- Ranking requires pulling a few field values and norms into memory.
- Assuming an optimized index, searching the number of docs we have per machine with
1GB of RAM is impossible without swapping.
- Over the last decade, the trend has been to stop manually managing what goes on disk and
what goes in RAM, instead trusting the operating system’s virtual memory and paging
systems to swap data in/out appropriately.
- For example, the caching HTTP proxy Varnish trusts the OS’s virtual memory, and is thus
significantly simpler and faster than Squid, which tries to manage the what-belongs-in-
memory vs what-belongs-on-disk itself.
- This philosophy has been jokingly summarized as “You’re not smarter than Linus, so don’t
try to be.”
We’re Smarter than
Linus!*
* When we cheat
6
- Many signals (such as user logins) let us predict which users are likely to do searches better
than the OS can.
- By keeping each user’s data in a separate index, we save memory and improve
performance.
- We only keep open IndexSearchers for users who are likely to do searches.
Other Benefits
• Deletion Filters
• MultiSearcher
• Flush planning
- A user search encompasses a ‘filtered’ view of the RAM Index, the currently flushing index,
plus their disk index.
- We open IndexWriters for each user in turn and flush documents from RAM to disk.
- Interesting cases including updates and deletions are handled with temporary filters on the
disk index.
Amazon Cloud
• Script everything
112500
KB/sec
75000
37500
0
Seq. Write Seq. Read Random Read Random Write
• Custom MergeScheduler
10
shaneal@greplin.com
@smanek
11