Greplin On Thousands of Indexes in The Cloud at Lucene Revolution 2011

Thousands of Indexes in the Cloud
1
Greplin searches:
- Greplin helps you search all your personal information, wherever it is.
- As Michael Arrington of TechCrunch said, we’ve “attacked the other half of search.”
- Greplin supports over a dozen services today, with more added constantly.
Requirements
• Many inserts
• Fewer searches
• Low per-user cost
- We insert up to 5,000 documents/second
- Average document size of 2KB-4KB
- A fully loaded server is an Amazon c1.medium machine responsible for up to 80,000,000

3KB documents
- Each machine has just 1.7GB of RAM!
- Overall, we handle about 50M documents per GB of RAM with median search latencies
around 200ms.
Memory
• Per doc: 2 longs + 1 int +1 String (avg 5

letters) into the FieldCache, and average of
10 norm’d fields/doc
• 27 bytes/doc * 50M docs = 1.3GB
- Ranking requires pulling a few field values and norms into memory.
- For 50M documents would require well over 1.3GB of memory.
- Assuming an optimized index, searching the number of docs we have per machine with
1GB of RAM is impossible without swapping.
- We benchmarked using a single-index + swapping: search times were multi-second.

“Virtual memory was meant to make it
easier to program when data was larger
than the physical memory, but people
have still not caught on.”
Poul-Henning Kamp,Varnish architect and coder.
What’s Wrong With 1975 Programming
http://www.varnish-cache.org/trac/wiki/ArchitectNotes
- Over the last decade, the trend has been to stop manually managing what goes on disk and
what goes in RAM, instead trusting the operating system’s virtual memory and paging
systems to swap data in/out appropriately.
- For example, the caching HTTP proxy Varnish trusts the OS’s virtual memory, and is thus
significantly simpler and faster than Squid, which tries to manage the what-belongs-in-
memory vs what-belongs-on-disk itself.
- This philosophy has been jokingly summarized as “You’re not smarter than Linus, so don’t
try to be.”
We’re Smarter than
Linus!*
* When we cheat
6
- Many signals (such as user logins) let us predict which users are likely to do searches better
than the OS can.
- By keeping each user’s data in a separate index, we save memory and improve
performance.
- We only keep open IndexSearchers for users who are likely to do searches.
Other Benefits
• tar -cvzf user.tar.gz user && mv user.tar.gz

• du -h
• Smaller ‘corruption domain’
By keeping each user’s index separate, we can:
- more easily move users between servers
- figure out their space usage
- ensure index corruption affects only one user

RAM Index
• Deletion Filters
• MultiSearcher
• Flush planning
- Inspired by Zoie (http://sna-projects.com/zoie/)
- All incoming documents are first added to a RAM Index.
- A user search encompasses a ‘filtered’ view of the RAM Index, the currently flushing index,
plus their disk index.
- When the RAM index is ‘full’ we create a new RAM index.
- We open IndexWriters for each user in turn and flush documents from RAM to disk.
- Interesting cases including updates and deletions are handled with temporary filters on the
disk index.
Amazon Cloud
• Script everything
• XFS+LVM expandability and snapshots are helpful
• Some pain is unavoidable

EBS Performance
150000
112500
KB/sec
75000
37500
0
Seq. Write Seq. Read Random Read Random Write
Single EBS RAID10 EBS Instance Store RAID 0 EBS
More info at: http://tech.blog.greplin.com/aws-best-practices-and-benchmarks

Other Cool Stuff
• ‘kill -9’ any time with no data-loss via a Protocol
Buffer Write Ahead Log
• Detect duplicate documents with Bloom Filter
• Dynamically sized SoftReference Cache
• Custom MergeScheduler
• Custom FieldCache for multi-valued or sparse

fields
• Efficient result clustering and faceting
10
Some of this is open source: https://github.com/Greplin

Questions?
Suggestions?
Robby Walker Shaneal Manek
shaneal@greplin.com
@smanek
11
We’re hiring: http://www.greplin.com/jobs

Greplin On Thousands of Indexes in The Cloud at Lucene Revolution 2011

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Greplin On Thousands of Indexes in The Cloud at Lucene Revolution 2011

Hochgeladen von

Copyright:

Verfügbare Formate

Thousands of Indexes in the Cloud

- We insert up to 5,000 documents/second

- Average document size of 2KB-4KB

- A fully loaded server is an Amazon c1.medium machine responsible for up to 80,000,000

- Each machine has just 1.7GB of RAM!

• Per doc: 2 longs + 1 int +1 String (avg 5

- For 50M documents would require well over 1.3GB of memory.

- We benchmarked using a single-index + swapping: search times were multi-second.

• tar -cvzf user.tar.gz user && mv user.tar.gz

By keeping each user’s index separate, we can:

- more easily move users between servers

- figure out their space usage

- ensure index corruption affects only one user

- Inspired by Zoie (http://sna-projects.com/zoie/)

- All incoming documents are first added to a RAM Index.

- When the RAM index is ‘full’ we create a new RAM index.

• XFS+LVM expandability and snapshots are helpful

• Some pain is unavoidable

Single EBS RAID10 EBS Instance Store RAID 0 EBS

More info at: http://tech.blog.greplin.com/aws-best-practices-and-benchmarks

• Detect duplicate documents with Bloom Filter

• Dynamically sized SoftReference Cache

• Custom FieldCache for multi-valued or sparse

• Efficient result clustering and faceting

Some of this is open source: https://github.com/Greplin

Robby Walker Shaneal Manek

We’re hiring: http://www.greplin.com/jobs

Das könnte Ihnen auch gefallen