When it comes to the collection and analysis of massive volumes of log data, there are really only two fighters worthy of facing off in the ring: Splunk and Elastic.

When it comes to trash-talking their competition, Splunk is clearly the more vocal of the two. From sharing presentations on their website, to offering TCO-oriented workshops, they desperately want to convince you that open-source is prohibitively expensive to own and operate.

In a manner typical of such ramblings, they present Elastic in a worst-case scenario as if it was the norm. Look, I get it… they are trying to sell their wares. I just find it unfortunate that they can’t be more honest as they do so.

In this series of posts, we will take a closer look at Splunk’s claims about Elastic. We will see how these claims measure up to the real-world facts that we live and breathe everyday with our customers.

Servers and Storage - How much do you really need?

Splunk’s criticism of Elastic starts with storage and, by extension, the number of servers that are required. We will take a look at their claims and compare them with real-world data from a deployment of our sýnesis™ Security Analytics solution.

Splunk’s presentation from .conf2017 uses an example of a 297 character message from a web server access log.

Original Log Indexed Size % of Original
Splunk 297 bytes 149 bytes 50%
Elastic 297 bytes 1090 bytes 367%

Doing the math, Splunk’s claim is that Elastic requires 730% more storage for the same original log data. We took a look at real-world data from various customer deployments, and we are convinced that Splunk’s math is off by a considerable amount.

One of the many sources supported out-of-the-box by sýnesis™ Security Analytics is Fortinet’s FortiOS-based solutions, such as the FortiGate Next-Generation Firewalls. This support include FortiOS’s native syslog messages, CEF-formatted messages as well as Netflow data.

A typical FortiOS log message looks like this:

<189>date=2018-05-04 time=15:06:22 devname=fw1 devid=FGT30D3X15021113 logid=0001000014 type=traffic subtype=local level=notice vd=root srcip=192.168.1.249 srcname=”localhost” srcport=8001 srcintf=”lan4″ dstip=224.0.0.7 dstport=8001 dstintf=unknown-0 sessionid=5053785 proto=17 action=deny policyid=0 dstcountry=”Reserved” srccountry=”Reserved” trandisp=noop service=”udp/8001″ app=”udp/8001″ duration=0 sentbyte=0 rcvdbyte=0 sentpkt=0 appcat=”unscanned” crscore=30 craction=131072 crlevel=high devtype=”Media Streaming” osname=”TV” mastersrcmac=14:bb:6e:30:7b:96 srcmac=14:bb:6e:30:7b:96

Based on investigation of gigabytes of raw data, the average size of these FortiOS logs works out to 590 bytes per message.

The Elasticsearch REST API provides rich statistics about all aspects of its inner workings and the data it contains. Querying statistics for an index containing only such FortiOS logs returns the following:

“docs”: {
    “count”: 634847
},
“store”: {
    “size_in_bytes”: 444499743
},

According to Splunk, indexing a 590 byte log into Elasticsearch should consume  2,165 bytes. However, the data clearly shows that Elasticsearch consumes only 700 bytes per log! Instead of requiring 1.37 GB to store 634,847 log messages, as Splunk would claim, we clearly see that only a little under 445 MB is needed. The data doesn’t lie.

WAIT A MINUTE!!! Surely there is some trickery going on here!!!

Not at all! But let’s dig a little deeper and you will see for yourself.

Splunk points out that to minimize storage requirements, it is possible to “optimize” Elasticsearch and the incoming data. They go on to point out many of these options, and the many supposed “consequences” of using them. We believe in full disclosure. So let’s dig into each point and see which were at play in our real-world FortiOS example, and what their use really means.

Optimization

Used?

Real-World Results

Delete original message field

Not only do our solutions keep the original message, we duplicate it. The original is indexed as an analyzed field, where the second is a keyword.

This method ensures that both full-text search as well as aggregations are possible. The result is that users have the full functionality they expect from Elasticsearch and Kibana.

Disable the _all field

The _all field went away completely as of Elasticsearch 6.0, and in most cases was always able to be removed without much consequence.

By keeping the original message field as analyzed text (as mentioned above) full-text search capabilities are maintained and the _all field is not missed.

Disable the _source field

We keep the _source field as it ensures that many operational tasks are easily performed. The impact is minimal, and the benefits (update API, reindex API and highlighting) make it well worth a little extra storage space.

Set optimal index/analyze options for each source

I think that Splunk may feel this is a big challenge, perhaps because of the way Splunk works, but we don’t see it as difficult with Elastic.

All sýnesis™ Solutions are designed to leverage the KOIOS Data Model, a common schema that ensures users enjoy a seamlessly integrated UI and analytics experience.

By optimizing this schema once, all new integrations are automatically optimized following a straightforward mapping of the data.

Use best_compression option

Of course we use this option!!! We do so because there is absolutely no reason not to. The primary drivers of Elasticsearch CPU utilization are queries. Log analytics is primarily an ingest heavy workload.

Of course it is necessary to query the data for visualizations and analytics. However typical logging use-cases do not require the same complex, resource-dependent queries of website search and other areas where Elasticsearch is used.

The environments for almost every logging, metric and network flow use-case that we have seen, have had abundant CPU to spare. Why not use it for compression? I can’t think of any reason not to.

The real bottleneck is always storage performance (also true for Splunk). Using compressed storage means less data to write/read, which means less IOPS. Trading a little extra CPU utilization for a lower IOPS load is a worthwhile trade. Especially when you have CPU to spare.

Speaking of functionality… Splunk’s example shows that significant additional data was added to the original message indexed by Elasticsearch. They argue that this extra information bloats the storage requirements significantly and is a weakness of the Elastic Stack.

Splunk and Elastic take different approaches to data ingestion and rendering. Splunk ingests raw data almost completely unchanged, and schema isn’t applied until data is rendered. This ensures that Splunk can efficiently ingest large amounts of data without much consideration beforehand for how that data will be used. Splunk’s Search Processing Language (SPL) is leveraged at query time to parse, format and enrich data so that it can be rendered on dashboards and further analyzed. They call this “schema on read”.

In contrast, Elastic follows a “schema on write” paradigm. As data is ingested it is parsed, formatted and enriched prior to being indexed by Elasticsearch. Of course this approach means that considerations must be made about how the data will be used as data source integrations are developed. Any later schema changes could require historical data to be reindexed.

Both approaches have advantages and disadvantages. Elastic will index more data and require more storage for that data. However, querying and rendering data is blazingly fast as additional processing is minimal (mostly limited to aggregations). Splunk does have lower storage requirements (although I have shown that this is not as significant as they want you to believe). You will however pay for this storage efficiency with significantly longer query times and slower rendering of dashboards. We have seen this consistently with identical datasets. We also have some benchmarking planned for the future that will quantify this is greater detail.

The question that you have to ask yourself is which do you value more? Our customers generally agree: If they have a major outage or are experiencing a critical cyber-attack, having instant access to their data is far more valuable than saving a bit of cash on a little extra storage.

I discussed this point with a Splunk Solutions Architect a while back, and he pointed out that there are options to do parsing, formatting and enrichment of the data as it is ingested. However he added that doing so was “doubling the store size”, which kind of shoots a big hole in their “Elastic uses too much disk” argument. Perhaps this is why you don’t hear them speak of it much.

So let’s revisit our FortiOS log. Since sýnesis™ Security Analytics is built on the Elastic Stack we apply schema on write, and we do a lot of enrichment of the data. This includes:

  • GeoIP information
  • Autonomous System information
  • IP Reputation information
  • DNS lookups
  • Client/Server detection (from src/dst)
  • User-friendly Protocol and Service names
  • Established Connection detection
  • TCP Flag decoding
  • Detection and tagging for various traffic types
  • Plus breaking out all of the vendor-specific information in the logs

As I said, we do A LOT of enrichment. Yet we still average only 700 bytes to store what was originally a 590 byte log. Based on what Splunk claims, they would need only 295 bytes to store this log.

Elastic does require more storage for the same volume of logs. As we have shown it is about 2.4 times as much… NOT the 7.3 multiplier claimed by Splunk!

However, what you gain for the price of that storage is blazing fast queries, dashboards and analytics, when compared to what can be delivered with Splunk’s schema on read approach.

Let’s finish up by looking at the number of servers required.

Splunk takes the scenario of ingesting 1 TB of raw logs per day, and then uses the example of Cloud-based Elasticsearch offerings to claim that you need at least 635 servers to handle that load. The problem with this approach is that all of the Cloud offerings they mention are not built for logging use-cases. They are either a compromise between the needs of logging and search use-cases, or are simply search-focused offerings. This is an apples to oranges comparison, and completely irrelevant for the proper design of a large-scale logging architecture.

Splunk was actually on the right track when they mentioned the “sweet spot” server to be 8 cores, 64 GB and 6 TB SSD storage (you can usually get away with RAID-0 HDDs as well). Let’s use this as the foundation and walk through a proper sizing.

As I have demonstrated, 590 bytes of logs results in 700 bytes of fully parsed, formatted and enriched data being written to disk. So in order to store 1 TB of raw logs we would need to write 1.186 TB to disk. Given a 90-day retention requirement we would need enough capacity for 107 TB.

Elasticsearch will only allocate new shards to disks that are less than 85% full. This can be tuned, but that is the default. So we need to modify our disk requirements to ensure that we can write all of the data and stay below that threshold. Given the need to be able to write 107 TB, we will need 126 TB of storage capacity.

Now back to our “sweet spot” server, which has 6 TB of storage. 126 divided by 6 equals 21. We need 21 servers to store the data.

It is safe to assume that an organization with such logging requirements also wants high-availability, which requires data redundancy. Redundancy means that the storage requirement will increase by the replication factor, as will the number of servers. This is true if it is Elastic, Splunk or any other solution. There is no breaking the laws of physics.

If we configure our cluster to write one replica of all of the data, we will need a total of 42 servers. A cluster of this size will also require dedicated Master nodes, raising the total by 3 to 45. I could even throw in another 5 servers to provide headroom for occasional spikes in log volume and the resulting 50 servers total would still be far far fewer than the 635 that Splunk wants you to believe would be necessary.

The other factor that must be considered is event throughput. 1 TB of 590 byte logs works out to 19,617 logs per second. We have a number of single-server deployments where the entire Elastic Stack is on one server, and we can still handle over 5,000 logs per second. We have also performed benchmarking using records which each contained 90 unique fields, and we were able to index 19,000 events/s before hitting the limit of the server’s 1Gb/s network card (it was an older machine). Spreading nearly 20,000 logs/s across 21 nodes isn’t going to stress any individual server in the least. All is good!

Conclusions

Hard facts from data observed on real-world implementations of the Elastic Stack make it clear: Splunk’s claims about the infrastructure costs related to using the Elastic Stack are grossly misleading.

Don’t believe the FUD from Splunk!

At Koiossian we help our customers make the right decisions about proper sizing of their environment, for both cost efficiency and outstanding performance!