The sheer scale of the data required for and new developments in monitoring IT infrastructures with traditional SIEM (Security Information and Event Management) solutions has been prompting changes for all but the most naive of these systems, and most of these changes involve dealing with and analysing large data sets, hence the connection with the whole Big Data movement.
Big Data is changing the landscape for SIEM providers; in most cases it's not just a difference of scale - just throwing bigger and faster hardware just won't do.
Some of the issues that arise in the day-to-day operation of such systems are as follows:
Long Time Horizons
Data (in the form of logs, mostly) needs to be stored for increasingly long periods of time because sometimes the context is what separates a real threat from false positives.
One small incident is perhaps not relevant if it happens only once but the same issue happening every day for six months might be indicative of something lurking around the corner.
This means that an effective SIEM system needs to have elements to detect and act upon these APTs (Advanced Persistent Threats).
Most SIEM solutions are based on a traditional, relational Database Management System, which are not meant for this type of large, unstructured and relatively static data.
Inconsistent Data Formats
The sheer variety of log types and formats presents, in and of itself, a challenge for traditional SIEMs which are generally based upon database systems which really need some sort of regularity to the data. Companies are trying to move away from having to define each new log format in terms the underlying persisting layer can understand.
Store Once, Read Multiple Times
Logs and other types of monitoring information (both real-time and otherwise) aren't meant to be edited or changed in any way. They are mostly timestamped and automatically generated by devices and/or applications.
Many companies therefore find themselves using technologies meant for other types of data, which further contribute to aggravate the problem.
Not Knowing what to Look For
Users don't always know what they must look for when trying to establish a correlation between different events (now and/or in the past); maybe after an incident has taken place they want to carry out a forensic examination.
SIEM solutions must allow for ad hoc reporting and visualization so that end-users can use the system in ways the original designer didn't think about.
Stretching this notion a little bit, we can see many users using their SIEMs as some sort of log search engine which provides unopinionated visualization for the logs, providing tools for users themselves to see correlations and connections between the data sources rather than doing it itself.
Similar Data that Doesn't Look So
Different devices sometimes describe data in specific ways that makes it extra difficult for systems to determine what's similar and what's not.
For example, you might have two firewalls in your network and one logs drops as
DROP: <IP> <TIMESTAMP> and the other logs it as
DENY <TIMESTAMP> <IP> or something like that. Systems need to be able to infer similarities like these and treat them as a single entities (Firewall Drops) and smooth out small noise like this.