Traditional DBMS store data that are finite and persistent. Data is available whenever we want it. Big data Analytics or in Data mining data is assumed to come in streams. If not processed immediately it is lost. Data streams are a continuous flow of data. Sensor data, network traffic, call center records, satellite images, data from electric power grids etc.are some of the popular examples for the data stream. Data streams possess several unique properties
A new class of data may evolve, which makes it difficult to include in the existing classes. (concept evolution)
The relation between input data and output data may change. (concept drift)
Apart from these unique characteristics, there are some potential challenges in data stream mining
- It is not manually possible to label all the data points in the stream.
- It is not feasible to store or archive this data stream in a conventional database.
- Concept drift
- Concept evolution
- Speed and huge volume make it difficult to mine the data. Only single scan algorithms will be feasible.
- Difficult to query with SQL-based tools due to lack of schema and structure.
Stream Data Model
A stream data architecture is a framework of software components built to incorporate and process a large volume of streaming data from various sources. A streaming data model processes the data immediately as it is generated, and continue it to store. This architecture also includes various additional tools for real-time processing, data manipulation, and analysis.
Benefits of stream processing
- Able to deal with infinite or never-ending streams of data
- Real-time processing
- Easy data scalability
Modern stream processing infrastructure is hyper-scalable, able to deal with Gigabytes of data per second with a single stream processor. This allows you to easily deal with growing data volumes without infrastructure changes.
There are four building blocks of stream architecture are:
1. Stream Processor/ Message broker
- Takes data coming from various sources, translates it into a standard format streams it on an ongoing basis.
- Two popular stream processing tools are Apache Kafka and Amazon Kinesis Data streams
2. Batch and Real-time ETL tools
- ETL stands for Extract, Transform, and Load. It is the process of moving a huge volume of unstructured data from one source to another. ETL is basically a data integration process
- ETL tools aggregate data streams from one or more message brokers.
- ETL tool or platform receives queries from users, fetches events from message queues, and applies the query, to generate a result. The result may be an API call, an action, a visualization, an alert, or in some cases a new data stream.
- A few examples of open-source ETL tools for streaming data are Apache Storm, Spark Streaming, and WSO2 Stream Processor.
3. Query Engine
- After streaming data is prepared for consumption by the stream processor, it must be analyzed to provide value. Some of the commonly used data analytics tools are Amazon Athena, Amazon Redshift, Elasticsearch, Cassandra.
4. Data Storage
- Streams may be archived in a large archival store, but it is not possible to answer queries from the archival store.
- A working store of limited capacity is used into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be a disk, or it might be the main memory, depending on how fast we need to process queries.
- The advent of low-cost storage technologies paved a way for organizations to store streaming event data.
- There are several ways in which event can be stored; in a database or a data warehouse, in the message broker, in the data lake.
- A data lake is the most flexible and inexpensive option for storing event data. But the latency (time required to transfer data from the storage) is high for real-time analysis.
Streaming data architectures enable developers to develop applications that use both bound and unbound data in new ways. For example, Alibaba’s search infrastructure team uses a streaming data architecture powered by Apache Flink to update product details and inventory information in real-time.
Netflix also uses Flink to support its recommendation engines and ING, the global bank based in The Netherlands, uses the architecture to prevent identity theft and provide better fraud protection. Other platforms that can accommodate both stream and batch processing include ApacheSpark, Apache Storm, Google Cloud Dataflow, and AWS Kinesis.
There are two types of stream queries: standing queries and ad-hoc queries
1. Standing queries
Permanently executing, produce output at appropriate times. Consider a temperature sensor bobbing about in the ocean, sending back to a base station a reading of the surface temperature each hour. The data produced by this sensor is a stream of real numbers. In this case, we can ask a query, what is the maximum temperature ever recorded by the sensor.
For answering this query we need not store the entire stream. When a new stream element arrives, we compare it with the stored maximum and set the maximum to whichever is larger. Similarly, if we want the average temperature overall time, we have only to record two values: the number of readings ever sent in the stream and the sum of those readings.
2. Ad-hoc queries
- Asked only when a piece of particular information is needed. Not permanently executing.
- Asked once about the current state of streams. Such queries are difficult to answer as we are not archiving the entire stream.
- For answering such ad-hoc queries we have to store the parts or summaries of the stream.
The sliding window approach can be used to answer ad-hoc queries.
Each sliding window stores the most recent n elements of the stream for some n. Or it can be all the elements that are arrived within the last t time units; maybe day.
The length of the sliding window is specified by its range. Stride specifies the portion of the window that is omitted when the window moves forward.
2 types of sliding windows
Time-based sliding windows
- Range and stride are specified by time intervals.
- For example, a sliding window with range= 10 mins and stride= 2 mins produces a window that covers the data in the last 10 mins. A new window is created
Count-based sliding windows
- Range and stride are specified in terms of the number of intervals.
- The obvious approach would be to generate a random number, say an integer from 0 to 9, in response to each search query.
- Store the tuple if and only if the random number is 0.
- Each user has, on average, 1/10th of their queries stored.
Geethu is a lecturer by profession and a blogger by passion. She loves writing and in ClassRounder she is looking to share the tutorials and the notes she prepared for the students. If you have any doubts within the topic, you can contact her at [email protected]