As mentioned earlier a data stream is a massive, infinite dataset. Hence it is not possible to store the entire stream. While mining a data stream a typical question that can arise is how can we answer certain critical queries without storing the entire stream. In some cases, we can get an answer from certain samples in the stream, without examining the entire stream.
Here, we have to keep in mind two things; one is the sample should be unbiased. The second one is the typical sample should be able to answer the queries. Choosing the right samples is critical. Carelessly choosing samples can destroy the results of the query. While sampling we must take care of some pitfalls.
Consider a search engine like Google that receives a stream of queries. Google wants to study the behavior of users. A typical question that can be asked is “What fraction of queries asked past the month are unique?”. Only 1/10th of the stream element is stored.
The obvious approach would be to generate a random number, say an integer from 0 to 9, in response to each search query. Store the tuple if and only if the random number is 0. If we do so, each user has, on average, 1/10th of their queries stored. Statistical fluctuations will introduce some noise into the data, but if users issue many queries, the law of large numbers will assure us that most users will have a fraction of quite close to 1/10th of their queries stored.
[Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at least once.But only d/100 pairs of duplicates d/100 = 1/10 * 1/10 * d]
This scheme gives us the wrong answer to the query asking for the average number of duplicate queries for a user. Suppose a user has issued s search queries one time in the past month, d search queries twice, and no search queries more than twice. If we have a 1/10th sample, of queries, we shall see in the sample for that user an expected s/10 of the search queries issued once. Of the d search queries issued twice, only d/100 will appear twice in the sample; that fraction is d times the probability that both occurrences of the query will be in the 1/10th sample. Of the queries that appear twice in the full stream, 18d/100 will appear exactly once.
To see why, note that 18/100 is the probability that one of the two occurrences will be in the 1/10th of the stream that is selected, while the other is in the 9/10th that is not selected.[Of d “duplicates” 18d/100 appear once 18d/100 = ((1/10*9/10)+(9/10*1/10))*d]. The correct answer to the query about the fraction of repeated searches is d/(s+d). However, the answer we shall obtain from the sample is d/(10s+19d).
d/100 appears twice, while s/10+18d/100 appears once. Thus, the fraction appearing twice in the sample is d/100 divided by d/100 + s/10 + 18d/100. This ratio is d/(10s + 19d).
A solution for the above scenario is to sample the users.
- Pick 1/10th of users and take all their searches in the sample.
- Use a hash function that hashes the user name or user id uniformly into 10 buckets.
- Each time a search query arrives in the stream, we look up the user to see whether or not they are in the sample. If so, we add this search query to the sample, and if not, then not.
- By using a hash function, one can avoid keeping the list of users.
- Hash each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0, then accept this search query for the sample, and if not, then not.
- Stream of tuples with keys.
- Key is some subset of each tuple’s components e.g., a tuple is (user, search, time).
- Key is user Choice of the key depends on the application
- To get a sample of size a/b: Hash each tuple’s key uniformly into b buckets Pick the tuple if its hash value is at most a.
Geethu is a lecturer by profession and a blogger by passion. She loves writing and in ClassRounder she is looking to share the tutorials and the notes she prepared for the students. If you have any doubts within the topic, you can contact her at [email protected]