Skip to content

jFriedli/Bloom-Filter

Repository files navigation

Java Project Analysis: "Bloom Filter Implementation"

Overview

This Java project implements a Bloom Filter, a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not. In other words, a query returns either "possibly in set" or "definitely not in set". Bloom filters are used to reduce the need for costly operations such as disk or network access. This project includes classes for the Bloom Filter itself and a Main class for testing its functionality.

What is a Bloom Filter?

A Bloom Filter is an efficient data structure that can quickly determine whether an element is potentially in a set. It's designed to be space-efficient at the cost of a certain probability of false positives. The Bloom Filter uses multiple hash functions to map each element to several positions in a bit array. When checking for an element's presence, if any of the bits at these positions are not set, the element is definitely not in the set. If all are set, the element might be in the set.

How Does a Bloom Filter Work?

  • Initialization: It starts with an array of bits, all set to 0.
  • Adding Elements: When an element is added, it is hashed multiple times (using different hash functions), and the corresponding bits in the array are set to 1.
  • Querying: To check if an item is in the set, the same hash functions are applied to the item. If any of the corresponding bits in the array are 0, the item is not in the set. If all are 1, the item might be in the set (hence, a false positive is possible).

BloomFilter Class

  • Purpose: Implements the Bloom Filter data structure.
  • Key Components:
    • boolean[] hashSet: Array of bits.
    • int n: Number of expected elements.
    • double p: False positive probability.
    • int m: Length of the bit array.
    • int k: Number of hash functions.
    • Methods:
      • Constructor: Initializes the filter with expected item count n and false positive probability p.
      • exists: Checks if an item is possibly in the set.
      • add: Adds an item to the set.
      • getHashIndex: Generates a hash index for a given string and seed.
      • calculateM: Calculates the length m of the bit array.
      • calculateK: Determines the optimal number of hash functions k.

Main Class

  • Purpose: Demonstrates the usage of the Bloom Filter with a dataset.
  • Key Components:
    • File reading to load words into a set.
    • Creation and population of a Bloom Filter with the words.
    • Testing for false positives and actual positives.
    • Methods:
      • main: Entry point for the demonstration.
      • getRandomWords: Generates a list of random strings.
      • generateRandomString: Creates a random string of a given length.
      • getRandomIntInRange: Returns a random integer within a specified range.

Example Usage

  • The Bloom Filter is demonstrated by reading a set of words from a file, adding them to the filter, and then checking for the existence of these words and randomly generated words to calculate the rate of false positives.

Key Takeaways

  • The project showcases an efficient implementation of a Bloom Filter in Java.
  • It demonstrates the probabilistic nature of Bloom Filters in handling false positives.
  • The Main class provides practical examples of how to use the Bloom Filter with real data.

Conclusion

This Java project successfully implements a Bloom Filter, demonstrating its practical use in handling large datasets where a quick, space-efficient, and probabilistic check for set membership is required.

About

Simple bloom filter in Java using Google Guava for murmur3 hashing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages