Apologies if this is the wrong forum, but it looks reasonably close.
I have a practical problem to estimate the number of actual entries in a large table, based on the fill level of a hash table. The problem is that I have a content addressable file store in which individual files are stored in a path determined by their SHA1 hash value. The storage is a multilevel hierarchy of directories, each level using the Rth octet of the hash value.
When there are a lot of files (10s of millions) it is time consuming to traverse the file tree and count them. I believe there should be a way to estimate the overall population size, maybe to within a factor of 2, based on the fill level of the buckets, assuming that the octet values in the SHA1 hash are uniform, which they should be.
It's a kind of reverse application of the birthday problem, only I am going to count the number of holes with no pigeons and want to estimate the total number of pigeons based on the holes that are empty.
The directory tree is 3 levels deep, with 16M possible end nodes, and a possibility of many files in each final directory. Although it is impractical to walk the whole tree, I can directly count the top couple of levels to see which of 2^16 prefixes exist. At lower levels I can sample the fill density reasonable approximation of the fill density.
I don't think the math is too hard, but my counting problem experience is very old now, and I am not familiar with the problem domain. Maybe this is a problem that's been written up somewhere and someone here would know where.
Any ideas or help would be most welcome. Thanks.
I have a practical problem to estimate the number of actual entries in a large table, based on the fill level of a hash table. The problem is that I have a content addressable file store in which individual files are stored in a path determined by their SHA1 hash value. The storage is a multilevel hierarchy of directories, each level using the Rth octet of the hash value.
When there are a lot of files (10s of millions) it is time consuming to traverse the file tree and count them. I believe there should be a way to estimate the overall population size, maybe to within a factor of 2, based on the fill level of the buckets, assuming that the octet values in the SHA1 hash are uniform, which they should be.
It's a kind of reverse application of the birthday problem, only I am going to count the number of holes with no pigeons and want to estimate the total number of pigeons based on the holes that are empty.
The directory tree is 3 levels deep, with 16M possible end nodes, and a possibility of many files in each final directory. Although it is impractical to walk the whole tree, I can directly count the top couple of levels to see which of 2^16 prefixes exist. At lower levels I can sample the fill density reasonable approximation of the fill density.
I don't think the math is too hard, but my counting problem experience is very old now, and I am not familiar with the problem domain. Maybe this is a problem that's been written up somewhere and someone here would know where.
Any ideas or help would be most welcome. Thanks.