PDA

View Full Version : Is it Random enough?



endotherm
November 13th, 2010, 03:20 AM
Hello all,

I have stumbled across a kinda weird phenomena with a utility I've written to select a random file from a directory hierarchy, and I wanted to get someone else's take on it.

The algorithm is quite simple.


create a list of all the files within the hierarchy recursively.
Generate a random int between 0 and the number of files in the list -1.
Return the file in the list at the random index.


The utility appears to work fine, and correctly launches a file when spawned. the set I am testing against contains about 175 files.

Heres where my query comes in. At runtime, the algorithm appears to be far less than random. From a purely anecdotal perspective, it appears to launch 1 of 20 or so specific files about 50% of the time. It also appears to favor files in the 130+ range, and very rarely returns a number less than 52. This is all when the program is in use, not in testing.

So the first thing I checked was the list building algorithm. it correctly detects all the files it should, so I started checking the random number generation algorithm. I added a millisecond based seed, but that didnt change anything so I wrote a quick diagnostic feature, that generates 175 random numbers within the range, and accumulates the
frequency of each value generated.

When I run the diagnostics however, everything appears to be working perfectly. Out of 175 values, there are on average 130-140 unique values, covering between 67-77% of the possible values at least once. No value is selected more than 3 times, and both 0 and 174 were selected.

So, since I have statistical evidence that indicates everything is working fine, why does the use of the program feel so flawed? Is it all in my head (observational bias) or do you think there is a problem with my analysis? A potential technical pitfall?

Any ideas?

StunnerAlpha
November 13th, 2010, 03:36 AM
I would say it is all in your head. If the data says its good, its good. You have to realize though, in the end every random number generated by computer is really pseudo-random, that is it has to take a seed of some sort(the randomizer is based on something), even if you don't actually feed one into the function. I think this is what you are subconsciously concerned about. I have no idea what this algorithm is for, but in terms of effective randomization, your algorithm seems legitimate.

techgeek314
November 13th, 2010, 04:03 AM
You may need to seed the random number generator to a different number on each run; you might be getting the generator starting with the same seed every time. Try seeding the random number generator with the current time.

dwhitney67
November 13th, 2010, 04:12 AM
When you roll a die, with numbers 1 through 6, the possibility of rolling a particular number is in theory 1-in-6 (or 1/6 or 16.7%). In reality, it is 1-in-2 (50%)... either you roll the desired number or you don't.

Randomization is based on your perception. If I randomly choose to draw a number from 0 to 1000, and coincidently the same number comes up twice in a row, or even twice in say ten draws, does that mean the randomization is wrong? Probably not; it is merely coincidence, or as some say, chance.

P.S. You never did indicate from where or how you are drawing your random numbers. Maybe you do have a flaw in your algorithm.

endotherm
November 13th, 2010, 04:18 AM
Thanks for the replies

yes, it could well be all in my head, and ultimately this algorithm is not fit for the task I had imagined.

I am reseeding the RNG with each selection, based on the current millisecond, but from the documentation, the default Randomize() call using the full datestamp, so I may have made it less specific. I've followed the best practices on the random value generation, so I feel like it should work. mabey I need to use a crypto library for better RNG.

at runtime this utility only need produce one value per run, so unless there is a bias in the simple rng generation (perhaps something initialized at boot or login), I would guess that the rng instance is not carrying over stuff from a previous run of the process.

perhaps the best bet, is to record the items selected, and reject them unless an arbritrary period of time has passed since their last selection.
what do you all think?

endotherm
November 13th, 2010, 04:33 AM
@DWhitney, you are quite right, I've tried to treat this as theory, partly to hide my embarrassment. for professional reasons, I'm faced with the unattractive requirement of getting back in touch with VB.net, so I am using this util for practice.

here is the code that directly touches the random file selection



Imports System.IO
Imports System.Diagnostics

Public Class RandEp
Private _root As DirectoryInfo
Private _Eps As New List(Of FileInfo)
Private _Types As New List(Of String) From {".mkv", ".avi", ".ogm", ".ogg", ".mp4"}
Private _DirExclusions As New List(Of String) From {"extra", "extras", "sample"}

Public Sub New(ByVal iRoot As DirectoryInfo)
If Not IsNothing(iRoot) And iRoot.Exists Then
Me._root = iRoot
Else
Throw New DirectoryNotFoundException("The series root directory was not found.")
End If
End Sub

Public Function Go() As FileInfo
BuildList()
Return _Eps(GetEpIndex())
End Function

Public Sub BuildList()
ScanDir(_root)
End Sub

Public Function GetEpIndex() As Integer 'public for diagnostics
Randomize(DateTime.Now.Millisecond)
Return CInt(Int(_Eps.Count * Rnd()))
End Function

Private Sub ScanDir(ByVal dir As DirectoryInfo)
If Not IsNothing(dir) And Not _DirExclusions.Contains(dir.Name.ToLower) Then
For Each File As FileInfo In dir.GetFiles()
If _Types.Contains(File.Extension.ToLower) Then
_Eps.Add(File)
End If
Next
For Each Directory In dir.GetDirectories()
If Not _DirExclusions.Contains(Directory.Name.ToLower) Then
ScanDir(Directory)
End If
Next
End If
End Sub
End Class

worksofcraft
November 13th, 2010, 04:39 AM
Thanks for the replies

yes, it could well be all in my head, and ultimately this algorithm is not fit for the task I had imagined.

I am reseeding the RNG with each selection, based on the current millisecond, but from the documentation, the default Randomize() call using the full datestamp, so I may have made it less specific. I've followed the best practices on the random value generation, so I feel like it should work. mabey I need to use a crypto library for better RNG.

at runtime this utility only need produce one value per run, so unless there is a bias in the simple rng generation (perhaps something initialized at boot or login), I would guess that the rng instance is not carrying over stuff from a previous run of the process.

perhaps the best bet, is to record the items selected, and reject them unless an arbritrary period of time has passed since their last selection.
what do you all think?

You should only seed the pseudo random sequence once. Typically once seeded it will not repeat the sequence it generates for a very larger number of calls e.g. 2^32 I think is guaranteed minimum sequence length for srand/rand in C.

So you seed it once with something like the current time and that positions you at an unpredictable spot in this pseudo sequence.

However if you keep seeding it over and over then you could well be positioning the pseudo sequence roughly in the same spot each time, depending on the granularity of the time you get :shock:

e.g. the function time returns a "time_t" object, but this is many systems just the number of integral seconds elapsed since a particular date and so it's only really using a rather narrow range of starting points from the available 2^32 that it could be seeded with!

trent.josephsen
November 13th, 2010, 04:40 AM
Maybe it's different in VB, but the C programmer in me is screaming "Don't reseed! Don't reseed!"

Re-seeding after every call or set of calls makes the pseudo-random algorithm less pseudo-random. Seed it once and leave it alone.

Edit: whoops, too slow!

endotherm
November 13th, 2010, 05:06 AM
good point, and thanks for the tip.

at runtime, the utility generates only one value per run of the process, but reseeding would impact the diagnostics, since I gen 175 numbers in a row.

I will separate out the initial randomize() call from the code that gets the random and see how that effects my diagnostics, but since the diagnostics show a good spread despite my reseeding mistake, I wonder if this will make a huge difference.

endotherm
November 13th, 2010, 05:11 AM
interestinger. I moved the randomize() call to the constructor to prevent it from being called multiple times, and the number of unique values generated in 175 runs, has dropped from 130-140 to 100-120, reducing the of range coverage to 60-68%. the highest frequency for a value is now 4.

Tony Flury
November 13th, 2010, 08:13 AM
Just a thought : 175 runs is far too small a sample - to test the randomness you need to be doing 1000s or 10,000s of runs.

worksofcraft
November 13th, 2010, 09:04 AM
I don't quite follow what you want, but you can calculate what to expect.

Think of your set of files as an N sided dice... so selecting one at random is like rolling that dice. Thus it should produce what is known as a "binomial distribution". You can calculate the expected frequency with: n! * p^k*(1-p)^(n-k) / (k!*(n-k)!)

p = probability i.e. 1/N of files in your set.
k = number of times you can expect the same file
n = the number of samples i.e. 175 in your case

endotherm
November 13th, 2010, 09:19 AM
something is definitely not right here. as some of you may have guessed from my code snippet, this utility is for queuing up a random episode of of somthing from my video collection. I added a series of 200 files, but have had to skip at least 8 of them in the last 12-16 hours of use. probably about 30 runs. I think I;m going to try salting the number a little and see where that gets me.

ad_267
November 13th, 2010, 09:28 AM
Maybe the problem is it's too random. Each item has an equal chance of being picked every time, regardless of whether it's already been selected. Instead you might want to have a a list of all the items and after selecting one randomly, remove that from the list.

This is an interesting read on the topic: http://blogs.msdn.com/b/shawnhar/archive/2009/12/17/the-psychology-of-randomness.aspx

worksofcraft
November 13th, 2010, 10:06 PM
I found a binomial chance calculator (http://stattrek.com/Tables/Binomial.aspx) on line.

Assuming a true random distribution it calculated that from your 200 files (i.e single trial success is 0.005) Pick one at random 175 times. Then on average about 42% of your files will not be chosen, about 36% will be chosen once, 16% will be played twice and less 6% more than twice.

endotherm
November 15th, 2010, 06:34 AM
well, I've come to some interesting conclusions.

I decided to breed my own rng, by multiplying two randoms together and modding them, just for salt. I noticed that my liking for the new algorithm went up, even though all the metrics on the diagnostics went down.

@works: yeah, that describes the output I'm getting pretty well. neat to see the math in action!

ultimately, you guys are right; "random" is not really the path to the ends I seek, so I bit the bullet and added features to record and administer a history of viewed files. a file won;t replay unless there is no history item data within the last 6 months, but I'm sure I'll have to tweak that. I already had data persistence code in place, so adding on to it was trivial enough.

this has been a very interesting conversation. thanks everyone for your thoughts and ideas!