I observed something a little unusual today. I received a dataset with over 2 million records and was asked to do some compute-intensive operations on them. I didn't want to process all of the records at once because it might have crashed my computer, or at least tied it up for the whole day. I also wanted to take a small random sample of the records ahead of time and test out what I was going to do in order to make sure it worked.
I noticed that one of the fields contained the date and time that each record was created. My idea was to parse out the records using the minute each record was created, thus dividing up the 2 million records into roughly 60 equal groups. If a record was created at 11:27 then that record was in group 27.
This would let me experiment on 1/60th of the data before I started processing, and also let me control how many of the records I processed at once when I actually got down to business. When it came time to actually process the records I could do 1/10th of them at a time by specifying in the query string to only do records with creation minute 1 through 6. When that group finished processing I would then run another batch with creation minute 7 through 12, and so on.
The weird thing I observed is that I expected to have sixty groups of records with roughly 36,000 records each. I expected to see variations between the groups, maybe give or take 1 or 2 thousand, but overall pretty close to each other. What I got was a huge range. The lowest group had 31,400 records, and the largest had almost 88,000 records!
When I get back to work tomorrow I am going to plot out the groups and see if the distribution is normal and what the standard deviation looks like.
My greyhound can run faster than your honor student.
No comments:
Post a Comment