I am very new to statistics and bioinformatics. For my project, I have been creating a certain number of sets of n patients and splitting them into subsets, say HA and HB, each containing equal number of patients. The idea is to create different distributions of patients. For this purpose, I have been using 'random seeds'. The sets are basically being shuffled using this random seed. Of course, there is further analysis involving ML. But the random seeds I have been using, they are from 1-100. My supervisor says that random seeds also need to be picked randomly, but I want to ask, is there a problem that the random seeds are sequential and ordered? Is there any paper/reason/statistical proof or theorem that supports/rejects my idea? Thanks in advance (Please be kind, I am still learning)
孕妇有血窦要注意什么
-
$\begingroup$ What do you mean by "random seeds are sequential and ordered". Do you pick a seed say uniformly at random and the latter seeds are just obtained by taking the next seed as per some ordering? On a side I do think your question is legitimate, but I feel 1) the question is more suited for stats.stackexchange.com 2) I DON'T have have an ML baground but I think you'd have to include more details to increase the chances of getting an answer, like how is this shuffling done, what is the dataset, what are you trying to learn from the dataset etc. (to me question seems a bit open-ended). $\endgroup$– advocateofnoneCommented 23 hours ago
1 Answer
TL;DR: Your advisor's observation is not unreasonable; though I suspect that in practice there's a good chance you'll be fine with what you did, if correctness is not too critical.
Longer answer: It depends on the PRNG. If you want to be safe and not think about this too hard, or if the stakes are high, it's generally safest to use true-random seeds, as that will be safe with all PRNGs. But with most PRNGs, probably it won't matter and you'll be fine with the procedure you followed.
Some PRNGs are specifically designed so that you can use them with seeds like 0, 1, 2, etc. They mix the seed into the state well before producing any output.
Other PRNGs are not designed that way. So, the outputs from seed 0 might be related to the outputs from seed 1, at least for the first few outputs. Also, if you're particularly unlucky, seeds like 0 might be "bad", e.g., cycle with a very short period. Hopefully your PRNG doesn't have those limitations, but it could.
There is usually no guarantee what kind of PRNG you're specifically using. So, if you want to be safe, follow your advisor's recommendation. For instance, you could use MD5(0), MD5(1), MD5(2), ..., as your seeds, if you want something reproducible and deterministic.
That said, for a quick experiment, I'm guessing I wouldn't worry about it.