Nah, bad data is valuable to science because you learned what you did had a confounding variable.
Maybe not exactly what it was, but investigating and identifying it is literally the scientific method.
Even early psych experiments like Standford Prison, completely useless data, but why it was useless lead to modern rules about experiments to control for all the ways Zimbardo fucked up.
But when talking about funding and/or employment…
A good scientist can defend their position with any data. The complete absence of any data would be when you’re fucked.
It’s not even just in science, metrics are a thing almost everywhere, and that’s just statistical analysis done by people who never heard those two words put together. It’s trivial to exploit the metrics to make things look better, but what’s better is to explain why the problem is the metrics.
If you’re less than ethical you could do that even if you’ve not been doing your job.
You’re talking about data that doesn’t back the initial hypothesis. That isn’t bad data in this context, and you’re correct that it is still valuable for reforming hypotheses and re-running the experiment.
Bad data in this context is referring to data quality - things like inconsistent collection, inadequate/missing data, free text vs controlled input, etc. In those cases the data can become almost useless (and this is usually known by the people working on a project but not necessarily by their management). This causes pressure to turn shit into gold when that just isn’t possible.
Imagine that your boss wanted you to predict what the temperature will be next Tuesday. In order to do this, your company has provided you the temperature from every Tuesday for the past 12 years. If that wasn’t bad enough, at first they recorded the date in DDMMYY format but 10 years ago they switched to MMDDYY. However, some records were still collected in the legacy DDMMYY format due to lack of training in the temperature collection department, and there is no way to distinguish the correct date. Also, one employee who was close to retirement only collected the temperature as “Hot” or “Cold” because that is how he was trained to do it when he was first hired 50 years ago and he never bothered to learn the new system. Now, you can probably build a model that tracks weekly temperature over time and approximates the next Tuesday’s temperature based on something like seasonality, the historical average, and the most recent Tuesday. But you’ll know that it’s not the best estimate, you’ll know there is way better data out there, and you’ll probably be able to make a simpler, more accurate estimate just by averaging the temperature from Saturday/Sunday/Monday.
Sorry, I maintain that processing data that is full of (known) systematic problems or data that is known to be insufficiently sensitive to detect the goal is a drain on limited resources.
No data is better than bad data.
Nah, bad data is valuable to science because you learned what you did had a confounding variable.
Maybe not exactly what it was, but investigating and identifying it is literally the scientific method.
Even early psych experiments like Standford Prison, completely useless data, but why it was useless lead to modern rules about experiments to control for all the ways Zimbardo fucked up.
But when talking about funding and/or employment…
A good scientist can defend their position with any data. The complete absence of any data would be when you’re fucked.
It’s not even just in science, metrics are a thing almost everywhere, and that’s just statistical analysis done by people who never heard those two words put together. It’s trivial to exploit the metrics to make things look better, but what’s better is to explain why the problem is the metrics.
If you’re less than ethical you could do that even if you’ve not been doing your job.
You’re talking about data that doesn’t back the initial hypothesis. That isn’t bad data in this context, and you’re correct that it is still valuable for reforming hypotheses and re-running the experiment.
Bad data in this context is referring to data quality - things like inconsistent collection, inadequate/missing data, free text vs controlled input, etc. In those cases the data can become almost useless (and this is usually known by the people working on a project but not necessarily by their management). This causes pressure to turn shit into gold when that just isn’t possible.
Imagine that your boss wanted you to predict what the temperature will be next Tuesday. In order to do this, your company has provided you the temperature from every Tuesday for the past 12 years. If that wasn’t bad enough, at first they recorded the date in DDMMYY format but 10 years ago they switched to MMDDYY. However, some records were still collected in the legacy DDMMYY format due to lack of training in the temperature collection department, and there is no way to distinguish the correct date. Also, one employee who was close to retirement only collected the temperature as “Hot” or “Cold” because that is how he was trained to do it when he was first hired 50 years ago and he never bothered to learn the new system. Now, you can probably build a model that tracks weekly temperature over time and approximates the next Tuesday’s temperature based on something like seasonality, the historical average, and the most recent Tuesday. But you’ll know that it’s not the best estimate, you’ll know there is way better data out there, and you’ll probably be able to make a simpler, more accurate estimate just by averaging the temperature from Saturday/Sunday/Monday.
That’s bad data.
This guy datas.
Sorry, I maintain that processing data that is full of (known) systematic problems or data that is known to be insufficiently sensitive to detect the goal is a drain on limited resources.
You’re exactly right.
here’s the thing …