Hadoop’s assumptions about a datacenter do not hold in a virtualized environment.
Storage is usually one or more of transient virtual drives, transient local physical drives, persistent local virtual drives, or remote SAN-mounted block stores or file systems.
Storage in virtual hard drives may cause a lot of seeking, even if it appears to be sequential access to the VM.
Networking may be slower and throttled by the infrastructure provider.
Virtual Machines are requested on demand from the infrastructure -the machines will be allocated anywhere in the infrastructure, possibly on servers running other VMs at the same time.
The other VMs may be heavy CPU and network users, which can cause the Hadoop jobs to suffer. Alternatively, the heavy CPU and network load of Hadoop can cause problems for the other users of the server.
VMs can be suspended and restarted without OS notification, this can cause clocks to move forward in jumps of many seconds.
Other users on the network may be able to listen to traffic, to disrupt it, and to access ports that are not authenticating all access.
Some infrastructures may move VMs around; this can actually move clocks backwards when the new physical host’s clock is behind that of the original host.
Replication to (transient) hard drives is no longer a reliable way to persist data.
The network topology is not visible to the Hadoop cluster, though latency and bandwidth tests may be used to infer “closeness”, to build a de-facto topology.
The correct way to deal with a VM that is showing re-occuring failures is to release the VM and ask for a new one, instead of blacklisting it.
The JobTracker may want to request extra VMs when there is extra demand.
The JobTracker may want to release VMs when there is idle time.
A failure of the hosting infrastructure can lose all machines simultaneously.