AI takes command: No rest for the data centre warriors in India

AI is reshaping data centre operations with unblinking precision—predicting failures, cutting energy waste, and fighting downtime in real time, while redefining human roles.

20 Aug 2025 13:16 IST

New Update

As Artificial Intelligence (AI) continues to evolve, data centres are undergoing a fundamental transformation. The demand for AI applications now necessitates specialised infrastructure and new operating models. AI is also being harnessed to manage the very hardware that powers it, shifting data centre operations from manual oversight to intelligent, automated systems.

Advertisment

According to Rohan Sheth, Head – Colocation and Data Centre Services at Yotta, while many AI use cases are still emerging, applications developed over the past five to ten years have already played a pivotal role in predictive operations. “Data centres are mission-critical facilities where downtime is unacceptable,” Sheth remarks. “The main selling point of a good data centre is zero downtime throughout the year, which is possible through predictive maintenance.”

Traditional data centres are designed to manage general computing workloads, such as hosting websites or running databases. In contrast, AI-focused data centres are built for high-performance computing, training and deploying AI models. This necessitates a unique setup: AI data centres use specialised processors such as Graphics Processing Units (GPUs), Tensor Processing Units, and dedicated AI accelerators capable of handling parallel processing.

Common choices for such tasks include NVIDIA’s H100, B200, and GB200 chips. While these chips take on the heavy computational load, Central Processing Units (CPUs) continue to manage data flow and support the accelerators.

Advertisment

High-speed data transfer between processors is critical in AI environments. AI workloads generate immense east-west traffic, particularly between GPU clusters, requiring both high bandwidth and ultra-low latency. To meet this demand, AI data centres employ low-latency connections, such as InfiniBand, and advanced Ethernet standards, like 400G and 800G.

These technologies ensure that large volumes of data move efficiently, avoiding bottlenecks that can impede model training. A spine-leaf architecture is often used in these networks, offering predictable, low-latency paths and scalability. Fibre optics remain the backbone of these high-speed networks.

How AI Monitors Hardware and Predicts Failures

AI systems are now essential for continuously monitoring data centre hardware. They collect and analyse data in real time using sophisticated algorithms to predict issues and optimise performance.

Advertisment

A network of sensors captures data across the facility. Environmental sensors detect temperature, humidity, air quality, and leaks. Power meters monitor power capacity and usage. Additional sensors track variables such as pressure, vibration, CPU and memory utilisation, disk health, and network traffic. This sensor information typically consists of time-series data, enabling the analysis of temporal trends.

An AI data pipeline collects and processes this raw sensor data, converting it into formats suitable for AI applications. This includes ingestion, cleaning, validation, and transformation of data. Time-series databases are essential for storing sensor data in chronological order. They support high write throughput, compress data efficiently, and allow rapid access to both real-time and historical data. Data lakes also provide flexible storage for vast volumes of structured and unstructured data.

Machine learning and AI algorithms then process the data for key operational insights. Anomaly detection, often deployed through AI for IT Operations, identifies deviations from normal behaviour. Techniques such as Isolation Forest, Local Outlier Factor, and One-Class Support Vector Machine are employed for unsupervised anomaly detection. Deep learning models such as Long Short-Term Memory networks excel at uncovering patterns in time-series data.

Advertisment

Predictive analytics is one of the most valuable applications of AI. By analysing historical data on CPU usage, disk health, temperature, and system errors, AI can anticipate hardware failures before they occur. This allows proactive scheduling of repairs, reducing unplanned downtime by up to 50% and cutting maintenance costs by 10–40%.

Reinforcement Learning, another advanced AI technique, is applied to optimise complex systems, such as cooling. AI agents simulate various operating conditions to learn how to dynamically manage coolant flow and temperature. A notable example is Google DeepMind’s use of deep reinforcement learning, which achieved up to a 40% reduction in cooling costs.

The Benefits of AI-Driven Monitoring in Data Centres

AI-enhanced monitoring brings considerable benefits. Ashish Arora, CEO of Nxtra by Airtel, notes, “AI is reshaping modern data centre operations, delivering measurable improvements in uptime, energy efficiency, and operational agility.”

Advertisment

By predicting hardware failures in advance, AI enables IT teams to take action before problems escalate, thereby extending the lifespan of equipment. Sheth explains, “AI-driven procedures now predict when maintenance is due, when oil needs changing, or when equipment needs servicing. This predicts the behaviour of the equipment and gives an indication months before a potential breakdown.”

Given that cooling systems consume a large share of a data centre’s energy, AI dynamically adjusts these systems based on real-time sensor data, significantly improving efficiency. Google DeepMind’s initiative reduced cooling energy usage by 40% and improved Power Usage Effectiveness (PUE) by 15%. Platforms such as KAYTUS KSManage aim to lower PUE even further, to below 1.3. AI also optimises energy use by identifying underutilised servers and redistributing workloads.

Arora adds, “Our data centres are equipped with thousands of sensors that continuously monitor environmental and operational parameters. AI analyses this data in real time, dynamically adjusting cooling, power distribution, and airflow to optimise energy use without impacting performance. This is not just about cost savings—it is about sustainability.”

Advertisment

Beyond energy optimisation, AI automates repetitive tasks like system health checks and software updates. It continually monitors workloads and dynamically allocates resources such as compute, storage, and network bandwidth. AI-powered platforms can forecast computing demand with up to 95% accuracy, reducing over-provisioning by 30–40%. Arora notes, “Through intelligent automation, we can instantly balance workloads, allocate resources, and respond to sudden spikes in demand. This means our customers benefit from consistent, high-performance services, even as their needs evolve.”

Challenges of Integrating AI into Data Centre Operations

While the benefits are significant, AI deployment in data centres presents notable hurdles. For example, integrating AI solutions with legacy infrastructure can be challenging. Existing systems may not be built for AI, necessitating redesigns of data pipelines and models.

Highlights JP Singh, Head of IP (Networks Infrastructure) at Nokia India, “Scaling AI in large data centre environments presents two key challenges: data complexity and trust. While AI can sift through vast telemetry streams and suggest operational actions, the accuracy of those suggestions cannot always be guaranteed, especially with transformer-based models that are prone to hallucination.”

Advertisment

The computational demands of AI workloads, particularly those involving deep learning, are substantial. Privacy techniques such as federated learning and homomorphic encryption can increase compute overheads by as much as 165% during initial deployment. Additionally, the power consumption of AI data centres places enormous stress on existing power grids, with some grid connection requests taking up to seven years to fulfil.

Arora notes, “First, the environment is complex itself. We are talking about thousands of interconnected devices, sensors, and workloads—each generating data in different formats and at different speeds. Making sense of that data in real time, and doing it reliably at scale, is no small task.”

Model performance is highly dependent on the quality of the data. Inaccurate or incomplete data can lead to flawed predictions. The large-scale collection of operational data also raises privacy concerns, particularly regarding unauthorised access and potential re-identification from anonymised data. Regulatory frameworks often lag behind the pace of AI development, complicating governance.

Rajesh Tapadia, CEO – Data Centres, India at Iron Mountain, comments, “One of the key challenges in scaling AI-driven automation across distributed data centre infrastructures is the fragmented nature of data storage. This fragmentation creates hurdles—namely in ensuring consistent data quality, interoperability, and seamless integration of AI systems across diverse environments.” A recent report by Iron Mountain and FT Longitude found that 43% of Indian organisations cite cybersecurity and compliance risks as their top concerns.

Human-Machine Collaboration and the Path Forward

Looking ahead, AI will become increasingly capable of making decisions autonomously, particularly in networking and resource optimisation. Yet, human oversight remains indispensable.

Digital twin technology is emerging as a game-changer. By creating virtual replicas of data centre systems, operators can run simulations and test control strategies without risking live operations. NVIDIA’s Omniverse platform is one such example. Singh highlights Nokia India’s approach: “EDA’s integrated CI/CD pipeline and Digital Twin provide a robust validation framework. Before any AI-driven action reaches production, it is tested in a virtual environment that mirrors the production state. This eliminates guesswork and mitigates risk, ensuring that automation decisions are credible and safe.”

The rise of low-latency applications is accelerating the adoption of edge computing. By processing data closer to the source, edge computing reduces latency and eases network congestion. Edge nodes perform real-time inference, while central data centres focus on complex tasks like large-scale model training.

As AI assumes greater control, human roles are evolving. Tapadia observes, “Human expertise will increasingly shift towards strategic oversight, governance, and exception handling. While AI can optimise tasks such as energy use, predictive maintenance, and workload distribution, human roles will evolve to ensure compliance with regulatory standards, manage complex transitions, and maintain operational resilience.” He also notes a shortage of skilled personnel for AI data management, affecting 40% of Indian organisations.

Arora concurs: “AI will automate the repetitive, but humans will continue to lead the design, direction, and governance of these systems.” He foresees a future shaped by “strong human-machine collaboration,” where AI offers insights, but it is “human judgement that ensures these insights translate into effective action.”

The future of AI in data centres hinges on continuous assessment, clear objectives, and the agility to adapt. AI is both the driver of new computational challenges and the most effective solution for managing them. Navigating this dual role requires commitment, strategic planning, and collaboration across every layer of data centre infrastructure and operations.