Why the cloud native telco still needs work

Telcos feel they are yet to see the full benefits of cloud-native operations. TMN talks to operators, vendors and cloud platform providers about the cloud-native journey.

Telcos hope to gain new revenues and enable new business models by offering network capabilities as a platform, witness the ongoing programmes on areas like network APIs and programmable network slices. Achieving this relies on an underlying network architecture that is software-based, automatable and flexible. In other words, to get to these new revenues and service capabilities operators will rely on a cloud-native network, just like the hyperscaler platforms with whom they hope to compete and co-operate. 

However, despite operating networks which include containerised functions from core to (in a few cases) the RAN, many operators say that they are yet to unlock the full benefits of cloud-native operations. 

Back in December 2023 the Cloud Native Computing Foundation produced a white paper, co-written by many mobile network operators with feedback from vendors, that outlined the steps that were still required to give telcos the true benefits of cloud-native operations. Leaning on the NGMN’s Cloud Native Manifesto, it highlighted a number of areas where the interaction between vendor CNFs and the underlying CaaS (cloud environments such as Red Hat, VMWare, Wind River and vendor-specific clouds) were still far from optimal. 

Operator wish list

The writers said that part of the problem is that operator requirements are not yet stable and still emerging – making things difficult for vendors. But they added that operators’ reticence has been in part because vendors have been averse to giving up a lucrative professional services business model and their control over vertical integration.

For a new model to work, vendors and CSPs must provide mutual SLAs, the paper said. The CSPs must guarantee a certain level of quality at the platform layer, while CNF vendors need to guarantee that the application will perform on the platform, meeting defined KPIs. 

These included aspects such as Pre-Validation of a CNF being performed against a reference that is common for all players. The paperalso asked for the ability for CSP’s DevOps or vendor’s delivery teams to adapt CNF artifacts without special change requests or involving complex R&D processes. CNFs should also be delivered with a series of automated tests, and CNF deployment and configuration should be fully automated (“everything as a code”) and done exclusively with declarative cloud native mechanisms like GitOps. The CNFs should also be completely independent from underlying infrastructure platforms, and constructed in a way that fully tolerates “graceful cluster rolling upgrade procedures” without blocking them and without service interruptions.

Ideally, the paper said that CNFs should work with any certified Kubernetes. “Network interfaces and CNI plug-ins create hardware dependencies and tie them to specific infrastructure. Therefore a wise, but far-reaching, approach would be to develop all networks and I/O acceleration purely in software. Broader adoption of this software separation might take years to mature. In the short term, the task is to improve the automation to maximise the agility of the applications.”

Further the paper said that applications should also be stateless. “The legacy approach uses local storage, which makes network functions heavy and slow. To improve workload mobility, the application state should be stored in custom resource definitions (CRDs) or a separate database.”

Cloud native telco applications also need to support horizontal scaling (across multiple machines) and vertical scaling (between different machine sizes), so that operators can start with a very small application and grow it as needed.

The should be configurable using DevOps tools and observable via an Open Metrics interface that monitoring tools can use. They should also be portable: so that CNFs declare their platform requirements without implying a specific implementation. And cloud native applications need to minimise divergence between development and production, enabling continuous deployment for maximum agility.

The effort and the cost of certifying on so many different platforms is not making it fly

Real world problems

Telcos themselves – even those that have already done a large amount of lifting their applications onto cloud platforms – are aware that they are not yet extracting the full benefits of being “cloud-native”.

On a panel debate hosted by TMN, (The softwarised network and how to get there) Telefónica’s Cayetano Carbajo, Director for Core, Transport and Service Platforms, GCTIO, identified one key issue – that its cloud deployments are still dependent on a specific certification with the cloud layer.

“The situation today is the majority of operators are using the CaaS from the vendor in the majority of cases. We don’t have the transversal approach. The effort and the cost of certifying on so many different platforms is not making it fly. When you count the number of demanding network functions on top of transversal CaaS, there are not so many cases in the world. This is a pity because we are losing capabilities for auto-healing and scaling, or reusing capacity for services. We are not there.”

What that means is that we’re not really leveraging the benefits of cloud… in the end I wouldn’t say we’re fully cloud native yet

Mark Henry, Director of Network and Spectrum Strategy, concurs that BT, which has all its core as software on its own BT network cloud, has also not yet unlocked the full cloud-native toolbox. That’s partly because the operator was forced to switch its core away from Huawei to Ericsson under pressure of a government-imposed deadline.

Although Henry points out it was a successful migration that took a lot less time than previous migrations, the prime requirement for stability and reliability (99% of its customers and traffic are on the new platform) meant that the operators has been “cautious” and therefore hasn’t yet fully harnessed the full benefits of being cloud native..

“We integrated the core network functions into our own network cloud and felt the pain of it.” Henry says. 

“When you migrate millions of customers, you’re just very focused on stability and resilience. So we’ve got that, it’s been pretty much transparent to the customers that we’ve migrated millions of them. That is an amazing achievement by Reza [Reza Rahnama, MD Mobile Networks, BT] and team.

“But what that means is that we’re not really leveraging the benefits of cloud. Yes it is containerised, it is multi tenant (we have a few of our fixed functions on there) – but the sharing of resources, the scaling… in the end I wouldn’t say we’re fully cloud native yet. And that is a development journey, with us hand in glove with Ericsson, plus our other VNF vendors.”

Henry agrees that certification and integration could be an ongoing challenge for the industry. 

“There’s different views. If you talk to non-telco people it’s not a religious debate – you will run on many different plans. But where we are in telco at the moment there’s some heavy lifting integration and I can that see for vendors, having to integrate with many different variants before they have sufficiently abstracted their software is challenging. And we’re seeing that play out.”

Henry says that he thinks a set of de facto requirements will start to emerge, towards which vendors and others can work.

“Without a standard, I think we’ll just coalesce around a similar set of functional requirements each cloud would have – and with the likes of BT, DT, SingTel others, there’s a few front runners driving that.”

The vendor challenge

For vendors too, there’s a recognition that diverse operator requirements are hard to meet. Nokia is the prime example, in that it declared last year that it would move away from its own cloud platform and instead steer operators towards Red Hat OpenShift. Part of its reasoning at the time was that supporting the range of different operator cloud platforms was just not sustainable for it as a business.

Mavenir’s Bejoy Pankajakshan, EVP – Chief Technology & Strategy Officer, also verifies the issue.

“We are deployed in different stacks with different operators on different infrastructure. From a maturity perspective it helps make software stacks that are more mature – running across these different infrastructures. We need to build abstraction layers that allow us to hide the complexity of the underlying CaaS/PaaS layer. When we port our application – if we have to re-engineer our software every different time that is not sustainable. So getting there is a key strategic investment area for us.”

We’ve seen our customers planning services the old way and locking in resources that are sitting there waiting for service uptake, which might not might or might not happen

Red Hat the platform layer player

It’s an issue that the platform providers are also aware of. 

Rimma Iontel, Chief architect, Telco, Red Hat says, “One of the things that’s like a big gap right now is the certification of the workloads.”

“We have a certification programme: we work with our partners and we certify the workloads, but it doesn’t do functionality testing. We’re just certifying that they can turn on the container on top of our platform and that it doesn’t break anything.”

When Red Hat is working with customers and they have specific needs from the testing point of view, Red Hat accommodates that.

“But it’s not like you can just dump any CNF on the platform and you have a pre-made test suite that you run through so then you can say, yeah, it’s a cloud native application, it complies with cloud native principles of deployment of lifecycle management, of scaling, scheduling, all of that.”

“What we’ve seen a lot is that most CNFs do not comply with that. They do not scale, they do not use Kubernetes-native constructs to deploy and just use Helm charts, which gives you the deployment but doesn’t give you the lifecycle management. And you don’t get scaling, oftentimes, because you have to resize it – so even if you don’t have enough subscribers utilising the resources it’s already locked in. Because of that, we’ve seen our customers planning services the old way and locking in resources that are sitting there waiting for service uptake, which might or might not happen.”

Another issue is on resiliency.

“Kubernetes makes sure that that service that you have deployed, that number of pods, is always there. If one fails it’s not a problem because a new one would just pop up automatically. That’s how Kubernetes works, because it’s a declarative model. And that’s how our platform works, automatically, because all of it is based on Kubernetes-native principles. But once you start looking at a CNF, hardly any of them do that,” Iontel says.

While Red Hat and its Open Shift platform is probably one of the most established platforms in telco cloud, along with the likes of VMWare and Wind River, there are other companies with an eye on solving the platform-CNF lifecycle issues.

This MWC there’s been a wholesale shift in approach because people are knocking on our doors now.

SUSE susses opportunity

One is SUSE. With its latest Adaptive Telco Infrastructure Platform 3.0 release, it claims it has a platform that is optimised for telco use cases, and that meets the requirements laid down by Project Sylva, the Linux Foundation-hosted programme being led by major European telcos that that is seeking to specify requirements for telco cloud infrastructure.

Keith Basil, GM, Edge Business Unit at SUSE, says “We do telco specific optimisations to get direct high speed access to the data streams coming into those workloads. So we deploy all of that, we lifecycle manage all of that, we test that, we certify that, we QA that, and then that is essentially the product.”

It is this product that Basil says is effectively a commercially implementation of Sylva’s reference architecture.

“We are participants in the upstream project Sylva, and contribute code to Sylva based on the requirements and specs that come out of the working groups there. Sylva has published a 1.0 reference spec and so what ATIP effectively is, is a downstream commercialisation of that reference architecture that’s fully supported.”

According to Basil, one partner, vRAN company Parallel Wireless, attributed the platform – along with an automated deployment capability called Edge Image Builder – to its ability to support much quicker product deployment, and also to support its chip-independent software architecture.

“It’s allowed them to be very aggressive with going to market with their own marketing. And it gives them coverage for Intel based platforms and ARM based platforms. So it’s expanded their speed and agility to address their own market.”

“Probably the biggest thing that has changed was getting back to the notion of open source always wins. It provides choice. And so we decided to embrace the Cluster API. In the Kubernetes world there’s a lot of adoption for that in non-telco use cases, but that same API can be used to point to our capability to carve out and provision bare metal machines

“So all of the integration glue that has that is needed for a C-API defined cluster, to orchestrate the building of a machine from an off state to a running cluster, we’ve done all of that.

“So if a telco customer, for example, has their own tooling, or command and control facilities, we can just give them the Cluster API which they can target, and then they can build their own clusters using our entire toolset from that point forward. So there’s a lot of freedom and flexibility.”

Basil says that telcos are beginning to talk to SUSE about its work.

“The last two MWCs we were aggressive, hungry, knocking on other people’s doors. This MWC there’s been a wholesale shift in approach because people are knocking on our doors now.”

I’m not suggesting that this is easy. In fact one of our focusses and in the next 12 months is operations at scale, because operating a network in a cloud environment – there’s a tremendous amount to learn.

Wind River says not easy, but happening

Another platform companiy that says it is enabling true cloud native operations is Wind River. Randy Cox,  Vice President, Product Management, Intelligent Cloud points out that Wind River Studio supports cloud telco deployments at scale, including a deployment of Samsung vRAN DU basebands at thousands of edge sites with Verizon, as well as supporting Vodafone’s transition towards to vRAN in the UK.

“We are commercially deployed in Verizon with the largest vRAN deployment in the world. We’re completely containerised, we’re cloud native, we have the ability to scale.”

As well as its Studio product Wind River also claims its VxWorks real time operating system extends the network to the device edge, meaning Studio can now manage that container all the way to the device. At MWC Wind River had a demo of its containerised OS sitting on a device on a car that is software defined. “Now that car actually looks like a sub cloud to Studio. So that’s not possible if you’re not doing cloud-native,” Cox said. 

That said, the company is now targeting operations at scale as a key focus over the next year.

I’m not suggesting that this is easy. This is hard stuff. It’s taken us a long time and we’ve gone through a lot of learnings. In fact one of our focusses in the next 12 months is operations at scale, because operating a network in a cloud environment – there’s a tremendous amount to learn, along with our partners as well as the carrier themselves.

“So if you hear a carrier said, ‘Well, it’s not where we want to be’, I get that because there’s a lot to learn and it’s a completely different way of doing things in terms skill set at the operator.”

Cox says that Wind River itself has added more automation capability.

“Part of our operations at scale focus also includes automation. We’ve done quite a bit of work there, but we’re not done yet either.”

As an example, Cox says that Wind River has developed automated life cycle management that starts with an audit of the hardware infrastructure, makes sure that it is in the right state, then instals its cloud platform software, analytics, before deploying the nodes across the network.

One example Cox gave is a deployment of the core network User Plane Function in Elisa’s network [Ericsson is Elisa’s core network supplier – TMN], in addition to which it is running a network monitoring function from Elisa-Polystar. Automating the deployment and life cycle management of those functions had showed about 96% savings in terms of operational staff, he claimed.

“We think that’s a very good kind of proof point. And by the way, because that’s a core deployment it has limited amounts of automation [potential]. We can’t really show and demonstrate our true capabilities until we get all the way to the DU and the edge.

“But if we extrapolate that those kinds of savings, our savings actually gets better, because with Studio Conductor we can go to the entire network at the same time to do this kind of deployment or upgrade, so that kind of efficiency is going to increase and scale out more when we get to the far edge.”

Wrapping up

Operators have defined their problems and their desire to get the full benefits of automated, cloud-native network operations. They know too that this challenges them internally, both in terms of skills and in terms of changing their own operating practices. Platform players are jostling for position to wrap up their platforms to be able to offer more manageability, flexibility and automation to their operator customers. The onus is going to be on softare vendors to fully build abstraction layers for their software, meaning they will probably have to cede control over integration and changes to their artefacts, accepting test and deployment processes that are more automated. If that does happen, then operators will feel they have more tools at their disposal to enable the new business models they are hoping for. Outside of all of this sits the hyperscalers and their telco businesses, primed for the telco opportunity, but as yet not welcomed into the heart of most operators’ businesses. But the clock is ticking.