Hi,

This is a follow up TSV-ART review.

Summary rating: Ready with Issues.

Sorry, I missed this email before Christmas. I just took another look at the updated draft -14. It does addresses some of the issues that I raised. However, I think the following issues should be addressed before publication.


  1.
Security considerations: As you reply in worst case managing to instruct the endpoint to combine the wrong streams could result in at minimal the incorrect produced output, it may also crash the decoder. Thus, I think the security considerations needs a bit more than the standard boiler plate here and be explicit about the need to be able to trust the signalling for which streams to combine into one output, as well as the need for having source authentication on all the streams that are used as input when decoding.
  2.
When it comes to the congestion control considerations I would note that also here it appears to needs a bit of discussion about the need to consider the full aggregate when adapting the bit-rates to ensure that the adaptation results in a proportional quality degradation, and not a much larger due to adapting the wrong stream.

When it comes to the general multiplexing description, including using SSRC grouping I think addressing this will have some significant impact on the time to publish this document. If that is worth it would be highly dependent if there exists usage of this that would benefit from that description. However, as you have a BUNDLE case it might be fine for the initial usage. Thus, it might be simpler to just go ahead for now, and if the actual deployments runs into use cases with need to more clearly express things for example multiple sources per direction in the same set of RTP session(s) then extensions may be warranted.

I still think this draft could have benefited from an additional architectural section between Section 4 and 5 that would have discussed how RTP sessions vs streams (SSRCs) are best used for some use cases. I think that would have simplified the rest of the description.

Cheers

Magnus


From: Lauri Ilola (Nokia) <lauri.ilola@nokia.com>
Date: Tuesday, 9 December 2025 at 14:42
To: Magnus Westerlund <magnus.westerlund@ericsson.com>, tsv-art@ietf.org <tsv-art@ietf.org>
Cc: avt@ietf.org <avt@ietf.org>, draft-ietf-avtcore-rtp-v3c.all@ietf.org <draft-ietf-avtcore-rtp-v3c.all@ietf.org>, last-call@ietf.org <last-call@ietf.org>
Subject: RE: draft-ietf-avtcore-rtp-v3c-12 ietf last call Tsvart review

[You don't often get email from lauri.ilola@nokia.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Hello Magnus!

Thanks for the thorough feedback. Let me try to address these over email here. I've implemented your suggestions below, except for the few clarifications that I wanted to ask.

Regarding Section 9.3.

You are correct that there are multiple ways of transmitting the atlas data and the video data. V3C has a concept to allow packing multiple video component in the same video frame so you can only end up needing one video stream. Together with the video you'll need the atlas stream in case the atlas data is dynamic. Alternatively the draft allows sending atlas data as part of the SDP, if it doesn't change over the session - which is the scenario that allows you to only stream on video for the volumetric experience. I'll try to clarify these two methods more clearly in the draft to avoid any confusing on the readers part.

Your point on ssrc-group is also well made and could be yet another way of grouping the different components. Would it make sense to add this as yet another way for grouping the data under the clarified section on grouping V3C components? It would probably just need a new <semantics> parameter to clarify the nature of grouping, correct?

Regarding Section 8.

> Due to how the full media representation when using V3C is dependent on having both the ATLAS as well as the component video streams the response to congestion control limitations are far from trivial. I think some clarification to the implementer here is needed on how it should behave when forced to reduce the aggregate bandwidth and how to consider inter stream prioritization. This issue is clearly different from what scalable video codecs encounter when being bandwidth limited where it is usually clear how to reduce the bit-rate.

This is an astute observation. You are absolutely correct that this is far from trivial and could be something that will set one implementation apart from another implementation. Many services and receivers may have different opinions how the adaptation should be performed depending on the available hardware and processing at hand. Specifying it here could be rather limiting and as such we propose to follow the bare minimum methods as written in the draft. I don't believe proper adaptation is as simple as defining media stream priorities, but some streams are for sure more important than others. For example, for one application it may be absolutely ok to drop color or texture information and stream only black and white data as a method for adaptation. Another application may be prefer increasing the noise for the rendering, by dropping occupancy information and trying to derive occupancy form depth & color videos. Do you consider this a road-blocker if we don't fix definitely the adaption in the specification?

Regarding Section 11.

> I think this format needs an additional security consideration due to the grouping. That is that for correct decoding the signalling system needs to correctly indicate the combination of the V3C Atlas stream and the component streams. If an attacker is able to manipulate this information the senders intention will not be represented.

This would mean that an attacker, if able to manipulate the SDP, would be able to direct atlas data to video decoder and vice versa, or that video codec components would be reconstructed incorrectly. This would likely cause the decoder to crash. Similar problems would occur, when a video and audio streaming session would be attacked and the bitstreams would be directed to incorrect decoders. This sounds like something that should have a default mechanism to protect against these kind of attacks. Do you know if there is a standard that would be addressing this?

> If I manipulate the ATLAS information can I significantly increase the decoding information. For example forcing magnitude more iterations over the underlying component video stream data to create the Volumetric representation?

Manipulation of the atlas data, would likely cause mis-indexing of video textures and result in crashing the decoder. How the decoders handle falsified atlas data is very much left to the decoder implementation. Smart implementations would have means of detecting such manipulations (for example counting how many texel read operations are made per pixel), but less smarter decoders could end up in infinite loops if not careful. I'm unsure how this sort of attacks could be prevented other than urging for carefulness from the decoder implementers. Would it be sufficient to add a note urging for such carefulness?

Thanks again for the constructive suggestions. Looking forward for your suggestions.

Kind regards,
-Lauri

-----Original Message-----
From: Magnus Westerlund via Datatracker <noreply@ietf.org>
Sent: Tuesday, October 28, 2025 3:50 PM
To: tsv-art@ietf.org
Cc: avt@ietf.org; draft-ietf-avtcore-rtp-v3c.all@ietf.org; last-call@ietf.org
Subject: draft-ietf-avtcore-rtp-v3c-12 ietf last call Tsvart review


CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.


Document: draft-ietf-avtcore-rtp-v3c
Title: RTP Payload Format for Visual Volumetric Video-based Coding (V3C)
Reviewer: Magnus Westerlund
Review result: Almost Ready

This document has been reviewed as part of the transport area review team's ongoing effort to review key IETF documents. These comments were written primarily for the transport area directors, but are copied to the document's authors and WG to allow them to address any issues raised and also to the IETF discussion list for information.

When done at the time of IETF Last Call, the authors should consider this review as part of the last-call comments they receive. Please always CC tsv-art@ietf.org if you reply to or forward this review.

High level issue:

I think this document is not clear enough on the different alternatives that is actually supported for transmitting the ATLAS data and the component video data.

Section 4.1 gives the impression that one can combine all data needed for one V3C represenation into a single video stream, i.e. being sent over a single RTP SSRC.

Section 9.2 instead talks about how to have seperate V3C with the atlas data, and then component video streams over other RTP streams (SSRC).

For the later there exists a plentora of possible multiplexing models. With what is being defined in section 9.2-9.4. With the defined grouping of V3C one can clearly do both RTP session based multiplexing as well as bundled. The examples in Section 9.3 appears to indicate that one need uniquie media lines in SDP per complete V3C representation and that one can't setup one media line per type and simple use multiple SSRC in each one complete set across the media line to generate one media representation? Or even by just establishing one payload type per type and then use RFC 5576 ssrc-group to indicate a set of SSRCs that are part of one representation. Wouldn't it make sense to have a ssrc-group for V3C?

Having read the document I think there is a need for a dedicated section that defines which combinations that are possible and what external from RTP/RTCP support these needs in providing the grouping.

Can you confirm that you have not identified anyway of using RTP/RTCP mechanisms that exist to identify the set of SSRCs that are part of one representation?

Another significant issue is the one for Section 8: regarding bit-rate adaptation for this payload format and its component stream.

Section 7.1:

Published specification: Please refer to [ISO.IEC.23090-5]

I think this needs to indicate the RFC that defined the RTP payload format, as that is the specification for which the media type is being registered.

Restrictions on usage: N/A

I think the recommened text from RFC 8088 for this field still applies:

This media type depends on RTP framing and, hence, is only defined
      for transfer via RTP [RFC3550].  Transport within other framing
      protocols is not defined at this time.

Section 8:

Due to how the full media representation when using V3C is dependent on having both the ATLAS as well as the component video streams the response to congestion control limitations are far from trivial. I think some clarification to the implementer here is needed on how it should behave when forced to reduce the aggregate bandwidth and how to consider inter stream prioritization. This issue is clearly different from what scalable video codecs encounter when being bandwidth limited where it is usually clear how to reduce the bit-rate.

Section 9.

Please add a reference to RFC 8866 in the first sentence.

Section 9.1:

I would recommend that one are clear that "byte-string" is using the definition that exists in RFC 8866.

Section 11:

I think this format needs an additional security consideration due to the grouping. That is that for correct decoding the signalling system needs to correctly indicate the combination of the V3C Atlas stream and the component streams. If an attacker is able to manipulate this information the senders intention will not be represented.

Secondly:

This RTP payload format and its media decoder do not exhibit any significant non-uniformity in the receiver-side computational complexity for packet processing, and thus are unlikely to pose a denial-of-service threat due to the receipt of pathological data. Nor does the RTP payload format contain any active content.

If I manipulate the ATLAS information can I significantly increase the decoding information. For example forcing magnitude more iterations over the underlying component video stream data to create the Volumetric representation?