The Vision Framework (iOS18 Edition)

Which Vision?

This article is about 'Vision' - The framework. Vision is here meant as in 'Computer Vision'. This is not about Vision OS. And it is also not about VisionKit, which is also something different. I highlighted the one I'm talking about in the image on the right. For my SSC25 project I decided to use the new Vision Framework.
Why New?
The framework first debuted on WWDC17 (iOS 11). But last year at WWDC24 the framework was marked as 'Legacy' as a new, native Swift implementation was added for iOS 18. Of course I had to try that one.
TL;DR: it's delightful! Basically you remove the 'VN' prefix from all the classes and then you get access to the new features.
If you want to take a look at some of the things I am talking about: I started with the 'old' Hand Drawing Sample Project. In that, we detect where the thumb and index finger of the most prominent hand are and show those two points. If they touch we will draw some lines. As a reference for the new classes I looked at the new OCR Sample Project. That is made using the new framework and aiming to detect text in a picture and place some rectangles in the place of the picture.
Also as I am writing this I feel that this will probably read kinda bad. I am trying to describe here what my opinion is on a framework, but only based on one specific use case. This might read better if you use it to read about some misunderstandings I had with the framework as a whole.

Hand Detection

For my Project I needed Hand Detection. Hand Detection in the 'old' framework looks something like this: (this is taken & adapted from the HandDrawing Sample)

var handPoseRequest = VNDetectHumanHandPoseRequest() // the request has a max of one hand
extension CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .up, options: [:])
        do {
            // Perform VNDetectHumanHandPoseRequest
            try handler.perform([handPoseRequest])
            // Continue only when a hand was detected in the frame.
            // Since we set the maximumHandCount property of the request to 1, there will be at most one observation.
            guard let observation = handPoseRequest.results?.first else { return }
            // Get points for thumb 
            let thumbPoints = try observation.recognizedPoints(.thumb)
            // Look for tip points.
            guard let thumbTipPoint = thumbPoints[.thumbTip] else { return }
            // Ignore low confidence points.
            guard thumbTipPoint.confidence > 0.3 else { return }
            // Convert points from Vision coordinates to AVFoundation coordinates.
            thumbTip = CGPoint(x: thumbTipPoint.location.x, y: 1 - thumbTipPoint.location.y)
        } catch {
            print(error)
        }
    }
}

With the new Vision it looks like this:

var handPoseRequest =  DetectHumanHandPoseRequest() 
handPoseRequest.maximumHandCount = 1 // the request has a max of one hand
nonisolated func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    Task{
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
        let handler = ImageRequestHandler(pixelBuffer, orientation: .up) // orientation is optional
        do{
            let observation = try await handler.perform(handPoseRequest)
            guard let thumbPoints = observation.first?.allJoints(in: .thumb), let tip = thumbPoints[.thumbTip] else { return }
            guard tip.confidence > 0.3 else { return }
            let thumbTip = CGPoint(x: tip.location.x, y: 1 - tip.location.y)
        }catch{
           print(error)
        }
    }
}

If you're now thinking 'that's nearly the same!' you'd be correct. A lot of the things stay the same. You're loosing all the VN-prefixes and gain some nice async await features. (Or maybe not so nice if you do not like to think about it.) The only thing that really changes is how to get the joints. It used to be that you'd go through the observations results recognizedPoints, and then look for the wanted joint. Now that is a bit easier? The observation offers access to all the joints, but you can also specify which group the joints should be part of, like being part of the thumb, index or other finger. The returned dictionary allows you to find the Joint by specifying its HumanHandPoseObservation.JointName.

JointName and PoseJointName

Sometimes the JointName is instead called PoseJointName, which is a typealias.

Confidence Scores

Whenever you perform a VisionRequest with an ImageRequestHandler, and the algorithm actually recognises something in the supplied image, you will be left with something that conforms to VisionObservation. In my case with the DetectHumanHandPoseRequest, I would get a HumanHandPoseObservation. (By the way if you feel lost with all of the long names, yeah same here, I have the docs open so that I get the correct names). Every VisionObservation has an ID, and a confidence, and some other things that aren't relevant here. The ID is unique for every observation. At first I thought that I could determine the object it recognised with the ID, but that is obviously not the case, as Vision would not be able to have a stable ID for an object based on (what are essentially) unconnected images. As for the confidence, that's a different case. Every observation has one. The confidence is a Float ranging from 0 to 1. The higher the confidence, the better. Now if you were to read the documentation carefully:

the value of 1.0 indicates not just the highest confidence, but it can also mean that the observation doesn't support or assign meaning to the confidence score.

If you were to ask for the confidence on a HandPoseObservation, you'd always get 1.0 confidence. The framework only gives confidence scores for the joints, not for the hand itself. I did not know that. Personally, I felt that was kinda weird and would have liked if there was more of a warning (or something that would have made me aware of that and its implications sooner). I had been filtering my results for a confidence score above 0.4, but I had been doing that on the hands confidence score, not on the joints scores.

Chirality

If you're working with hands, you might want to get the chirality (handedness) of the hand. The HandPoseObservation does come with an attribute called Chirality. This will tell you if what you're looking at is a left or a right hand. It's an Enum that's either .left or .right. Pretty easy, right? For my project I wanted to get one hand that is a base, and one that pointing, so I thought simply going with left and right (as ideally, my user would have exactly one left and one right hand) would be easiest. But I quickly realised that the chirality should have a confidence score of itself. In my case the chirality would very often be wrong, it will basically return .right in nearly all cases. This probably has multiple reasons. How would you recognise if something is a left or right hand? Let's say I want to recognise left hands. I would look for the arm it's attached to. What if I do not have that arm? Well then I'd look at where the thumb is, but now I also need to see if I am currently looking at the front or back of the hand. And hands could be rotated in any direction. What if the image I'm looking at is actually mirrored? I feel the most reliable way to see if something is a left or right hand would be to look at the rest of the body. In the end, I moved away from this, I didn't use chirality at all, and used my own calculations to get which hand is which.

Async and Swift6

I am not an expert for Swift 6. What I really liked about the new implementation, is that it is just much less code. (Because this framework is made for structured concurrency, and the swift-y way of doing things.)

AVCaptureVideoDataOutputSampleBufferDelegate

If you are trying to analyse images in real time (like whilst the user is filming something) you will probably work with the aforementioned delegate. That delegate has a function called captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection), which will hand you the image as a CMSampleBuffer (or rather 'in' said buffer). The Buffer, as well as any CVImageBuffers you get from it, are preconcurrency and will cause the compiler to yell at you if you're trying to move them into another function. The way I understand, that's because it is explicitly not sendable. I've read somewhere on the forums that AVFoundation will try to reuse the buffer as soon as it feels that the Delegate method exited.
I also found this discussion on the forums, and went with annotating the functions as non-isolated. I do not know if this is good or necessary. I would love to learn more about this in the future. Currently I am just unsure if the amount of work I am doing even means I should offload that from the Main Actor, of if that would just mean that I create more problems by trying to program that.

Orientation (Confusion)

Please read this part as total confusion, I hope to update this as I figure out why things are the way they are.
When you're using an ImageRequestHandler, you can add an orientation (CGImagePropertyOrientation) to the Handler. This is one thing that I still sort of struggle with, and this might also play into the chirality part I was talking about. The way I read the documentation, the camera will capture images in what is its natural landscape orientation, if you capture a picture in portrait, you would then need to rotate the image to display it correctly. The doc also states that this can be important to CV tasks like recognising faces. So I would assume that knowing the correct orientation is also important for recognising hands.

Orientation values are commonly found in image metadata, and specifying image orientation correctly can be important both for displaying the image and for certain image processing tasks such as face recognition. For example, the pixel data for an image captured by an iOS device camera is encoded in the camera sensor’s native landscape orientation.

I've seen in the hand drawing sample code, that Apple specifies the orientation as .up. But sadly the sample code only works in one orientation (portrait) of the iPad and not if you rotate the iPad. Also, I am using the front camera, not the back camera, as shown in the documentation. What is the natural landscape of the front camera of an iPad? For my old iPad Pro, which has the camera on the shorter side of the iPad, I would have assumed that the natural landscape is when the iPad is in landscape, with the camera on the left side. That is because I interpreted it as 'landscape' not as 'natural landscape'. Now I would say that the camera's natural landscape would actually be in portrait mode when the camera is at the top of the iPad. Therefore the rotation, if the iPad is in landscape mode (camera left), would have to be .right, or .rightMirrored. But no, it seems to be giving the correct orientation in .downMirrored. I assume this is the correct rotation, as it places the points on the screen correctly, without rotating the content. The hand drawing sample does rotate the content. They convert points from Vision to AVFoundation & I stopped doing that. For their conversion they mirror the y-axis, so .up would become .down for me.
So we move on with giving the rotation as .downMirrored. Why mirrored? I do not know. I assume it's something to do with it being a front camera, even if Apples sample doesn't have to mirror.
Now if I rotate the iPad, the orientation would have to change, right? It doesn't actually seem to. Actually the orientation that I gave to the buffer seemed to not be affected if I rotate the iPad, which seemed even more confusing to me.
Now I had some friends testing my app. One of them has one of the newer iPads, that have their front camera no longer on the short side of the iPad, but instead on the long side, directly under where the Apple Pencil could go. When he opened my app, the points that I displayed (where the hands are) were on different parts of the screen. For him the orientation I gave to the handler seemed to be off. If he runs the hand drawing sample code (or I run it on my mom's iPad) the points drawn by that sample are also in the wrong place. So I assume that that has nothing to do with the code I use to rotate the image when the iPad rotation changes. Now, if I rotate the new iPad until its camera is at the bottom, then the points match up with the hand.
Through testing all possible CGImagePropertyOrientation values, I know that when I put the orientation to .upMirrored, it will be correct for new iPads, no matter the rotation. This is still kinda confusing to me. The mirrored seems to be because it's a front camera, whilst the .up or .down seems to stem from where the camera is. But that would mean that the front camera in those iPads is not rotated 90 degrees, but actually different by 180 degrees. And that is where I gave up and just accepted that I do not seem to get where natural landscape is for a front camera. Or maybe I am misunderstanding how this is meant completely. Or it's something with converting different units. For my project I simply went with checking the name of the device, and then changing the orientation if it's one of the new iPads. This is kinda a bad solution, I know, but I had a deadline to meet.

Conclusion

In conclusion, when I set out to write this, I wanted to write about how much I like the new vision. It's so much nicer to use than the old one, much less code and much easier to understand whats going on. But now that I am close to done I realise that that's not what I've written. This text is much more about all the things I found confusing. Maybe this helps you, maybe you know the answers to my confusion, I would love to hear from you!
Oh, I also linked a WWDC Session about the new Vision, I really enjoyed watching that.

Tagged with: