Tesla CEO Elon Musk recently unveiled the company's Tesla Bot. The robot code-named Optimus shuffled across a stage, waved its hand, and pumped its arms in a slow-speed dance move. Musk predicts the robot could cost $20,000 within three to five years if all goes according to plan. But the question is, what can it do for us. But before we get into that, lets look at the main devices that drive the Tesla Bot.
Tesla Bot Actuators
The Actuators are the main drive system for any Robot. You could say a robot is nothing more than a PC with moving parts, or in other words, a Robot is a PC with Actuators and sensors. Tesla has developed its own Actuators for the Bot, it uses 3 types of rotary actuators and 3 types of Linear Actuators.
If you are wondering why Tesla didn't use standardized Linear Actuators like the FIRGELLI actuator, its because they have several constraints that means they have to develop their own systems to get the Robots to be ultimately lightweight, power efficient, high power density and low cost. Tesla have claimed they want to get the Bot to retail for $20,000 each. This in itself is a tall order for something that's gong to require 23 Actuators, and powerful PC, lots of sensors and a battery pack to make it last more than a few hours, plus a strong skeleton to hold it all together.
Tesla Bot Linear Actuators
The Linear Actuators Tesla developed are highly specific for a specific role, this means they would not really be of much use for any other application other than a Robot. Their Actuators employ a planetary Roller system and Tesla calls it, but this is basically code for Ballscrew leadscrew design, and instead of a traditional magnetic armature coil in the middle of the motor they decided to use a brushless core motor design. This means the Ball leadscrew design is very efficient and uses less power, but also more expensive. And they use a Brushless power system which means the live span will be significantly faster and allows highly specific drive modes controlled by the software.
The length of travel is only about 2" long, and as the picture showed of them lifting a Piano at 500KG, this is alot of weight. You may wonder why it needs to lift so much weight?, well that's because when installed in a metal skeleton, the actuators travel needs to amplify the stoke of what its moving. So if its moving the Leg of a Robot, the leg needs to be able to move about 150 degs, or over a 2 foot length the leg needs to swing from about zero to a 3-foot arc. The huma body that has evolved over 100,000's of years allows us humans to do this using our leg muscles, but getting a linear actuator to do this is no easy task. So the point I'm making is that, even though The Actuator can lift 500Kg of weight over 2-inches, once that actuators connected to a lever, the force gets reduced significantly, depending on the leverage ratio, and but the speed increases which makes for a nice trade-off.
Tesla Bot Presentation.
Here is what Tesla themselves had to say about the latest Bot presentation they gave on sept 30th 2022
Elon Musk presents: We've got some really exciting things to show you, I think you'll be pretty impressed. I do want to set some expectations with respect to our Optimus robot as as you know last year it was just a person in a robot suit but we've not we've come a long way and it's I think you know compared to that it's going to be very impressive. And we're going to talk about the advancements in AI for full self-driving as well as how they apply to more generally to real world AI problems like a humanoid robot and and even going beyond that. I think there's some potential that what we're doing here at Tesla could make a meaningful contribution to AGI, and and I think actually tells us a good entity to do it from a governance standpoint because we're a publicly traded company we have one class of of stock and that means that the public controls Tesla and I think that's actually a good thing um so if I go crazy you can fire me this is important maybe I'm not crazy I don't know. So yeah so we're going to talk a lot about our progress in AI autopilot as well as the progress in with with dojo, and then we're going to bring the team out and do a long q & a so you can ask tough questions. Whatever you'd like existential questions technical questions if it would want to have as much time for Q&A as possible so let's see with that you guess what daily.
Hey guys I'm Milan I work on autopilot and it is rubber I'm Lizzy a mechanical engineer on the project as well okay so should we should we bring up the Bot before we forward first time we try this robot without any backup support cranes
mechanical mechanisms no cables nothing yeah I want to join with you guys
tonight but it was the first time let's see you ready let's go self-driving computer that runs in your Tesla cars by the way this is the it's literally the first time the robot has operated without a tether was on stage tonight that's it, so the robot can actually do a lot more than we just showed you we just didn't want it to fall on its face, so we'll we'll show you some videos now of the robot doing a bunch of other things um yeah which are less risky.
Yeah we wanted to show a little bit more what we've done over the past few months with apart and just walking around and dancing on stage and just humble beginnings but you can see the autopilot neural networks running as
is just retrained for the bud directly on that on that new platform that's my watering can you see a rendered view that's that's the robot what's the that's the world the robot sees so it's it's very clearly identifying objects like this is the object it should pick up picking it up. We use the same process as we did for autopilot to collect data in train your networks that we then Deploy on the
robot that's an example that illustrates the upper body a little bit more
something that we'll like try to nail down in a few months over the next few months I would say to perfection this is really an actual station in the Fremont Factory as well that it's working at.
That's not the only thing we have to show today so that what you saw was what we call Bumble C, that's our sort of rough development robot using semi-off-the-shelf actuators but we actually have it's gone a step further than that already the team's done an incredible job and we actually have an optimist bot with a
fully Tesla designed at both actuators battery pack control system everything it wasn't quite ready to walk but I think it will walk in a few weeks, but we wanted to show you the robot and something that's actually fairly close to what will go into production, and and show you all the things it can do so let's bring it out
We expect to have in Optimus production unit one which is the ability
to move all the fingers independently move the thumb, have two
degrees of freedom so it has opposable thumbs and both left and right hand so
it's able to operate tools and do useful things, our goal is to make a useful
humanoid robot as quickly as possible and we've also designed it using the
same discipline that we use in designing the car which is to say to design it for manufacturing such that it's possible to make the robot at in high volume at low cost with high reliability so that's incredibly important I mean you've all seen very impressive humanoid robot demonstrations and that that's great but what are they missing?, they're missing a brain, they don't have the the intelligence to
navigate the world by themselves and they're also very expensive, and made in low volume whereas this this is the optimistic society and extremely capable robot but made in very high volume probably ultimately millions of units and it is expected to cost much less than a car.
I would say probably less than twenty thousand dollars would be my guess
the the potential for optimistic is I think appreciated by very effective
people hey as usual Tesla demos are coming in hot so okay that's good that's good um yeah the teams put on put in and the team has put in an incredible amount of work uh it's uh working days you know seven days a week running the 3am oil to to get to the demonstration today I'm super proud of what they've done is they've really done done a great job I just like to give a hand to the whole option of this team so you know that now there's still a lot of work to be done to refine Optimus and improve it obviously this is just Optimus version one and that's really why we're holding this event which is to convince some of the most talented people in the world like you guys to join Tesla and help make it a reality and bring it to fruition at scale such that it can help millions of people and the the and the potential it likes it is is really boggles the mind because you have to say like what what is an economy an economy is uh sort of productive entities times the productivity uh Capital times output productivity per capita at the point in which there is not a limitation on capital, it's not clear what an economy even means at that point an economy becomes quasially infinite so what what you know taken to fruition in the hopefully benign scenario um the this means a future of abundance a future where um there is no poverty where people you
can have whatever you want in terms of products and services it really is a fundamental transformation of civilization as we know it obviously we want to make sure that transformation is a positive one and safe and but that's also why I think Tesla as an entity doing this being a single class of stock publicly traded owned by the public is very important and should not be overlooked I think this is essential because then if the public doesn't like what Tesla's doing the public can buy shares in Tesla and vote differently.
This is a big deal like it's very important that that I can't just do what I want you know sometimes people think that not but it's not true so you know that's it's very important that the the corporate entity that has that that makes this happen is something that the public can properly influence and so I think the Tesla structure is is ideal for that and like I said that you know self-driving cars will certainly have a tremendous impact on the world um I think they will improve the
productivity of Transport by at least a half order of magnitude perhaps an order of magnitude perhaps more um optimists I think has maybe a two order of magnitude potential Improvement in economic output like like it's not clear it's not clear what the limit actually even is so but we need to do this in the right way we need to do it carefully and safely and ensure that the outcome is one that is beneficial to civilization and and one that Humanity once I can't this is also it's extremely important obviously, so and I hope you will consider joining Tesla to achieve those goals at Tesla we really care about doing the right thing here always aspire to do the right thing and and really not pay the road to hell with good intentions and I think the road to hell is mostly paved with bad intentions but every now and again there's a good intention in there so we want to do it do the right thing um so you know consider joining us and helping make it happen um with that let's let's uh move on to the next phase right on thank you Elon
All right so you've seen a couple robots today let's do a quick timeline recap so last year we unveiled the Tesla bot
concept but a concept doesn't get us very far we knew we needed a real development and integration platform to
get real-life learnings as quickly as possible so that robot that came out and did the little routine for you guys we
had that within six months built working on software integration Hardware upgrades over the months since then but
in parallel we've also been designing the Next Generation this one over here
so this guy is rooted in the the foundation of sort of the vehicle design process you know we're leveraging all of
those learnings that we already have obviously there's a lot that's changed since last year but there's a few things
that are still the same you'll notice we still have this really detailed focus on the true human form we think that
matters for a few reasons but it's fun we spend a lot of time thinking about how amazing the human body is we have
this incredible range of motion typically really amazing strength a fun
exercise is if you put your fingertip on the chair in front of you you'll notice that there's a huge range of motion that
you have in your shoulder and your elbow for example without moving your fingertip you can move those joints all
over the place um but the robot you know its main function is to do real useful work and
it maybe doesn't necessarily need all of those degrees of freedom right away so we've stripped it down to a minimum sort
of 28 fundamental degrees of freedom and then of course our hands in addition to that
humans are also pretty efficient at some things and not so efficient in other times so for example we can eat a small
amount of food to sustain ourselves for several hours that's great uh but when we're just kind of sitting around no
offense but we're kind of inefficient we're just sort of burning energy so on the robot platform what we're
going to do is we're going to minimize that idle power consumption drop it as low as possible and that way we can just
flip a switch and immediately the robot turns into something that does useful work
so let's talk about this latest generation in some detail shall we so on the screen here you'll see in
Orange are actuators which we'll get to in a little bit and in blue our electrical system
so now that we have our sort of human-based research and we have our first development platform we have both
research and execution to draw from for this design again we're using that vehicle design
foundation so we're taking it from concept through design and Analysis and
then build and validation along the way we're going to optimize for things like cost and efficiency
because those are critical metrics to take this product to scale eventually how are we going to do that well we're
going to reduce our part count and our power consumption of every element possible we're going to do things like
reduce the sensing and the wiring at our extremities you can imagine a lot of mass in your hands and feet is going to
be quite difficult and power consumptive to move around and we're going to centralize both our
power distribution and our compute to the physical center of the platform
so in the middle of our torso actually it is the Torso we have our battery pack this is sized at 2.3 kilowatt hours
which is perfect for about a full day's worth of work what's really unique about this battery
pack is it has all of the battery Electronics integrated into a single PCB within the pack so that means everything
from sensing to fusing charge management and power distribution is all on one all
in one place we're also leveraging both our Vehicle Products and our Energy Products to roll
all of those key features into this battery so that's streamlined manufacturing really efficient and
simple cooling methods battery management and also safety and of course we can leverage Tesla's
existing infrastructure and supply chain to make it so going on to sort of our brain it's
not in the head but it's pretty close also in our torso we have our Central Computer so as you know Tesla already
ships full self-driving computers in every vehicle we produce we want to leverage both the autopilot hardware and
the software for the humanoid platform but because it's different in requirements and in form factor we're
going to change a few things first so we still are gonna it's gonna do everything that a human brain does
processing Vision data making Split Second decisions based on multiple sensory inputs and also Communications
so to support Communications it's equipped with wireless connectivity as well as audio support
and then it also has Hardware level security features which are important to protect both the robot and the people
around the robot so now that we have our sort of core
we're going to need some limbs on this guy and we'd love to show you a little bit about our actuators and our fully
functional hands as well but before we do that I'd like to introduce Malcolm who's going to speak a little bit about
our structural foundation for the robot [Applause]
thank you
Tesla have the capability to finalize highly complex systems it does get much more complex than a crash you can see
here a simulated crash on model 3 superimposed on top of the actual physical crash
it's actually incredible how um how accurate it is just to give you an idea of the complexity of this model
it includes every knot Bolton washer every spot Weld and it has 35 million degrees of freedom it's quite amazing
and it's true to say that if we didn't have models like this we wouldn't be able to make the safest cars in the world
so can we utilize our capabilities and our methods from the automotive side to influence a robot
well we can make a model and since we had crash software we used the same software here we can make it fall down
the purpose of this is to make sure that if it falls down ideally it doesn't but it's superficial damage
we don't want to for example break its gearbox at its arms that's equivalent of a dislocated shoulder of a robot
difficult and expensive to fix so we wanted to dust itself off get on with a job that's been given
if we could also take the same model and we can drive the actuators using the input from a previously solved model
bringing it to life so this is producing the Motions for the tasks we want the robot to do these
tasks are picking up boxes turning squatting walking upstairs whatever the set of tasks are we can play to the
model this is showing just simple walking we can create the stresses in all the components that helps us to
optimize the components these are not dancing robots these are
actually the modal Behavior the first five modes of the robot and typically when people make robots they make sure
the first mode is up around the top single figures up towards 10 Hertz
who is it do this is to make the controls of walking easier it's very difficult to walk if you can't guarantee
where your foot wobbling around that's okay to make one robot we want to make thousands maybe Millions
we haven't got the luxury of making them from carbon fiber titanium we want to make them on plastic things are not
quite so stiff so we can't have these high targets I'll call them dumb targets
we've got to make them work at lower targets so is that is that going to work well if you think about it sorry about
this but we're just bags of soggy jelly and Bones thrown in we're not high frequency if I stand on
my leg I don't vibrate at 10 Hertz we people operate at low frequency so we
know the robot actually can it just makes controls harder so we take the information from this the modal data and
the stiffness and feed that into the control system that allows it to walk
just changing tax slightly looking at the knee we could take some inspiration from
biology and we can look to see what the mechanical advantages of the knee is it turns out it actually represents quite
similar to the four bar link and that's quite non-linear that's not surprising really because if
you think when you bend your leg down the torque on your knee is much more when it's bent than it is when it's
straight so you'd expect a non-linear function and in fact the biology is non-linear
this matches it quite accurately so that's the representation the four by
link is obviously not physically four bar link as I said the characteristics are similar but me betting down that's
not very scientific let's be a bit more scientific we've played all the tasks through the through this graph but this
is showing pickets of walking squatting the tasks I said we did on the stress and that's the uh the talk a scene at
the knee against the knee bend on the horizontal axis this is showing the requirement for the knee to do all these
tasks and then put a curve through it surfing over the top of the Peaks and that's saying this is what's required to
make the robot do these tasks
so if we look at the four bar link that's actually the green curve and it's saying that the non-linearity of the
four by link is actually linearized the characteristic of the force what that really says is that's lowered the force
that's what makes the actuator have the lowest possible Force which is the most efficient we want to burn energy up slowly
what's the blue curve well the blue curve is actually if we didn't have a four bar link we just had an arm
sticking out of my leg here with a with an actuator on it a simple two bar link
that's the best you could do with a simple two bar link and it shows that that would create much more force in the
actuator which would not be efficient so what's that look like in practice
well as you'll see but it's very tightly packaged in the knee you'll see a good
transparent in a second you'll see the full bar link there it's operating on the actuator this is determined the
force and the displacements on the actuator and now pass you over to concertina to
so I am I would like to talk to you about um the design process and the actuator
portfolio uh in our robot so there are many similarities between a
car and the robot when it comes to powertrain design the the most important thing that matters here is energy mass and cost
we are carrying over most of our designing experience from the car to the robot
so in the particular case you see a car with two drive units and the drive units
are used in order to accelerate the car 0 to 60 miles per hour time or drive a
cities Drive site while the robot that has 28 actuators and
it's not obvious what are the tasks at the actuator level so we have tasks that
are higher level like walking or climbing stairs or carrying a heavy object which need to be translated into
joint into joint specs therefore we use our model
that generates the torque speed trajectories for our joints which
subsequently is going to be fed in our optimization model and to run through
the optimization process this is one of the scenarios that the
robot is capable of doing which is turning and walking so when we have this torque speed
trajectory we laid over an efficiency map of an actuator and we are able along
the trajectory to generate the power consumption and the energy accumulative
energy for the task versus time so this allows us to define the system
cost for the particular actuator and put a simple Point into the cloud then we do
this for hundreds of thousands of actuators by solving in our cluster and the red line denotes the Pareto front
which is the preferred area where we will look for optimal so the X denotes
the preferred actuator design we have picked for this particular joint so now we need to do this for every joint we
have 28 joints to optimize and we parse our cloud we parse our Cloud again for every joint
spec and the red axis this time denotes the bespoke actuator designs for every
joint the problem here is that we have too many unique actuator designs and
even if we take advantage of the Symmetry still there are too many in order to make something Mass
manufacturable we need to be able to reduce the amount of unique actuator designs therefore we run something
called commonality study which we parse our Cloud again looking this time for
actuators that simultaneously Meet The Joint performance requirements for more than one joint at the same time so the
resulting portfolio is six actuators and they show in a color map the middle figure
um and the actuators can be also viewed in this slide we have three rotary and
three linear actuators all of which have a great output force or Torque per Mass
the rotary actuator in particular has a mechanical clutch integrated on the high speed side angular contact
ball bearing and on the high speed side and on the low speed side a cross roller
bearing and the gear train is a strain wave gear and there are three integrated sensors
here and the bespoke permanent magnet machine the linear actuator
I'm sorry the linear actuator has planetary rollers and an inverted planetary Screw
As a gear train which allows efficiency and compaction and durability
so in order to demonstrate the force capability of our linear actuators we
have set up an experiment in order to test it under its limits
and I will let you enjoy the video
so our actuator is able to lift
a half tone nine foot concert grand piano
and
this is a requirement it's not something nice to have because our muscles can do
the same when they are direct driven when they are directly driven or quadricep muscles can do the same thing
it's just that the knee is an up gearing linkage system that converts the force
into velocity at the end effector of our Hills for purposes of giving to the
human body agility so this is one of the main things that are amazing about the human body and I'm
concluding my part at this point and I would like to welcome my colleague Mike who's going to talk to you about hand
design thank you very much thanks constantinos
so we just saw how powerful a human and a humanoid actuator can be however
humans are also incredibly dexterous the human hand has the ability to move
at 300 degrees per second it has tens of thousands of tactile sensors
and it has the ability to grasp and manipulate almost every object in our daily lives
for our robotic hand design we were inspired by biology we have five fingers an opposable thumb
our fingers are driven by metallic tendons that are both flexible and strong we have the ability to complete wide
aperture power grasps while also being optimized for precision gripping of small thin and delicate objects
so why a human like robotic hand well the main reason is that our factories and the world around us is
designed to be ergonomic so what that means is that it ensures that objects in our Factory are graspable
but it also ensures that new objects that we may have never seen before can be grasped by the human hand and by our
robotic hand as well the converse there is is pretty interesting because it's saying that these objects are designed to our hand
instead of having to make changes to our hand to accompany a new object
some basic stats about our hand is that has six actuators and 11 degrees of freedom it has an in-hand controller which
drives the fingers and receives sensor feedback sensor feedback is really important to
learn a little bit more about the objects that we're grasping and also for proprioception and that's the ability for us to recognize where
our hand is in space one of the important aspects of our hand is that it's adaptive this adaptability
is involved essentially as complex mechanisms that allow the hand to adapt to the objects that's being grasped
another important part is that we have a non-back drivable finger drive this clutching mechanism allows us to hold
and transport objects without having to turn on the hand Motors you just heard how we went about going
we went about designing the Tesla bot Hardware now we'll hand it off to Milan and our autonomy team to bring this
robot to life thanks Mike
all right um so all those cool things we've shown earlier in the video were posted
possible just in a matter of a few months thanks to the amazing word that we've done autopilot over the past few years
most of those components ported quite easily over to the Bots environment if you think about it we're just moving
from a robot on Wheels to a robot on legs so some of those components are pretty similar and some other require
more heavy lifting so for example our computer vision neural networks
reported directly from autopilot to the Bots situation it's exactly the same occupancy Network
that we're talking to a little bit more details later with the autopilot team that is now running on the bot here in
this video the only thing that changed really is the training data that we had to recollect
we're also trying to find ways to improve those occupancy networks using work made on your Radiance fields to get
really great volumetric rendering of the Bots environments for example here some
machine read that the bot might have to interact with
another interesting problem to think about is in indoor environments mostly with that sense of GPS signal how do you
get about to navigate to its destination say for instance to find its nearest charging station so we've been training
more neural networks to identify high frequency features key points within the
Bots camera streams and track them across frames over time as the bot navigates to its its environment
and we're using those points to get a better estimate of the Bots pose and trajectory within its environment as
it's walking we also did quite some work on the
simulation side and this is literally the autopilot simulator uh to which we've integrated the robot's Locomotion
code and this is a video of the motion control code running in the operator simulator simulator showing the
evolution of the robots walk over time and so as you can see we started quite slowly in April and start accelerating
as we unlock more joints and deeper more Advanced Techniques like arms balancing over the past few months
and so Locomotion is specifically one component that's very different as we're moving from the car to the Bots
environment and so I think it warrants a little bit more depth and I'd like my colleagues to start talking about this
now foreign
hi everyone I'm Felix I'm a robotics engineer on the project and I'm going to talk about walking
seems easy right people do it every day you don't even have to think about it
but there are some aspects of walking which are challenging from engineering perspective for example
physical self-awareness that means having a good representation of yourself what is the length of your limbs what is
the mass of your limbs what is the size of your feet all that matters also having an energy efficient gate you
can imagine there's different styles of walking and all of them are equally efficient
most important keep balance don't fall and of course also coordinate the motion
of all of your limbs together so now humans do all of this naturally but as Engineers or roboticists we have
to think about these problems and if I'm going to show you how we address them in our Locomotion planning and control
stack so we start with Locomotion planning and our representation of the bond that
means the model of the robot's kinematics Dynamics and the contact properties and using that model and the desired
path for the Bots our Locomotion planner generates reference trajectories for the entire system
this means feasible trajectories with respect to the assumptions of our model
the planner currently Works in three stages it starts planning footsteps and ends with the entire motion photo system
and let's dive a little bit deeper in how this works so in this video we see footsteps being planned over planning
Horizon following the desired path and we start from this and add then for
trajectories that connect these footsteps using toe off and yield strike just as the humans just as humans do
and this gives us a larger stride and less knee Bend for high efficiency of the system
the last stage is then finding a center of mass trajectory which gives us a fee dynamically feasible motion of the
entire system to keep balance as we all know plans are good but we
also have to realize them in reality let's say you know see how we can do this
[Applause] thank you Felix hello everyone my name
is Anand and I'm going to talk to you about controls so let's take the motion plan that Felix
just talked about and put it in the real world on a real robot let's see what happens
it takes a couple steps and falls down well that's a little disappointing
but we are missing a few key pieces here which will make it work
now as Felix mentioned the motion planner is using an idealized version of
itself and a version of reality around it this is not exactly correct
it also expresses its intention through trajectories and wrenches branches of
forces and torques that it wants to exert on the World to locomote
reality is way more complex than any similar model also the robot is not
simplified it's got vibrations and modes compliance sensor noise and on and on
and on so what does that do to the real world when you put the bot in the real world
well the unexpected forces cause unmodeled Dynamics which essentially the planner doesn't know about and that
causes destabilization especially For A system that is dynamically stable like biped locomotion
so what can we do about it well we measure reality we use sensors and our understanding of
the world to do state estimation and status to me here you can see the attitude and pelvis pose which is
essentially the vestibular system in a human along with the center of mass trajectory being tracked when the robot's walking
in the office environment now we have all the pieces we need in
order to close the loop so we use our better bot model we use the understanding of reality that
we've gained through State estimation and we compare what we want versus what we expect the reality we expect that
reality is doing to us in order to add corrections to the behavior of the
robot here the robot certainly doesn't appreciate being poked but it doesn't
admirable job of staying upright the final Point here is a robot that
walks is not enough we needed to use its hands and arms to
be useful let's talk about manipulation
[Applause]
hi everyone my name is Eric robotics engineer on teslabot and I want to talk
about how we've made the robot manipulate things in the real world we wanted to manipulate objects while
looking as natural as possible and also get there quickly so what we've done is
we've broken this process down into two steps first is generating a library of natural motion references or we could
call them demonstrations and then we've adapted these motion references online to the current real world situation
so let's say we have a human demonstration of picking up an object we can get a motion capture of that
demonstration which is visualized right here as a bunch of keyframes representing the locations of the hands
the elbows the Torso we can map that to the robot using inverse kinematics and if we collect a
lot of these now we have a library that we can work with but a single demonstration is not
generalizable to the variation in the real world for instance this would only work for a box in a very particular
location so what we've also done is run these reference trajectories through a
trajectory optimization program which solves for where the hand should be how the robot should balance
during uh when it needs to adapt the motion to the real world so for instance
if the box is in this location then our Optimizer will create this
trajectory instead next Milan's going to talk about uh
what's next for the Optimus uh Tesla y thanks thanks Larry
right so hopefully by now you guys got a good idea of what we've been up to over the past few months
um we started doing something that's usable but it's far from being useful there's still a long and exciting road
ahead of us um I think the first thing within the next few weeks is to get Optimus at least at
par with Bumble C the other bug prototype you saw earlier and probably Beyond we're also going to start
focusing on the real use case at one of our factories and really gonna try to try to nail this down and I run out all
the elements needed to deploy this product in the real world I was mentioning earlier
um you know indoor navigation graceful for management or even servicing all
components needed to scale this product up but um I don't know about you but after
seeing what we've shown tonight I'm pretty sure we can get this done within the next few months or years and I make
this product a reality and change the entire economy so I would like to thank the entire Optimus team for the hard
work over the past few months I think it's pretty amazing all of this was done in barely six or eight months thank you
very much [Applause]
thank you hey everyone
hi I'm Ashok I lead the autopilot team alongside Milan God it's coming so hard to top that
Optimus section he'll try nonetheless anyway
um every Tesla that has been built over the last several years we think has the
hardware to make the car drive itself we have been working on the software to
add higher and higher levels of autonomy this time around last year we had
roughly 2 000 cars driving our FSD beta software since then we have significantly
improved the software as robustness and capability that we have now shipped it to 160 000 customers as of today
yep [Applause]
this is not come for free it came from the sweat and blood of the engineering team over the last one year
for example we trained 75 000 neural network models just last one year that's
roughly a model every eight minutes that's you know coming out of the team and then we evaluate them on our large
clusters and then we ship 281 of those models that actually improve the performance of the car
and this space of innovation is happening throughout the stack the the planning software the
infrastructure the tools even hiring everything is progressing to the next level
the FSG beta software is quite capable of driving the car it should be able to navigate from
parking lot to parking lot handling CDC driving stopping for traffic lights and stop signs
negotiating with objects at intersections making turns and so on
all of this comes from the camera streams that go through our neural networks that run on the car itself it's
not coming back to the server or anything it runs on the car and produces all the outputs to form the world model
around the car and the planning software drives the car based on that
today we'll go into a lot of the components that make up the system the occupancy Network acts as the base
geometry layer of the system this is a multi-camera video neural
network that from the images predicts the full physical occupancy of the world around
the robot so anything that's physically present trees walls buildings cars walls what
have you it predicts if it's specifically present it predicts them along with their future motion
on top of this base level of geometry we have more semantic layers in order to
navigate the roadways we need the lens of course but then the roadways have lots of
different lanes and they connect in all kinds of ways so it's actually a really difficult problem for typical computer
vision techniques to predict the set of planes and their connectivities so we reached all the way into language
Technologies and then pulled the state of the art from other domains and not just computer vision to make this task
possible for vehicles we need their full kinematic state to control for them
all of this directly comes from neural networks video streams raw video streams come into the networks go through a lot
of processing and then outputs the full kinematic state that positions velocities acceleration jerk all of that
directly comes out of the networks with minimal post processing that's really fascinating to me because how how is
this even possible what world do we live in that this magic is possible that these networks predicts fourth
derivatives of these positions when people thought we couldn't even detect these objects
my opinion is that it did not come for free uh it it required tons of data so we had a bit sophisticated Auto labeling
systems that Shone through raw sensor data run a ton of offline compute on the
servers it can take a few hours run expensive neural networks distill the information into labels that train our
in-car neural networks on top of this we also use our simulation system to synthetically
create images and since it's a simulation we trivially have all the labels
all of this goes through a well-oiled data engine pipeline where we first
train a baseline model with some data ship it to the car see what the failures are and once you know the failures
we mine the fleet for the cases where it fails provide the correct labels and add the data to the training set
this process systematically fixes the issues and we do this for every task that runs in the car
yeah and to train these new massive neural networks this year we expanded our training infrastructure by roughly
40 to 50 percent so that sits us at about 14 000 gpus today across multiple
training clusters in the United States we also worked on our AI compiler which
now supports new operations needed by those neural networks and map them to the uh the best of our underlying
Hardware resources and our inference engine today is capable of Distributing the execution of
a single neural network across two independent system on ships essentially two independent computers interconnected
within the simple self-driving computer and to make this possible we have to keep a tight control on the end-to-end
latency of this new system so we deployed more advanced scheduling code across the full FSD platform
all of these neural networks running in the car together produce the vector space which is again the model of the
world around the robot or the car and then the planning system operates on top of this coming up with trajectories that
avoid collisions or smooth make progress towards the destination using a combination of model based optimization
plus neural network that helps optimize it to be really fast
today we are really excited to present progress on all of these areas we have the engineering leads standing by to
come in and explain these various blocks and these power not just the car but the same components also run on the Optimus
robot that Milan showed earlier with that I welcome panel to start talking about the planning section
hi all I'm parel joint let's use this intersection scenario to
dive straight into how we do the planning and decision making in autopilot so we are approaching this intersection
from a side street and we have to yield to all the crossing vehicles rightness as we are about to enter the
intersection The Pedestrian on the other side of the intersection decides to cross the road
without a crosswalk now we need to yield to this pedestrian yield to the vehicles from the right and
also understand the relation between The Pedestrian and the vehicle on the other side of the intersection
so a lot of these intra-object dependencies that we need to resolve in a quick glance
and humans are really good at this we look at a scene understand all the possible interactions evaluate the most
promising ones and generally end up choosing a reasonable one
so let's look at a few of these interactions that autopilot system evaluated we could have gone in front of this
pedestrian with a very aggressive launch in a lateral profile now obviously we are being a jerk to The
Pedestrian and we would spook The Pedestrian and his cute pet we could have moved forward slowly short
for a gap between The Pedestrian or and the vehicle from the right again we are being a jerk to the vehicle
coming from the right but you should not outright reject this interaction in case this is only safe interaction available
lastly the interaction we ended up choosing stay slow initially find the reasonable
Gap and then finish the maneuver after all the agents pass
now evaluation of all of these interactions is not trivial especially when you care about modeling
the higher order derivatives for other agents for example what is the longitudinal
jerk required by the vehicle coming from the right when you assert in front of it relying purely on collision checks with
modular predictions will only get you so far because you will miss out on a lot of valid interactions
this basically boils down to solving a multi-agent joint trajectory planning problem over the trajectories of ego and
all the other agents now how much ever you optimize there's going to be a limit to how fast you can
run this optimization problem it will be close to close to order of 10 milliseconds even after a lot of incremental approximations
now for a typical crowded unpredictable left say you have more than 20 objects each
object having multiple different future modes the number of relevant interaction combinations will blow up
we the planner needs to make a decision every 50 milliseconds so how do we solve this in real time
we rely on a framework what we call as interaction search which is basically a parallelized research over a bunch of
maneuver trajectories the state space here corresponds to the kinematic state of ego the kinematic
state of other agents the nominal future multiple multimodal predictions and all the static entities in the scene
the action space is where things get interesting we use a set of maneuver trajectory
candidates to Branch over a bunch of interactional decisions and also incremental goals for a longer
Horizon maneuver Let's Walk Through This research very quickly to get a sense of how it works
we start with a set of vision measurements namely Lanes occupancy moving objects these get represented as
sparse extractions as well as latent features we use this to create a set of goal
candidates Lanes again from the lanes Network or unstructured regions which correspond to
a probability mask derived from Human demonstrations once we have a bunch of these gold
candidates we create seed trajectories using a combination of classical optimization approaches as well as our
Network planner again trained on data from the customer feed now once we get a bunch of these free
trajectories we use them to start branching on the interactions we find the most critical interaction
in our case this would be the interaction with respect to The Pedestrian whether we assert in front of it or yield to it
obviously the option on the left is a high penalty option it likely won't get prioritized so we Branch further onto
the option on the right and that's where we bring in more and more complex interactions building this optimization
problem incrementally with more and more constraints and that research keeps flowing branching on more interactions branching
on more goals now a lot of tricks here lie in evaluation of each of each of this node
of the research inside each node initially we started with creating
trajectories using classical optimization approaches where the constraints like I described would be added incrementally
and this would take close to one to five milliseconds per action now even though this is fairly good
number when you want to evaluate more than 100 interactions this does not scale
so we ended up building lightweight queryable networks that you can run in the loop of the planner
these networks are trained on human demonstrations from the fleet as well as offline solvers with relaxed time limits
with this we were able to bring the rundown runtime down to close 200 microseconds per action
now doing this alone is not enough because you still have this massive research that that you need to go
through and you need to efficiently prune the search space so you need to do a do scoring on each
of these trajectories few of these are fairly standard you do a bunch of collision checks you do a bunch of comfort analysis what is the jerk and
actually required for a given maneuver the customer Fleet data plays an important role here again
we run two sets of again lightweight variable networks both really augmenting each other one of them trained from
interventions from the FST beta Fleet which gives a score on How likely is a given maneuver to result in
interventions over the next few seconds and second which is purely on human demonstrations human driven data giving
a score on how close is your given selected action to a human driven trajectory
the scoring helps us prune the search space keep branching further on the interactions and focus the compute on
the most promising outcomes the the cool part about this
architecture is that it allows us to create a cool blend between uh data driven approaches where you
don't have to rely on a lot of hand engineered costs but also ground it in reality with physics-based checks
now a lot of what what I described was with respect to the agents we could observe in the scene but the same
framework extends to objects behind occlusions we use the video feed from eight cameras
to generate the 3D occupancy of the world the blue mask here corresponds to the
visibility region we call it it basically gets blocked at the first
occlusion you see in the scene we consume this visibility mask to generate what we call as ghost objects which you
can see on the top left now if you model the spawn regions and the state transitions of this ghost
objects correctly if you tune your control response as a
function of that existence likelihood you can extract some really nice human-like behaviors
now I'll pass it on on to fill to describe more on how we generate these occupancy Networks
hey guys my name is Phil uh I will share the details of the occupancy Network we build over the past year
this network is our solution to model the physical work in 3D around our cars and it is currently not shown in our
customer facing visualization and what we will see here is the road Network output from our internal Dev tool
the occupancy Network takes video streams of all our 80 cameras as input produces a single unified volumetric
occupancy in Vector space directly for every 3D location around our car it
predicts the probability of that location being occupied a lot since it has video contacts it is
capable of predicting obstacles that are occluded instantaneously
for each location it also produces a set of semantics such as curb car pedestrian
and low debris as color coded here
occupancy flow is also predicted for motion since the model is a generalized Network
it does not tell static and dynamic objects explicitly it is able to produce and
model the random motions such as the swerving trainer here
this network is currently running in all Teslas with FSD computers and it is
incredibly efficient runs about every 10 milliseconds with our neural accelerator
so how does this work let's take a look at the architecture first we Rectify each camera images with
the camera calibration and the images were shown here were given to the network it's actually not
the typical 8-bit RGB image as you can see from the first imagery on top we're
giving the 12 bit raw photo account image to the network since it has four
bits more information it has 16 times better dynamic range as well as reduced
latency since we don't have the wrong ISP in Adobe anymore we use a set of records and back with
FPS as a backbone to extract images space features next we construct a set of 3D position
query along with the IMG space features as keys and values fit into an attention module
the output of the attention module is high dimensional spatial features
these special features are aligned temporarily using vehicle odometry
to derive motion last this spatial temporal features go
through a set of D convolution to produce the final occupancy and occupancy flow output
they're formed as fixed size boxer gray which might not be precise enough for planning on control
in order to get a higher resolution we also produce per voxel feature Maps which will feed into MLP with 3D spatial
Point queries to get position and semantics at any arbitrary location
after knowing the model better let's take a look at another example here we have an articular bus parked along right
side row highlighted as a L-shaped boxer here as we approach the bus starts to
move the blue the front of the cart turns blue first indicating the model predicts the frontal bus has a down zero
occupancy flow and the s-bus keeps moving the entire bus turns blue
and you can also see that the network predicts the precise curvature of the bus
well this is a very complicated problem for traditional object detection Network as you have to see whether I'm going to
use one cuboid or perhaps a two to fit the curvature but for occupation Network
since all we care about is the occupancy in the visible space and we'll be able to model the curvature precisely
besides the voxel grade the occupancy Network also produces a drivable surface
the drivable surface has both 3D geometry and semantics they are very useful for control especially on healing
and curvy roads the surface and the voxel gray are not predicted independently instead the
voxel grid actually aligns with the surface implicitly here we are at a hero Quest where you
can see the 3D geometry of the surface being predicted nicely
planner can use this information to decide perhaps we need to slow down more for the Hillcrest and as you can also
see the voxel grade aligns with the surface consistently
besides the Box source and the surface we're also very excited about the recent breakthrough in neural readings field or
Nerf we're looking into both incorporate some of the light color features into
occupancy Network training as well as using our Network output as the input state for Nerf
as a matter of fact Ashok is very excited about this this has been his uh personal weekend project for a while
on these nerves because I think the Academia is building a lot of these
Foundation models uh for language using like tons of large data sets for language but I think for vision nerves
are going to provide the foundation models for computer vision because they are grounded in geometry and geometry
gives us a nice way to supervise these networks and freezes of the requirement to Define an ontology and the
supervision is essentially free because you just have to differentiably render these images so I think in the future uh this
occupancy Network idea where you know images come in and then the network produces a consistent
volumetric representation of the scene that can then be differentially rendered into any image that was observed I I
personally think is a future of computer vision uh and you know we do some initial work on it uh right now but I
think in the future both at Tesla and in the Academia we will see that these
combination of One-Shot prediction of volumetric occupancy uh will be that's
my personal bet sexual so here's an example early result of a
3D Reconstruction from our free data instead of focusing on getting perfect RGB reprojection in imaging space our
primary goal here is to accurately represent the warnings 3D space for driving and we want to do this for all
our free data over the world in all weather and lighting conditions and obviously this is a very challenging
problem and we're looking for you guys to help finally the occupancy network is trained
with large auto level data set without any human in the loop and with that I'll pass to Tim to talk
about what it takes to train this network thanks Phil
[Applause] all right hey everyone let's talk about some training
infrastructure so we've seen a couple of videos you know four or five uh I think and care
more and worry more about a lot more Clips on that so we've been looking at
the occupancy networks just from Phil just fills videos it takes 1.4 billion
frames to train that Network what you just saw and if you have a hundred thousand gpus uh it would take one hour
but if you have uh one GPU it would take a hundred thousand hours so that is not
a Humane time period that you can wait for your training job to run right we want to ship faster than that so that
means you're going to need to go parallel so you need a more compute for that that means you're going to need a
supercomputer so this is why we've built in-house three supercomputers comprising
of 14 000 gpus where we use 10 000 gpus for training and around four thousand
gpus for auto labeling all these videos are stored in 30 petabytes of a distributed managed video
cache you shouldn't think of our data sets as fixed let's say as you think of your
imagenet or something you know with like a million frames you should think of it as a very fluid thing so we've got a
half a million of these videos flowing in and out of this cluster these clusters every single day
and we track 400 000 of these kind of python video instantiations every second
so that is that's a lot of calls we're gonna need to capture that in order to govern the retention policies of this
distributed video cache so underlying all of this is a huge amount of infra all of which we build and manage
in-house so you cannot just buy you know 40 000
gpus and then a 30 petabytes of Flash mvme and just put it together and let's go train uh it actually takes a lot of
work and I'm gonna go into a little bit of that what you actually typically want to do is you want to take your accelerator so
that it could be the GPU or Dojo which we'll talk about later and because that's the most expensive
component that's where you want to put your bottleneck and so that means that every single part of your system is
going to need to outperform this accelerator and so that is really complicated that
means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes
these nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning
framework this machine learning framework then needs to hand it off to your GPU and then you can start training but then you
need to do so across hundreds or thousands of GPU in a reliable way in
logstap and in a way that's also fast so you're also going to need an interconnect extremely complicated we'll talk more
about dojo in a second so first I want to take you to some
optimizations that we've done on our cluster so we're getting in a lot of videos and
video is very much unlike let's say training on images or text which I think is very well established video is quite
literally a dimension more complicated um and so that's why we needed to go end
to end from the storage layer down to the accelerator and optimize every single piece of that because we train on the photon count
videos that come directly from our Fleet we train on those directly we do not post process those at all
the way it's just done is we seek exactly to the frames we select for our batch we load those in including the
frames that they depend on so these are your iframes or your keyframes we package those up move them into shared
memory move them into a double bar from the GPU and then use the hardware decoder that's only accelerated to
actually decode the video so we do that on the GPU natively and this is all in a very nice python pytorch extension
doing so unlocked more than 30 training speed increase for the occupancy networks and freed up basically a whole
CPU to do any other thing um you cannot just do training with just
videos of course you need some kind of a ground Truth uh and uh that is actually an interesting problem as well the
objective for storing your ground truth is that you want to make sure you get to your ground truth that you need in the
minimal amount of file system operations and load in the minimal size of what you need in order to optimize for aggregate
cross cluster throughput because you should see a compute cluster as one big device which has internally fixed
constraints and thresholds so for this we rolled out a format that
is native to us that's called small we use this for our ground truth our feature cache and any inference outputs
so a lot of tensors that are in there and so just the cartoon here let's say these are your uh is your table that you
want to store then that's how that would look out if you rolled out on disk so what you do is you take anything you'd
want to index on so for example video timestamps you put those all in the header so that in your initial header
read you know exactly where to go on disk then if you have any tensors uh you're going to try to transpose the
dimensions to put a different dimension last as the contiguous Dimension and then also try different types of
compression then you check out which one was most optimal and then store that one this is actually a huge step if you do
feature caching unintelligible output from the machine Learning Network rotate around the
dimensions a little bit you can get up to 20 increase in efficiency of storage then when you store that we also
ordered the columns by size so that all your small columns and small values are together so that when you seek for a
single value you're likely to overlap with a read on more values which you'll use later so that you don't need to do
another file system operation so I could go on and on I just went on
on touched on two projects that we have internally but this is actually part of a huge continuous effort to optimize the
compute that we have in-house so accumulating and aggregating through all these optimizations We Now train our
occupancy networks twice as fast just because it's twice as efficient and now if we add in bunch more compute and go
parallel we cannot train this in hours instead of days and with that I'd like to hand it off to
the biggest user of compute John
hi everybody my name is John Emmons I lead the autopilot Vision team I'm going to cover two topics with you
today the first is how we predict lanes and the second is how we predict the future behavior of other agents on the road
in the early days of autopilot we modeled the lane detection problem as an image space instant segmentation task
our network was super simple though in fact it was only capable of printing Lanes from of a few different kinds of
geometries specifically it would segment the Eagle Lane it could segment adjacent
lanes and then it had some special casing for forks and merges this simplistic modeling of the problem
worked for highly structured roads like highways but today we're trying to build a system
that's capable of much more complex Maneuvers specifically we want to make left and right turns at intersections
where the road topology can be quite a bit more complex and diverse when we try to apply this simplistic modeling of the
problem here it just totally breaks down taking a step back for a moment what
we're trying to do here is to predict the spark set of lame instances in their connectivity and what we want to do is to have a
neural network that basically predicts this graph where the nodes are the lane segments and the edges encode the
connectivities between these Lanes so what we have is our lane detection
neural network it's made up of three components in the first component we have a set of
convolutional layers attention layers and other neural network layers that encode the video streams from our eight
cameras on the vehicle and produce a rich visual representation
we then enhance this digital representation with a coarse roadmap Road level map data which we encode with
a set of additional neural network layers that we call the lane guidance module this map is not an HD map but it
provides a lot of useful hints about the topology of lanes inside of intersections the lane counts on various roads and a set of other attributes that
help us the first two components here produced a
dense tensor that sort of encodes the world but what we really want to do is to convert this dense tensor into a
smart set of lanes in their connectivities we approach this problem like an image
captioning task where the input is this dense tensor and the output text is predicted into a special language that
we developed at Tesla for encoding Lanes in their connectivities in this language of lanes the words and
tokens are the lane positions in 3D space in The Ordering of the tokens introducted modifiers in the tokens
encode the connective relationships between these Lanes by modeling the task as a language
problem we can capitalize on recent autoregressive architectures and techniques from the language Community for handling the multiple
modality of the problem we're not just solving the computer vision problem at autopilot we're also applying the state-of-the-art and
language modeling and machine learning more generally I'm now going to dive into a little bit more detail this language component
what I have depicted on the screen here is the satellite image which sort of represents the local area around the
vehicle the set of nosing edges is what we refer to as the lane graph and it's ultimately what we want to come out of this neural
network we start with a blank slate we're going to want to make our first
prediction here at this Green Dot this green dots position is encoded as
an index into a course grid which discretizes the 3D World now we don't predict this index directly
because it would be too computationally expensive to do so there's just too many grid points and predicting a categorical
distribution over this has both implications at training time and test time so instead what we do is we disretch the
world coarsely first we predict a heat map over the possible locations and then we latch in the most probable location
on this we then refine the prediction and get the precise point
now we know where the position of this token is we don't know its type in this case though it's the beginning of a new
Lane so we approach it as a start token and because it's a star token there's no
additional attributes in our language we then take the predictions from this first forward pass and we encode them
using a learned additional embedding which produces a set of tensors that we combine together
which is actually the first word in our language of lanes we add this to the you know first position in our sentence here
we then continue this process by printing the next Lane point in a similar fashion
now this Lane point is not the beginning of a new Lane it's actually a continuation of the previous Lane
so it's a continuation token type now it's not enough just to know that
this Lane is connected to the previously protected plane we want to encode its precise geometry which we do by
regressing a set of spline coefficients we then take this Lane we encode it
again and add it as the next word in the sentence we continue predicting these continuation Lanes until we get to the
end of the prediction grid we then move on to a different Lane segment so you can see that cyan dot there now
it's it's not topologically connected to that pink point it's actually forking off of that that blue sorry that green
point there so it's got a fork type and Fork tokens
actually point back to previous tokens from which the fork originates so you
can see here the fork Point predictor is actually the index zero so it's actually referencing back to tokens that it's already predicted like you would in
language we continue this process over and over again until we've enumerated all of the
tokens in the Ling graph and then the network predicts the end of sentence token
yeah I just want to note that the reason we do this is not just because we want to build something complicated it's
almost feels like a turing complete machine here with neural networks though is that we tried simple approaches for
example uh trying to just segment the lanes along the road or something like that but then the problem is when
there's uncertainty say you cannot see the road clearly and there could be two lanes or three lanes and you can't tell
a simple segmentation based approach would just draw both of them is kind of a 2.5 Lane situation and the
post processing algorithm would hilariously fail when the predictions are such yeah the problems don't end there I mean
you need to predict these connective conditions like these connective Lanes inside of intersections which it's just not possible with the approach that
ashok's mentioning which is why we had to upgrade to this sort of like overlaps like this segmentation would just go Haywire but even if you try very hard to
you know put them on separate layers it's just a really hard problem what language just offers a really nice framework for modern getting a
sample from a posterior as opposed to you know trying to do all of this in post-processing
but this doesn't actually stop for just autopilot right John this can be used for Optimus again you know I guess they wouldn't be
called Lanes but you could imagine you know sort of in this you know stage here that you might have sort of paths that sort of you know encode the possible
places that people could walk yeah it's basically if you're in a factory or in a you know home setting
you can just ask the robot okay let me please talk to the kitchen or please route to some location in the factory
and then we predict a set of Pathways that would you know go through the aisles take the robot and say okay this
is how you get to the kitchen it just really gives us a nice framework to model these different paths that simplify the navigation problem or the
downstream planner all right so ultimately what we get from
this Lane detection network is a set of lanes in their connectivities which comes directly from the network there's
no additional step here for as far simplifying these you know dense predictions into into indispersed ones
this is just a direct unfiltered output of the network
okay so I talked a little bit about Lanes I'm going to briefly touch on how we model and predict the future paths in
other semantics on objects so I'm just going to go really quickly through two examples the video on the
right here we've got a car that's actually running a red light and turning in front of us what we do to handle
situations like this is we predict a set of short time Horizon future trajectories on all objects we can use
these to anticipate the dangerous situation here and apply whatever you know braking and steering action is required to avoid a collision
in the video on the right there's two vehicles in front of us the one on the left lane is parked apparently it's
being loaded unloaded I don't know why the driver decided to park there but the important thing is that our neural network predicted that it was stopped
which is the red color there um the vehicle in the other lane as you notice also is stationary but that one's
obviously just waiting for that red light to turn green so even though both objects are stationary and have zero velocity it's the semantics that is
really important here so that we don't get stuck behind that awkwardly parked car
predicting all of these agent attributes presents some practical problems when trying to build a real-time system
we need to maximize the frame rate of our object section stack so that autopilot can quickly react to the changing environment
every millisecond really matters here to minimize the inference latency our neural network is split into two phases
in the first phase we identified locations in 3D space where agents exist
in the second stage we then pull out tensors at those 3D locations append it with additional data that's on the
vehicle and then we you know do the rest of the processing this specification step allows the
neural network to focus compute on the areas that matter most which gives us Superior performance for a fraction of the latency cost
so putting it all together the autopilot Vision stack predicts more than just the geometry and kinematics of
the world it also predicts a rich set of semantics which enables safe and human-like driving
I'm not going to hand things off to Street we'll tell us how we run all these cool neural networks on our FSD computer thank you
[Applause]
hi everyone I'm SRI today I'm going to give glimpse of what it takes to run this FSC networks in the
car and how do we optimize for the inference latency uh today I'm going to focus just on the
FSG Lanes Network that John just talked about
so when you started this track we wanted to know if we can run this FSC Lanes Network natively on the trip engine
which is our in-house neural network accelerator that we built in the FSD computer
when we build this Hardware we kept it simple and we made sure it can do one
thing ridiculously fast dense dot products but this architecture is auto
regressive and iterative where it crunches through multiple attention attention blocks in the Inner Loop
producing sparse points directly at every step so the challenge here was how
can we do this parse Point prediction and sparse computation on a dense dot product engine let's see how we did this
on the trip so the network predicts the heat map of
most probable spatial locations of the point now we do a Arc Max and a one
heart operation which gives the one hard encoding of the index of the spatial location
now we need to select the embedding associated with this index from an embedding table that is learned during
training to do this on trip we actually built a lookup table in SRAM and we engineered
the dimensions of this embedding such that we could achieve all of this thing with just matrix multiplication
not just that we also wanted to store this embedding into a token cache so
that we don't recompute this for every iteration rather reuse it for future Point prediction again we pulled some
tricks here where we did all these operations just on the dot product engine it's actually cool that our team
found creative ways to map all these operations on the trip engine in ways
that were not even imagined when this Hardware was designed but that's not the only thing we have to
do to make this work we actually implemented a whole lot of operations and features to make this model
compilable to improve the intake accuracy as well as to optimize performance
all of these things helped us run the 75 million parameter model just under 10
millisecond of latency consuming just 8 watts of power
but this is not the only architecture running in the car there are so many other architectures modules and networks
we need to run in the car to give a sense of scale there are about a billion parameters of all the networks
combined producing around 1000 neural network signals so we need to make sure
we optimize them jointly and such that we maximize the compute
utilization throughput and minimize the latency so we built a compiler just for neural
networks that shares the structure to traditional compilers as you can see it takes the massive
graph of neural Nets with 150k nodes and 375k connection takes this thing
partitions them into independent subgraphs and com compels each of those
subgraphs natively for the inference devices then we have a neural network
Linker which shares the structure to traditional Linker where we perform this link time optimization
there we solve an offline optimization problem uh for with compute memory and memory
bandwidth constraints so that it comes with an optimized schedule that gets executed in the car
on the runtime we designed a hybrid scheduling system which basically does
heterogeneous scheduling on one SOC and distributed scheduling across both the socs to run these networks in a model
parallel fashion to get 100 drops of compute utilization we need to optimize across all the
layers of software right from tuning the network architecture the compiler all
the way to implementing a low latency high bandwidth RDMA link across both the srcs and in fact going even deeper to
understanding and optimizing the cache coherent and non-coherent data Paths of the accelerator in the soc this is a lot
of optimization at every level in order to make sure we get the highest frame rate and as every millisecond counts
here and this is this is just the this is the
visualization of the neural networks that are running in the car this is our digital brain essentially as you can see
these operations are nothing but just the matrix multiplication convolution to name a few real operations running in the car
to train or train this network with a billion parameters you need a lot of labeled data so aegon is going to talk
about how do we achieve this with the auto labeling pipeline
thank you uh thank you Sherry
uh hi everyone I'm Jurgen Zhang and I'm leading a geometric Vision at autopilot
so yeah let's talk about Auto labeling
so we have several kinds of all the labeling Frameworks to support various types of networks but today I'd like to
focus on the awesome Lanes net here so to successfully train and generalize
this network to everywhere we think we went tens of millions of trips from
probably one one million intersection or even more so
then how to do that so it is certainly achievable uh to Source sufficient
amount of trips because we already have as Tim explained earlier we already have like 500 000 trips per day cash rate
um however converting all those data into a training form is a very challenging technical problem
to solve this challenge we tried various ways of manual and auto labeling so from
the First Column to the second from the second to the third each Advance provided us nearly 100x Improvement in
throughput but still uh we won an even better Auto labeling machine that can provide
provide providers good quality diversity and scalability
to meet all these requirements uh despite the huge amount of engineering effort required here we've developed a
new order labeling machine powered by multi-trib reconstruction so this can replace 5 million hours of
manual labeling with just 12 hours on cluster for labeling 10 000 trips
so how we solved there are three big steps the first step is high Precision trajectory and structure recovery by
multi-camera visual inertial odometry so here all the features including ground surface are inferred from videos
by neural networks then tracked and reconstructed in the vector space
so the typical drift rate of this trajectory in car is like 1.3 centimeter
per meter and 0.45 Milli radian per meter which is pretty decent uh considering its compact compute
requirement than the recovery service and raw details are also used as a strong
guidance for the later manual verification step this is also enabled in every FSD
vehicle so we get pre-processed trajectories and structures along with the trip data
the second step is multi-2 reconstruction which is the big and core piece of this machine
so the video shows how the previously shown trip is reconstructed and aligned
with other trips basically other trips from different people not the same vehicle so this is done by multiple
internet steps like course alignment pairwise matching joint optimization then further surface refinement
in the end the human analyst comes in and finalizes the label
so each happy steps are already fully parallelized on the cluster so the
entire process usually takes just a couple of hours
the last step is actually Auto labeling the new trips so
here we use the same multi-trip alignment engine but only between pre-built reconstruction and each new
trip so it's much much simpler than fully reconstructing all the clips altogether
that's why it only takes 30 minutes per trip to other label instead of manual
several hours of manual labeling and this is also the key of scalability
of this machine this machine easily scales as long as we
have available compute and trip data so about 50 trees were newly order
labeled from this scene and some of them are shown here so 53 from different vehicles
so this is how we capture and transform the space-time slices of the world into
the network supervision yeah one thing I like to note is that again just talked about how we Auto
label our lanes but we have Auto laborers for almost every task that we do including our planner and many of
these are fully automatic like no humans involved for example for objects or other kinematics the shapes their
Futures everything just comes from Auto labeling and the same is true for occupancy too and we have really just
built a machine around this yeah so if you can go back one slide not one more
it says parallelized on cluster so so that sounds pretty straightforward but
it really wasn't um maybe it's it's fun to share how something like this comes about um so a while ago we didn't have any
auto labeling at all and then someone makes a script it starts to work it starts working better until we reach a
volume that's pretty high and we clearly need a solution and so there were two other engineers in
our team who were like you know that's an interesting you know uh thing what we needed to do was build a whole graph of
essentially python functions that we need to run one after the other first you pull the clip then you do some cleaning then you do
some Network inference then another Network inference until you finally get this but so you need to do this as a
large scale so I so I tell them we probably need to shoot for you know 100 000 Clips per day or like 100 000 items
that seems good um and so the engineers say well we can do you know a bit of postgres and a bit
of elbow grease we can do it meanwhile we are a bit later and we're doing 20
million of these functions every single day again we pull in around half a million
clips and on those we run a ton of functions each of these in a streaming fashion and so that's kind of the back
end infra that's also needed to not just run training but also Auto labeling yeah it really is like a factory that
produces labels and like production lines yield quality uh inventory like all of the same Concepts applied to this
label Factory uh that applies for you know the factory for our cars that's right
okay uh thanks uh so yeah so concluding
this section uh I'd like to share a few more challenging and interesting examples for Network for sure and even
for humans probably uh so from the top there's like examples for like lack of Lies case or foggy night or roundabout
and occlusions by heavy occlusions by parked cars and even rainy night with their raindrops on camera lenses uh
these are challenging but once their original scenes are fully reconstructed by other clips they all of them can be
Auto labeled so that our cards can drive even better through these challenging scenarios
so now let me pass the mic to David to learn more about how Sim is creating the new world on top of these labels thank
you
thank you again my name is David and I'm going to talk about simulation so simulation plays a critical role in
providing data that is difficult to source and or hard to label however 3D scenes are notoriously slow
to produce take for example the simulated scene playing behind me a
complex intersection from Market Street in San Francisco it would take two weeks for
artists to complete and for us that is painfully slow however I'm going to talk about using
jaegan's automated ground truth labels along with some brand new tooling that allows us to procedurally generate this
scene and many like it in just five minutes that's an amazing a thousand times faster than before
so let's dive in to our scene like this is created we start by piping the automated ground
truth labels into our simulated World creator tooling inside the software Houdini starting with Road boundary
labels we can generate a solid Road mesh and re-topologize it with the lane graph labels this helps inform important Road
details like Crossroads slope and detailed material blending
next we can use the line data and sweep geometry across its surface and project it to the road creating Lane paint
decals next using median edges we can spawned
Island geometry and populate it with randomized foliage this drastically changes the visibility of the scene
now the outside world can be generated through a series of randomized heuristics a modular building generators
create visual obstructions while randomly placed objects like hydrants can change the color of the curves while
trees can drop leaves below it obscuring lines or edges
next we can bring in map data to inform positions of things like traffic traffic lights or stop signs we can trace along
its normal to collect important information like number of lanes and even get accurate street names on the
signs themselves next using Lane graph we can determine Lane connectivity and spawn directional
Road markings on the road and they're accompanying road signs and finally with Lane graph itself we
can determine Lane adjacency and other useful metrics to spawn randomized traffic permutations Insider simulator
and again this is all automatic no artists in the loop and happens within minutes and now this sets us up to do
some pretty cool things since everything is based on data and heuristics we can start to fuzz
parameters to create visual variations of the single ground truth it can be as subtle as object placement and random
material swapping to more drastic changes like entirely new biomes or locations of environment like Urban
Suburban or rural this allows us to create infinite targeted permutations for specific
ground truths that we need more ground Truth for and all this happens within a click of a
button and we can even take this one step further by altering our ground truth
itself say John wants his Network to pay more attention the directional Road markings
to better detect an upcoming captive left turn lane we can start to procedurally alter our lane graph inside
the simulator to help folk to create entirely new flows through this intersection to help
Focus the Network's attention to the road markings to create more accurate predictions and this is a great example of how this
tooling allows us to create new data that could never be collected from The Real World
and the true power of this tool is in its architecture and how we can run all tasks in parallel to infinitely
scale so you saw the tile Creator tool in action converting the ground truth
labels into their counterparts next we can use our tile extractor tool
to divide this data into geohash tiles about 150 meters square in size
we then save out that data into separate geometry and instance files this gives us a clean source of data that's easy to
load and allows us to be rendering engine agnostic for the future
then using a tile loader tool we can summon any number of those cache tiles using a geohash ID currently we're doing
about these five by five tiles or three by three usually centered around Fleet hotspots or interesting land graph
locations in the tile loader also converts these tile sets into U assets for consumption
by the Unreal Engine and gives you a finished project product from what you saw in the first slide
and this really sets us up for size and scale as you can see on the map behind us
we can easily generate most of San Francisco city streets and this didn't take years or even months of work but
rather two weeks by one person we can continue to manage and grow all
this data using our PDG Network inside of the tooling this allows us to throw
compute at it and regenerate all these tile sets overnight this ensures all environments are of
consistent quality and features which is super important for training since new ontologies and signals are constantly
released and now to come full circle because we
generated all these tile sets from ground truth data that contain all the weird intricacies from The Real World
and we can combine that with the procedural Visual and traffic variety to create Limitless targeted data for the
network to learn from and that concludes the Sim section I'll pass it to Kate to talk about how we can
use all this data to improve autopilot thank you
thanks David hi everyone my name is Kate Park and I'm here to talk about the data engine which is the process by which we
improve our neural networks via data we're going to show you how we deterministically solve interventions
via data and walk you through the life of this particular clip in this scenario
autopilot is approaching a turn and incorrectly predicts that Crossing vehicle as stopped for traffic and thus
a vehicle that we would slow down for in reality there's nobody in the car it's just awkwardly parked we've built this
tooling to identify the mispredictions correct the label and categorize this
clip into an evaluation set this particular clip happens to be one of 126
that we've diagnosed as challenging parked cars at turns because of this
infra we can curate this evaluation set without any engineering resources custom
to this particular challenge case to actually solve that challenge case
requires mining thousands of examples like it and it's something Tesla can trivially do we simply use our data
sourcing infra request data and use the tooling shown previously to correct the
labels by surgically targeting the mispredictions of the current model we're only adding the most valuable
examples to our training set we surgically fix 13 900 clips and uh
because those were examples where the current model struggles we don't even need to change the model architecture a
simple way update with this new valuable data is enough to solve the challenge case so you see we no longer predict
that Crossing vehicle as stopped as shown in Orange but parked as shown in red
in Academia we often see that people keep data constant but at Tesla it's
very much the opposite we see time and time and again that data is one of the best if not the most deterministic lever
to solving these interventions we just showed you the data engine Loop
for one challenge case namely these parked cars at turns but there are many challenge cases even for one signal of
vehicle Movement we apply this data engine Loop to every single challenge case we've diagnosed whether it's buses
curvy roads stopped Vehicles parking lots and we don't just add data once we
do this again and again to perfect the semantic in fact this year we updated our vehicle movement signal five times
and with every weight update trained on the new data we push our vehicle movement accuracy up and up
this data engine framework applies to all our signals whether they're 3D
multi-cam video whether the data is human labeled Auto labeled or simulated whether it's an offline model or an
online model model and Tesla is able to do this at scale because of the fleet
Advantage the infra that our engine team has built and the labeling resources that feed our Networks
to train on all this data we need a massive amount of compute so I'll hand it off to Pete and Ganesh to talk about
the dojo supercomputing platform thank you [Applause]
thank you thank you Katie
thanks everybody thanks for hanging in there we're almost there my name is Pete Bannon I run the custom
silicon and low voltage teams at Tesla and my name is Ganesh venkat I run the
doji program
[Applause] thank you I'm frequently asked why is a car
company building a super computer for training and this question fundamentally
misunderstands the nature of Tesla at its heart Tesla is a hardcore technology
company all across the company people are working hard in science and engineering
to advance the fundamental understanding and and methods that we have available
to build cars Energy Solutions robots and anything else so can we we can do to
improve The Human Condition around the world it's a super exciting thing to be a part of and it's a privilege to run a very
small piece of it in the semiconductor group tonight we're going to talk a little bit about dojo and give you an
update on what we've been able to do over the last year but before we do that I wanted to give a little bit of
background on the initial design that we started a few years ago when we got started the goal was to provide a
substantial Improvement to the training latency for our autopilot team some of
the largest neural networks they trained today run for over a month which inhibits their ability to rapidly
explore Alternatives and evaluate them so you know a 30X speed up would be
really nice if we could provide it at a cost competitive and energy competitive way
to do that we wanted to build a chip with a lot of arithmetic arithmetic
units that we could utilize at a very high efficiency and we spent a lot of time studying whether we could do that
using DRM various packaging ideas all of which failed and in the end even though
it felt like an unnatural act we decided to reject dram as the primary storage medium for this system and instead focus
on SRAM embedded in the chip SRAM provides unfortunately a modest
amount of capacity but extremely high bandwidth and very low latency and that enables us to achieve High utilization
with the arithmetic units those choices
of that particular choice led to a whole bunch of other choices for example if you want to have virtual memory you need
page tables they take up a lot of space we didn't have space so no virtual memory we also don't have interrupts the
accelerator is a bare bonds Rob piece of Hardware that's presented to a compiler
in the compiler is responsible for scheduling everything that happens in a terministic way so there's no need or
even desire for interrupts in the system we also chose to pursue model
parallelism as a training methodology which is not the typical situation most
most machines today use data parallelism which consumes additional memory capacity which we obviously don't have
so all of those choices led us to build a machine that is pretty radically
different from what's available today we also had a whole bunch of other goals one of the most important ones was no
limits so we wanted to build a compute fabric that would scale in an unbounded way for the most part I mean obviously
there's physical limits now and then but you know pretty much if your model was
too big for the computer you just had to go buy a bigger computer that's what we were looking for today the way package
machines are packaged there's a pretty fixed ratio of for example GPU CPUs and
and dram capacity and network capacity and we really wanted to disaggregate all that so that as models evolved we could
vary the ratios of those various elements and make the system more flexible to meet the needs of the
autopilot team yeah and it's so true with like No Limits philosophy was our guiding star
all the way all of our choices were centered around that and and to the
point that we didn't want traditional data center infrastructure to limit our
capacity to execute these programs at speed so that's why we
that's why sorry about that that's why we integrated
vertically our data center entire data center by doing a vertical
integration of the data center we could extract new levels of efficiency we could optimize power
delivery Cooling and as well as system management across
the whole data center stack rather than doing Box by box and integrating that
those boxes into Data Centers and to do this we also wanted to
integrate early to figure out limits of scale uh for our
software workloads so we integrated Dojo environment into our autopilot software very early and we learned a lot of
lessons and today uh Bill Chang will go over our hardware update as well as some
of the challenges that we faced along the way and Rajiv kurian will give you a
glimpse of our compiler technology as well as go over some of our cool results
right there you go
thanks Pete thanks Ganesh um I'll start tonight with a high level
vision of our system that will that will help set the stage for the the challenges and the problems we're
solving and then also how software will then leverage this for performance
now our vision for Dojo is to build a single unified accelerate a very large
one software would see a seamless compute plane with globally addressable
very fast memory and all connected together with uniform high bandwidth and
low latency now to realize this we we need to use
density to achieve performance now we leverage technology to get this density in order to break levels of
hierarchy all the way from the chip to the scale out systems
now silicon technology has has used this has done this for decades chips have
followed Moore's law to for density and integration to get performance scaling
now a key step in realizing that Vision was our training tile not only can we integrate 25 dies at
extremely high bandwidth but we can scale that to any number of additional tiles by just connecting them together
now last year we showcased our first functional training tile and at that time we already had workloads running on
it and since then the team here has been working hard and diligently to deploy
this at scale now we've made amazing progress and had a lot of Milestones along the way and of
course we've had a lot of unexpected challenges but this is where our fail fast
philosophy has allowed us to push our boundaries
now pushing density for performance presents all new challenges one area is power delivery
here we need to deliver the power to our compute die and this directly impacts
our Top Line compute performance but we need to do this at unprecedented density we need to be able to match our
die pitch with a power density of almost one amp per millimeter squared
and because of the extreme integration this needs to be a multi-tiered vertical
power solution and because there's a complex heterogeneous material Stack Up
we have to carefully manage the material transition especially CTE
now why does the coefficient of thermal expansion matter in this case CTE is a fundamental material property
and if it's not carefully managed that Stack Up would literally rip itself apart
so we started this effort by working with vendors to deliver to develop this
power solution but we realized that we actually had to develop this in-house
now to balance schedule and risk we built quick iterations to support
both our system bring up and software development and also to find the optimal design and
stack up that would meet our final production goals and in the end we were able to reduce CTE over 50 percent
and meet our performance by 3x over our initial version
now needless to say finding this optimal material stack up while maximizing
performance at density is extremely difficult
now we did have unexpected challenges along the way here's an example where we push the
boundaries of integration that led to component failures
this started when we scaled up to larger and longer workloads and then intermediate intermittently a single
site on a tile would fail now they started out as recoverable failures but as we pushed some much
higher and higher power these would become permanent failures
now to understand this failure you have to understand why and how we build our
power modules solving density at every level is the is
is the Cornerstone of actually achieving our system performance now because our X Y plane is used for
high bandwidth communication everything else must be stacked vertically
this means all other components other than our die must be integrated into our power modules
now that includes our clock and our power supplies and also our system controllers
now in this case the failures were due to losing clock output from our oscillators
and after an extensive debug we found that the root cause was due to vibrations on the module from
piezoelectric effects our nearby capacitors
now singing caps are not a new phenomenon and in fact very common in power design
but normally clock chips are placed in a very quiet area of the board and often
not affected by power circuits but because we needed to achieve this level of integration these oscillators need to
be placed in very close proximity now due to our switching frequency and
then the vibration resonance created it caused Auto plane vibration on our mems
oscillator that caused it to crack now the solution to this problem is a
multi-prong approach we can reduce the vibration by using soft terminal caps
we can update our mems part with a lower Q factor for the outer plane Direction
and we can also update our switching frequency frequency to push the resonance further away from these
sensitive bands now addition to the to the density uh at
the system level we've been making a lot of progress at the infrastructure level
we knew that we had to re-examine every aspect of the data center infrastructure
in order to support our unprecedented power and cooling density
we brought in a fully custom designed CDU to support dojo's dense cooling
requirements and the amazing part is we're able to do this at a fraction of the cost versus buying off the shelf and
modifying it and since our Dojo cabinet integrates enough power and cooling to match an
entire row of standard it racks we need to carefully design our cabinet and
infrastructure together and we've already gone through several iterations of this cabinet to optimize
this and earlier in this year we started load testing our power and cooling
infrastructure and we were able to push it over two megawatts before we tripped our substation and got a call from the
city yeah now last year we introduced only a
couple of components of our system the custom D1 die and the training tile but
we teased the exit pod as our end goal we'll walk through the remaining parts of our system that are required to build
out this exit pod now the system tray is a key part of
realizing our vision of a single accelerator it enables us to seamlessly seamlessly
connect tiles together not only within the cabinet but between cabinets
we can connect these Tiles at very tight spacing across the entire accelerator
and this is how we achieve our uniform communication this is a laminated bus bar that allows
us to integrate very high power mechanical and thermal support in an extremely dense integration
it's 75 millimeters in height and and supports six Tiles at 135 kilograms
this is the equivalent of three to four fully loaded high performance racks
next we need to feed data to the training tiles this is where we've developed the dojo interface processor
it provides our system with high bandwidth dram to Stage our training data
and it provides full memory bandwidth to our training tiles using TTP our custom
protocol that we can use to communicate across our entire accelerator it also has high-speed ethernet that
helps us extend this custom protocol over standard ethernet and we provide native Hardware support
for this with little to no software overhead and lastly we can connect connect to it
through a standard Gen 4 pcie interface
now we pair 20 of these cards per tray and that gives us 640 gigabytes of high
bandwidth dram and this provides our disaggregated memory layer for our training tiles
these cards are a high bandwidth ingest path both through pcie and ethernet
they also provide a high ratex z-connectivity path that allows shortcuts across our large Dojo
accelerator now we actually integrate the host
directly underneath our system tray these hosts provide our ingest processing and connect to our interface
processors through pcie these hosts can provide Hardware video
decoder support for video Based training and our user applications land on these
hosts that we so we can provide them with the standard x86 Linux environment
now we can put two of these assemblies into one cabinet and pair it with redundant power supplies that do direct
conversion of three phase 480 volt AC power to 52 volt DC power
Now by focusing on density at every level we can realize the vision of a single
accelerator now starting with the uniform nodes on our custom D1 die
we can connect them together in our fully integrated training tile and then finally seamlessly connecting
them across cabinet boundaries to form our Dojo accelerator
and all together we can house two full accelerators in our exit pod for a
combined one exaflop of ml compute now all can altogether this amount of
technology and integration has only ever been done a couple of times in the
history of compute next we'll see how software can leverage this to accelerate their performance
[Applause]
thanks Bill my name is Rajiv and I'm going to talk some numbers so our software stack begins with the pi
torch extension that speaks to our commitment to one standard pytorch models out of the box
we're going to talk more about our jit compiler and the ingest pipeline that feeds the hardware with data
abstractly performances tops times utilization times accelerator occupancy
we've seen how the hardware provides Peak Performance is the job of the compiler to extract utilization from the
hardware while code is running on it and it's the job of the ingest pipeline to make sure that data can be fettered
at throughput high enough for the hardware to not ever starve so let's talk about why communication
bound models are difficult to scale but before that let's look at why resnet 50 like models are easier to scale you
start off with a single accelerator run the forward and backward passes followed by the optimizer
than to scale this up you run multiple copies of this on multiple accelerators and while the gradient is produced by
the backward pass do need to be reduced and this introduces some communication this can be done Pipeline with the
backward pass this setup scales fairly well almost
linearly for models with much larger activations
we run into a problem as soon as we want to run the forward pass the batch size that fits in a single
accelerator is often smaller than the batch Norm surface so to get around this researchers typically run this setup on multiple
accelerators in sync batch Norm mode this introduces latency bound communication to the critical path of
the forward pass and we already have a communication bottleneck and while there are ways to get around
this they usually involve tedious manual work best suited for a compiler and ultimately there's no skirting
around the fact that if your state does not fit in a single accelerator you can be communication bound
and even with significant efforts from our ml Engineers we see such models don't scale linearly
the dojo system was built to make such models work at high utilization the high
density integration is was built to not only accelerate the compute bound portions of a model but also the latency
bound portions like a batch Norm or the bandwidth bound portions like a gradient
all reduced or a parameter all gather a slice of the dojo mesh can be carved
out to run any model the only thing users need to do is to make the slice large enough to fit a
bathroom surface for their particular model after that the partition presents itself
as one large accelerator freeing the users from having to worry about the internal details of execution
and as the job of the compiler to maintain this abstraction fine grain synchronization Primitives in
uniform low latency makes it easy to accelerate all forms of parallelism across integration boundaries tensors
are usually store sharded in SRAM and replicated just in time for layers execution we depend on the high Dojo
bandwidth to hide this replication time tensor replication and other data transfers are overlapped with compute
and the compiler can also recompute layers when it's profitable to do so
we expect most models to work out of the box as an example we took the recently released stable diffusion model and got
it running on dojo in minutes out of the box the Kampala was able to map it in a model parallel manner on 25 Dojo dies
here's some pictures of a cyber truck on Mars generated by stable diffusion running on dojo
looks [Applause]
looks like it still has some ways to go before matching the Tesla Design Studio team
so we've talked about how communication bottlenecks can hamper scalability perhaps an acid test of a compiler and
the underlying Hardware is executing a cross-diabash form layer like mentioned before this can be a Serial bottleneck
the communication phase of a bachelor begins with nodes Computing the local mean and standard deviations then
coordinating to reduce these values then broadcasting these values back and then they resume their work in parallel
so what would an ideal batch form look like on 25 Dojo dots let's say the previous less activations
are are already split across dice we would expect that 350 nodes on each
die to coordinate and produce die local mean and standard division values ideally these would get would further
reduced with the final value ending somewhere and towards the middle of the tile we would then hope to see a broadcast of
this value radiating from the center let's see how the compiler actually executes a real Bachelor operation
across 25 dice the communication trees were extracted from the compiler and the
timing is from a real Hardware one we're about to see 8750 nodes on 25 dies
coordinating to reduce and then broadcast the bastrum mean and standard deviation valves
dial Local reduction followed by global reduction towards the middle of the tie
then the reduced value broadcast radiating from The Middle accelerated by the Hardware's broadcast
facility this operation takes only five
microseconds on 25 Dojo dice the same operation takes 150 microseconds on 24
gpus this is an orders of magnitude improvement over gpus
and while we talked about an all reduced operation in the context of a batch Norm it's important to reiterate that the
same advantages apply to all other communication Primitives and these Primitives are essential for large scale
training so how about full model performance so while we think that resonant 50 is
not a good representation of real world Tesla workloads it is a standard Benchmark so let's start there
we are already able to match the 100 die for die however perhaps a hint of dojo's
capabilities is that we're able to hit this number with just a batch of 8 per die
but Dojo was really built to tackle larger complex models so when we set out to tackle real world
workloads we looked at the usage patterns of our current GPU cluster and two models stood up the auto labeling
networks a class of offline models that are used to generate ground truth and the occupancy networks that you heard
about the auto labeling networks are large models that have high arithmetic intensity while the occupancy networks
can be in justbound we chose these models because together they account for a large chunk of our current GPU cluster
usage and they would challenge the system in different ways
so how do we do on these two networks the results we're about to see were measured on multi-die systems for both
the GPU and Dojo but normalized to per die numbers on our Auto labeling Network we're
already able to surpass the performance of an a100 with our current Hardware running on our older generation vrms on
our production Hardware with our newer vrams that translates to doubling the throughput of an a100
and our model showed that with some key compiler optimizations we could get to more than three extra performance of an
a100 we see even bigger leaps on the occupancy Network
almost 3x with our production Hardware with room for more
foreign [Applause]
level of compiler performance we could replace the ml compute of one two three four five and six GPU
boxes with just a single Dojo tile [Applause]
and this Dojo tile costs less than one of these GPU boxes whoa
yeah what it really means is that networks
that took more than a month to train now take less than a week
alas when we measure things it did not turn out so well at the pie torch level we did not see
our expected performance out of the git and this timeline chart shows our problem the teeny tiny little green bars that's
the compile code running on the accelerator the row is mostly white space where the
hardware is just waiting for data with our dense ml compute Dojo hosts
effectively have 10x more ml compute than the GPU host the data loader is running on this one host simply couldn't
keep up with all that ml Hardware so to solve our data loader scalability
issues we knew we had to get over the limit of this single host the Tesla transport protocol moves data
seamlessly across host tiles and ingest processors so we extended the Tesla
transport protocol to work over ethernet we then built the dojo network interface called the dnic to leverage TTP over
ethernet this allows any host with a dnic card to be able to DM it to and
from other TTP endpoints so we started with the dojo mesh
then we added a tier of data loading hosts equipped with the dnic card
we connected these hosts to the mesh via an ethernet switch now every host in this data loading tier is capable of
reaching all TTP endpoints in the dojo mesh via Hardware accelerated dma
after these optimizations went in our occupancy went from four percent
to 97 percent so the data loading sections have reduced
data the data loading sections have reduced drastically and the ml Hardware is kept busy we actually expect this
number to go to 100 pretty soon after these changes went in we saw the full expected speed up from the pytorch
layer and we were back in business so we started with Hardware design that
breaks through traditional integration boundaries in service of our vision of a single giant accelerator
we've seen how the compiler and just layers build on top of that Hardware so after proving your performance on
these complex real-world networks we knew what our first large-scale deployment would Target our high
arithmetic intensity Auto labeling Networks today that occupies 4000 gpus over 72
GPU racks with our dense computer and our high performance we expect to provide the
same throughput with just four Dojo cabinets
[Applause]
and these four Dojo cabinets will be part of our first exopod that we plan to build by quarter one of 2023
this one more than double Tesla's Auto labeling capacity
[Applause] the first extra part is part of a total
of seven extra parts that we plan to build in Palo Alto right here across the wall
[Applause] and we have a display cabinet from one of these exopods for everyone to look at
six tiles densely packed on a tray 54 petaflops of compute 640 gigabytes of
high bandwidth memory with power and host to feed it
a lot of and we're building out new versions of
all our cluster components and constantly improving our software to hit new limits of skill we believe that we
can get another 10x improvement with our next Generation Hardware
and to realize their ambitious goals we need the best software and Hardware Engineers so please come talk to us or
visit tesla.com AI thank you [Applause]
all right all right let me know
all right so we um hopefully that was enough detail
and now we can move to questions um and uh guys uh like I think the team
came back come out on stage and but we really wanted to show the the depth
and breadth of Tesla in um artificial intelligence
compute Hardware robotics actuators and [Music]
and try to really shift the perception of the company away from uh you know a
lot of people think we're like just a car company or we make cool cars whatever but uh
they don't have most people have no idea that Tesla is arguably the the leader in
real world AI hardware and software and that we're building
uh what is arguably the first uh some of the most radical
computer architecture since the the crayon supercomputer and I think if you're interested in
developing uh some some of the most advanced technology in the world that's going to really affect the world in a
positive way uh tells us the place to be so yeah let's fire away with some
questions I think there's a mic at the front and a
mic at the back uh
thank you very much I I was impressed here yeah I was impressed very much by
Optimus but I wonder why they don't driven the hunt why did you choose a
tender driven approach for the hunt because tendons are not very durable and
why spring loaded well this is pretty cool awesome yes
that's a great question um you know when it comes to any type of actuation scheme there's trade-offs
between you know whether or not it's a tendon urine system or some type of linkage based system I'm just keep in
mind close to your mouth a little bit closer yeah Jeremy cool um so yeah the main reason why we went
for a tendon-based system is that you know first we actually investigated some synthetic tendons but we found that
metallic boating cables are you know a lot stronger um one of the advantages of these cables
um is that it's very good for part reduction we do want to make a lot of these hands so having a bunch of parts a
bunch of small linkages ends up being you know a problem when you're making a lot of something one of the big reasons
that you know tendons are better than linkages in a sense is that you can be anti-backlash
so anti-backlash essentially you know allows you to not have any gaps or you
know stuttery Motion in your fingers spring-loaded mainly what spring loaded
allows us to do is it allows us to have active opening so instead of having to
have two actuators to drive the fingers closed and then open we have the ability to you know have the tendon drive them
closed and then the Springs passively extend and this is something that's seen in our hands as well right we have the
ability to actively flex and then we also have the ability to extend yeah
I mean our goal with Optimus is to have a robot that is maximally useful as
quickly as possible so there's a lot of ways to solve the various problems of a humanoid robot
um and uh we're probably not balking up the right Tree on on all the Technical
Solutions and I should say that we're we're open to evolving the Technical Solutions that you see here over time
we're not they're not locked in stone um but we do we have to pick something
um in and we want to pick something that's going to allow us to produce the robot as quickly as possible
and have it like I said be useful as quickly as possible we're trying to follow the goal of fastest path to a
useful robot that can be made at volume and we're going to test the robot internally at Tesla uh in in our Factory
and uh and just see like how useful is it because you have to have a you're
going to close the loop on reality to confirm that the robot is in fact useful
um and uh yeah so we're just gonna use it to build things and um we're
confident we can do that with the hand that we have currently designed but this I'm for sure they'll be had version two
version three and we may change the architecture quite significantly over time
sorry Hi um you're the Optimus robot is really impressive that you did a great job
um bipedal robots are really difficult um but what I noticed might be missing from your plan is uh to acknowledge the
utility of the human spirit and I'm wondering if Optimus will ever get a personality and be able to laugh at our
jokes while they've well it folds our clothes yeah absolutely
um I think we want to have really fun versions of Optimus
um and so that optimists can both do be utilitarian and do tasks but can also be
kind of like a friend um and a buddy and and um hang out with
you and I'm sure people will think of all sorts of creative uses for this robot
um and uh you know the thing once you have the core intelligence and actuators
figured out then you can actually you know put all sorts of
costumes I guess on on the robot I mean you can make the robot look uh
you can scan the robot in many different ways um and um I'm sure people will find uh
very interesting ways to to uh yeah versions of Optimus so
thanks for the great presentation I wanted to know if there was an equivalent to interventions in Optimus
it seems like labeling through moments where humans disagree with what's going on is important and in a humanoid robot
that might be also a desirable source of information
yes it was saying um yeah I I think we uh we'll have ways to
remote operate the robot and intervene when it does something bad especially when we are training the robot and
bringing it up um and hopefully we you know design it in a way that we can stop the robot from
if it's going to hit something we can just like hold it and it'll stop it won't like you know crush your hand or something and those are all intervention
data uh yeah and we can learn a lot from our simulation systems too where we can
check for collisions and supervise that those are bad actions yeah I mean so Optimus we want over time
to for it to be um you know an Android kind of Android that you'd seen in in Sci-Fi movies like
Star Trek the Next Generation like data but obviously we could program the robot to be less robot-like and more friendly
and and uh you know you can obviously learn to emulate humans and and feel very natural
so as as AI in general improves uh we can uh add that to the robot and
um you know it should be obviously able to do simple instructions uh or even
into it what it is that you want um so you could give it a high level uh instruction and then it can break that
down into a series of actions and and take those actions
hi uh yeah it's exciting to think that with the Optimus you will uh think that
you can achieve orders of magnitude of improvement and economic output
um that's really exciting um and when Tesla started the mission was to accelerate the Advent of
renewable energy or sustainable transport so with the Optimus do you
still see that mission being this mission statement of Tesla or is it going to be updated with you know
mission to accelerate the Advent of I don't know infinite abundance or
Limitless Limitless economy yeah it I mean it is not strictly
speaking um Optimus is not strictly speaking uh
directly in line with uh accelerating sustainable energy it you know
to the degree that it is more efficient at getting things done than a person it does I guess help with if you know
sustainable energy but it I think the mission effectively does does somewhat broaden with the Advent of Optimus uh to
uh you know I don't know making the future awesome so you know I think you
look at optimists and um I know about you but I I'm excited to see what optimists will become
and you know this is like you know if you could I mean you can tell like any given technology
if are you do you want to see what it's like in a year two years three years four years five years ten
I'd say for sure you definitely want to see what what's happened with Optimus um whereas you know a bunch of other
Technologies are you know sort of plateaued um about name names here but uh
[Laughter] um you know so
I think Optimus is going to be incredible in like five years ten years like mind-blowing and I'm really
interested to see that happen I hope you are too oh I think
um I have a quick question here I'm Justin and I was wondering like are you
planning to extend like conversational capabilities for the robot and my second
follow-up question to that is what's like the end goal what's the end goal with Optimus
uh yeah optimists would definitely have conversational capabilities so
um I I you'd be able to talk to it and have a conversation and it would feel quite
natural so from an end call standpoint I'm I'm I don't know I think it's Gonna Keep
keep evolving and I'm not sure where where it ends up but
some some place interesting for sure um you know we always have to be careful
about the you know don't go down the Terminator path uh that's a you know I I
thought for maybe we should start off with a video of like the Terminator starting off with this you know skull
crushing but oh that might be I don't know if you want to take that too seriously so yeah you know we we do want Optimus to
be safe so we are uh designing in um safeguards where you can uh locally
stop the robot um and uh you know with like basically a localized
control ROM that you can't update over the Internet which I think that's quite important
um essential frankly um so
uh like a localized stop button um remote remote control something like
that that cannot be changed um
but I mean it's definitely going to be interesting it won't be boring so
okay yeah I see you today you have very attractive product with dojo and its applications so I'm wondering what's the
future for Dojo platform we would like to provide like a infrastructure infrastructure and service like AWS or
you'll be like a sales a chip like the Nvidia so basically what's the future because of the I say you use a seven
nanometer so they developer cost is like easily over 10 million US Dollars how do you make the penis like a business-wise
yeah I mean um Dojo is is a is a very big computer um and actually we'll be use a lot of
power and needs a lot of cooling so I think it's probably going to make more sense to have Dojo operate in like uh
Amazon web services manner than to try to sell it to someone else
um so the the most that would be the most efficient way to operate Dojo is just have it be uh a a service that you
can use uh that's available online and that where you can train your models way
faster and for less money and that as the
um world transitions to software 2.0
and that's on the bingo card someone I know it has to know how to drink five tequilas
um so let's see um software 2.0
[Laughter] yeah we'll use a lot of neural net
training so uh you know it kind of makes sense that over time as there's more more neural
net stuff that people will want to use and uh the fastest lowest cost neural
net training system so I think there's a lot of opportunity in that direction
hi my name is Ali jahanian thank you for this event it is very inspirational my
question is um I'm wondering what is your vision for uh
humanit robots that understand our emotions and art and can contribute to
our creativity well I think there's this um you're
already seeing robots that at least uh are able to generate very interesting
art with like like Dali um and Dali 2. um
and I think we'll we'll start seeing AI that can actually generate even movies that have a that have coherence
like interesting movies and tell jokes so it's quite remarkable how fast AI is
uh advancing um at many companies besides Tesla
we're headed for a very interesting future and um yeah so
you guys want to comment on that yeah I guess uh The Optimist robot can come up with physical art not just digital art
you can you know you can ask for some dance moves in text or voice and then you can produce those in the future so
it's a lot of like physical heart not just digital art oh yeah yeah computers can absolutely
make a physical art yeah yeah 100 yeah like dance sure play soccer or whatever you
um I mean it needs to get more agile but over time for sure
thanks so much for the presentation for the Tesla autopilot slides I noticed
that the models that you were using were heavily motivated by language models and I was wondering what the history of that
was and how much of an improvement it gave I thought that that was a really interesting curious choice to use
language models for the lane transitioning so there are sort of two aspects for why we transitioned to language modeling so
the first talk talk loud and close Okay okay got it
yeah so the language models help us in two ways the first way is that it lets us predict lanes that we couldn't have
otherwise as a shook mentioned earlier basically when we predicted Lanes in sort of a dense 3D fashion you can only
model certain kinds of lanes but we want to get those crisscrossing connections inside of intersections it's just not possible to do that without making it a
graph prediction if you try to do this with dense segmentation it just doesn't work um also the lane prediction is a multimodal
problem sometimes you just don't have sufficient visual information to know precisely how things look on the other
side of the intersection so you need a method that can generalize and produce um you know coherent predictions you
don't want to be predicting two lanes in three lanes at the same time you want to commit to one in a generative model like these language models provides that
hi oh hi uh my name is Giovanni um yeah thanks for the presentation
that's really nice I have a question for our FSD team so for the neural networks how do you
test like how do you do unit test software unit tests on that like do you have like a a bunch or I don't know mid
thousands or uh yes uh cases where
so the neural network that after you train it you have to pass it before you release it to as a product right uh yeah
what's your software unit testing strategies for this basically yeah glad you asked there's like a series of tests
that we have defined uh starting from you know unit test for the software itself but then for the neural network models we have VIP sets defined where
you know you can Define uh if you just have a large test set that's not enough what we find uh we need like
sophisticated uh VIP sets for different failure modes and then we curate them and grow them over the time of the
product so over the years we have like like hundreds of thousands of examples where we have been failing in the past
that we have curated and so we for any new model we test against the entire history of these failures uh and then
keep adding to this test set on top of this we have Shadow modes where we ship these models in silent to
the car and we get data back on where they are failing or succeeding uh and there's extensive QA program it's very
hard to ship a regression there's like nine levels of filters before it hits customers but then we have really good
infra to make this all efficient and I'm one of the QA testers so I QA
the car yeah like a Creator yeah so I'm constantly in the car just being
queuing like whatever the latest uh alpha build is that doesn't totally crash
finds a lot of bugs uh hi um great event I have a question
about uh foundational models for uh I have all seen that uh big models that
really can when you scale up with data and model parameter right from GT3 to
Palm it can actually now do reasoning do you see that it's essential uh skinning
up foundational models with data and size and then at least you can get a
teacher model right that potentially can solve all the problems and then you distill to a student model is that how
you see foundational models relevant for at 100 I mean that's quite similar to our Auto labeling model so we we don't
just have models that run in the car we train models that are entirely Offline that are like extremely large that can't
run in real time on the car so we just run those offline on the servers producing really good labels
that can then train the online networks so that's one form of distillation of
these teacher student models right in terms of foundation models we are building some really really large
data sets that you know are multiple petabytes and we are seeing that some of these tasks work really well uh when we
have this large data sets like the kinematics like I mentioned video in all the kinematics out of all the objects
and up to the fourth derivative and people thought we couldn't do detection with cameras detection depth velocity
acceleration and imagine how precise these have to be for these higher order derivatives to be accurate and this all
comes from these kind of large data sets and large models so we are seeing the equivalent of foundation models in our own way for
geometry and kinematics and things like those you want to add anything John
yeah I'll keep it brief basically whenever we train on a larger data set we see big okay basically whenever we
train on a larger data set we see big improvements in our model performance and basically whenever we initialize our
networks with you know some pre-training step from some other auxiliary task we basically see improvements the
self-supervised or supervised with large data sets both help a lot
hey so at the beginning Elon said that Tesla was potentially interested in building artificial general intelligence
systems given the potentially transformative impact of technology like that it seems prudent to invest in
technical AGI safety uh expertise specifically I know Tesla does a lot of
technical narrow AI Safety Research I was curious if Tesla was intending to
try to build expertise in technical artificial general intelligence safety specifically
well if I mean if if it starts looking like we're gonna be uh making a significant contribution to artificial
general intelligence then then we'll for sure invest in in uh safety I'm a big believer in AI safety I think there
should be an AI uh sort of regulatory Authority at the government level uh just as there is a
regulatory Authority for uh anything that affects Public Safety so we have a regulatory Authority for aircraft and
cars and uh sort of food and drugs and because they affect Public Safety and AI
also affects Public Safety so I think um and this is not really something that government I think understands yet but I
think I think there should be a referee that is uh ensuring um or doing trying to ensure uh Public
Safety for uh AGI um and you think of like well like what are
the elements that are necessary to to create AGI like uh the accessible data set is extremely
important and if you've got a large number of of cars and humanoid robots uh
processing you know petabytes of of video data and
audio data from The Real World uh just like humans that that's that might be
the biggest data set it probably is the biggest data set um because in addition to that you can
obviously incrementally scan the the internet um but what the internet can't quite do is
is have millions or hundreds of millions of cameras in the real world
like I said with audio and and uh and other sensors as well so so I think we
probably will have the most amount of data um and probably the most amount of TR of
training power therefore probably uh we will make a
contribution to AGI
hey um I noticed the semi was back there but we haven't talked about it too much I was just wondering for the semi truck
what are the uh changes you're thinking about from a sensing perspective I imagine there's very different
requirements obviously than just a car if and if you don't think that's true why is that true
uh no I think uh basically uh you can drive a car I mean think about what drives any vehicle it's um a biological
neural net uh with uh with eyes with cameras essentially so if um and really
what is your your primary sensors are uh two uh cameras on a slow gimbal a very
slow gimbal um that's uh that's your head uh so if
if um you know if a biological neural net with with two cameras on a slow gimbal can drive a semi truck then
um if you've got like eight cameras with continuous 360 degree Vision uh
operating at a higher frame rate and much higher reaction rate then I think it is obvious that you should be able to drive a semi or any any vehicle much
better than a human hi my name is Akshay thank you for the
event uh assuming you know Optimus would be used for different use cases and
would evolve at different piece for these use cases uh would it be possible
to sort of develop and deploy different software and Hardware components independently and deploy them you know
in the in Optimus so that the overall you know feature development is faster for
Optimus reference to the questions
okay all right we did not comprehend uh unfortunately our neural net did not comprehend the question
uh yeah so well next question
I want to switch a gear to the autopilot so um when you guys plan to roll out the
FSD beta to countries other than us and Canada and also my next question is
what's the biggest bottleneck or the technological barrier you think in the current order part of the stack and how
you envision to solve that to make the autopilot is considerable better than human in terms of a performance Matrix
safety assurance and the human confidence I think you're also matching 4V uh fstb
or whatever you are guys going to combine the highway and the city as a single stack and some architectural uh
big Improvement can you maybe experiment a bit on that thank you uh well that's a whole bunch of
questions well we we I we're hopeful to be able to I think
from a technical standpoint um FSD beta should be it should be possible to roll that sfsd beta uh
worldwide by the end of this year um um but we you know from for a lot of
countries we need regulatory approval um and so we are somewhat gated by the regulatory approval in other countries
um but I you know I but I think from technical standpoint it will be ready to
go uh to to a worldwide beta by the end of this year and there's quite a big
Improvement that we're expecting to release next month uh that will always be especially good at uh
assessing the velocity of fast-moving cross traffic and a bunch of other things so anyone elaborate
for the objects yeah I guess so there used to be a lot of differences between production
autopilot and the full self-driving beta but those differences have been getting smaller and smaller over time I think
just a few months ago we now use the same vision only object detection stack in both FSD and in the production
autopilot on all vehicles um there's still a few differences the primary one being the way that we
predict Lanes right now so we upgraded the modeling of Lane so that it could handle these more complex geometries like I mentioned in the talk in
production autopilot we still use a simpler lane model but we're extending our current FSD beta models to work in
all sort of Highway scenarios as well uh yeah and the the version of uh FST
beta that I drive actually does have the integrated stack so this uh it uses the
FSD stack uh both in city streets and Highway and uh it works quite well for me uh we but we need to validate it in
all kinds of weather like heavy rain snow dust um and uh and just make sure it's
working uh as better than the production stack uh in you know across a wide range
of uh environments but we're pretty close to that um I mean I think it's I don't know maybe
I'll definitely be before the end of the year and and maybe November Yeah in our personal drives uh the FSD
stack on Highway drives already way better than the production stack we have and we do expect to also include the
parking lot stack as a part of the FSC stack before the end of this year so that will basically bring us to you
sit in the car in the parking lot and drive till the end of the parking lot at a parking spot before the end of this
year yeah and and in terms of the like the the fundamental the fundamental metric to optimize against is
um how many miles per between a necessary intervention so
um just massively improving the how many miles the car can drive on in full
autonomy before an intervention is required that is uh safety critical um so
yeah that's uh that's the fundamental metric that we're measuring uh every week and we're making radical
improvements on that hi thank you hi thank you so much for
the presentation very inspiring my name is Daisy I actually have a non-technical
question for you I'm curious uh if you are back to your 20s what are some of
the things you wish you knew back then what are some advice you would give to your younger self
well I'm trying to figure out something useful uh to say
yeah yeah I joined Tesla would be one thing um
um yeah I think just generally try to um expose yourself to as many smart people
as possible and I read a lot of books
um you know I do that did do that though uh
so um I think there's some Merit to just also
uh like not being like necessarily too intense uh and and like enjoying the
moment a bit more I would say to 20 or 20 something me uh just to you know uh
stop and smell the roses occasionally would probably be a good idea
um you know it's like when we're developing the the Falcon one rocket
and uh on on the quadriline atoll and we had this beautiful little island that
we're developing the rocket on and not once that during that entire time did I even have a drink on the
beach I'm like well I should have had a drink on the beach that would have been fine
thank you very much uh I think you have excited all of the robotics people with
with Optimus uh this feels very much like 10 years ago in driving but as uh
driving has proved to be harder than it actually looked 10 years ago what do we know now that we didn't 10 years ago
that would make for example AGI on a humanoid come faster
well I mean it it seems to me that HEI is advancing very quickly um
hardly a week goes by without some significant announcement and uh yeah I mean
at this point like AI seems to be able to win at almost any rule-based game
uh it's it's able to create extremely impressive art um
engage in conversations that are are very sophisticated you know
write essays and these these just keep improving
um and there's there's so much more so so many more talented people working
on AI and the hardware is getting better I think it's a it's a AI is on a super
like a strong exponential curve of of improvement independent of what we do at
Tesla um and obviously we'll benefit somewhat from that exponential curve of
improvement with AI um accessible just also happens to be very
good at actuators that Motors you know Motors gearboxes controllers Power Electronics batteries
um sensors and um you know really like I say that you know
the biggest difference between the robot on four wheels and the robot with arms and legs is is getting the actuators
right actually it's an actuators and sensors problem um and obviously the you know how you
control those actuators and sensors but it's a yeah actuators and sensors and how you
control the actuators it's a I know where you have to have like the ingredients necessary to create a
compelling robot and we're doing it so
hi Elan uh you are actually bringing the humanity to the next level literally Tesla and
you are bringing the humanity to the next level so you said Optimus Prime uh
Optimus will be used in next Tesla Factory my question is will a new Tesla
Factory will be fully run by Optimus program and
and when can general public order a humanoid yeah I think it'll it'll you know we're
going to start optimists with very simple tasks in the factory um you know like maybe just like loading
apart like you saw in the video loading a part uh for you know carrying apart from one
place to another or loading a part into um a one of our more conventional robot
cells uh to you know uh that welds the body together so we'll start you know
just trying to how do we make it useful at all um and then and then gradually expand the number of situations where it's
useful um and I think that that the number of situations where Optimus
is useful will it will grow exponentially um like really really fast
um in terms of when people can order one I don't know I think it's not that far
away um well I think you mean when can people receive one
um so I don't know I'm like I'd say probably within three years I'm not more
than five years within three to five years you could probably receive an Optimus
I feel the best way to make the progress for agis to involve as many smart people across the world as possible and given
the size and resource of Tesla compared to robot companies and given the state of humanity research at the moment
wouldn't make sense for the kind of Tesla to sort of Open Source some of the
simulation hardware parts I think Tesla can still be the dominant platformer
where it can be something like Android OS or like iOS stuff for the entire human or research would that be
something that rather than keeping the Optimus to just Tesla researchers or the
factory itself can open it and let the whole world explore human research
um I think we have to be careful about Optimus being potentially uh used uh in ways that are bad uh because that is one
of the possible things to do so I think we would you know would
provide optimists where you can provide instructions to optimists but where those instructions
are uh you know governed by some laws of robotics um that you cannot overcome
so you know not doing harm to others and uh
it would have I think probably quite a few safety related things with with Optimus yeah so all right well we'll
just take maybe a few more questions and then and then and then thank you all for coming
questions um one deep and one Broad on the Deep for Optimus what's the
current and what's the ideal controller bandwidth and then in the broader question uh there's this big
advertisement for the depth and breadth of the company what is it uniquely about
Tesla that enables that uh anyone want to tackle the bandwidth
question yeah yeah so the technical bandwidth of the costume
okay for the bandwidth question you have to understand or or figure out what is the
task that you wanted to do and what is the free if you took a frequency transform of that task what is it that
you want your limbs to do and that's where you get your bandwidth from it's not a number that you can specifically just say you need to understand your use
case and that's from that's where the bandwidth comes from uh okay what is the broad question
I don't quite remember the breadth and depth thing I can answer the breadth and depth but yeah
I was interesting on the back of the question I think we probably will just end up increasing the bandwidth or you
know which translates to the effect of uh dexterity um and reaction time of the of the robot
um like you could save states it's not one Hertz um and it's maybe you don't need to go
all the way to 100 Hertz uh but I don't know maybe 10 25 I don't know
over time I think the the bandwidth will will increase quite a bit uh or or translated to uh dexterity and latency
um uh you'd want to minimize that over time uh yeah minimize latency maximize dexterity
um I mean in terms of breadth and depth um I I guess you know we've got
we're a pretty big company at this point so we've got a lot of different areas of expertise that we necessarily have to
develop in order to make autonomous or in order to make electric cars and then in order to make autonomous electric
cars um we've we've just I mean Tesla is like a whole series of startups basically and
um so far they've um almost all been quite successful um so we must be doing something right
um and I you know I consider one of my uh core responsibilities uh Iranian company is to have an environment where
uh great Engineers can flourish and and I think in a lot of companies
I don't know maybe most companies uh if if somebody's a really talented driven engineer they they're unable to actually
uh their their talents are are suppressed at a lot of companies and it's you know
and and some of the companies that the engineering Talent is suppressed in a way that is uh maybe not obviously bad
but but where it's just so comfortable and you paid so much money and you but you your the output you actually have to
produce is so low that it's like a Honey Trap you know so like there's a few
Honey Trap uh places in Silicon Valley uh where they're not necessarily don't seem like bad places for engineers but
you have say like a good engineer went in and what did they get out
and the output of of that engineering Talent it seems very low
even though there seem to be enjoying themselves uh that's why I call it there's a few Honey Trap companies in
Silicon Valley uh Tesla is not a Honey Trap we're demanding and it's like you're gonna get a lot of done
um and it's going to be really cool um and it's you know not going to be easy
but uh if you are a super talented engineer your talents will be uh used I
think to a greater degree than anywhere else
you know SpaceX also that way so Highline uh uh I have two questions so
both to the autopilot team so the thing is like uh I have been following your progress for the past few years so today
you have made changes on like the lean detection like you said that like previously you are doing instant semantic segmentation now you guys are
built transfer models for like building the lanes so what are another some some other common challenges which you guys
are facing right now like which you are solving in future as a curious engineer so that like we as a researcher can work
on those start working on those and the second question is like I'm really curious about the data engine like you
guys have like told a case like where the car is stopped so how are you finding cases which is very much similar
to that from the data which you have like so little bit more on the data engine would be great so that's it okay
um I'll start answer the first question uh using occupancy Network as an example so uh what you saw in the presentation
did not exist a year ago so we only spent one year about time I'm relationship with more than 12 occupancy
Network and you have a one Foundation model actually to represent the entire
physical world around everywhere and you always the condition is actually really
really challenging so only over a year ago we're kind of like driving a 2d where if there's a war and
it says curve we kind of represent with the same static Edge which is obviously you know not not ideal right there's a
big difference between a curve and a wall when you drive you make different choices right so after we realize that
we have to go to 3D we have to basically re-sync the entire problem and think about how we address that so this will
be like one example of challenges we have uh uh we have a conquer in the past year
yeah to answer the question about how we actually source examples of the tricky stopped cars there's a few ways to go
about this but two examples are one we can trigger for disagreements within our signals so let's say that parked bit
flickers between parked and driving will trigger that back and the second is we can leverage more of the Shadow mode
logic so if the customer ignores the car but we think we should stop for it we'll get that data back too so these are just
different like various trigger logic that allows us to get those data campaigns back
hi uh thank you for the amazing presentation thanks so much uh so there
are a lot of companies that are focusing on the AGI problem and one of the reasons why it's such a hard problem is
because the problem itself is so hard to Define several companies have several different definitions they focus on
different things so what is Tesla how is Tesla defining the ATI problem and what are you focusing on specifically
well well we're not actually specifically focused on AGI I'm simply
saying that HGI is so is seems likely to be an emergent property of of what we're
doing um because we're creating all these autonomous cars and autonomous humanoids
um that are actually uh within a truly gigantic data stream that's coming in
and and being processed um it's by far the most amount of real world data and and data you can't get by
just searching the internet because you have to be out there in the world and interacting with people and interacting with the of the roads and and just you
know Earth is a big place and reality is messy and complicated um so so I think it's sort of like uh
likely to just it just seems likely to be an emergent property of if you've got you know tens or hundreds of millions of
autonomous vehicles and and maybe even a comparable number of humanoids uh maybe more than that on the humanoid front
um well that's just the most amount of data um and if that that video is being processed
it just seems likely that you know the the cars will will definitely get way better than human
drivers and the the humanoid robots will become increasingly
indistinguishable from humans perhaps and and so then like you said you have a
emergent property of of AGI um
and arguably the you know humans collectively are sort of a super intelligence as well especially as we
improve the data rate between humans I mean think like that seems to be way back in the early days the internet was
like the internet was like um Humanity acquiring a nervous system where now all of a sudden any one
element of humanity could know uh all of the knowledge of of humans by connecting
to the internet almost all the knowledge or certainly huge part of it whereas previously uh we would exchange
information by osmosis by you know by we'd have to like in order to transfer data so you would have to write a letter
someone would have to carry the letter by person to another person and then a whole bunch of things in between and
then it was like this position yeah I mean it's insanely slow when you think about
it um and even if you were in the Library of Congress you still didn't have access to all the world's information and you
certainly couldn't search it and I know obviously very few people are in the Library of Congress so
um I mean one of the great um a sort of equality elements
like the internet is has been the most the biggest equalizer in history in
terms of access to information or knowledge um in any student of History I think
would agree with this because you know you go back a thousand years there were very few books like
like and books would be incredibly expensive but only a few people knew how to read and only if an even smaller
number of people even had a book now now look at it like you you can access any book instantly you can learn
anything for basically for free it's pretty incredible so
you know I was asked uh recently what period of History would I prefer to be
at the most and my answer was right now
this is the most interesting time in history and I read a lot of history
so let's oh yeah let's do our best to keep that going yeah
and to go back to one of the early questions I would answer like you can the the thing that's happened over time
with respect to Tesla autopilot is that we've just the
the neural Nets have gotten have gradually absorbed more and more software and in the limit of course you
could say simply take the videos as seen by the car uh and compare those to these
the steering inputs from the steering wheel and pedals which are very simple inputs uh and it in principle you could
train with nothing in between because that's what humans are doing with a biological neural net you could train
based on video and uh and and what trains the video is the the moving of
the the steering wheel and the pedals with no other software in between we're not there yet but it's gradually
going in that direction oh all right wait last question
are you going uh I think we've got a question in the front here uh hello there right there I will do two
questions fine over there um hi uh thanks for such a great presentation well the old question last
um with FSD being used by so many people uh do you think what's the com how do you evaluate the company's risk
tolerance in terms of performance statistics and do you think there needs to be more transparency or regulation from third parties as to how what's good
enough and um defining like thresholds for performance uh across
some many miles uh sure well the you know I
the the number one design uh requirement at Tesla is safety so um like and that
goes across the board so in terms of the mechanical safety of the car we have the lowest probability of injury of any cars
ever tested by the government for just a passive mechanical safety essentially
crash structure and airbags and whatnot um uh we have the best uh uh the highest
rating for active safety as well and um I think it's going to get to the point
where you the act of safety is so ridiculously good it's it's it's it's like just absurdly better than a human
um and then with respect to uh autopilot um we do publish this uh broadly
speaking the statistics on um miles driven with cars that have no
autonomy or Tesla cars with no autonomy with kind of a hardware one Hardware two
Hardware three um and then uh the ones that are in FSD beta
um and we see steady improvements all along the way um and you know sometimes there's this
dichotomy of you know should you uh wait until the car is like I don't know uh
three times safer than a person before deploying any technology but I think that's that's actually morally wrong
um at the point of which you believe that that adding autonomy uh reduces uh
injury and death um I think you have a moral obligation to deploy it uh even though you're going to
get sued and blamed by a lot of people because the people whose lives you've saved don't know that their lives are
saved and the people the people who who do occasionally die or get injured they definitely know or their estate does uh
that it was you know whatever there's a problem with with autopilot um that's why you have to look at the at
the numbers in sort of total miles driven how many accidents occurred how many accidents were serious how many
fatalities and you know we've got well over three million cars on the road so this it's that's a lot of miles driven
every day it's not going to be perfect but what matters it is that is that it is very clearly safer uh than not
deploying it um yeah so I think uh last question
I think yeah so thanks what's the last question here
okay um yeah I got it okay uh hi so
um I do not work on Hardware so maybe the hardware team and you guys can
enlighten me uh why is it required that there be symmetry
um in the design of Optimus because humans uh we have handedness right we
are we use some set of muscles more than others over time there is wear and tear
uh right so maybe you'll start to see some joint failures or some actuator
failures more over time I understand that this is extremely pre-stage
um also we as humans have based so much fantasy and fiction over super human
capabilities like all of us don't want to walk right over there we want to extend our arms and like we have all
these you know a lot of fantasy Fantastical designs so considering
everything else that is going on in terms of batteries and intensity of
compute maybe you can leverage all those aspects into coming up with something
well I don't know more interesting in terms of your the robot that you're
building and I'm hoping uh you're able to explore those directions yeah I mean I think it would be cool to
have like you know make Inspector Gadget real that would be pretty sweet um so yeah I mean you know right now we
just want to make a basic humanoid what work well and our goal is fastest path
to a useful uh humanoid robot I think this is this will ground Us in reality
literally um and ensure that we are uh doing
something useful like one of the hardest things to do is to be useful uh to to
actually and then and then to have high utility under the curve like how many people did you help time you know and
how much help did you you provide to each person on average
and then how many people did you help the total utility uh like trying to actually ship useful product
that people like to a large number of people is so insanely hard it boggles
the mind um you know that's why I could say like man there's a hell of a difference between a company that has shift product and one has not sure product uh it's a
game this is night and day um and then even once you ship product can you make the cost the value of the
output worth more than the cost of the input which is again insanely difficult especially with Hardware so
um but I think over time I think we cool to do creative things and have like eight arms and whatever
um and have different versions uh and maybe you know there'll be some Hardware
like companies that are able to add things to an optimist like maybe we've
you know added at a power port or something like that or attach them you can add you know add attachments to your
Optimist like you can add them to your phone um it could be a lot of cool things that could be done over time and there could
be maybe an ecosystem of small companies that or companies that make add-ons for
Optimus so with that uh uh just thank the team for their hard work
uh you guys are awesome and uh yeah and uh thank
you and uh thank you all for coming and for everyone online thanks for tuning in
um and I think uh this will be one of those great videos where you can like if you you can fast forward to the bits
that you find most interesting but we try to give you a tremendous amount of detail uh literally so you can look at
the video at your leisure and you can focus on the parts that you find interesting and skip the other parts uh
so thank you all it's and we'll do this try to do this every year and uh we might do a part a monthly podcast even
um uh so uh but I think it'll be you know uh great to sort of
bring you along for the ride and and like show you uh what what cool things are are happening and um yeah thank you