Strange Loop

September 12-14 2019

/

Stifel Theatre

/

St. Louis, MO

Democratizing AI - Back-fitting End-to-end Machine Learning at LinkedIn Scale

Joel Young and Bo Long

27 September 2018

Transcript

00:00

Good afternoon! I'm Joel Young. I lead the Machine Inference Infrastructure team at LinkedIn. This is my partner Bo Long. He leads the Machine Learning Algorithms team. Together we're leading a complete rethinking of how machine learning is done at LinkedIn.

00:24

Today we'll walk through what happened to start us on this journey. We'll give an overview of our products and how AI contributes, the opportunities we’re facing and how we're scaling our technology and our people to meet these opportunities

00:41

Here are some of our key products. All of these have one thing in common. You can think of them as providing a view into the economic graph. Consider your LinkedIn connections, your connections’ connections, and so on. This explicitly defines a graph -- a little snapshot into the economic graph. But it's not the full graph. it's not exactly right. There's people that you work with on a day to day basis that you haven't connected to and there's also people -- certainly ones in this room that if you were introduced to -- if you got to meet (it's probably part of why you're here today) -- it would bring value to you; would bring value to them. You can think of the people-you-may-know product as addressing this problem.

Basically, finding the missing edges in the graph.

We have the Feed product finding the information that you need right now to bring value. Things like articles, new articles updates from your connections, and even job opportunities.

When you are looking for a new job -- time for the next play -- our job recommendations product helps you find the right job to move to.

Sometimes you'll find that for that job there's a skills gap. It might be missing something. Our learning product provides recommended courses from our large library in LinkedIn Learning to help fill in the gap. By the way there's courses in there for almost everything I'm going to talk about today including deep learning, TensorFlow, etc.

On our sales navigator platform we help sales folks identify the influencers at companies and the decision makers. It can help find a path to them to bring value.

Likewise, on the recruiter product we help companies find exactly the right talent -- people like you to drive their mission.

02:43

So, it's pretty clear that all of these products have AI and machine learning at their core. How do we go about designing the AI and machine learning to address these products? So, the key is to start at the product goals and work with the product managers, work with our customers. Look at what are the needs? What is the budget? How much time?

And drive from that product metrics that we can measure success against.

Then we work with the data science analytics to refine the product metrics to relevance metrics. Concrete measurable signals. Things that will become the labels when we move on to the machine learning stage.

We then join the labels with features drawn from that economic graph and move into training state -- the modeling and training stage -- we'll talk a lot more about this stage later on.

Once we've trained a model, we ship it to production. When we bring it to production, we don't just throw it out there and start serving. We use an extensive A/B testing system to scientifically evaluate whether or not the new model is meeting our business goals. We evaluate against product metrics, relevance metrics, performance metrics, etc. and make informed decisions as to whether to continue to ramp or to back off and this is a continuous cycle.

It's going continuously -- constant product refinement

04:17

So, it's pretty clear that machine learning is driving all of this. It's also in hidden places in our stack. You have things like anti-abuse, spam detection, fraudulent account detection, and also even in the build system!

You might be deploying a product and the deploy fails. Historically when the deploy fails it goes back to the dev who is pushing the service and they try again -- it might work this time!

Why do they try again? Well, there could be one of the causes that is their mistake, the other causes could be a random hardware malfunction out there somewhere on the grid, or it could be an actual bug, or nondeterminism in the deployment stack. If we can apply a model to this, look at the situation when the failure happens, and direct the fault to the team that's most likely to be able to fix it, we can shave off a significant amount of time and help build trust in the developers into the build system.

05:20

AI is eating the software world.

Many of our products have evolved over the past decade and they're built from substantial amounts of carefully engineered code -- in fact we have over a hundred search verticals and recommendation services providing relevance -- machine learning -- to our products. Each of these services was custom-built by different teams.

In the past we've had a few efforts to try and standardize the machine learning components in the stack. Unfortunately, these have often worked -- they bring value -- they help bring in a new technology – but what they don't end up doing is minimize - reducing the complexity. They don't replace things they just add another part.

What I call second system syndrome.

Also, we have many different modeling technologies in use. There's simple logistic regression, tree ensembles, deep learning, additive ensembles of all of three of the above.

Our workflows are built with things like pig, Dali, hive, spark, scalding, cascading, etc.

Tons and tons of different technologies.

06:37

Also, all of these services are mission critical to the company.

Whatever we do as we try and advance the state of the art, we need to reduce the risk of site issues and when they do happen, we need to make

them easier to troubleshoot.

06:53

Also, there's some secular trends that are happening in the world. Machine learning is becoming democratized. It used to be the case that it was an ivory tower skill. You needed a master's degree or a PhD in machine learning or maybe physics but now we're seeing people come out of boot camps, coming out of online courses, people taking the Coursera courses, people taking pretty high-quality courses even in their undergrads.

At LinkedIn what we're finding is almost every engineering team -- product teams, UI teams -- all of them have people with meaningful practical AI skills.

Remember again AI is eating a software. The engineers -- they want to solve their problems. Now they have the skills to do it. They don't want to be blocked by the machine learning gods.

07:49

But there's too much friction in the stacks. Too many stacks. Too many technologies. Too much manual work to gather the features, design and train the models, deploy them into production, and to keep them running.

08:06

Another of the secular trends is the rate of advancement in machine learning approaches -- and even more in the tooling to build them -- is accelerating. Almost every day, if you followed the right news sources, you see big companies and labs announcing new tooling. Things like TensorFlow, Cafe2, TLC ONNX, Spark mlLib. Every day the list cooler is longer.

08:31

So, we spent a fair amount of time thinking about "Hey! Is the right solution that we go all-in on one of these stacks? Maybe TensorFlow, maybe PI torch?" But we can't know which one of them is going to win the shakeout wars that almost certainly are going to come and most likely what's going to happen is each of these stacks are going to carve out a portion of the space. Each are going to be bringing value to different classes of models, different styles of problems -- and we don't want to bind in.

If you look at the Clean Architecture book, the author talks about guidelines. One of the things is to delay your decision-making. And in any case, we already have a few technologies that we know we have to build support for. So, as we design, as we build our future for machine learning, it needs to be extensible by us and not only the core infrastructure and machine learning teams but also by our end users. Even the UI teams.

If they come up with a new tech that exactly solves their problem, they need to be able to bind it into what we're building rather than having to build their own thing that we're gonna have to clean up the tech debt later on.

09:41

You've probably heard of the fundamental rule of engineering -- fast, good, cheap -- pick two. Historically we've picked fast and cheap or good and cheap. This brought very quick value to the particular products but it's also part of why we always end up with a second system syndrome. We don't actually reduce complexity. This time Bo and I are able to go in with the other two choices, fast and good. And by fast what we're really focusing on is fast iteration speed for the modelers. A fast experimentation cycle. We aren't forgetting cheap but now it's cheap to serve not cheap to build.

10:31

So, what is our particular goal? Our concrete target goal is to more than double modeler efficiency. Modeler productivity. One of the things we found is that there's a direct connection between the modeler iteration speed (as long as that iteration is coupled with early feedback) directly driving improved models -- improved product metrics.

We're doing this focusing on the model creation stages, deployment stages, and then maintenance or running, and continuous monitoring of the models running in production

11:10

We've divided the problem into a set of layers.

We have experimentation and authoring exploring the data science, exploring the economic graph, trying modeling approaches.

Driving down into the training layer vast clusters, GPUs, distributed training systems.

Moving down into packaging up the artifacts. The artifacts can be surprisingly complicated and large. Moving them into the right production environments and then running in production.

Running in production as well as in training are the two places where the "fast and cheap" comes in. Threading through all this are two pillars.

We have what we call the Feature Marketplace. This is where we look at better technologies for generating features. Things like sliding window feature aggregation, count services, also feature a feature pub/sub system. Whereas a development team / modeling team you might come up with a pretty cool feature. Maybe you have a new deep embedding that can capture a member profile and you're getting a lot of leverage. You've heard of this idea of transfer learning you want to share it with another team. Historically that's been extremely complicated. What we're doing is we're separating: this is the data for the feature, this is the semantics. So, features have a name. As a modeler all you need to do is know the name of the feature. The providers of the data provide anchors in each of the execution environments. Offline I'll get it from this Avro file, this ORC file, this Dali view. Online it'll come from this key value store or this REST API. Down in search it'll come from the query or it'll come from the inverted index -- or the forward index.

On the other side we have the assurance -- the health assurance. Constant monitoring, constant checking to make sure that the systems are staying healthy. That when we're designing a model all the way down in the authoring stage, that the features are available online giving warning. Maybe it's okay that they're not available yet but you want to know that part of your deployment process is going to be needing to get the features into production, checking for offline online consistency etcetera. I'll talk a little bit more about this later.

13:39

OK. So how are we organizing this? Some of you may have heard of McChrystal's team-of-teams model. So historically organizations are very hierarchical. You have managers, have managers reporting to those managers. You have teams. We're not doing this project in this sense what we have instead is for each of those layers each of those pillars we have a lead. Usually it's an engineer, a tech lead, and the engineers working on that come from across the organization. Some of them are up in product engineering. Some are in our tools chain. Some of them are working in modeling in the relevance org. Some of them are in infrastructure.

Also, we have a leadership team of which Bo and I are part of. The leadership team isn't directive we're not telling people go do this, go do that. What we're setting is direction for the project vision and we're looking for places where there's friction building up. Where people aren't knowing what they need to know about the different parts. Sometimes there's too much collaboration: the way things are set up, teams are having to talk too much, it's slowing down progress. Then we'll help dig in and find a better API -- a better boundary between the projects. Other places there's not enough communication.

So, going back since we are talking about machine learning one of the ways you can think about this is that the leadership team ... We're really just a stochastic gradient descent optimizer. We're trying to find a more global optima for the project. Hopefully not too stochastic though.

15:29

It's also distributed across the world. We have participation -- active participation from our teams in Bangalore, from Europe, and from our multiple teams in the US.

15:40

So, what exactly do we mean when we talk about a model? If there's one thing you can take away from from this; a model, for our purposes, it's just a directed acyclic graph. It's just a DAG: You have a set of input features, you have transformations on these features, and that's it. Some of the transformations can be really simple. Things like convert a categorical feature into a one-hot. It can be linear algebra operations, interaction features, maybe do a cross product or an inner product between two vectors. It can also be very rich transformations like use TensorFlow to compute a deep embedding.

Some of these transformations in the DAG are also trainable. There's unknowns in them; Wild cards that need to be figured out.

16:35

So, this is a notional dag for one of our job recommendation models. We'll talk about this a lot more later on when we discuss the training. But at a high level there's a set of input features: we have raw member profile text, we have raw job text. From the raw member text, maybe, we need to compute a deep embedding, [then] project it down to say a hundred-dimensional real vector. Maybe we also have a set of standardized features: standardized job title, seniority, geo locations, etc. They've been processing through streaming pipelines. From these we might compute a deep interaction model. Maybe a probability of click-through using [the] job embedding and member embedding. And we might take the output of that [plus] all our other features and learn a random forest with something like XGBoost. We have some other models that go into the mix. We'll talk about [this] more later. And maybe we combine them into a linear combination

17:44

So, how does the engineer actually go about building a model like this? For our hard-core modelers, the way they often like to work is we just provide IntelliJ bindings for [the] dsls [Domain Specific Languages]. They work on the models just like other code and then they use the build and deploy tooling to run the trainers and the exploration on Spark and Hadoop. For everyone else we provide a Jupiter-based exploration and authoring environment. In addition to a great interactive experience we find that by using the JupyterHub it really motivates engineers to not pull personal member information down onto their development boxes.

On this slide you can see a view into a Jupyter notebook with a notional two-dimensional title embedding. Actual title embeddings are far far higher dimensionality but it's hard to project them.

18:53

So then building the model.

Now we talked about having a DAG. The model is just a DAG but when we represent it what we've found is that the models get quite interesting and complex. And having a graphical drag the boxes around [interface] -- it seems cool, people go ooh and pointy-haired bosses like me sometimes get excited about them, but they don't speed up development, they don't speed up learning, they maybe speed up the first half-hour and remove the fear factor.

What we need is a lot more control, a lot more insight into what's happening. And for this we have a declarative DSL that we've built in-house called Quasar -- short for Quick Ass Scoring And Ranking. Our previous one was called LOSER back four years ago which was short for logistic scorer -- but no more negative names!

So here you can see a little section of the model. We're finding a cosine angle between two of the embeddings: a member embedding and a job embedding. We're also applying a tree model to compute scores. The scores might be a probability of apply, a probability of click-through. If you can get the signal maybe it's even a probability of, what's called in the ads world, a conversion -- the probability that the person actually gets the job. And we can figure that out -- it just takes a month or two for people to update their profiles, we know that they applied we, know they clicked, and eventually we'll know if they got the job.

Also, the dsl provides what you might call first order operations. It can actually work on lists of entities. Can work as a ranker and not just a score.

So, you can see at the bottom we're taking a list of all the jobs, all the documents ordered by the score. We can do top-K, we can do filters, we can do diversity -- split lists apart, bring them together. All sorts of operations on the lists [of documents] themselves.

20:53

OK! All Right! So, of course people want to be able to explore their data using the Jupyter notebook, so you get that for free. You can tie into matplotlib Python tools that everyone knows and loves.

21:10

And, also, we provide hooks, so you can drive your larger-scale work. You don't want to do heavy lifting within the notebook itself. You know the data is far too big to run on a single box even if we had a few terabytes of RAM in[stalled]. But [instead] drive it over to spark running on the big clusters. You can see a little notional comparison coming out of the results for two versions of the model: accuracy and area under the curve.

So, Bo is going to walk us through how we train the models.

21:43

Thanks Joel!

So like Joel mentioned, the goal of production machine learning is to help our engineers to do model training productively. So, let's take a look at how we train the model.

22:02

So, to choose machine learned model we need key components. Two key components.

One is which features (which are data we prepare for the model training). Normally we refer to each variable as a feature. Another key component is the algorithm which takes the features and the input optimized objective function and learns a model. So, with features and algorithm together we can learn a model.

22:34

Let's take a look at the features.

First: in the age of big data, even for a single model, we could have millions of features. At LinkedIn we train a lot of different models for different products. So, the overall feature space is huge.

As you all know, the practice of feature engineering is a major effort for the model training.

One time, a professor I enjoyed at Stanford said, "Applied machine learning is basically just feature engineering".

So, at LinkedIn, to make feature engineers more productive, we built this framework called the Feature Marketplace for our engineers to share, discover, and explore the features easily.

So, with this Feature Marketplace engineers can contribute their own features to the Feature Marketplace.

And the big thing is, through this Feature Marketplace, people can easily get all the information about the features -- like how a feature was generated and which models, which features, and the history of the feature distributions. We also provide a lot of other powerful tools like how to do feature selection, how to do validation, and the monitoring of those features.

So, with the Feature Marketplace, engineers can find the features for their model training very productively.

24:07

I also want to mention one challenge specifically for our Feature Marketplace. So how do we make sure that engineer actually gets all they ask for?

I'm talking about offline/online consistency. you know when we do a lot of offline experiments experiments to gather good features it works very well. But there's no guarantee you actually can get a stable feature online. Because there's always a lot of steps [that] could be generated in scripts between offline and online.

So, at LinkedIn we built Frame, a platform which's gonna generate a feature offline and online based on common libraries and the same configurations so make sure our online consistency. Also, we hide the feature generation details from the engineers.

So, the modelers only need to specify the feature name. then they can easily get a feature for the offline training and also get a feature when they do the online model scoring.

Also, Frame also put other powerful tools like feature joining to make sure that you do the feature joining very efficiently.

So, in summary, with this Feature Marketplace our engineers can easily identify good features for model training with offline and online consistency.

25:32

So, after we have the features, now let's look at the algorithms. At LinkedIn we have three major types of algorithm. First, one is deep learning which is a popular algorithm based on multi-layer neural networks. At LinkedIn we use TensorFlow for training deep learning. We used deep learning for both feature generation and for prediction.

For feature generation, we generated all kinds of embeddings for models. For example, we generated a word embedding for user profile which can be used for job recommendations.

We also use [deep learning] for prediction. especially for some challenging tasks like language understanding and imaging understanding.

A second type of algorithm we use at LinkedIn is trees. Basically, ensembles of trees. Specifically, we use the boosting trees algorithm implemented in the open-source package XGBoost.

So, boosting trees are very good at capturing nonlinear interactions within data so we use boosting trees for both feature generation and also prediction.

For the third type of algorithm we developed our own algorithm package based on the concept of general linear mixed model which is a powerful model for recommendation applications.

I assume you know, general linear models. It's a popular model for response prediction; for example, logistic regression for binary response prediction, linear regression for numeric response prediction, and Poisson regression for counting response prediction. They are all specialized types of the general linear model.

When we use a general linear model for recommendation, we also want to capture specific information for a specific user or a specific item. For example, if we need a job recommendation, you also want to capture specific information for that other user.

Normally we do this by incorporating user ID features or actor ID feature into the model. So, you cross other features with a user ID. When we do this, we are actually doing general linear mixture models. So, the intuition is very simple here. Basically, what we are doing is we have a global model which uses all the data and all the features.

But on top of the global model we use the data for a specific user to train a small producer model on top of the global model. Similarly, you also train a per-actor model just for that actor like we did for a job. By doing this you have a deep representation.

So, the idea is simple but the computation for the algorithm is very challenging. Think about we have 500 million members on LinkedIn. Even if each member model only has one thousand features -- that's a huge number.

That's why we develop our own algorithm package to do these general linear mixture models. we also open source this. The package name is PhotonML. Please feel free to give it a try.

More than that, we also generalize a general linear mixed model to non-linear models. Now we also supported training a non-linear model. For example, we can use the tree model for the global model. But on top of the tree, you can train another per-user model for example just based on logistic regression. So, it's very flexible -- powerful for [making] recommendations.

29:26

So, after we have the features and have the algorithms we also build into training and Photon Connect to connect all the training processes together.

So, this engine is going to use Frame like I mentioned before to prepare our features -- making sure our offline/online features are consistent. We also call different types of algorithm, do all the feature transformers based on the dag and at the finish is the final model training and we have Quasar to do the online model serving. Of course, it's going to be a repeating process. You keep getting the feedback and then you improve your model with another training cycle.

30:05

Now let's take a look at a specific example: this example is for LinkedIn job recommendation. First take a look at the features. Where I recommend jobs to our members. So, we have member features and the job features.

For job features we have two types. One is member profile text features which is just a feature based on raw text information which has a lot of information -- very rich information. But it's quite noisy.

Another one is a member standardized feature which is based on structured data. Extract the information based on standard format -- is much cleaner but it's has less information.

For the job side, I have a similar situation.

So, let's look at how we generate a feature. For raw text feature, like I mentioned, has rich information for the noise -- that's why we can use deep learning to extract an embedding which remove the noise, but I gathered more -- can't wrap the right features. So, we have different learning for a member and also for the job.

Together we'll also have interaction features also based on neural network interaction. Now we have the feature based on the deep learning. We also have standardized features. We have the features all ready.

Now we do the final model training. Like I mentioned before we use a general linear mixed model. So first we pick half of the trees -- a forest its ensemble trees and our global model which will take all the data -- all the features.

The global model can do the prediction but on top of the global model we also train a per-user model and a per-job model.

You can see the per-user model will only use the data for other user based on job features. On the other side we also only use the per-user features for the per-job model.

And then put together will be in the final model which cannot be recommended the job to our users.

After we have this model and Joel will talk about how we are going to deploy and how we use this model.

32:23

Thanks Bo!

So, if you think about the model that Bo just talked about. If you think about it naively or brute force -- what we have is over five hundred million members, over ten million jobs in our inventory. If we tried to do an all-pair solving, we'd have at least five gazillion member job pairs that we'd have to solve (sorry I can't say that number out loud) but fortunately we don't have to do it that way!

33:00

There's structure to the model. So, there's the global component -- this is that deep interaction model combined with the XGBoost (the tree model). We take that, those two parts, and we take the Quasar dag, package them up and we push them as an artifact down to all of our searcher nodes.

Then if you think about the per-member components, really these are just a vector of numbers -- they're the coefficients for a logistic regression. We can just take those and put them in an online key value store, indexed by say the member ID.

We can take the per job coefficients, again just a set of numbers, treat them as features and just write them into the sharded forward index that's deployed to the searchers.

33:56

Now if we think about this: when a model like this is actually going to be used, we're not doing it for all the members we're doing it for the member that is on the site, so we don't have 500 million members to deal with. We have one. That makes the problem five hundred million times easier but even ten million jobs is still a lot of work. How do we do it in practice?

34:12

The member comes to the site, they navigate to the jobs page and when they do that a REST call goes down to our job recommendations mid-tier. The job of the mid-tier is to integrate everything we know about the situation. It reaches out to our standardized data stores, pulls things like: what's the members current job? what's their industry? where are they (geo-locs)?

It reaches out to the model store and pulls that member's coefficients. By the way, which coefficients should it pull? Well, it has already checked against the A/B test platform ... which model this particular member is seeing. Then, by the way, it also uses the Frame system. There may be other features -- you know which features it got from the standardized ones -- and the ones that the model down in the searchers are going to need; bundles it up [and] sends it in the query down to the search cluster.

The broker takes it apart, shards it out to a whole bunch of searchers, then on the searcher it takes the per-member portion coming in from the query, it takes the features from the forward index, takes the global model, and puts it all together -- runs the Quasar inference engine.

The inference engine supports lazy modes, it supports columnar modes, batching so that it can keep vector pipelines hot, etc.

Then the results come back out. And we can do this fast enough that we give a real-time experience to our members.

35:39

So, we've got it running in production. How do we keep it healthy?

Well one of the things we've found -- this is obvious but it's not obvious when you're stuck in the loop -- the more work you do to keep your models healthy early in the lifecycle the easier it is to keep them healthy when they're running in production.

So, in the Health Assurance thing we we add a lot of verification and validation steps like even down in the authoring environment.

You've crafted this model. It uses a set of Frame features. Are those features available online? Are they updated fast enough in the online environment to be giving you the robustness that you need? Is there checking constantly if the offline versions of the features are consistent with the online version of the features? Trying to use the same calculations on both sides. Trying to avoid some of the problems with the lambda architecture that we see. Basically, constant monitoring, constant checking.

One of the common patterns that you see [with] machine learned models: They're trained on a set of features. The performance of that model is going to naturally start degrading as the world drifts from the world that was in place at the time that you trained.

Historically what's detected is "Oh we start to see business metric impacts -- must be time to retrain the model". The natural inclination is to delay that as long as possible because it's been difficult. You have to make sure the data pipelines work again. Maybe it's been a month since you've retrained.

So, the next step is to go look at your data and then make a guess: "All right if I retrain and deploy weekly maybe with a something like a cron job it works good."

But then the reality is is that weekly sometimes it's too slow sometimes it's too fast so we're either throwing away product lift or we're throwing away compute.

With the monitoring system we can be tracking this degradation and when it reaches a threshold, we can automatically trigger a retrain and deploy.

37:47

So, another part is continuous online anomaly detection.

In a sense -- not in a sense -- in reality -- anomaly detection is just another kind of machine learning model so we're monitoring our model deployment system with a model trained in our model deployment system. Part of our alerting system is to actually ask the people receiving the alert "Hey was this a true anomaly? Was it an anomaly but it's actually a new trend in the world? It's not a flaw in the system it's just new information that we need to integrate into our modeling processes. Was it a false positive or huh don't know yet?"

38:34

So, we talked a lot about what we're doing to scale up the technology -- make it easy. But that's not enough. People are coming in (more and more) with machine learning skills but not everybody.

Also, we have our own stack so people learned stuff with Weka or TensorFlow or this toolkit or that toolkit. They've got to come in and learn our own, so we've set up this thing called AI Academy. We have a blog post on it -- you can read more.

One of the interesting classes is we have a course for managers. They're not going to be doing the machine learning themselves -- they'll be leading people who are. What kind of problems are actually solvable? What are they reasonably likely to get good results from? And what are still what we call AI-hard problems? Also, how to evaluate success concepts like precision, and recall, and F-score, and area under the curve, and how to create well-structured A/B tests and understand significance and things like that.

People in the audience who do machine learning -- you know that it's a fundamental R&D problem. It's very hard to say, "In one week we're going to have a model that's gonna bring a half a percent lift." Probably not gonna happen. But it could take two months depending on how hard the problem is.

40:02

Who's heard of the idea of a success trap? A few people. So, what's a success trap?

It's when a company, when a business, when a team, when an individual gets stuck -- since it's a machine learning talk -- in a local optimum. They've learned something of the world. They've had success but they've over-learned. They over fit this piece of information and they miss opportunities.

A classic one is Kodak with digital cameras -- missing the transition even though they are the ones who invented digital cameras. So, there's a particular success trap that I call the support trap. That I've seen happen many times with new infrastructure components.

You adopt your customers. Customers come in too quickly. Each customer that joins has a finite support cost. If you bring in more customers than your team can support the team is now frozen out. You can't continue doing strategic development. You only have two choices: one is you can evict customers -- that's kind of a painful thing to do -- or you can try and get more resources. You can find more engineers to bring in.

But if you waited too long, now you're a bad bet, right? People have started losing confidence, your customers aren't happy [since] you're not quite serving them. It can be hard and even if you do get the head count it can be slow to bring them up to speed.

It's much better to avoid this so what we're doing in the Productive ML project is we're not doing a whole scale adoption. Instead we're going component by component, even sub-components and the layers and pillars.

Each one teams adopt, each adoption makes the next part of the adoption easier. Maybe they want to use the Quasar modeling language and the inference engine? Maybe they want to use the Feature Marketplace system? Maybe they want to use the monitoring stuff? We also look at each component and we rate it pre-alpha, alpha, beta, or at general availability and we limit adoption depending on where it is in the curve.

Pre-alpha [and] alpha will have very specific partners that have agreed to accept some of the risk that's involved. The benefit, of course, is when it pans-out they're way ahead of the curve on the adoption cycle

42:34

So, in conclusion, we've presented our approach to completely overhaul the machine learning stack at LinkedIn. You can find more at our engineering blog.

Question and Answer Session

42:48

And we're open to questions. You can put questions up on the sli.do. I haven't seen anybody using it so far but you're welcome to. I'll start the app up or you can just ask, and we'll repeat the question back. So, there's still no questions [on sli.do] so does anyone have any questions?

43:14

Go Ahead

43:36

Yup. Yup. That's a great question. So, the question was "When one of our modeling teams comes up with a new feature and they want to publish it for sharing, how do we provision the infrastructure? Who's responsible for running that infrastructure?"

To be honest we're working through the details on that. Feature sharing is a lot like getting leverage with your code. By sharing, it's awesome you're bringing efficiency and you're also binding yourself because now you can't change your API anymore.

You have to deprecate and provide end-of-lifes and all that. But right now, the provider of the model [feature] owns their infrastructure now. The key value stores have SRE teams that run the key value stores. the Kafka and Samza streaming teams have SREs that run the various clusters.

Just as they were responsible for the infrastructure that they used before they published [the new feature], part of their decision on the publishing is to continue that responsibility.

If a feature gets a lot of leverage, it's going to move to one of a couple teams: The Standardization team that produces features that are generally usable for business metrics and for machine learning and relevance or to a common features team that's more focused on relevance driving features that are generally leverageable.

Thank you!

45:12

There's a question over here. Yup yup.

45:34

So, the question is "For a small team of modelers it can be a big challenge to move from that to getting a full-scale model in production."

Did I get that about right? OK.

So, this is a big part of what we're trying to do, and this is where the democratization comes in. By using the components in our architecture starting with the Jupyter notebook tied in with the Quasar DSL support, the frame support, tying over to Photon Connect that Bo talked about.

Basically, a team, if they use this infrastructure, they're greatly increasing the probability that what they build will be deployable using the rest of the infrastructure.

Still we're not gonna block anybody. If somebody wants to go out and try something new -- do their own thing they can -- but they're gonna be on their own. They're not going to get support from the relevant site reliability engineers, etc.

Did that answer your question?

46:41

Any other questions?

46:53

Yep. So, the question is "What are we doing with hyper-parameter tuning?"

Bo, do you want to talk about this?

So that's a very good question. Actually, I should mention that like we said we have a huge feature space. That basically means that we have the huge parameter space, so we actually built a component called Auto Tuning, so we can automatically tune the parameters.

Even further, we can tune some complicated models like a non-linear model like a tree and the linear component together which is based on a gaussian process. So that helps hyper-parameter tuning a lot.

47:54

Any other questions?

48:01

Yep. So, the question is "We have the course for the managers. Have we had any issues with the managers being part of the problem instead of the solution."

So, truth in advertising: right now the machine learning infrastructure that we have is, at this point in Productive ML, more like the old way where it's awkward and difficult. Right now, in our classes, every person who goes through the machine learning process in AI Academy is partnered with somebody on a relevance team for an apprenticeship. So, we're only taking product engineering teams that have partner relevance teams -- machine learning teams

And they're working through together so there's a fair chance that we will see issues like you're talking about. And it's actually pretty likely because engineering managers at LinkedIn are engineers -- they actually build things -- and that means that some of the managers might take both classes. They might take the how to do it and how to manage it classes.

Hopefully with the two they'll avoid being part of the problem but we welcome stress. It's only by people trying things that are on the edge of our envelope that we learn in advance. So, there's going to be challenges and they're going to be good ones, I think.

49:49

Other questions? Going once, going twice ...

Session is closed!

[Applause]